Meeting_2006-10-30
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 30.10.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 45.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact writers who already have received contracts
- Elle Márjá Vars and Aage Solbakk. Both said they would give us texts.
- Elle Márjá Vars and Aage Solbakk. Both said they would give us texts.
- Move norwegian documents in Min Áigi from sme to nob
- Not done
- Not done
- finish Forrest i18n work (pdf)
- Fixed i18n in pdf together with Sjur. Cleaned up broken links.
- Fixed i18n in pdf together with Sjur. Cleaned up broken links.
- cvs synching of the risten.no code in eXist (read-only)
- Sjur did this
- Sjur did this
- consider a script for automatic testing of the spell checker
- Nothing done
- Nothing done
- consider more testing routines
- Nothing done
- Nothing done
- update Maaren's Forrest installation to r430284
- Not done
- Not done
-
fix bugs!
- Not done
Maaren
- investigate the generated word form list sent to Polderland - use the command
- done some
- done some
- investigate unrecognised word forms in the hyphenator
- done some
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- Paradigm generator implemented. final testing still going on. The text
- Paradigm generator implemented. final testing still going on. The text
- generate parallel corpus files manually (with Trond)
- export corpus tools to location available to all (with cron), cf news disc.
- not done.
- not done.
- help Trond with some shell commands
- done some.
- done some.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- done some more
- done some more
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- finish i18n work of Forrest
- helped Børre with the PDF i18n
- helped Børre with the PDF i18n
- install our local copy of risten.no and propnouns on the G5
- done
- done
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- proposal posted to the news
- proposal posted to the news
-
fix bugs!
- other:
- participated in a Nordic language technology seminar in Gothenburg
- cvs syncing of risten.no code on the G5 (with help from Børre)
- participated in a Nordic language technology seminar in Gothenburg
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
- nothing this week
- find and study all derived verbs in our corpus
- nothing this week
- nothing this week
- suggest which derivations could be generated
- nothing this week
- nothing this week
- investigate unrecognised word forms in hyphenator
- done some serious investigations
- done some serious investigations
- investigate the generated word form list sent to Polderland - use the command
- done some serious investigations
- done some serious investigations
- decide how to specify compounding behaviour info in the lexicon
- working on it
- working on it
-
fix bugs!
- worked
Tomi
- continue implementation of the speller lexicon conversion
- continued
- continued
- make generator as server, based on Saara's code
- Saara did
- Saara did
- add lexc2xspell code to cvs
- done
- done
- add hyphenation points to the generated output
- not done
- not done
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Not done, haven't done more than saying "bures" to Thomas a couple of times.
- Not done, haven't done more than saying "bures" to Thomas a couple of times.
- Get more sma texts to improve language recognition
- I got a whole bunch of texts, but lost my memory stick (!)
- I got a whole bunch of texts, but lost my memory stick (!)
- study paragraphs with mixed content
- Done some. There are systematic traits there, cf. discussion to come.
- Done some. There are systematic traits there, cf. discussion to come.
- add corpus user accounts and access issues to Bugzilla
- Not done.
- Not done.
- investigate unrecognised word forms in the hyphenator
- Don't remember this. Worked on sma hyphenator, though.
- Don't remember this. Worked on sma hyphenator, though.
- decide how to specify compounding behaviour info in the lexicon
- Blank.
- Blank.
- fix bugs!.
3. Documentation
One small problem: Forrest seems to crash on raw HTML. Børre will check it.
TODO:
- finish i18n work (Børre and Sjur)
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- i18n does not work in PDF ("Table of Content" won't translate)
- done - two strings still missing (we couldn't find their location)
- done - two strings still missing (we couldn't find their location)
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- check potential raw HTML bug/problem (Børre)
4. Corpus gathering
Two more authors contacted, both positive. Åge Solbakk is coming to Tromsø
Børre has been digging more in the SD archives, and found some more texts.
sma
smj
TODO:
- continue to help NSI to get their corpus (Børre)
- nothing last week, will visit them when going to Kautokeino this or next week
- nothing last week, will visit them when going to Kautokeino this or next week
- sma:
- Bible (Trond).
- Discussions with the Sámi Parliament (Børre, Sjur)
- Bible (Trond).
- add as much smj texts as possible (Børre)
5. Corpus infrastructure
User accounts and access
TODO:
- add the issue with subissues to Bugzilla (Trond)
- not yet
More texts to the graphical corpus interface:
We have sent approximately 10 texts to Oslo, aligned and with sme analysis. Now,
TODO:
- add text to the server (Lars)
- Lars came back from holiday last week, which means it will probably soon be
- Lars came back from holiday last week, which means it will probably soon be
Aligner
TODO:
- report improvements in aligner back to Øystein ( Børre)
- gather more parallel texts (Trond)
- try out NT alignment strategies (Saara)
Language recognition
Trond and Saara has done some work on paragraphs with mixed content.
Types of mixed paragraphs in the newspaper texts:
- Norwegian quotations (titles, repliques, etc.)
- Bilingual text, separated by some separator: (/)
- Systematic omissions in the original translations
- Technical text for the typographers
- Names
- Unsystematic Norwegian parts of sentences
Examples of the types:
- Muhtomin láve friddjavuođadovdu ja eará háve fas dakkár dovdu ahte "Dere
- Vi har spurt eldre samer om hvordan de hadde det før i tiden./ Mii leat
- Du lihkkologut: 1, 14, 27 og 31 ¶
- BILDE: Kjell Kemi og Mai Britt Utsi ¶ HOVEDSAK: Bilde av Sponheim og rein ¶
- Eambbo dieđuid daid ortnegiid birra ja ohcanskoviid gávnnat min
- 1992: s lei NSR sámi delegašuvnnas mii soabadii Justisministariin om opplegget
The first is the most common one. In the MÁ corpus, there are 4000 strings with
Suggestion for handling the types:
- Quoted strings: Pick out the quoted strings and check them separately
- when? preprocessor or conversion to XML? conversion to XML, see below
- when? preprocessor or conversion to XML? conversion to XML, see below
- Do nothing or look for known separators when the recognition returns
- Do nothing, and add "og" as a loan word in the lexicon (with !SUB!!!)
- Identify the technical words BILDE, HOVEDSAK, then mark them as non-wanted(?)
- Do nothing. (we go for CC "og")
- Do nothing for the time being (bilingual analysis in the future?)
Conversion of quotations:
<p lang=a>...dovdu ahte «Dere... » ...</p> -- converted to: -- <p lang=a>...dovdu ahte <span type="quote" lang="nb">«Dere... »</span> ...</p>
Types of quotations:
- Directed: «», “”
- Undirected: "" (if even number, easy, if odd, hard) <==
Norwegian sequences could be strung together, and treated as an un-analyzible
Language distribution of paragraphs, as identified by the language recogniser:
LANG # hits - reality: sme 68431 - true nob 10595 - true smj 8468 - mostly sme, some smj nno 1220 - mostly nob eng 994 - true fin 956 - some true, most sme? dan 482 - false ger 252 - false sma 81 - mostly true, some short paragraphs may be false isl 9 - false
TODO:
- get more sma texts, first the Bible / NT (Trond)
- add <span>to the corpus processing, encapsulating identifiable sequences
6. Infrastructure
Xerox tools wrapped as servers
Paradigm generator is now finished (some problems with the XML still). The
Saara needs paradigm grammars for all POSes, see the example for N:
N+Subclass?+Number+Case+Possessive?+Clitic? V+ A+ Adv+ Pron+ ...
The inflector (generator as server) has four output options:
- short paradigm (nom, gen, gen pl)
- standard paradigm (full w/o poss and clitics)
- complete (incl poss. clitics)
- take any single string including tags, return inflected form
Input is one of:
- Lemma + POS and grammar type (short / standard / complete)
- Lemma + tag string
Next: add the hyphenation filter to the hyphenator server
TODO:
- improve and finish the present prototype (Saara)
- fix the corpus tag list in the cwb/ directory (Trond)
- add the hyphenation filter to the hyphenator server (Saara)
- create / check the paradigm grammar as exemplified above (Maaren)
Hyphenator
sma
Trond had some discussions with Ove Lorentz. We have done "maximize
Unrecognised word forms
The unrecognised forms are forms generated by the nonrec transducer, but
The command sequence is:
- log in to victorio, move to gt/
-
make wordlist TARGET=sme (the result is: sme/wordlist-sme.txt.gz)
- move wordlist-sme.txt.gz to local computer (or G5?)
-
make TARGET=sme (gives sme.fst)
-
make hyph TARGET=sme (gives hyph-sme.fst)
- gunzip wordlist-sme.txt.gz
- cat wordlist-sme.txt
lookup -flags mbTT -utf8 bin/hyph-sme.fst > output.txt |
TODO:
- Update the sma hyphenator rule set with the insights gained from smj
- done several updates, still more to be done
- done several updates, still more to be done
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
M4
It is problematic for the CG rules, as the rule numbering gets mixed up. The
7. Linguistics
Names and multilinguality
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
Observations:
1) Multilinguality is always optional.
2) We can observe that "foreign" names in texts follows a domination pattern:
3) When looking at our name classification, multilinguality varies according to:
Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong for the national and country names, weak (informative) for foreign names Sur - none Tit - strong (titles)
Suggestion:
We need to reconsider the all names in all languages policy. That policy is
A further issue is whether we should reconsider our cohort policy. Today, Sur
"<Trosterud>" "Trosterud" N Prop Sur Sg Nom <<< @HNOUN "Trosterud" N Prop Plc Sg Nom <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
Derivation and spellers like Aspell
TODO:
- find and study all derived words in our corpus (Thomas and Trond)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
Unwanted word forms:
- comparation of -laš derivations (they should not be generated in comparative
Questionable forms:
a-a a-a+Interj á-a a-a+Interj !SUB ASDF:a ASDF+N+ACR+Sg+Gen A:a A/S+N+ACR+Sg+Acc f:a f:a +? (wanted: f Gen, f:s Loc) from SGL meeting: 003/05: Davvisámegiela sánit normeremii (Gažaldagat leat boahtán divvun-prográmma ráhkadeddjiin) 1) Mot galgá merket oanádusaid sojaheami omd. NRK:as, NRKas, NRK-as? Ovddeš Sámi giellaráđđi (Norggas) lea evttohan ná čállojuvvot: Nom. NSR Akk. NSR Gen NSR` Jearaldat lea, ahte galgatgo ain ná oanidit? Mearrádus: Oanádusat sojahuvvojit dainna lágiin: nom. NSR Akk. NSR (not NSR:a) <== a and NOT a:a Gen. NSR <== a and NOT a:a Ill. NSR:i Lok. NSR:s Kom. NSR:in Ess. NSR:n Correct: abstrávttabuinnán abstrávttabuinnán abstrákta+A+Comp+Sg+Com+PxSg1 abstrávttabuinnán abstrákta+A+Comp+Pl+Loc+PxSg1 Error? abstrávttaboiinnán abstrávttaboiinnát abstrávttaboiinnis abstrávttaboiinniset abstrávttaboiinnán abstrávttaboiinnán abstrávttaboiinnán +? **Dessa är med trots att dom är !SUB ***må taes bort! accompagnerejun V+TV+Der/j+Pass+PrfPrc ábuhuvvože ábuhit+V+TV+Pass+Pot+Prs+Du1 áccohallagođežedne V+IV+Der/alla+Der/goahti+Pot+Prs+Du1 *en del var märkta !sub (med små bokstäver, av mej? Gör det nån skillnad?). Jag har ändrat dom. Märkt dock att ovanstående INTE hade små bokstäver.
NB! SUB marking has to be with uppercase SUB to be removed.
TODO:
- investigate the generated word form list sent to Polderland - use the command
- alphabet letters need to be correctly inflected (colon as case separator)
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations (Thomas, Sjur)
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- add the proper noun interface to the G5 (Sjur)
- done, you can now try out the proper noun lexicon in risten.no.
- done, you can now try out the proper noun lexicon in risten.no.
- cvs synching of the risten.no code in eXist (read-only) (Børre)
- done
- done
- add a simple password protection to risten.no in the G5 (Børre)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
Speller data generation
It reads the lexc files, communicates with the server, xml communication needs
TODO:
- add code to cvs (Tomi)
- done
- done
- implement generator server based on Saara's code (Tomi)
- done (by Saara)
- done (by Saara)
- decide how to specify compounding behaviour info for the lexicon
- discussion started in news - please respond!
- discussion started in news - please respond!
- add hyphenation points to the generated output (Tomi)
- planning meeting for the word form generator / data conversion script
- discussion started in news
Automatic testing of the Word spellchecker
TODO:
- consider a script for automatic testing (Sjur, Børre)
- ask Polderland about testing tools (Sjur)
- done
- done
- consider more testing routines (Sjur, Børre)
- consider infra for testing feedback (Børre, Sjur)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
10. Other
Bug fixing
64 open Divvun/Disamb bugs, and 24 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Task lists as iCal entries
Børre should have a look at Maaren's computer when he is in Kautokeino.
TODO:
- update Maaren's Forrest installation to r430284 (Børre)
Employee seminar in Alta
SD has an employee seminar in Alta 7.-8. December - should we go there? Sjur
TODO:
- ask Julie Eira about SD employee seminar (Sjur)
11. Next meeting, closing
Next meeting 6.11.2006 at 9: 30 (on the Swedish day in Finland - Swedish as the
Closed at 12: 14.
Appendix - task lists for the next week
Boerre
- contact writers who already have received contracts
- move norwegian documents in Min Áigi from sme to nob
- consider a script for automatic testing of the spell checker in Word
- consider more testing routines
- update Maaren's Forrest installation to r430284
- check potential raw HTML bug/problem
-
sma discussions with SD (with Sjur, Trond)
- add as much smj texts as possible
- report improvements in aligner back to Øystein
- add a simple password protection to risten.no in the G5
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
- create / check the paradigm grammar as exemplified above
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- improve text_cat with paragraphs of mixed content
- generate parallel corpus files manually (with Trond)
- export corpus tools to location available to all (with cron), cf news disc.
- help Trond with some shell commands
- plan the word form generator / data conversion script
- add <span>to the corpus processing, encapsulating identifiable sequences
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- finish i18n work of Forrest with Børre
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations
- plan the word form generator / data conversion script
- consider a script for automatic testing of the spell checker in Word
- consider more testing routines
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- ask Julie Eira about SD employee seminar
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived words in our corpus (with Trond)
- suggest which derivations could be generated
- investigate unrecognised word forms in hyphenator
- decide how to specify compounding behaviour info in the lexicon
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations
- fix bugs!
Tomi
- continue implementation of the speller lexicon conversion
- make generator as server, based on Saara's code
- add lexc2xspell code to cvs
- add hyphenation points to the generated output
- plan the word form generator / data conversion script
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- get more sma texts, first the Bible / NT
- add corpus user accounts and access issues to Bugzilla
- fix the corpus tag list in the cwb/ directory
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Sjur)
- find and study all derived words in our corpus (with Thomas)
- fix bugs!.