Meeting_2006-12-04
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 4.12.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 48.
Present: Børre, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- consider a script for automatic testing of the spell checker in Word
- Began work. Some AppleScript done
- Began work. Some AppleScript done
- meeting Wednesday at 9.30 with Sjur to plan testing
- Done
- Done
-
sma discussions with SD (with Sjur, Trond)
- Not done
- Not done
- add a simple password protection to risten.no in the G5
- Not done
- Not done
- consider infra for testing feedback
- Some plans made during the meeting with Sjur
- Some plans made during the meeting with Sjur
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- Not done
- Not done
- update all forrest installations, including local patches
- Not done
- Not done
-
fix bugs!
- Done some work on bug 348
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
Saara
- finalize server of the Xerox tools.
- help Trond with some shell commands
- re-analyze parallel files
- consider implementing some new features to the corpus files
- add closed POSes to the paradigm gen, if needed.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- finally got multiuser editing safe against data corruption - after more than
- finally got multiuser editing safe against data corruption - after more than
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- investigate unrecognised word forms in the hyphenator
- nothing done, but this has become much easier with the debug output from
- nothing done, but this has become much easier with the debug output from
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- had a meeting, but didn't finish, and the second part disappeared in other
- had a meeting, but didn't finish, and the second part disappeared in other
- meeting Tuesday 12.30
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- consider a script for automatic testing of the spell checker in Word
- discussed with Børre, pseudocode in place
- discussed with Børre, pseudocode in place
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- make sure we get one/more licenses in Alta
- make sure we get one/more licenses in Alta
- publish corpus contracts and project infra on NoDaLi-sta
- not yet
- not yet
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts
- check whether lookup can be used to generate paradigms for closed POSes
- done, doesn't look like it
- done, doesn't look like it
- meeting Wednesday at 9.30 with Børre to plan testing
- done, memo available soon (not checked in) in the gt/doc/proof/ dir
- done, memo available soon (not checked in) in the gt/doc/proof/ dir
-
fix bugs!
- done some for risten.no, gleaned through others
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- redo the derived words work for smj
- done
- done
- decide how to specify compounding behaviour info in the lexicon
- begun
- meeting Tuesday 12.30
- done
- done
- begun
- check why some SUB-marked entries got included in the normative transducer
- not done
- not done
- investigate unrecognised word forms in hyphenator
- not done
- not done
- test editing of the xml files
- not done
- not done
- clean terms-sme.xml such that all names have the correct tag for their use
- not done
- not done
- write paradigm grammar for the closed POSes
- done
- done
-
fix bugs!
- just added new bugs
Tomi
- make PLX lexicon for all open POSes
- not done - how much is done, what is left?
- there is something fishy in the server, Saara has got it working but I
- not done - how much is done, what is left?
- add derivations to the PLX generation
- not done
- not done
- add compound stems to the PLX generation
- added noun
- added noun
- add closed POS and clitics to PLX generation
- not done
- not done
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- get more sma texts
- Still waiting
- Still waiting
- investigate unrecognised word forms in the hyphenator
- Not done
- Not done
- decide how to specify compounding behaviour info in the lexicon
- Not done
- Not done
- meeting Tuesday 12.30
- Meeting kept, but more to discuss.
- Meeting kept, but more to discuss.
-
sma discussions with SD (with Børre, Sjur)
- Not done.
- Not done.
- redo the derived words work for smj
- Not done.
- Not done.
3. Documentation
TODO:
- update all forrest installations, including local patches (Børre)
- not yet, will do it in Alta
4. Corpus gathering
Nothing new during last week.
TODO:
- get sma Bible / NT texts (Trond)
- Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)
5. Corpus infrastructure
More texts to the graphical corpus interface
Trond has talked with Lars, who is writing documentation for the users.
TODO:
- Consider whether to implement the <p> only policy or not (Saara)
- add text to the server (Lars)
- Discuss with Lars (Trond)
Aligner
No more texts yet, Saara has included the aligner in the relevant perl script.
TODO:
- gather more parallel texts (Trond, Børre)
- re-analyze parallel files using the command-line version (Saara)
6. Infrastructure
Xerox tools wrapped as servers
The server hasn't been working for Tomi, but is now working again. The
sajin NIR # !SUB sa^jis NIR sa^jiin NIR sa^ji NIR sad^jái NIR sad^ji NIR sad^je NIRL - this is done by client sa^je NIRL sa^ji NIRL sa^je NIRL sajiis NIR # !SUB sa^jiin NIR sa^jii^guin NIR sa^jiid NIR sa^jii^de NIR sa^jit NIR sa^jiid NIRL sajin NIR #N+Sg+Loc sa^jis NIR #N+Sg+Loc sa^jiin NIR #N+Sg+Com sa^ji NIR #N+Sg+Acc sad^jái NIR #N+Sg+Ill sad^ji NIR sad^je NIRL #N+Sg+Nom sa^je NIRL #N+Sg+Gen sa^ji NIRL #N+Sg+Gen sa^je NIRL #N+Sg+Gen sajiis NIR #N+Pl+Loc sa^jiin NIR #N+Pl+Loc sa^jii^guin NIR #N+Pl+Com sa^jiid NIR #N+Pl+Acc sa^jii^de NIR #N+Pl+Ill sa^jit NIR #N+Pl+Nom sa^jiid NIRL #N+Pl+Gen Using fsts: /opt/smi/sme/bin/isme.fst /opt/smi/sme/bin/hyph-sme.fst
It should have been using: ifst-norm:inverse-norm.fst.
-rwxrwxr-x 1 root cvs 2257 jun 21 10:13 allcaps.fst -rwxrwxr-x 1 root cvs 92 jun 21 10:16 cap-sme -rwxrwxr-x 1 root cvs 6995574 des 4 00:38 hyph-sme.fst -rwxrwxr-x 1 root cvs 1206092 des 4 00:38 isme.fst -rwxrwxr-x 1 root cvs 3106957 des 4 00:38 isme-norm.fst -rwxrwxr-x 1 root cvs 674609 des 4 00:38 sme-dis.rle -rwxrwxr-x 1 root cvs 1251450 des 4 00:38 sme.fst
TODO:
- remove clitics from the paradigm generator (Saara)
- done
- done
- investigate why possessives have disappeared from the paradigm generator
- make sure the normative generator is used when generating paradigms (Tomi)
Hyphenator
TODO:
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- possibly the main cause was found in the meeting (non-normative forms
- possibly the main cause was found in the meeting (non-normative forms
7. Linguistics
Names and multilinguality
TODO:
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
Derivation and spellers like Aspell
TODO:
- redo the work for smj, including discussion regarding Actio
- done by Thomas
North Sámi
The following words are included in the normative list despite being marked
accompagnerejun -JUVVON accompagnerejun accompagneret+V+TV+Der1+Der/j+Der/Pass+PrfPrc ábuhuvvože -ETNE ábuhuvvože ábuhit+V+TV+Pass+Pot+Prs+Du1 áccohallagođežedne -ETNE áccohallagođežedne áccohallat+V+TV+Der3+Der/goahti+Pot+Prs+Du1 In sme-lex.txt: +Der2+Der/Pass:uvvo DOHPPEINCH ; +Der/Pass+PrfPrc:un K ; !SUB +Du1:e K ; !SUB +Du1:edne K ; !SUB +Du1:etne K ;
These are generated by make wordlist TARGET=sme, which uses
The last version of the wordlist does not include the errouneous words anymore.
TODO:
- investigate the generated word form list sent to Polderland - use the command
- check why some SUB-marked entries got included in the normative transducer
- done in the meeting
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
Polderland data generation
All other open POSes now included in the paradigm generator, below is a verb
galggaže VIR #V+Pot+Prs+Du1 galggažedne VIR #V+Pot+Prs+Du1 galg^ga^žet^ne VIR #V+Pot+Prs+Du1 galg^ga^žeh^pet VIR #V+Pot+Prs+Pl2 galg^gaš VIR #V+Pot+Prs+ConNeg galg^ga^žat VIR #V+Pot+Prs+Pl1 galg^ga^žit VIR #V+Pot+Prs+Pl1 galg^ga^žan VIR #V+Pot+Prs+Sg1 galg^ga^žea^ba VIR #V+Pot+Prs+Du3 galg^ga^žat VIR #V+Pot+Prs+Sg2 galg^ga^žeahp^pi VIR #V+Pot+Prs+Du2 galg^ga^žit VIR #V+Pot+Prs+Pl3 galg^ga^ža VIR #V+Pot+Prs+Sg3 galg^ga^leim^me VIR #V+Cond+Prs+Du1 galg^ga^šeim^me VIR #V+Cond+Prs+Du1 galg^ga^leid^det VIR #V+Cond+Prs+Pl2 galg^ga^šeid^det VIR #V+Cond+Prs+Pl2 galg^ga^le VIR #V+Cond+Prs+ConNeg galg^ga^še VIR #V+Cond+Prs+ConNeg galg^ga^leim^met VIR #V+Cond+Prs+Pl1 galg^ga^šeim^met VIR #V+Cond+Prs+Pl1 ...
TODO:
- write paradigm grammar for the closed POSes (Thomas)
- done
- done
- check whether lookup can be used to generate paradigms for closed POSes
- couldn't find anything
- couldn't find anything
- decide how to specify compounding behaviour info for the lexicon
- meeting Tuesday at 12.30
- done, needs to be followed up
- done, needs to be followed up
- meeting Tuesday at 12.30
- make inflection PLX lexicon for all open POSes (Tomi)
- done
- done
- add closed POS and clitics to PLX generation (Tomi)
- add derivations to the PLX generation (Tomi)
- add compound stems to the PLX generation (Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
When the PLX-based speller is ready: use the generated word list as test input:
We need a meeting to plan testing. We'll do it shortly this week, and perhaps
TODO:
- meeting Wednesday at 9.30 (Sjur, Børre)
- done
- done
- consider a script (shell+AppleScript?) for automatic testing of MS Word
- pseudocode written, some AppleScript
- pseudocode written, some AppleScript
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- not yet
Bug fixing
56 open Divvun/Disamb bugs, and 23 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Task lists as iCal entries
TODO:
- update Maaren's Forrest installation (Børre)
11. Next meeting, closing
The next meeting is 11.12.2006, 09: 30 Norwegian time.
The meeting was closed at 11: 22.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
-
sma discussions with SD (with Sjur, Trond)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- update all forrest installations, including local patches
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
Saara
- finalize server of the Xerox tools.
- help Trond with some shell commands
- re-analyze parallel files
- consider implementing some new features to the corpus files
- add closed POSes to the paradigm gen, if needed.
- investigate why possessives have disappeared from the paradigm generator
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Trond)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- publish corpus contracts and project infra on NoDaLi-sta
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- make sure the normative generator is used when generating paradigms
- investigate why possessives have disappeared from the paradigm generator
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- get more sma texts
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Sjur)