Meeting_2007-01-15
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 15.01.2007
- Time: 09.00 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 9: 44.
Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- not done
- not done
- continue work on script for automatic testing of the spell checker in Word
- not done
- not done
- fix sme texts in corpus this month
- not done
- not done
- find missing nob parallel texts in corpus
- not done
- not done
- translate Windows installer to sme
- some done, helped Thomas
- some done, helped Thomas
- work on the Polderland data generation (PLX format conversion)
- done, not finished.
- done, not finished.
- go through other directories, fix parallelity information for other documents
- not done
- not done
-
fix bugs!
- not done
Maaren
- investigate the generated word form list sent to Polderland - use the command
- not done
Saara
- fix sme texts in corpus this month
- in progress
- in progress
- send aligned, xml nob texts to Lars
- add correction markup to the xml files (string-to-correction markup)
- done, but see newsgroup message
- done, but see newsgroup message
- first new version of xml2lexc in Perl
- done
- done
-
fix bugs!
- fixed couple of bugs
Sjur
- name lexicon:
- rewrite the integration with forrest, to get a more flexible integration
- search interface finished, editor half-way; still needs some javascript and
- search interface finished, editor half-way; still needs some javascript and
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- rewrite the integration with forrest, to get a more flexible integration
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- finally done!
- finally done!
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- fix bugs!
Steinar
- conversion error screening
- not done
- not done
- missing lists
- done some work
- done some work
- report conversion errors to Saara
- not done
- not done
- Go through the Num bugs
- not done
- not done
- Look at the actio compound issue when adding from missing lists
- added words
- added words
-
fix bugs!
- not done
- not done
- worked with cg-sets
- done some
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
- nothing this week
- decide how to specify compounding behaviour info in the lexicon
- decided
- decided
- translate Windows installer to sme and smj
- ready soon
- ready soon
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
- nothing this week
- nothing this week
- Lack of lowering before hyphen: Twol rewrite.
- nothing this week
- nothing this week
- include numbers in the non-recursive transducers
- not done
- not done
- Go through the Num bugs
- not done
- not done
- Write diphthong hyphenation pseudocode
- done
- done
- fix stuorra-oslolaš lower case o
- not done
- not done
-
fix bugs!
- worked
Tomi
- add closed POS and clitics to PLX generation
- done with help from Børre
- done with help from Børre
- add derivations to the PLX generation
- not done
- not done
- add compound stems to the PLX generation
- not done
- not done
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- No smj last week.
- No smj last week.
- decide how to specify compounding behaviour info in the lexicon
- Decided
- Decided
- Set up work on missing and conversion screening with Steinar and Ilona.
- Done.
- Done.
- fix sme texts in corpus this month
- Continuously working on this one.
- Continuously working on this one.
- find missing nob parallel texts in corpus, go through Saara's list
- report conversion errors to Saara
- Saara has been leading this work...
- Saara has been leading this work...
- Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
- Not done
- Not done
- Go through the Num bugs
- Not done
- Not done
- Make numeral testbed
- Not done.
- Not done.
- Rewrite hyphenation-code (pseudocode from Thomas) sme, smj
- Done
- Done
- Get input on sma hyphenations
- Not done.
- Not done.
- fix stuorra-oslolaš lower case o
- This one I would like to pass over to Tomi.
- This one I would like to pass over to Tomi.
- include numbers in the non-recursive transducers for sme, smj
- Started work on this one. Split the closed-smX-lex.txt file with Børre.
- Started work on this one. Split the closed-smX-lex.txt file with Børre.
- fix bugs!.
3. Documentation
Nothing this week.
4. Corpus gathering
Trond finally got the sma texts from Snåsa, quite a lot of text, but not
The relevant persons have worked on the tasks below.
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such wholes are found
- Go through the list of missing or errouneous nob texts, based upon Saaras
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Lars Nygård has left UiO. Anders Nøklestad is back in his old position.
Alignment
TODO:
- go through other directories, fix parallelity information for other documents
- Still to be done.
- Still to be done.
- re-analyze parallel files using the command-line version (Saara)
- done all existing files
- done all existing files
- when aligned, send aligned, xml nob texts to Kristin ( Saara)
- not yet done
Conversion issues
TODO:
- add correction markup to the xml files (string-to-correction markup)
- see news discussion - we will and should allow text corrections concerning
- see news discussion - we will and should allow text corrections concerning
- report conversion errors to Saara ( Trond, Steinar)
- Not done.
6. Infrastructure
Nothing this week.
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
Numbers:
One problem we have is to correctly identify base forms of numerals, cf:
guhttanuppelohkái guhttanuppelohkái guhtta+Num+Sg+Nom guhttanuppelohkái guhtta+Num+Sg+Acc
TODO:
- discontinous case inflection (but only for maximally three-part compound
- produce correct base forms in the analyzer (Thomas, Trond)
- include numbers in the non-recursive transducers (i.e. split the recursive and
- Set up test bed for numerals, test and revise (who?)
- Make a test bed make num-paradigm ( Trond)
- Go through the Num bugs (Trond, Thomas, Steinar)
- Preprocessing of ordinals at the end of sentences - reported as bug #368.
Hyphenation problem
TODO:
- write diphthong hyphenation pseudocode (Thomas)
- done for both sme and smj
- done for both sme and smj
- rewrite hyphenation code (Trond)
- done for both sme and smj
- done for both sme and smj
- ask Ove Lorentz to report on our sma hyphenator (Trond)
- Not done.
Lule Sámi
It could actually be that the smj numerals are not recursive. They were made
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- include numbers in the non-recursive transducers
- Set up a test bed for numerals, test and revise (who?)
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in the meeting memo.
Postponed:
- data synchronisation between risten.no and the cvs repo
TODO:
- try to make a first version of xml2lexc in Perl for testing and preparation
- done
- done
- restructure interface code for easier maintenance, coding and use
- well under way, still some work
- well under way, still some work
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
Polderland data generation
There is now a decision on compound parts, and compounding can now be
We have a UTF-8 problem with the paradigm server in some cases, some characters
Suggestion: Just use the G5, and not victorio, since there is no time to fix the
TODO:
- decide how to specify compounding behaviour info for the lexicon
- Done!
- Done!
- add closed POS and clitics to PLX generation (Børre, Tomi)
- Progressing.
- Progressing.
- add compound stems to the PLX generation (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- Include numerals in the speller (Børre, Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
- nothing yet
Localisation
TODO:
- translate Windows installer text to sme and smj ( Børre, Thomas)
- progressing (smj is mostly done, lots lacking in sme)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- not done
Bug fixing
56 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 22.1.2007, 09: 30 Norwegian time.
The meeting was closed at 10: 44.
Appendix - task lists for the next week
Boerre
- continue work on script for automatic testing of the spell checker in Word
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- translate Windows installer text to sme and smj
- work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
- Concentrate on compounding
- go through other directories, fix parallelity information for other documents
- add sma texts to the corpus repository
- fix bugs!
Maaren
- tasks according to Thomas
Saara
- fix sme texts in corpus this month
- send aligned, xml nob texts to Kristen
- fix problems with xml2lexc if needed
- fix bugs!
Sjur
- name lexicon:
- restructure interface code for easier maintenance, coding and use
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- restructure interface code for easier maintenance, coding and use
- hire linguist and programmer
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- fix bugs!
Steinar
- Complete the semantic sets in sme-dis.rle
- missing lists
- report conversion errors to Saara
- Look at the actio compound issue when adding from missing lists
- Go through the Num bugs
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- work with compounding
- translate Windows installer to sme and smj
- lexicalise actio compounds
- Lack of lowering before hyphen: Twol rewrite.
- Go through the Num bugs
- fix stuorra-oslolaš lower case o
- include basic numbers in the non-recursive transducers
- implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!
Tomi
- add compound stems to the PLX generation
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- report conversion errors to Saara
- Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
- Go through the Num bugs
- Make numeral testbed
- Get input on sma hyphenations
- include numbers in the non-recursive transducers for sme, smj
- implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!.