Meeting_2007-01-08
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 08.01.2007
- Time: 09.00 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 9: 24.
Present: Børre, Saara, Sjur, Steinar, Thomas, Trond
Absent: Maaren, Tomi
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- still not done
- still not done
- continue work on script for automatic testing of the spell checker in Word
- nothing done
- nothing done
- recreate our forrest tarball
- done
- done
- update setup and installation instructions for new users/computers
- done
- done
- create new forrest tarball
- done
- done
- cvs up of the public server should be done for xtdoc/sd/documentation/
- done
- done
- fix sme texts in corpus this month
- nothing done
- nothing done
- find missing nob parallel texts in corpus
- nothing done
- nothing done
- aligner status quo
- works
- works
- translate Windows installer to sme and smj
- translated a few strings
- translated a few strings
-
fix bugs!
- not done
- not done
- other
- had to setup (www.)divvun.no to serve static content directly via apache.
- had to setup (www.)divvun.no to serve static content directly via apache.
Maaren
- investigate the generated word form list sent to Polderland - use the command
Saara
- help Trond with some shell commands
- this is related to bug 355.
- this is related to bug 355.
- fix sme texts in corpus this month
- done some
- done some
- aligner status quo
- aligner is working
- aligner is working
- send aligned, xml nob texts to Lars
- not done
- not done
- add correction markup to the xml files (string-to-correction markup)
- faced some problems
- faced some problems
- first new version of xml2lexc in Perl
- started, test version will be available today.
- started, test version will be available today.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- continued refactoring and redesign of the risten.no code: risten.no now has
- continued refactoring and redesign of the risten.no code: risten.no now has
- refactor SD-terms editor code
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix forrest installations for Maaren, Disamb
- send Windows installer translation text to Børre and Thomas
- done
- done
- fix bugs!
Steinar
- conversion error screening
- not done
- not done
- missing lists
- done some work
- done some work
- report conversion errors to Saara
- not done
- not done
-
fix bugs!
- not done
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- decide how to specify compounding behaviour info in the lexicon
- nothing this week
- nothing this week
- translate Windows installer to sme and smj
- not done
- not done
-
fix bugs!
- worked hard with bugs
Tomi
- On sick leave.
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- No smj this week.
- No smj this week.
- get more sma texts
- Also no news.
- Also no news.
- aligner status quo
- done - working
- done - working
- decide how to specify compounding behaviour info in the lexicon
- Set up work on missing and conversion screening with Steinar.
- Done (missing only).
- Done (missing only).
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- Not yet.
- Not yet.
- report conversion errors to Saara
- Done. Progress on conversion last week.
- Done. Progress on conversion last week.
-
fix bugs!.
- Fixed some, actually, and found new ones
3. Documentation
Børre has done all(!) open issues - thanks!
TODO:
- either fix forrest installations (Sjur), or create a new tarball
- done (new tarball)
- done (new tarball)
- cvs up of the public server should be done for xtdoc/sd/documentation/
- done
- done
- update setup and installation instructions (Børre)
- done
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such wholes are found
- Go through the list of missing or errouneous nob texts, based upon Saaras
5. Corpus infrastructure
Aligner
TODO:
- make an overview of status quo (Trond, Børre, Saara).
- done
- done
- go through other directories, fix parallelity information for other documents
- the sme/admin/ files are now double checked.
- the sme/admin/ files are now double checked.
- gather more parallel texts (Trond, Børre)
- not any more? (depends on what we are missing and how hard they are to get,
- not any more? (depends on what we are missing and how hard they are to get,
- re-analyze parallel files using the command-line version (Saara)
- progressing
- progressing
- when aligned, send aligned, xml nob texts to Lars ( Saara)
Conversion issues
Initial nbsp? Cf these:
ccat -l sme -r zcorp/bound/sme/admin/depts/
preprocess --abbr=bin/abbr.txt --corr=src/typos.txt | lookup -flags mbTT -utf8 -f bin/missing | grep '\?' | cut -f1 | sort | uniq -c | sort -nr | l |
19 –Danne (EM SPACE, U+2003) 84 –Lea 62 Bueng
Use UnicodeChecker to check spurious characters that look inocent (nbsp, em
admin/depts still have many errors stemming from the problematic PDF
34 kapih 33 ttalat
TODO:
- add correction markup to the xml files (string-to-correction markup)
- report conversion errors to Saara ( Trond, Steinar)
6. Infrastructure
Nothing
7. Linguistics
North Sámi
Compounds with Actio as the first part. Earlier, we accepted such compounds, but
bajásšaddaneavttuid olgunastinlága ávkkástallanvugiide
Another problem:
- we get heajus-Oslo instead of heajos-Oslo
- we get lojis-Oslo instead of lojes-Oslo
- we get álkis-Oslo instead of álkes-Oslo
The fix here is the same as for smj below: Hyphen must be included in the
Third issue: stuorra-oslolaš does not work (that is, initial lower case oslolaš
TODO:
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
- Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
Numbers:
Needs to be included in the non-rec transducers, and their inflection needs to
- inflection (case, number, ordinality)
- generation of higher numerals (not inflected)
TODO:
- include numbers in the non-recursive transducers (i.e. split the recursive and
- Set up test bed for numerals, test and revise (who?)
- Make a test bed make num-paradigm ( Trond)
- Go through the Num bugs (Trond, Thomas, Steinar)
Hyphenation problem
Vowel combinations that are not defined as diphthongs behave like diphthongs
laam laam laam (expected: la-am) liam liam liam luam luam luam lyam lyam lyam lium lium lium luum luum luum lyum lyum lyum
For Lule sámi ALL Vowel combos behave like diphtongs.
South Sámi needs supervising, too, there are reports indicating it has onset
TODO:
- write diphthong hyphenation pseudocode (Thomas)
- rewrite hyphenation code (Trond)
- ask Ove Lorentz to report on our sma hyphenator (Trond)
Lule Sámi
We had a twol problem -- now fixed by Thomas. We still have a problem with
The fix is to include the hyphen in the right context triggering vowel fronting.
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- include numbers in the non-recursive transducers
- Set up a test bed for numerals, test and revise (who?)
8. Name lexicon infrastructure
Sjur continued refactoring and redesign of the risten.no code: risten.no now
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
TODO:
- try to make a first version of xml2lexc in Perl for testing and preparation
- soon ready: -)
- soon ready: -)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
Polderland data generation
TODO:
- decide how to specify compounding behaviour info for the lexicon
- add closed POS and clitics to PLX generation (Børre, Tomi)
- add compound stems to the PLX generation (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- Include numerals in the speller (Børre, Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
Localisation
TODO:
- send translation text to Børre and Thomas ( Sjur)
- done
- done
- translate Windows installer text to sme and smj ( Børre, Thomas)
-
Børre has started, Thomas (and perhaps Steinar and Maaren)
-
Børre has started, Thomas (and perhaps Steinar and Maaren)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
Bug fixing
57 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 15.1.2007, 09: 00 Norwegian time.
The meeting was closed at 10: 50.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- translate Windows installer to sme
-
fix bugs!
- work on the Polderland data generation (PLX format conversion)
- go through other directories, fix parallelity information for other documents
Maaren
- investigate the generated word form list sent to Polderland - use the command
Saara
- fix sme texts in corpus this month
- send aligned, xml nob texts to Lars
- add correction markup to the xml files (string-to-correction markup)
- first new version of xml2lexc in Perl
- fix bugs!
Sjur
- name lexicon:
- rewrite the integration with forrest, to get a more flexible integration
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- rewrite the integration with forrest, to get a more flexible integration
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- fix bugs!
Steinar
- conversion error screening
- missing lists
- report conversion errors to Saara
- Go through the Num bugs
- Look at the actio compound issue when adding from missing lists
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- translate Windows installer to sme and smj
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
- Lack of lowering before hyphen: Twol rewrite.
- include numbers in the non-recursive transducers
- Go through the Num bugs
- Write diphthong hyphenation pseudocode
- fix stuorra-oslolaš lower case o
- fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- decide how to specify compounding behaviour info in the lexicon
- Set up work on missing and conversion screening with Steinar and Ilona.
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- report conversion errors to Saara
- Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
- Go through the Num bugs
- Make numeral testbed
- Rewrite hyphenation-code (pseudocode from Thomas) sme, smj
- Get input on sma hyphenations
- fix stuorra-oslolaš lower case o
- include numbers in the non-recursive transducers for sme, smj
- fix bugs!.