Meeting_2007-08-27
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 27.8.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 39.
Present: Børre, Ilona, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Per-Eric
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- run all known spelling errors in the prooftest corpus through the speller
- not done
- not done
- add extraction of all known spelling errors in the regular corpus (not the
- not done
- not done
- move Steinar's error markup in the xml files to (a copy of) the original
- not done
- not done
-
fix bugs!
- nothing fixed
- nothing fixed
- other
- sent contracts to Berit Johnskareng regarding her Davvi Girji books
- fixed contracts with Lene Antonsen regarding her book "Jámešgušbákti
- contacted Johan Jernsletten about the Ginna, Gálka, Borta, Snorra books.
- Began modifying jazzy-0.52 to include support
- sent contracts to Berit Johnskareng regarding her Davvi Girji books
Ilona
- lexicalise missing words
- add sme names from FIN
- working on it
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
- add compounding tags to adjectives
Saara
- add new XSL/XML headers for proofing test docs
- fix bugs!
Sjur
- improve speller test bench:
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- done, very helpful, it has identified a major hole in the PLX conversion
- done, very helpful, it has identified a major hole in the PLX conversion
- integrate the ccat speller testing options in the make file
- run all known spelling errors in the corpus through the speller
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- nothing done
- nothing done
- fix stuorra-oslolaš lower case o
- nothing done
- nothing done
-
ä/æ in smj speller
- compiled new speller - use it as the basis for further testing
- compiled new speller - use it as the basis for further testing
- restart work on the XML name editor/risten.no integration
- nothing real last week
- nothing real last week
- plan the rest of the project period
- did some, not finished
- did some, not finished
- resend smj speller-correct document to Kåre Tjikkom
- done
- done
-
fix bugs!
- worked on them, added more reports, and comments on existig ones
- worked on them, added more reports, and comments on existig ones
- other things:
- recompiled spellers
- tested the typos lists on the new spellers
- recompiled spellers
Thomas
- work with compounding
- worked
- worked
- fix stuorra-oslolaš lower case o
- not done
- not done
-
ä/æ in smj speller
- not done
- not done
-
fix bugs!
- not done
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- not done
- not done
- add Hunspell data generation/conversion
- not done
- not done
-
fix bugs!
- fixed
Trond
- Work on the web corpus issues
- update the smj proper noun lexicon, and refine the morphological
- Not done.
- Not done.
- fix stuorra-oslolaš lower case o
- Not done.
- Not done.
- add sma texts to the corpus repository
- Not done
- Not done
-
ä/æ in smj speller
- Not done.
- Not done.
-
fix bugs!.
- Worked on bugs.
3. Documentation
Nothing new.
4. Corpus gathering
Børre has contacted several people, see task status above.
TODO:
- add sma texts to the corpus repository (Trond)
5. Corpus infrastructure
Saara has removed the *.html lists from the xdoc folder, and our
6. Infrastructure
We are ordering a new server for faster processing. - Order not yet placed, we
7. Linguistics
North Sámi
Remaining twol issues: see
Ilona is working on the list of sme names from Finland. The list
There is an empty file, gt/smn/src/propernoun-smn-lex.txt, in cvs. Inari
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- fix twol bug (Sjur, Thomas, Trond)
- meet online this week - Thursday around 12 AM Norwegian time
- meet online this week - Thursday around 12 AM Norwegian time
- add the sme place names from Finland (Ilona)
- working on it
Lule Sámi
The æ-ä issue: see bug 411 .
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
-
ä/æ in speller (Tomi, Sjur)
- Works in transducer, not in speller, see bug report (#411)
- Works in transducer, not in speller, see bug report (#411)
- lexicalise words from the Olavi missing list, but check against the pdf
- add compounding tags to (some weeks of work):
- nouns (Thomas)
- a few weeks left
- a few weeks left
- adjs (Per-Eric)
- almost finished
- almost finished
- nouns (Thomas)
- resend smj speller-correct document to Kåre Tjikkom ( Sjur)
- done
8. Name lexicon infrastructure
This sub-project needs to get up and running soon. Mainly Sjur's task.
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
OOo spellers
Tomi is working on the lexicon conversion to the Hunspell format. It is
plx source -> transducer -> wordlist in plxformat -> speller binary src/* *-plx.fst > 60 GB 2 MB polderland/* hun source -> transducer -> java/perl-server program -> huncode src/* *-hunspell.fst
My question:
- generate full paradigm per word with xfst (as for polderland today)
- extract stems automatically <= from the generated paradigm (60 GB)
- turn the result into hunspell stem / cont
The hunspell generation process thus mirrors the plx generation process. Yes,
We have, in parallel, been looking at sfst. The results are good, sfst seems
The sfst version of smX will be put on hold to after newyear.
TODO:
- add Hunspell data conversion (Tomi)
- working on it
Testing
Spelling Error Markup
TODO:
- Set up ways of adding meta-information (source info, used in testing or not,
- move Steinar's error markup in the xml files to (a copy of) the original
Automated testing
We need a separate speller pre-processor, to turn ccat output into suitable
TODO:
- document the AppleScript testing tool (Sjur)
- improve speller test bench (Sjur)
- create a speller preprocessor (Børre or Sjur)
- integrate the ccat speller testing options in the Makefile (Sjur)
- create a speller preprocessor (Børre or Sjur)
- add extraction of all known spelling errors in the corpus (not the
- put on hold until we have such markup in the regular corpus
- put on hold until we have such markup in the regular corpus
- add regression self-test:
- integrate these in the make file (Sjur)
- done
- integrate these in the make file (Sjur)
Lexicon conversion to the PLX format
We have found a bug in
We need a compounding form without a hyphen in speller. In xfst processing
But is this a lexc problem? Probably not, see below (sme.fst, sme-norm.fst,
-bash-3.00$ lookup -flags mbTT -utf8 sme/bin/spellernonrec-sme.save 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% biila- biila- biila+N+SgNomCmp+Cmpnd biila-- biila-- biila-- +? -bash-3.00$ lookup -flags mbTT -utf8 sme/bin/spellernonrec-sme.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% bii^la bii^la biila+N+Sg+Nom bii^la- bii^la- biila+N+SgNomCmp+Cmpnd bii^la-- bii^la-- bii^la-- +?
TODO:
- fix bug 484 (Tomi)
- fix bug 458 (Trond, Sjur)
New public beta
Delayed till the majority of the present bugs are fixed. The twolc bug
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
Bug fixing
When fixing bugs, record the version number containing the fix in the Bugzilla
57 open Divvun/Disamb bugs (29 of these 56 are speller-related bugs,
Project meeting
We'll meet in September, 24-28, in Tromsø to work on the hardest remaining
11. Next meeting, closing
The next meeting is 3.9.2007, 09: 30 Norwegian time.
The meeting was closed at 10: 54.
Appendix - task lists for the next week
Boerre
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
- move Steinar's error markup in the xml files to (a copy of) the original
- create a speller preprocessor
- fix bugs!
Ilona
- lexicalise missing words
- add sme names from FIN
- make smn propernoun-list
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
- add compounding tags to adjectives
Saara
- add new XSL/XML headers for proofing test docs
- fix bugs!
Sjur
- improve speller test bench:
- document the AppleScript testing tool
- create a speller preprocessor
- integrate the ccat speller testing options in the make file
- document the AppleScript testing tool
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
-
ä/æ in smj speller
- work on the XML name editor/risten.no integration
- plan the rest of the project period
- fix sme twol bug (#460), meeting Thursday at 12 AM
- fix bug 458
- fix bugs!
Thomas
- work with compounding
- fix stuorra-oslolaš lower case o
-
ä/æ in smj speller
- fix sme twol bug (#460), meeting Thursday at 12 AM
- fix bugs!
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- add Hunspell data generation/conversion
- fix bug 484
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological
- fix stuorra-oslolaš lower case o
- add sma texts to the corpus repository
-
ä/æ in smj speller
- fix sme twol bug (#460), meeting Thursday at 12 AM
- fix bug 458
- fix bugs!.