Meeting_2007-09-10
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 10.9.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 41.
Present: Sjur, Thomas, Trond
Absent: Børre, Ilona, Per-Eric, Tomi
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- move Steinar's error markup in the xml files to (a copy of) the original
- add semi-automatic updates of fixed and open issues to README files
- fix bugs!
Ilona
- lexicalise missing words
- add sme names from FIN
- worked on it
- worked on it
- make smn propernoun-list
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
- finish with the compounding tags to adjectives
Saara
- add new XSL/XML headers for proofing test docs
- not done
- not done
- Set up ways of adding meta-information for proofing correct corpus docs
- not done
Sjur
- document the AppleScript testing tool
- expanded the present documentation, still needs more.
- expanded the present documentation, still needs more.
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- not done
- not done
- fix stuorra-oslolaš lower case o
- not done
- not done
-
ä/æ in smj speller
- not done
- not done
- work on the XML name editor/risten.no integration
- not done
- not done
- plan the rest of the project period
- major milestones set, needs more details
- major milestones set, needs more details
- fix sme twol bug (#460), meeting Thursday at 9 AM
- done
- done
- fix bug 458
- done
- done
- bug Kåre Tjikkom about the smj correct document
- done and received
- done and received
-
fix bugs!
- done some, reported others
- done some, reported others
- other tasks:
- refined the speller test bench more
- compiled new spellers
- reran self-test - the test itself contains errors, and needs to be cleaned
- refined the speller test bench more
Thomas
- work with compounding
- finished
- finished
- fix stuorra-oslolaš lower case o
- not done
- not done
-
ä/æ in smj speller
- not done
- not done
- fix sme twol bug (#460), meeting Thursday at 9 AM
- done
- done
-
fix bugs!
- worked
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- add Hunspell data generation/conversion
- fix bug 484
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological
- No smj work
- No smj work
- fix stuorra-oslolaš lower case o
- Not done
- Not done
- add sma texts to the corpus repository
- Not done
- Not done
-
ä/æ in smj speller
- Not done
- Not done
- fix sme twol bug (#460), meeting Thursday at 9 AM
- The bug is still open, but significant progress has been made. We now
- The bug is still open, but significant progress has been made. We now
- fix bug 458
- Closed.
- Closed.
- fix bugs!.
3. Documentation
TODO:
- add semi-automatic updates of fixed and open issues to README files
4. Corpus gathering
We received the correct-marked corpus from Kåre. Some of the errors identified
wrong§correct - orthographic leif§feil {wrong}£{correct} - morphosyntactic {mun muitalat}£{mun muitalan} wrong€correct - lexical/terminological guossodimieddne€biebbmoieddne
The corresponding xml should look like:
<error type="ort" correct="feil">leif</error> <error type="synt" correct="mun muitalan">mun muitalat</error> <error type="lex" correct="biebbmoieddne">guossodimieddne</error>
Decision: processing of the above markup will be implemented for future use, but
TODO:
- add sma Bible texts to the corpus repository (Trond)
- bug Kåre Tjikkom about the smj correct document (Sjur)
- done, received, very useful, but we could use even more.
- done, received, very useful, but we could use even more.
- add correct type differentiation to XSL processing (Saara)
- add correct type differentiation to ccat (Tomi)
5. Corpus infrastructure
Nothing.
6. Infrastructure
Nothing.
7. Linguistics
North Sámi
Fixed a long-standing bug last week: )
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- fix twol bug (Sjur, Thomas, Trond)
- done!
- done!
- add the sme place names from Finland (Ilona)
- still working
Lule Sámi
smj propernoun bug issue:
- convert from common base (which means sme base)
- Words not convertable should be added to separate smj lexicon, and words that
- Words not convertable should be added to separate smj lexicon, and words that
- send to smj morphology
The original todo was to correct the smj morphology.
- conversion errors
- words that should not have been converten
- missing smj-unique names
- errors in the morphology
Testing procedures:
- analyse baseforms (as for sme)
- generate a couple of caseforms from the baseforms, and inspect result
Suggestion:
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
-
ä/æ in speller, see bug report #411 (Tomi, Sjur)
- lexicalise words from the Olavi missing list, but check against the pdf
- add compounding tags to:
- nouns (Thomas)
- finished
- finished
- adjs (Per-Eric)
- finished
- nouns (Thomas)
8. Name lexicon infrastructure
This sub-project needs to get up and running soon. Mainly Sjur's task.
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
OOo spellers
TODO:
- add Hunspell data conversion (Tomi)
Testing
Spelling Error Markup
See discussion above under the smj discussion.
TODO:
- Set up ways of adding meta-information (source info, used in testing or not,
- move Steinar's error markup in the xml files to (a copy of) the original
Automated testing
cat gt/doc/proof/spelling/testing/selftest-pl-sme-20070909.xml | grep \ '<original>' | cut -d">" -f2 | cut -d"<" -f1 | \ lookup -flags mbTT -utf8 gt/sme/bin/sme.fst | grep '\?' | cut -f1 | wc -l
Lex test of the lule-specific words. Three were not recognised:
Sálatduottar Sálatduottar +? Várjjat Várjjat +? Fatjatj Fatjatj +?
Abbreviations are currectly printed twice if they should be followed by a full
TODO:
- document the AppleScript testing tool (Sjur)
- enhanced, not finished
- enhanced, not finished
- document the testing procedures (Sjur)
Lexicon conversion to the PLX format
TODO:
- fix bug 484 (Tomi)
- fix bug 458 (Trond, Sjur, Tomi)
- done
New public beta
Delayed till the majority of the present bugs are fixed. The twolc bug
Update: twolc error fixed, as well as 458.
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
Bug fixing
When fixing bugs, record the version number containing the fix in the Bugzilla
56 open Divvun/Disamb bugs (25 of these 56 are speller-related bugs,
Project meeting
We'll meet in September, 24-28, in Tromsø to work on the hardest remaining
11. Next meeting, closing
The next meeting is 17.9.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 32 (but it included a lot of regular work as well).
Appendix - task lists for the next week
Boerre
- move Steinar's error markup in the xml files to (a copy of) the original
- add semi-automatic updates of fixed and open issues to README files
- fix bugs!
Ilona
- lexicalise missing words
- add sme names from FIN
- make smn propernoun-list
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
- add correct type differentiation to XSL processing - bug 504
Sjur
- document the AppleScript testing tool
- document the testing procedures
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
-
ä/æ in smj speller
- work on the XML name editor/risten.no integration
- plan the rest of the project period
- fix bugs!
Thomas
- fix stuorra-oslolaš lower case o
-
ä/æ in smj speller
- fix bugs!
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- add Hunspell data generation/conversion
- fix bug 484
- add correct type differentiation to ccat - bug 505
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological
- fix stuorra-oslolaš lower case o
- add sma texts to the corpus repository
-
ä/æ in smj speller
- fix bugs!.