Meeting_2007-02-19
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 19.02.2007
- Time: 09.00 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 49.
Present: Børre, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Saara, Steinar
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- write form to request corpus user account
- not done
- not done
- document how to apply for access to closed corpus, and details on the corpus
- some done
- some done
- update and fix our documentation and infrastructure as Steinar finds
- not done
- not done
- continue work on script for automatic testing of the spell checker in Word
- not done
- not done
- fix sme texts in corpus this month
- not done
- not done
- find missing nob parallel texts in corpus
- not done
- not done
- work on the Polderland data generation (PLX format conversion)
- not done
- not done
- go through other directories, fix parallellity information for other documents
- not done
- not done
- add sma texts to the corpus repository
- not done
- not done
- move the G5 to the basement
- done
- done
- add info to front page (incl. download links)
- not done
- not done
- write separate page with detailed info (incl. download links)
- not done
- not done
-
fix bugs!
- not done
Maaren
- lexicalise actio compounds
Saara
- fix sme texts in corpus this month
- continue aligning the rest of the parallel files
- fix problems with xml2lexc if needed
- have some holiday first
- start improving the corpus interface for Sámi in Oslo.
- fix bugs!
Sjur
- name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- add cvs synchronisation
- worked on this one - settled on a standardised pretty-print format, made
- worked on this one - settled on a standardised pretty-print format, made
- refactor the rest of the SD-terms editor code
- hire linguist and programmer
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- get an Intel Mac for Tomi
- write press release for the beta
- first version done
- first version done
-
fix bugs!
- other tasks done:
- helped with PLX generation of number
- received beta from Polderland, installed and tested
- other work with the beta release
- installed resources and tools for compiling PLX files into binary spellers
- wrote a start on a page for the beta release
- helped with PLX generation of number
Steinar
- test our infrastructure and documentation - follow the documentation exactly,
- Complete the semantic sets in sme-dis.rle
- missing lists
- report conversion errors to Saara
- Look at the actio compound issue when adding from missing lists
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- Go through the Num bugs
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- fix stuorra-oslolaš lower case o
- implement discontinous case inflection for sme numbers
- produce correct number base forms in the sme analyzer
- translate beta release docs to sme and smj
- fix bugs!
Tomi
- improve numerals in the speller
- add prefixes to the PLX
- add derivations to the PLX generation
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- Worked with Thomas on this, not finished yet.
- Worked with Thomas on this, not finished yet.
- fix sme texts in corpus this month
- Worked on the aligment of the texts.
- Worked on the aligment of the texts.
- find missing nob parallel texts in corpus, go through Saara's list
- Found some, but it turned out we had them already! We must align what we
- Found some, but it turned out we had them already! We must align what we
- Go through the Num bugs
- Looked at some num paradigms, but no bug closed.
- Looked at some num paradigms, but no bug closed.
- fix bugs!.
3. Documentation
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- document how to apply for access to closed corpus, and details on the corpus
- correct and imrove it based on feedback from Steinar ( Børre)
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such holes are found
- Go through the list of missing or errouneous nob texts, based upon
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Alignment
Main news: We have a working parallel corpus online.
Notes about the interface (or lack of documentation): the first search field in
TODO:
- go through other directories (nob dicrectories, sd directories), fix
Conversion issues
TODO:
- report conversion errors to Saara ( Trond, Steinar)
6. Infrastructure
Børre and Steinar have both started on the task of testing and
TODO:
- test our infrastructure and documentation - follow the documentation exactly,
- update and fix our documentation and infrastructure as Steinar finds
7. Linguistics
Numbers:
Thomas is almost finished with correcting the number part of the sme
TODO:
- discontinous case inflection in sme (but only for maximally three-part
- produce correct number base forms in the sme analyzer (Thomas)
- Go through the sme Num bugs (Thomas)
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in the meeting memo.
Postponed:
TODO:
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
Polderland data generation
TODO:
- improve number conversion (Børre, Tomi)
- add prefixes to the PLX (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- next after numbers are fixed
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for Tomi (Sjur)
- not yet
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Beta release
Tentative beta release: Thursday 15.2. - but it might be delayed till later in
In the beta, sme is now Catalan, whereas smj is Basque.
All beta packages (mklex tools, Win and Mac tools) can be copied from Sjur's
/Users/sjur/mklex.zip /Users/sjur/SamiProofingtools_beta-Mac.dmg /Users/sjur/SamiProofingtools_beta-Win.zip
mklex -M256 -p sami_north_phon* revInputSamiNort.plx mssp3samiNorthern.lex
SamiNortAsCatalan2007-02sp
The PLX compilers (one each for sme, smj), which compiles the specified source
As a last step after the lexicon is compiled, use the tool
cd gt/ export RIncludes=/System/Library/Frameworks/Carbon.framework/Headers/ /Developer/Tools/Rez sme/polderland/CatalanLex.rsrc.hex -a -o $SpellerLexiconFil /Developer/Tools/SetFile -a CI -c MSOF -t HMSD $SpellerLexiconFile
This step is necessary to make MS Office recognise the speller lexicon file as a
PLX files:
- adjectives
- AdjectiveRoot
- AdjectiveRoot
- verbs
- VerbRoot
- Copula
- Negativeverb
- VerbRoot
- nouns
- NounRoot
- NounRoot
- propernouns
- ProperNoun
- ProperNoun
- Adverb
- Conjunction
- Interjection
- Particles
- Adposition (pp)
- Pronoun
- Subjunction
- Numerals
- pp
All ok, except numerals.
DONE:
- delivered PLX data of sme and smj including compounding
- translated Windows installer to sme and smj
- installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for
- added resources needed for compiling PLX lexicons to our cvs repo
- tested the beta drop from Polderland - good we did, it is absolutely
TODO:
- write press release (Sjur)
- done first draft, see xtdoc/sd/.../xdocs/pr/
- done first draft, see xtdoc/sd/.../xdocs/pr/
- add info to front page (incl. download links) (Børre)
- write separate page with detailed info (incl. download links) (Børre)
-
Sjur wrote a start
-
Sjur wrote a start
- test the beta release from Polderland thoroughly before it is released
- download and installation
- documentation
- technical performance
- linguistic performance:
- true positives (correctly recognised misspellings)
- false positives (correct words errouneously marked as misspellings)
- false negatives (misspellings not recognised by the speller)
- true negatives (correctly spelled words recognised as such by the speller)
- suggestions
- true positives (correctly recognised misspellings)
- all tests on both Mac and Win - Windows only (Børre, Sjur, Thomas)
- download and installation
- compile new speller lexicons using the mklex* tools on the G5, following the
- add compilation of MS Office spellers part of the Makefile (Tomi)
- install Windows and MS Office; test tools on Windows (Børre, Thomas)
- collect a list of PR recipients, forward to
- questions for Polderland (Børre):
- version info in the speller?
- remaking/updating the installer packages with linguistic updates - who?
- version info in the speller?
Compilation
Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs
Testing
Different ways of testing:
- Impressionistic, functionality: try the program, try all the functions
- Impressionistic, coverage: try the program on different texts, look for false positives
- Systematic (in order of importance):
- Make a corpus of texts, from different genres (can be done before 0.2 release)
- For each text, detect precision
- For each text, detect recall
- For each text, detect accuracy
- For each text, detect precision
- Make a corpus of texts, from different genres (can be done before 0.2 release)
Before beta release: precision is important, but have a look at recall as well.
Recall and precision
- precision = tp / ( tp + fp ) = true redlines / all redlines
- can we trust that the redlines are actually errors?
- Task: check all hits
- (test p, are they tp or fp?)
- can we trust that the redlines are actually errors?
- recall = tp / ( tp + fn) = true redlines / all errors in doc
- can we trust that all errors are actually found?
- Task: check every single word
- (test p, are they tp or fp, test n, are they tn or fn?)
- can we trust that all errors are actually found?
- accuracy = tp + tn / tp + fp + fn + tn = overall performance
Definitions:
- true positives (correctly recognised misspellings)
- false positives (correct words errouneously marked as misspellings)
- false negatives (misspellings not recognised by the speller)
Timetable
- The next beta version (beta 0.2) is ready tuesday at xx h?
- Testing 0.2: Thomas, Steinar, Trond, Ilona, ...
- 0.3 compilation starts at thursday
- The next beta version (beta 0.3) is ready sunday
- Monday: Testing beta 0.3 for unpleasent surprises
- We release beta 0.3 on Tuesday, unless there are surprises
- If there are surprises, we must compile again, this time 0.4
- Deadline for documentation as already(?) stated
compile a:
- sm(e|j)-lex.txt to *-plx.txt = 83 hrs?
- -plx.txt
two-phase sort:
now:
- cat *plx.txt > sme-plx.txt
- cat sme-plx.txt
sort -r | uniq > all_except_nouns-sme-plx.txt |
tomorrow:
- cat noun-sme-plx.txt all_except_nouns-sme-plx.txt
sort -r | uniq > sme-plx.txt |
- sort complexity = N
one-phase sort:
- tomorrow:
- cat *plx.txt > sme-plx.txt
- cat sme-plx.txt
sort -r | uniq > all_except_nouns-sme-plx.txt |
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
Bug fixing
57 open Divvun/Disamb bugs, and 23 risten.no bugs
Moving G5
TODO:
- move the G5 to the basement (Børre)
- moved, new IP 129.242.220.113
11. Next meeting, closing
The next meeting is 26.2.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 12.
Appendix - task lists for the next week
Boerre
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- update and fix our documentation and infrastructure as Steinar finds
- continue work on script for automatic testing of the spell checker in Word
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- work on the Polderland data generation (PLX format conversion)
- go through other directories, fix parallellity information for other documents
- add sma texts to the corpus repository
- move the G5 to the basement
- add info to front page (incl. download links)
- write separate page with detailed info (incl. download links)
- fix bugs!
Maaren
- lexicalise actio compounds
Saara
- fix sme texts in corpus this month
- continue aligning the rest of the parallel files
- fix problems with xml2lexc if needed
- have some holiday first
- start improving the corpus interface for Sámi in Oslo.
- fix bugs!
Sjur
- name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor the rest of the SD-terms editor code
- hire linguist and programmer
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- get an Intel Mac for Tomi
- write press release for the beta
- fix bugs!
Steinar
- test our infrastructure and documentation - follow the documentation exactly,
- Complete the semantic sets in sme-dis.rle
- missing lists
- report conversion errors to Saara
- Look at the actio compound issue when adding from missing lists
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- Go through the Num bugs
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- fix stuorra-oslolaš lower case o
- implement discontinous case inflection for sme numbers
- produce correct number base forms in the sme analyzer
- translate beta release docs to sme and smj
- fix bugs!
Tomi
- improve numerals in the speller
- add prefixes to the PLX
- add derivations to the PLX generation
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Go through the Num bugs
- fix bugs!.