Meeting_2007-02-26
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 26.02.2007
- Time: 09.00 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 10, continued at 13.02.
Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- write form to request corpus user account
- Not done
- Not done
- document how to apply for access to closed corpus, and details on the corpus
- Not done
- Not done
- update and fix our documentation and infrastructure as Steinar finds
- Begun working
- Begun working
- continue work on script for automatic testing of the spell checker in Word
- Not done
- Not done
- find missing nob parallel texts in corpus
- Not done
- Not done
- work on the Polderland data generation (PLX format conversion)
- Not done
- Not done
- go through other directories, fix parallellity information for other documents
- Not done
- Not done
- add sma texts to the corpus repository
- Not done
- Not done
- add info to front page (incl. download links)
- Not done
- Not done
- write separate page with detailed info (incl. download links)
- Not done
- Not done
- fix bugs!
Maaren
- lexicalise actio compounds
- done some
Saara
- fix sme texts in corpus this month
- continue aligning the rest of the parallel files
- fix problems with xml2lexc if needed
- have some holiday first
- done
- done
- start improving the corpus interface for Sámi in Oslo.
-
fix bugs!
- done some
- done some
- other
- started writing article(s)
Sjur
- name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- synchronisation between cvs and running db-s
- refactor the rest of the SD-terms editor code
- hire linguist and programmer
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- get an Intel Mac for Tomi
- write press release for the beta
-
fix bugs!
- other:
- was on Winter Holiday last week
- did some work on the beta release
- was on Winter Holiday last week
Steinar
- test our infrastructure and documentation - follow the documentation exactly,
- tested most, reported some major problems, info about access to corpus is
- tested most, reported some major problems, info about access to corpus is
- Complete the semantic sets in sme-dis.rle
- no work this week
- no work this week
- missing lists
- no work this week
- no work this week
- report conversion errors to Saara
- not done
- not done
- Look at the actio compound issue when adding from missing lists
- not done
- not done
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- Go through the Num bugs
- not done
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- work with compounding
- begun adding tags
- begun adding tags
- Lack of lowering before hyphen: Twol rewrite.
- not done
- not done
- fix stuorra-oslolaš lower case o
- not done
- not done
- implement discontinous case inflection for sme numbers
- finished
- finished
- produce correct number base forms in the sme analyzer
- finished
- finished
- translate beta release docs to sme and smj
- not done
- not done
-
fix bugs!
- fixed some
Tomi
- improve numerals in the speller
- add prefixes to the PLX
- add derivations to the PLX generation
- fix bugs!
Trond
- update the smj proper noun lexicon, and refine the morphological analysis,
- Not done
- Not done
- fix sme texts in corpus this month
- Worked on getting an overview
- Worked on getting an overview
- find missing nob parallel texts in corpus, go through Saara's list
- Worked on getting an overview
- Worked on getting an overview
- Go through the Num bugs
- No bugs resolved
- No bugs resolved
-
fix bugs!.
- No bugs resolved
- No bugs resolved
- Last week was speller week.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- document how to apply for access to closed corpus, and details on the corpus
- correct and imrove it based on feedback from Steinar ( Børre)
4. Corpus gathering
The disamb project would want parallell texts relevant for terminological work
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such holes are found
- Go through the list of missing or errouneous nob texts, based upon
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Alignment
The aligner output has too many errors. Possible tasks to improve it added to
TODO
- go through other directories (nob dicrectories, sd directories), fix
- Improve the automatic process:
- Improve the anchor list and realign (Trond, Børre)
- Only adding words does not improve alignment, you have to consider the format
- The documents have still some formatting issues which cause trouble in
- Test and improve settings in the aligner
- Improve the anchor list and realign (Trond, Børre)
- Align manually (Trond, Steinar) (especially shorter terminological texts)
Conversion issues
TODO:
- report conversion errors to Saara ( Trond, Steinar)
- should be done anyway, all the time, by all of us, for all tools: -)
6. Infrastructure
Steinar has gone through all of the documentation except access to the corpus,
TODO:
- test our infrastructure and documentation - follow the documentation exactly,
- done
- done
- add report to gt/doc/infra/, probably as infrareport.jspwiki
- update and fix our documentation and infrastructure as Steinar finds
7. Linguistics
Numbers:
TODO:
- discontinous case inflection in sme (but only for maximally three-part
- done
- done
- produce correct number base forms in the sme analyzer (Thomas)
- done
- done
- Go through the sme Num bugs (Thomas)
- done
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- continuous work
- continuous work
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- postponed till after the public beta
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Nothing new.
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
lexc2xml conversion bug
<entry id="Budejju"> <infl lexc="ACCRA"> <stem>Budej3ju</stem> </infl> <infl lexc="ACCRASUB" type="secondary"> <stem>Budeju</stem> </infl> <senses> <sense ref="Budejju" sem="plc"/> </senses> <log/> </entry>
Where do we draw the line between "words" and variants belonging to the same
Trondheim ? Trondhjem ? Nidaros (separate entry)
An example where stem variation is just an expression of paradigm variation:
Buckingham^shire:Buckingham^shire3 ACCRA-plc ; !SUB Buckingham^shire:Buckingham^shire ACCRA-plc ; ! Note: the final vowel e3, used for e-stems that NEVER may have illatives Buckinghamshire+N+Sg+Ill Buckinghamshire+N+Sg+Ill Buckinghamshirii Buckinghamshire+N+Sg+Ill Buckinghamshirej <== sic!
In SD-terms, Budejju and Budeju would be stored as separate entries, but
<entry id="Budejju"> <infl lexc="ACCRA"> <stem >Budej3ju</stem> </infl> <variant ref=""/> <senses> <sense ref="Budejju" sem="plc"/> </senses> <log> <comment date="date-of-conversion" who="xxx"> Comment text </comment> </log> </entry> <entry id="Budejju" type="secondary"> <infl lexc="ACCRASUB"> <stem>Budeju</stem> </infl> <variant ref=""/> <senses> <sense ref="Budejju" sem="plc"/> </senses> <log/> </entry>
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
MS Office speller
There are no language codes available for Office 2004, and there won't be any,
Apologies about the delay. This is indeed a shortcoming of the application. The product team is currently looking at how it can be addressed moving forward. For Office 2004, the only global language setting I'm aware of is the one in the Custom tab in the document properties window. There you can set the Name field to "Language" then set the Type to text and enter the value "Saami." Obviously this doesn't provide object-model support via language settings. I hope this helps a little...
The official Microsoft language codes for Sámi languages are:
hex dec language-country combo ---- ---- ---------------------- 043b 1083 Northern Sami - Norway 083b 2107 Northern Sami - Sweden 0c3b 3131 Northern Sami - Finland 103b 4155 Lule Sami - Norway 143b 5179 Lule Sami - Sweden 183b 6203 Southern Sami - Norway 1c3b 7227 Southern Sami - Sweden 203b 8251 Skolt Sami - Finland 243b 9275 Inari Sami - Finland
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
Precision and recall testing
A testbed has been set up (Trond), and some texts are marked for errors and
Types of tests:
- Technical testing
- Testing for linguistic functionality
- Testing for lexical coverage
- Testing for normativity
- Testing the suggestions
TODO:
- get an Intel Mac for Tomi (Sjur)
- not yet
- not yet
- Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
- textid - nu_wds - tp - fp - tn - fn - prec - rec - acc - spellid - ref_to_txt
- textid - nu_wds - tp - fp - tn - fn - prec - rec - acc - spellid - ref_to_txt
- Store the tested texts, for reference (Trond, Børre)
The tester should identify these 4 values:
- wds - number of words in the text
- tp - correctly identified errors
- fp - correctly written but marked as errors
- fn - errors not marked as such
The spreadsheet will then calculate precision, recall and accuracy. Steinar
Testing of suggestion should follow the same lines:
- errs - number of errors in the text
- tp - the intended word is among the suggestions
- fp - the intended word is not among the suggestions
- fn - no suggestions
- tn - (not relevant??)
Ordering of suggestions:
- place in the list of the intended correction
- ordered first
- ordered top-five
- ordered below top-five
- ordered first
"Perceived Quality", ie for all recognised errors/tp:
- number of correct suggestions at top
- number of correct suggestions among top-five
- number of correct suggestions below top-five
Testing on unseen texts
We need to use unknown texts in order to measure the performance of the speller.
Regression tests
We need to ensure that we do not take steps backwars, ie all known spelling
Storing test texts
Test texts should be stored in the corpus catalogue, separated from the ordinary
TODO:
- get speller test tool from Polderland ( Sjur)
- Set up (sub)directories (Børre, Saara)
- Add potential test texts (Børre, Thomas, Trond, anyone, really)
- Manually mark them for typos (making them into gold standards)
- Format the added texts in appropriate ways - use our existing xml format, with
- Set up ways of adding meta-information (source info, used in testing or not,
- Set up test record page in gt/doc/proof/spelling/testing/ (Børre)
- Conduct tests on new beta versions on the basis of the unspoiled gold standard
- alternatively: make test scripts that will run the tests automatically,
include the ones already tested in the testing/ catalogue
The b0.3 / 2007.02.26 version
Subjective impression: With the b0.3 version we have surpassed the alpha
Writing biillaviessu, ránskkabiila (gen+nom) we get error, and suggestion
SgNom does not have any L tag - but SgNomCmp does. Thus, biilaviessu is NOT
sát^ne^gir^ji NIR +N+Sg+Nom sát^ne^gir^ji GaIALR +N+Sg+Gen sát^ne^gir^ji GaAL +SgGenCmp vies^su NIR bii^la NIR +N+Sg+Nom biil^la GaIALR bii^la NAL +SgNomCmp biil^la GaAL
The two bii^la entries above in effect create one entry
Known errors:
- clitics do not work with W class words (uninflected words)
- two options:
- generate these with clitics (adds words from 6700 -> 100 000)
- ask Polderland to look at it - Sjur will do that
- generate these with clitics (adds words from 6700 -> 100 000)
- two options:
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Lexicon conversion to the PLX format
We need to test that the conversion is correct and gives expected results in all
TODO:
- add derivations to the PLX generation (Tomi)
- add prefixes to the PLX (Børre)
- middle nouns (Børre)
- make conversion test sample; add conversion testing to the make file
- improve number conversion (Børre, Tomi)
Public Beta release
Tentative public beta release: after the initial linguistic bugs and poor
Internal deadlines:
- A date for when lexical updates should be checked in, in
- A plan for how many pre-betas we should compile, and when(?)
- alpha = Dutch (sme) + French (smj)
- beta 0.1 = the first Catalan (sme) + Basque (smj)
- beta 0.2 = the second Catalan
- beta 0.3 = 26. or 27.: compound beta
- beta 0.4 = 2.3.: first derivation beta, also including numbers, prefixes.
- beta 0.5 = 7.3.: final derivation beta, also including middle nouns
- alpha = Dutch (sme) + French (smj)
Linguistic issues still open:
- derivations (Tomi)
- numbers 1-20 (Børre)
- prefixes (eahpe, ii-) (Børre)
- middle nouns (LEXICON: lexc: Rmiddle, plx: L) (Børre)
The middle nouns are: beai, beal, geaš, oahpaheai, oai, vuol. They are also
beai+ShCmp:beai Rreal ; (not used init in our corpus) beal+ShCmp:beal Rreal ; (init with Num -goalmmat, -guđát, -nuppi, lexicalized) geaš+ShCmp:geaš Rreal ; (not used init in our corpus) oahpaheai+ShCmp:oahpaheai Rreal ; init, but then actually 2-part oai+ShCmp:oai Rreal ; (not used in corpus init oaivuolli (SUB? yes!) vuol+ShCmp:vuol Rreal ; (not used in our corpus)
The PLX format does not allow encoding a stem as middle-only. For the public
DONE:
- delivered PLX data of sme and smj including compounding
- translated Windows installer to sme and smj
- installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for
- added resources needed for compiling PLX lexicons to our cvs repo
- tested the beta drop from Polderland - good we did, it is absolutely
TODO:
- write press release (Sjur)
- done first draft, see xtdoc/sd/.../xdocs/pr/
- done first draft, see xtdoc/sd/.../xdocs/pr/
- add info to front page (incl. download links) (Børre)
- write separate page with detailed info (incl. download links) (Børre)
-
Sjur wrote a start
-
Sjur wrote a start
- compile new speller lexicons using the mklex* tools on the G5, following the
- done regularly now
- done regularly now
- add compilation of MS Office spellers part of the Makefile (Tomi)
- install Windows and MS Office; test tools on Windows (Børre, Thomas)
- collect a list of PR recipients, forward to Berit Karen Paulsen
- questions for Polderland (Børre):
- version info in the speller?
- remaking/updating the installer packages with linguistic updates - who?
- version info in the speller?
Version identification of speller lexicons
See the Norwegian spellers for an example, with the trigger string tfosgniL.
Suggestion:
nuvviD -> Divvun nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?) nuvviD -> 12.2.2007 (automatically generated/added) nuvviD -> Sjur_Nørstebø_Moshagen nuvviD -> Børre_Gaup nuvviD -> Thomas_Omma nuvviD -> Maaren_Palismaa nuvviD -> Tomi_Pieski nuvviD -> Trond_Trosterud nuvviD -> Saara_Huhmarniemi nuvviD -> Steinar_Nilsen nuvviD -> Lene_Antonsen nuvviD -> Linda_Wiechetek
These correction rules (and their corresponding PLX entries) should be added
Conversion from LexC to PLX
Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs Nouns compile at 3 sec/noun, i.e. (23600*3) / 3600 = 19 hrs
This is so far acceptable for nouns, but on the edge of being unacceptable for
We need to investigate why adjectives are so slow, and try to improve on the
Testing
Different ways of testing:
- Impressionistic, functionality: try the program, try all the functions
- Impressionistic, coverage: try the program on different texts, look for
- Systematic (in order of importance):
- Make a corpus of texts, from different genres (can be done before 0.2
- For each text, detect precision
- For each text, detect recall
- For each text, detect accuracy
- For each text, detect precision
- Make a corpus of texts, from different genres (can be done before 0.2
Before beta release: precision is important, but have a look at recall as well.
Recall and precision
- precision = tp / ( tp + fp ) = true redlines / all redlines
- can we trust that the redlines are actually errors?
- Task: check all hits
- (test p, are they tp or fp?)
- can we trust that the redlines are actually errors?
- recall = tp / ( tp + fn) = true redlines / all errors in doc
- can we trust that all errors are actually found?
- Task: check every single word
- (test p, are they tp or fp, test n, are they tn or fn?)
- can we trust that all errors are actually found?
- accuracy = tp + tn / tp + fp + fn + tn = overall performance
Definitions:
- true positives (correctly recognised misspellings)
- false positives (correct words errouneously marked as misspellings)
- false negatives (misspellings not recognised by the speller)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
Bug fixing
57 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 5.3.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 05 first time, then at 14.55.
Appendix - task lists for the next week
Boerre
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- update and fix our documentation and infrastructure as Steinar finds
- continue work on script for automatic testing of the spell checker in Word
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- add prefixes to the PLX conversion
- add middle nouns to the PLX conversion
- improve number PLX conversion
- go through other directories, fix parallellity information for other documents
- add sma texts to the corpus repository
- add info to front page (incl. download links)
- write separate page with detailed info (incl. download links)
- Improve automatic alignment process
- Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
- Store the tested texts, for reference
- Set up (sub)directories for speller test documents
- Add potential speller test texts
- Mark-up the added speller test texts, using our existing xml format
- Set up ways of adding meta-information to speller test docs
- Set up test record page in gt/doc/proof/spelling/testing/
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- continue aligning the rest of the parallel files
- prepare files for manual alignment
- add ABBR, ACR, clitics to closed classes + ADV to paradigm generator
- add correction markup to test documents
- update lexc2xml with comment field
- start improving the corpus interface for Sámi in Oslo.
- Set up (sub)directories for speller test documents
- Mark-up the added speller test texts, using our existing xml format
- fix bugs!
Sjur
- name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor the rest of the SD-terms editor code
- hire linguist and programmer
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix stuorra-oslolaš lower case o
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- get an Intel Mac for Tomi
- write press release for the beta
- get speller test tool from Polderland
- Set up ways of adding meta-information to speller test docs
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards)
- Infrastructure test: add report to gt/doc/infra/, probably as
- Complete the semantic sets in sme-dis.rle
- missing lists
- Look at the actio compound issue when adding from missing lists
- Align corpus manually
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- fix stuorra-oslolaš lower case o
- translate beta release docs to sme and smj
- Add potential speller test texts
- fix bugs!
Tomi
- add derivations to the PLX generation
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- fix bugs!
Trond
- Participate in the beta testing setup
- Test the beta versions
- Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Discuss with Anders
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
- Store the tested texts, for reference
- Add potential speller test texts
- fix bugs!.