Meeting_2007-03-12
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 12.03.2007
- Time: 09.00 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 17.
Present: Børre, Sjur, Steinar, Thomas, Trond
Absent: Maaren, Saara, Tomi
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- write form to request corpus user account
- not done (can be delayed till after beta)
- not done (can be delayed till after beta)
- document how to apply for access to closed corpus, and details on the corpus
- not done (can be delayed till after beta)
- not done (can be delayed till after beta)
- update and fix our documentation and infrastructure as Steinar finds
- begun with it
- begun with it
- continue work on script for automatic testing of the spell checker in Word
- not done
- Sjur did it
- Sjur did it
- not done
- fix sme texts in corpus this month
- not done (can be delayed till after beta)
- not done (can be delayed till after beta)
- find missing nob parallel texts in corpus
- not done (can be delayed till after beta)
- not done (can be delayed till after beta)
- add prefixes to the PLX conversion
- not done
- not done
- add middle nouns to the PLX conversion
- not done
- not done
- improve number PLX conversion
- not done
- not done
- go through other directories, fix parallellity information for other documents
- not done
- not done
- add sma texts to the corpus repository
- not done (can be delayed till after beta)
- not done (can be delayed till after beta)
- add info to front page (incl. download links)
- done
- done
- write separate page with detailed info (incl. download links)
- done
- done
- Improve automatic alignment process
- not done
- not done
- Store the tested texts, for reference
- not done
- not done
- Add potential speller test texts
- not done
- not done
- Set up ways of adding meta-information to speller test docs
- not done
- not done
- get an Intel Mac for Tomi
- ordered
- ordered
- collect a list of PR recipients, forward to Berit Karen Paulsen
- not done
- not done
- add version info to the generated speller lexicons
- not done
- not done
- run all known spelling errors in the corpus through the speller
- not done
- not done
- test the typos.txt list, and check that all entries are properly corrected
- not done
- not done
- consider how to do a regression self-test
- not done
- not done
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- continue aligning the rest of the parallel files
- prepare more files for manual alignment
- update lexc2xml with comment field
- done
- done
- start improving the corpus interface for Sámi in Oslo.
- Set up corpus directories for proofing test documents
- done
- done
- Mark-up the added speller test texts, using our existing xml format
- infrastructure is ready, the files?
- infrastructure is ready, the files?
- fix bugs!
Sjur
- name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- postponed till after beta release
- postponed till after beta release
- refactor the rest of the SD-terms editor code
- hire linguist
- planning interview
- planning interview
- fix stuorra-oslolaš lower case o
- not done (postponed till after beta)
- not done (postponed till after beta)
- write form to request corpus user account
- not done (postponed till after beta)
- not done (postponed till after beta)
- document how to apply for access to closed corpus, and details on the corpus
- not done (postponed till after beta)
- not done (postponed till after beta)
- write press release for the beta
- nothing more done since first draft
- nothing more done since first draft
- get speller test tool from Polderland
- got them, but they were buggy - Polderland is working on a fix
- got them, but they were buggy - Polderland is working on a fix
- Set up ways of adding meta-information to speller test docs
- not done, but speller test docs are in principle just regular corpus docs,
- not done, but speller test docs are in principle just regular corpus docs,
- collect a list of PR recipients
- not done
- not done
- add version info to the generated speller lexicons
- not done
- not done
- run all known spelling errors in the corpus through the speller
- not done, we're not ready for that yet
- not done, we're not ready for that yet
- test the typos.txt list, and check that all entries are properly corrected
- ran typos.txt (first column) through the speller - found many slip-throughs
- also ran the second column of typos.txt through the speller, and found
- ran typos.txt (first column) through the speller - found many slip-throughs
- consider how to do a regression self-test
- been thinking, but no action so far
- been thinking, but no action so far
-
fix bugs!
- other issues:
- wrote an AppleScript to run texts through the speller via MS Word itself, and
- discussed derivation conversion with Tomi
- started discussing improvements to the conversion - presently it is way
- wrote an AppleScript to run texts through the speller via MS Word itself, and
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
- some texts finished (marked typos) but not added to a directory yet
- some texts finished (marked typos) but not added to a directory yet
- Infrastructure test: add report to gt/doc/infra/, probably as
- the report is hopefuly there very soon (if not I will contact the technisian)
- the report is hopefuly there very soon (if not I will contact the technisian)
- Complete the semantic sets in sme-dis.rle
- no work this week
- no work this week
- missing lists
- no work this week
- no work this week
- Look at the actio compound issue when adding from missing lists
- not done
- not done
- Align corpus manually
- not done
- not done
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- done
- done
- work with compounding
- worked and still working
- worked and still working
- Lack of lowering before hyphen: Twol rewrite.
- not done
- not done
- fix stuorra-oslolaš lower case o
- not done (delayed till after beta)
- not done (delayed till after beta)
- translate beta release docs to sme and smj
- not done
- not done
- Add potential speller test texts
- not done
- not done
-
fix bugs!
- all the time
Tomi
- add derivations to the PLX generation
- done
- done
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- update ccat to handle error/correction markup
- add version info to the generated speller lexicons
- fix bugs!
Trond
- Participate in the beta testing setup
- done
- done
- Test the beta versions
- done
- done
- Work on the parallel corpus issues
- done
- Discuss with Anders
- done, i.e., with Lars
- done, i.e., with Lars
- Work on the aligner with (Børre)
- Not done.
- Not done.
- fix sme texts in corpus this month
- Not done.
- Not done.
- find missing nob parallel texts in corpus, go through Saara's list
- Not done.
- Not done.
- done
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Not done.
- Not done.
- Align corpus manually
- Not done.
- Not done.
- Store the tested texts, for reference
- Not done.
- Not done.
- Add potential speller test texts
- Not done.
- Not done.
- collect a list of PR recipients
- Not done.
- Not done.
-
fix bugs!.
- Not done.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- document how to apply for access to closed corpus, and details on the corpus
- correct and imrove it based on feedback from Steinar ( Børre)
- beta documentation (see separate beta section below)
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such holes are found
- Go through the list of missing or errouneous nob texts, based upon
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Trond has been in Odense, in the Panorama meeting. The goal of Panorama is to
Alignment
TODO
- go through other directories (nob dicrectories, sd directories), fix
- Improve the automatic process:
- Improve the anchor list and realign (Trond, Børre)
- Only adding words does not improve alignment, you have to consider the format
- The documents have still some formatting issues which cause trouble in
- Test and improve settings in the aligner
- Improve the anchor list and realign (Trond, Børre)
- Align manually (Trond, Steinar) (especially shorter terminological texts)
6. Infrastructure
TODO:
- add report to gt/doc/infra/, probably as infrareport.jspwiki
- update and fix our documentation and infrastructure as Steinar finds
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- postponed till after the public beta
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
Different ways of testing
- Impressionistic, functionality: try the program, try all the functions
- Impressionistic, coverage: try the program on different texts, look for
- Systematic (in order of importance):
- Make a corpus of texts, from different genres (can be done before 0.2
- For each text, detect precision
- For each text, detect recall
- For each text, detect accuracy
- For each text, detect precision
- Make a corpus of texts, from different genres (can be done before 0.2
Before beta release: precision is important, but have a look at recall as well.
Definitions
-
tp - true positives (correctly recognised misspellings)
-
fp - false positives (correct words errouneously marked as misspellings)
- fn - false negatives (misspellings not recognised by the speller)
Recall and precision
-
precision = tp / ( tp + fp ) = true redlines / all redlines
- can we trust that the redlines are actually errors?
- Task: check all hits
- (test p, are they tp or fp?)
- can we trust that the redlines are actually errors?
-
recall = tp / ( tp + fn) = true redlines / all errors in doc
- can we trust that all errors are actually found?
- Task: check every single word
- (test p, are they tp or fp, test n, are they tn or fn?)
- can we trust that all errors are actually found?
- accuracy = tp + tn / tp + fp + fn + tn = overall performance
Precision and recall testing
A testbed has been set up (Trond), and some texts are marked for errors and
Types of tests:
- Technical testing
- Testing for linguistic functionality
- Testing for lexical coverage
- Testing for normativity
- Testing the suggestions
The tester should identify these 4 values:
- wds - number of words in the text
- tp - correctly identified errors
- fp - correctly written but marked as errors
- fn - errors not marked as such
The spreadsheet will then calculate precision, recall and accuracy. Steinar
Testing of suggestion should follow the same lines:
- errs - number of errors in the text
- tp - the intended word is among the suggestions
- fp - the intended word is not among the suggestions
- fn - no suggestions
- tn - (not relevant??)
Ordering of suggestions:
- place in the list of the intended correction
- ordered first
- ordered top-five
- ordered below top-five
- ordered first
"Perceived Quality", ie for all recognised errors/tp:
- number of correct suggestions at top
- number of correct suggestions among top-five
- number of correct suggestions below top-five
Testing on unseen texts
We need to use unknown texts in order to measure the performance of the speller.
Regression tests
We need to ensure that we do not take steps backwars, ie all known spelling
We also need to regression test the PLX conversion. In principle this is easy -
TODO:
- add extraction of all known spelling errors in the corpus (not the
- test the typos.txt list, and check that all entries are properly corrected
- consider how to do a regression self-test, ie, how to test the full
Testing tools
We have received a set of testing tools from Polderland. They have some
Sjur has written an AppleScript to run arbitrary texts through the MS Word
TODO:
- get updated Polderland testing tools (Sjur)
- document the AppleScript testing tool (Sjur)
- write tools for statistical analysis of test results (Sjur)
Storing test texts
Test texts should be stored in the corpus catalogue, separated from the ordinary
TODO:
- Store the tested texts, for reference (Trond, Børre)
- Set up (sub)directories (Saara)
- top-level dir corpus/prooftest/orig/ and corpus/prooftest/xml/
- top-level dir corpus/prooftest/orig/ and corpus/prooftest/xml/
- Add potential test texts (Børre, Thomas, Trond, anyone, really)
- Manually mark them for typos (making them into gold standards)
-
erorr§error
-
erorr§error
- Format the added texts in appropriate ways - use our existing xml format, with
- requires changes to ccat to handle error/correction markup (Tomi)
- requires changes to ccat to handle error/correction markup (Tomi)
- Set up ways of adding meta-information (source info, used in testing or not,
- Conduct tests on new beta versions on the basis of the unspoiled gold standard
- alternatively: make test scripts that will run the tests automatically,
- include the ones already tested in the testing/ catalogue
- test 0.3 on the same texts
- test each version before beta release
The b0.3 / 2007.02.26 version
Known errors:
- clitics do not work with W class words (uninflected words). Two options:
- generate these with clitics (adds words from 6700 -> 100 000)
- done
- done
- ask Polderland to look at it - Sjur will do that
- Tomi did it, follow-up e-mail discussions by Sjur and Tomi
- generate these with clitics (adds words from 6700 -> 100 000)
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Conversion from LexC to PLX
Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs Nouns compile at 3 sec/noun, i.e. (23600*3) / 3600 = 19 hrs
Verbs take 13-15 hours to compile.
This is so far acceptable for nouns, but on the edge of being unacceptable for
We need to investigate why adjectives are so slow, and try to improve on the
Update:
Saara has several ideas for how to improve the speed on her end (ie the
abohtta GOAHTI "abbot N" ; !+SgNomCmp +SgGenCmp +PlGenCmp ↓ <- perl-script +N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ; ! +N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ; ! ↓ +N+SgNomCmp+SgGenCmp+PlGenCmpabohtta+N+Sg+Gen GOAHTI "abbot N" ; ! +N+SgNomCmp+SgGenCmp+PlGenCmpabohtta:+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ; ! -- filter which removes the Cmp-tags from all case forms except SgNom, SgGen and PlGen - thus, all forms without Cmp tags will be L only --
The basic idea is to use the Xerox tools to do the conversion for us, by
Some brief profiling done by Børre during the meeting showed that
We also played further with using the Xerox tools all the way to producing ready
We need to test that the conversion is correct and gives expected results in all
TODO:
- Look at bottlenecks in existing code (Tomi, Børre)
- Look at xfst ways of doing it (Sjur, Trond, ...)
- add derivations to the PLX generation (Tomi)
- working on it
- working on it
- add prefixes to the PLX (Børre)
- middle nouns (Børre)
- make conversion test sample; add conversion testing to the make file
- improve number conversion (Børre, Tomi)
Public Beta release
Due to the problems with generating the PLX files discussed above, we need to
Linguistic issues still open:
- derivations (Tomi)
- "solved" in the existing code
- "solved" in the existing code
- numbers 1-20 (Børre)
- prefixes (eahpe, ii-) (Børre)
- middle nouns (LEXICON: lexc: Rmiddle, plx: L) (Børre)
DONE:
- delivered PLX data of sme and smj including compounding
- translated Windows installer to sme and smj
- installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for
- added resources needed for compiling PLX lexicons to our cvs repo
- tested the beta drop from Polderland - good we did, it is absolutely
- questions for Polderland:
- version info in the speller
- remaking/updating the installer packages with linguistic updates
- version info in the speller
- add compilation of MS Office spellers part of the Makefile
- install Windows and MS Office; test tools on Windows
TODO:
- finish press release (Sjur)
- add info to front page (incl. download links) (Børre)
- write separate page with detailed info (incl. download links) (Børre)
- collect a list of PR recipients, forward to Berit Karen Paulsen
Version identification of speller lexicons
See the Norwegian spellers for an example, with the trigger string tfosgniL.
Suggestion:
nuvviD -> Divvun nuvviD -> Dávvisámegiella nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?) nuvviD -> 12.2.2007 (automatically generated/added) nuvviD -> Sjur_Nørstebø_Moshagen nuvviD -> Børre_Gaup nuvviD -> Thomas_Omma nuvviD -> Maaren_Palismaa nuvviD -> Tomi_Pieski nuvviD -> Trond_Trosterud nuvviD -> Saara_Huhmarniemi nuvviD -> Steinar_Nilsen nuvviD -> Lene_Antonsen nuvviD -> Linda_Wiechetek
These correction rules (and their corresponding PLX entries) should be added
TODO:
- add version info to the generated speller lexicons (Børre, Sjur, Tomi)
10. Other
Project meeting IRL
Reserve the whole week after easter for a project gathering, probably in
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- delayed until the public beta is out the door
Bug fixing
57 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 19.3.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 50.
Appendix - task lists for the next week
Boerre
- write form to request corpus user account
- document how to apply for access to closed corpus, and details on the corpus
- update and fix our documentation and infrastructure as Steinar finds
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- add prefixes to the PLX conversion
- add middle nouns to the PLX conversion
- improve number PLX conversion
- go through other directories, fix parallellity information for other documents
- add sma texts to the corpus repository
- Improve automatic alignment process
- Store the tested texts, for reference
- Add potential speller test texts
- Set up ways of adding meta-information to speller test docs
- get an Intel Mac for Tomi
- collect a list of PR recipients, forward to Berit Karen Paulsen
- add version info to the generated speller lexicons
- run all known spelling errors in the corpus through the speller
- test the typos.txt list, and check that all entries are properly corrected
- consider how to do a regression self-test
- Look at bottlenecks in existing PLX conversion code
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- continue aligning the rest of the parallel files
- prepare more files for manual alignment
- start improving the corpus interface for Sámi in Oslo.
- mark-up the added speller test texts, using our existing xml format
- improve cgi-bin scripts
- fix bugs!
Sjur
- hire linguist
- finish press release for the beta
- Set up ways of adding meta-information to speller test docs
- collect a list of PR recipients
- add version info to the generated speller lexicons
- run all known spelling errors in the corpus through the speller
- consider how to do a regression self-test
- get updated Polderland testing tools
- document the AppleScript testing tool
- write tools for statistical analysis of test results
- Look at xfst ways of doing PLX conversion
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
- Infrastructure test: add report to gt/doc/infra/, probably as
- Complete the semantic sets in sme-dis.rle
- missing lists
- Look at the actio compound issue when adding from missing lists
- Align corpus manually
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- translate beta release docs to sme and smj
- Add potential speller test texts
- fix bugs!
Tomi
- Look at bottlenecks in existing PLX conversion code
- improve PLX conversion speed
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- update ccat to handle error/correction markup
- add version info to the generated speller lexicons
- fix bugs!
Trond
- Test the beta versions
- Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Discuss with Anders
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Store the tested texts, for reference
- Add potential speller test texts
- collect a list of PR recipients
- Look at xfst ways of doing PLX conversion
- fix bugs!.