Meeting_2007-03-19
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 19.03.2007
- Time: 09.00 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 04.
Present: Børre, Sjur, Steinar, Thomas, Tomi, Trond
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- write form to request corpus user account
- delayed till after beta
- delayed till after beta
- document how to apply for access to closed corpus, and details on the corpus
- delayed till after beta
- delayed till after beta
- update and fix our documentation and infrastructure as Steinar finds
- started to set up a clear path in the documentation for newbies to find their
- started to set up a clear path in the documentation for newbies to find their
- fix sme texts in corpus this month
- not done
- not done
- find missing nob parallel texts in corpus
- not done
- not done
- add prefixes to the PLX conversion
- not done
- not done
- add middle nouns to the PLX conversion
- not done
- not done
- improve number PLX conversion
- not done
- not done
- go through other directories, fix parallellity information for other documents
- most directories seem to have the necessary info
- most directories seem to have the necessary info
- add sma texts to the corpus repository
- not done
- not done
- Improve automatic alignment process
- not done
- not done
- Store the tested texts, for reference
- not done
- not done
- Add potential speller test texts
- gathered some text from samediggi.no
- gathered some text from samediggi.no
- Set up ways of adding meta-information to speller test docs
- not done
- not done
- get an Intel Mac for Tomi
- it arrived in Kauto last week, it is on it's way to Tromsø now.
- it arrived in Kauto last week, it is on it's way to Tromsø now.
- collect a list of PR recipients, forward to Berit Karen Paulsen
- not done
- not done
- add version info to the generated speller lexicons
- not done
- not done
- run all known spelling errors in the corpus through the speller
- we don't have the needed infra for this in place yet
- we don't have the needed infra for this in place yet
- test the typos.txt list, and check that all entries are properly corrected
- ran a brief test. Not everything is corrected.
- ran a brief test. Not everything is corrected.
- consider how to do a regression self-test
- not done
- not done
- Look at bottlenecks in existing PLX conversion code
- did it last week, found hyphenation
- Tomi and Sjur are fixing this
- did it last week, found hyphenation
-
fix bugs!
- not done
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- continue aligning the rest of the parallel files
- prepare more files for manual alignment
- start improving the corpus interface for Sámi in Oslo.
- mark-up the added speller test texts, using our existing xml format
- the directories and infrastructure are created and waiting for files.
- the directories and infrastructure are created and waiting for files.
- improve cgi-bin scripts
- fix bugs!
Sjur
- hire linguist
- preparing interview this week
- preparing interview this week
- finish press release for the beta
- not done
- not done
- Set up ways of adding meta-information to speller test docs
- not done
- not done
- collect a list of PR recipients
- not done
- not done
- add version info to the generated speller lexicons
- gave it a try, but it didn't work out as intended - will check with PL
- gave it a try, but it didn't work out as intended - will check with PL
- run all known spelling errors in the corpus through the speller
- not yet
- not yet
- consider how to do a regression self-test
- not done
- not done
- get updated Polderland testing tools
- got them - they are now working as expected, although we still need to
- got them - they are now working as expected, although we still need to
- document the AppleScript testing tool
- started on it, not finished
- started on it, not finished
- write tools for statistical analysis of test results
- only planning done
- only planning done
- Look at xfst ways of doing PLX conversion
- did it for verbs - all verbs now converted to PLX in about 15 mins - down
- another candidate POS is numbers - only problem is that it is circular
- did it for verbs - all verbs now converted to PLX in about 15 mins - down
-
fix bugs!
- got the bug count down to less than 50, by closing several old issues that
- got the bug count down to less than 50, by closing several old issues that
Steinar
- Beta testing: Align manually (shorter texts)
- not started
- not started
- Manually mark speller test texts for typos (making them into gold standards),
- done some work
- done some work
- Infrastructure test: add report to gt/doc/infra/, probably as
- done
- done
- Complete the semantic sets in sme-dis.rle
- no work this week
- no work this week
- missing lists
- no work this week
- no work this week
- Look at the actio compound issue when adding from missing lists
- not done
- not done
- Align corpus manually
- not started
- not started
- fix bugs!
Thomas
- work with compounding
- progressing
- progressing
- Lack of lowering before hyphen: Twol rewrite.
- not done
- not done
- translate beta release docs to sme and smj
- not done
- not done
- Add potential speller test texts
- not done
- not done
-
fix bugs!
- not any fixing this week
Tomi
- Look at bottlenecks in existing PLX conversion code
- was resolved partially
- was resolved partially
- improve PLX conversion speed
- dramatically faster now
- dramatically faster now
- make PLX conversion test sample; add conversion testing to the make file
- not done
- not done
- improve number PLX conversion
- not done
- not done
- update ccat to handle error/correction markup
- not done
- not done
- add version info to the generated speller lexicons
- not done
- not done
- fix bugs!
Trond
- Test the beta versions
- Done, albeit not systematically
- Done, albeit not systematically
- Work on the parallel corpus issues
- Done some screening.
- Discuss with Anders
- Lars, that is. Done. More talks today.
- Lars, that is. Done. More talks today.
- Work on the aligner with (Børre)
- Not done.
- Not done.
- fix sme texts in corpus this month
- Not done.
- Not done.
- find missing nob parallel texts in corpus, go through Saara's list
- Not done.
- Not done.
- Done some screening.
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Store the tested texts, for reference
- Add potential speller test texts
- collect a list of PR recipients
- Look at xfst ways of doing PLX conversion
- fix bugs!.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- delayed till after the beta release
- delayed till after the beta release
- document how to apply for access to closed corpus, and details on the corpus
- delayed till after the beta release
- delayed till after the beta release
- correct and imrove it based on feedback from Steinar ( Børre)
- low priority
- low priority
- beta documentation (see separate beta section below)
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such holes are found
- Go through the list of missing or errouneous nob texts, based upon
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Alignment
TODO
- go through other directories (nob dicrectories, sd directories), fix
- Improve the automatic process:
- Improve the anchor list and realign (Trond, Børre)
- Only adding words does not improve alignment, you have to consider the format
- The documents have still some formatting issues which cause trouble in
- Test and improve settings in the aligner
- Improve the anchor list and realign (Trond, Børre)
- Align manually (Trond, Steinar) (especially shorter terminological texts)
6. Infrastructure
TODO:
- add report to gt/doc/infra/, probably as infrareport.jspwiki
- done
- done
- update and fix our documentation and infrastructure as Steinar finds
- started, working on it
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- postponed till after the public beta
Lule Sámi
Trond fixed a bug where initial capital vowel blocked the CG rule to work.
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
Selecting test texts
In principle, we need the same text types as the ones we already aim at in our
- Min Áigi net issue
- Blogs
- Our own linguistic texts
- New department texts from the net
Storing test texts
TODO:
- Set up (sub)directories (Saara)
- top-level dir corpus/prooftest/orig/ and corpus/prooftest/xml/
- done
- done
- top-level dir corpus/prooftest/orig/ and corpus/prooftest/xml/
- Manually mark test texts for typos (making them into gold standards)
-
erorr§errors
-
erorr§errors
- Add the marked test texts to the orig/ catalogue
- Format the added texts in appropriate ways - use our existing xml format, with
- done
- done
- change ccat to handle error/correction markup (Tomi)
- extract the document text with original errors (input to standard speller
- extract the document text with the available corrections (the correct docu),
- extract all and only the spelling errors with their corrections, in a tab
- extract the document text with original errors (input to standard speller
- Set up ways of adding meta-information (source info, used in testing or not,
- add the wanted xml elements to the XSL header (Saara?) (source info is
- outworned (ie not suitable for speller testing any more, only for regression
- lexicalised (all unknown, correctly spelled words added to lexicon)
- outworned (ie not suitable for speller testing any more, only for regression
- add the wanted xml elements to the XSL header (Saara?) (source info is
- Conduct tests on new beta versions on the basis of the unspoiled gold standard
- alternatively: make test scripts that will run the tests automatically,
- include the ones already tested in the testing/ catalogue
- test 0.3 on the same texts
- test each version before beta release
Testing tools
TODO:
- get updated Polderland testing tools (Sjur)
- got them - they're working excellent (somewhat weird output, though)
- got them - they're working excellent (somewhat weird output, though)
- document the AppleScript testing tool (Sjur)
- not finished
- not finished
- write tools for statistical analysis of test results (Sjur)
Regression tests
TODO:
- add extraction of all known spelling errors in the corpus (not the
- test the typos.txt list, and check that all entries are properly corrected
- consider how to do a regression self-test, ie, how to test the full
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Lexicon conversion to the PLX format
TODO:
- Look at bottlenecks in existing code (Tomi, Børre)
- done - solved
- done - solved
- Look at xfst ways of doing it (Sjur, Trond, ...)
- done for verbs
- done for verbs
- add derivations to the PLX generation (Tomi)
- done
- done
- add gt/cwb/paradigm.smj.txt file into gt/script/server_anl.pl
- add prefixes to the PLX (Børre)
- middle nouns (Børre)
- add Makefile target for PLX conversion of lexc files (Tomi):
- adjectives
- nouns
- propernouns
- verbs derived into other POSes
- verbs - must be done on gtsvn.uit.no
- produced by the paradigm server on victorio? or regenerate every night, and
- produced by the paradigm server on victorio? or regenerate every night, and
- adjectives
- make conversion test sample; add conversion testing to the make file
- improve number conversion (Børre, Tomi)
Public Beta release
Due to the problems with generating the PLX files discussed above, we need to
Linguistic issues still open:
- prefixes (eahpe, ii-) (Børre)
- middle nouns (LEXICON: lexc: Rmiddle, plx: L) (Børre)
DONE:
- delivered PLX data of sme and smj including compounding
- translated Windows installer to sme and smj
- installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for
- added resources needed for compiling PLX lexicons to our cvs repo
- tested the beta drop from Polderland - good we did, it is absolutely
- questions for Polderland:
- version info in the speller
- remaking/updating the installer packages with linguistic updates
- version info in the speller
- add compilation of MS Office spellers part of the Makefile
- install Windows and MS Office; test tools on Windows
TODO:
- improved smj speller (incl. derivations and compounds) (Sjur, Tomi)
- finish press release (Sjur)
- add info to front page (incl. download links) (Børre)
- write separate page with detailed info (incl. download links) (Børre)
- collect a list of PR recipients, forward to Berit Karen Paulsen
Version identification of speller lexicons
See the Norwegian spellers for an example, with the trigger string tfosgniL.
Suggestion:
nuvviD -> Divvun nuvviD -> Dávvisámegiella nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?) nuvviD -> 12.2.2007 (automatically generated/added) nuvviD -> Sjur_Nørstebø_Moshagen nuvviD -> Børre_Gaup nuvviD -> Thomas_Omma nuvviD -> Maaren_Palismaa nuvviD -> Tomi_Pieski nuvviD -> Trond_Trosterud nuvviD -> Saara_Huhmarniemi nuvviD -> Steinar_Nilsen nuvviD -> Lene_Antonsen nuvviD -> Linda_Wiechetek
These correction rules (and their corresponding PLX entries) should be added
TODO:
- add version info to the generated speller lexicons (Børre, Sjur, Tomi)
10. Other
Project meeting IRL
Reserve the whole week after easter for a project gathering, probably in
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- delayed until the public beta is out the door
Updates:
- MacOS: 10.4.9
- SubEthaEdit: 2.6.2
Bug fixing
48 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 26.3.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 36.
Appendix - task lists for the next week
Boerre
- update and fix our documentation and infrastructure as Steinar finds
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus
- add prefixes to the PLX conversion
- add middle nouns to the PLX conversion
- improve number PLX conversion
- add sma texts to the corpus repository
- Improve automatic alignment process
- Add potential speller test texts
- collect a list of PR recipients, forward to Berit Karen Paulsen
- add version info to the generated speller lexicons
- run all known spelling errors in the corpus through the speller
- test the typos.txt list, and check that all entries are properly corrected
- consider how to do a regression self-test
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- continue aligning the rest of the parallel files
- prepare more files for manual alignment
- start improving the corpus interface for Sámi in Oslo.
- mark-up the added speller test texts, using our existing xml format
- improve cgi-bin scripts
- add new features to the paradigm generator
- add gt/cwb/paradigm.smj.txt file into gt/script/server_anl.pl
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- fix bugs!
Sjur
- hire linguist
- finish press release for the beta
- collect a list of PR recipients
- add version info to the generated speller lexicons
- run all known spelling errors in the corpus through the speller
- consider how to do a regression self-test
- document the AppleScript testing tool
- write tools for statistical analysis of test results
- make improved smj speller (incl. derivations and compounds)
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
- Complete the semantic sets in sme-dis.rle
- missing lists
- Look at the actio compound issue when adding from missing lists
- Align corpus manually
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- translate beta release docs to sme and smj
- Add potential speller test texts
- fix bugs!
Tomi
- make improved smj speller (incl. derivations and compounds)
- add Makefile target for PLX conversion of lexc files
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- update ccat to handle error/correction markup
- add version info to the generated speller lexicons
- fix bugs!
Trond
- Test the beta versions
- Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Discuss with Anders
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Add potential speller test texts
- collect a list of PR recipients
- fix bugs!.