Meeting_2007-04-10
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 10.04.2007
- Time: 10.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 51.
Present: Børre, Maaren, Sjur, Steinar, Thomas, Tomi
Absent: Saara, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- update and fix our documentation and infrastructure as Steinar finds
- not done
- not done
- find missing nob parallel texts in corpus
- not done
- not done
- improve number PLX conversion
- not done
- not done
- add sma texts to the corpus repository
- not done
- not done
- improve automatic alignment process
- Had some discussions with Trond. We found out that the anchor list must
- Had some discussions with Trond. We found out that the anchor list must
- collect a list of PR recipients, forward to Berit Karen Paulsen
- not done
- not done
- run all known spelling errors in the corpus through the speller
- not done
- not done
- add extraction of all known spelling errors in the regular corpus (not the
- not done
- not done
-
fix bugs!
- not done
- not done
- other:
- built and uploaded fink containing gobby for intel macs. began work on ppc
- made scripts to download articles from minaigi.no and added them to the
- built and uploaded fink containing gobby for intel macs. began work on ppc
Maaren
- lexicalise actio compounds
- trying to do this
- trying to do this
- Manually mark speller test documents for typos
- started working
Saara
- prepare more files for manual alignment
- in progress
- in progress
- mark-up the added speller test texts, using our existing xml format
- done
- done
- improve cgi-bin scripts
- add new features to the paradigm generator
- not done
- not done
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- started discussion in newsgroup
- started discussion in newsgroup
- continue with speller test data
- done
- done
- compilation of verb lists
- not done
- not done
- speed in smj conversion
- not done
- not done
- fix bugs!
Sjur
- finish press release for the beta
- not yet
- not yet
- collect a list of PR recipients
- not yet
- not yet
- improve version info in the speller lexicons
- not yet
- not yet
- run all known spelling errors in the corpus through the speller
- not yet
- not yet
- document the AppleScript testing tool
- not yet
- not yet
- write tools for statistical analysis of test results
- done
- done
- integrate regression self tests with the make file
- not yet
- not yet
- make improved smj speller (incl. derivations and compounds)
- worked a lot on this - we generate 66 Gb(!) of data, and in the end the
- worked a lot on this - we generate 66 Gb(!) of data, and in the end the
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
- Started work, added two XML texts (correct.xml)
- Started work, added two XML texts (correct.xml)
- Complete the semantic sets in sme-dis.rle
- no work this week
- no work this week
- missing list
- added terminology from missing sami linguistics and literature lists
- added terminology from missing sami linguistics and literature lists
- Look at the actio compound issue when adding from missing lists
- not done
- not done
- Align corpus manually
- fix bugs!
Thomas
- work with compounding
- working hard
- working hard
- Lack of lowering before hyphen: Twol rewrite.
- nothing this week
- nothing this week
- translate beta release docs to sme and smj
- not done
- not done
- Add potential speller test texts
- not done
- not done
-
fix bugs!
- participated
Tomi
- make improved smj speller (incl. derivations and compounds)
- worked on it
- worked on it
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- improve prefix and middle-noun PLX conversion
- fix bugs!
Trond
- Test the beta versions
- Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Discuss with Anders
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Add potential speller test texts
- collect a list of PR recipients
- fix bugs!.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- delayed till after the beta release
- delayed till after the beta release
- document how to apply for access to closed corpus, and details on the corpus
- delayed till after the beta release
- delayed till after the beta release
- correct and improve it based on feedback from Steinar ( Børre)
- low priority
- low priority
- beta documentation (see separate beta section below)
4. Corpus gathering
Børre has added texts from Min Áigi to the prooftest corpus dir.
TODO:
-
sme texts: no new additions, fix corpus errors during this month
- missing nob parallel texts should be added if such holes are found
- Go through the list of missing or errouneous nob texts, based upon
- add sma texts to the corpus repository (Børre)
5. Corpus infrastructure
Alignment
TODO
- go through other directories (nob dicrectories, sd directories), fix
- Improve the automatic process:
- Improve the anchor list and realign (Trond, Børre)
- Only adding words does not improve alignment, you have to consider the format
- The documents have still some formatting issues which cause trouble in
- Test and improve settings in the aligner
- Improve the anchor list and realign (Trond, Børre)
- Align manually (Trond, Steinar) (especially shorter terminological texts)
6. Infrastructure
TODO:
- update and fix our documentation and infrastructure as Steinar finds
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- postponed till after the public beta
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- implement data synchronisation between risten.no and
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
9. Spellers
OOo speller(s)
TODO after the MS Office Beta is delivered:
- add Hunspell data generation to the lexc2xspell (Tomi - after the
- study the Hunspell formalism in detail (Børre, Sjur, Tomi)
Testing
Spelling Error Markup
Procedure for marking up:
- pick a file in:
- rename it from .xml to .correct.xml:
- copy to your own computer
- open in SEE or XMLEditor
- add manual markup according to the established convention
- when done, copy the file back to victorio - see dir structure below
Directory structure and file locations for manually corrected files:
1 prooftest/.../orig/file.html loading 1 prooftest/.../orig/file.html.xsl converting 2 prooftest/.../bound/file.html.xml to this file, copying back to orig as 3a 3a prooftest/.../orig/file.html.correct.xml speling§spelling working on this manually, using RCS to check in each generation of manual markup 3b prooftest/.../bound/file.html.xml <error corr="spelling">speling</error>
- missing: last changes in eror$error => <error> conversion, + last ccat option
TODO:
- Manually mark test texts for typos (making them into gold standards)
-
erorr§errors
- when correcting to multiple strings: erorr.Og§(error. Og)
- update correction markup xml conversion to handle the second case
- update correction markup xml conversion to handle the second case
-
erorr§errors
- change ccat to handle error/correction markup (Tomi)
- extract the whole text, both the original text and their corrections, in a
- not done
- not done
- extract the whole text, both the original text and their corrections, in a
- Set up ways of adding meta-information (source info, used in testing or not,
- add the wanted xml elements to the XSL header (Saara) (source info is
- outworned (ie not suitable for speller testing any more, only for regression
- lexicalised (all unknown, correctly spelled words added to lexicon)
- outworned (ie not suitable for speller testing any more, only for regression
- add the wanted xml elements to the XSL header (Saara) (source info is
- Conduct tests on new beta versions on the basis of the unspoiled gold standard
- alternatively: make test scripts that will run the tests automatically,
- include the ones already tested in the testing/ catalogue
- test each version before beta release
Testing tools
A first version of statistics and test result processing is finished and
Test output can temporarily be found on
TODO:
- document the AppleScript testing tool (Sjur)
- write tools for statistical analysis of test results (Sjur)
- started, Saara continued, Sjur will continue with Forrest integration
- done first version, see link above.
- done first version, see link above.
- started, Saara continued, Sjur will continue with Forrest integration
- improve speller test bench (Sjur)
Regression tests
TODO:
- add extraction of all known spelling errors in the corpus (not the
-
ccat now ready, it should be integrated in the Makefile (Sjur, Tomi)
-
ccat now ready, it should be integrated in the Makefile (Sjur, Tomi)
- test the typos.txt list, and check that all entries are properly corrected
- consider how to do a regression self-test, ie, how to test the full
- extract all the base forms in the lexicon, and run them through the speller
- extract all SUB-marked entries, and run them through the lexicon
- integrate these in the make file (Sjur)
- extract all the base forms in the lexicon, and run them through the speller
Localisation
We need to translate the info added to our front page (and a separate page)
TODO:
- translate beta release docs to sme ( Thomas)
- translate beta release docs to smj ( Thomas)
Lexicon conversion to the PLX format
Postverbal clitics
Numbers
Numbers as figures need to be generated up front. The output should be:
1 UILH 2 UILH ... 1000000 UILH
TODO:
- add numbers as figures to the PLX sources (Børre)
Compounding restrictions
How to include compounding restriction comment tags in the transducers:
giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp => (using a perl script or similar) +SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !
TODO:
- improve prefix conversion to PLX (Tomi)
- improve middle noun conversion to PLX (Tomi)
- improve noun + adjective PLX conversion: ( Tomi)
- compounding stems - how do we generate them? Using the java client?
- compounding tags - we need to obey them when making the transducers.
- compounding stems - how do we generate them? Using the java client?
- add propernouns to xfst-based conversion
- make conversion test sample; add conversion testing to the make file
- improve number conversion (Børre, Tomi)
- run xfst-based PLX conversion on victorio, make the result available on our
Public Beta release
Due to the problems with generating the PLX files discussed above, we need to
DONE:
- delivered PLX data of sme and smj including compounding
- translated Windows installer to sme and smj
- installed PLX compiler in G5 at /usr/local/bin/mklex (one version for
- added resources needed for compiling PLX lexicons to our cvs repo
- tested the beta drop from Polderland - good we did, it is absolutely
- add compilation of MS Office spellers part of the Makefile
- install Windows and MS Office; test tools on Windows
TODO:
- improved smj speller (incl. derivations and compounds) (Sjur, Tomi)
- add numbers, compound restrictions to sme speller if time permits
- add names to smj speller (Sjur)
- finish press release (Sjur)
- add info to front page (incl. download links) (Børre)
- write separate page with detailed info (incl. download links) (Børre)
- translate press release, web pages (Børre, Thomas, whoever)
- collect a list of PR recipients, forward to Berit Karen Paulsen
- test speller installers on Windows and Mac (Børre)
- update installer packages with latest speller lexicon (Børre, Sjur)
Version identification of speller lexicons
The date stamp isn't automatically updated, it needs to be.
TODO:
- make the date stamp reflect the compilation date automatically (Sjur)
10. Other
Project meeting IRL
The planned gathering will have to be on 16.-20.4., in Guovdageaidnu. All of
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- delayed until the public beta is out the door
Bug fixing
51 open Divvun/Disamb bugs, and 23 risten.no bugs
New team member
Per-Eric Kuoljok started working in the Divvun project April 1. He needs to get
TODO:
- set up computer (Børre, Sjur)
- install all required software (Børre, Sjur)
- set up all user accounts (Sjur, Trond)
11. Next meeting, closing
The next meeting is 23.4.2007, 09: 30 Norwegian time.
The meeting was closed at 11: 34.
Appendix - task lists for the next week
Boerre
- update and fix our documentation and infrastructure as Steinar finds
- find missing nob parallel texts in corpus
- improve number PLX conversion
- add sma texts to the corpus repository
- improve automatic alignment process
- collect a list of PR recipients, forward to Berit Karen Paulsen
- run all known spelling errors in the corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
- test speller installers on Windows and Mac
- set up Per-Eric's computer
- install all required software on Per-Eric's computer
- update installer packages with latest speller lexicon
- add numbers, compound restrictions to both spellers if time permits
- add numbers as figures to the PLX sources
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Saara
- prepare more files for manual alignment
- improve cgi-bin scripts
- add new features to the paradigm generator
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- compilation of verb lists
- speed in smj conversion
- fix bugs!
Sjur
- finish press release for the beta
- collect a list of PR recipients
- make the version info date stamp reflect the compilation date automatically
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- make improved smj speller (incl. derivations and compounds)
- set up Per-Eric's computer
- install all required software on Per-Eric's computer
- set up all user accounts for Per-Eric
- improve speller test bench
- add names to smj speller
- update installer packages with latest speller lexicon
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
- Complete the semantic sets in sme-dis.rle
- missing lists
- Look at the actio compound issue when adding from missing lists
- Align corpus manually
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
- translate beta release docs to sme and smj
- Add potential speller test texts
- fix bugs!
Tomi
- make improved smj speller (incl. derivations and compounds)
- add numbers, compound restrictions to both spellers if time permits
- make PLX conversion test sample; add conversion testing to the make file
- improve number PLX conversion
- improve prefix and middle-noun PLX conversion
- fix bugs!
Trond
- Test the beta versions
- Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara's list
- Discuss with Anders
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
- Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- Improve automatic alignment process
- Align corpus manually
- Add potential speller test texts
- collect a list of PR recipients
- set up all user accounts for Per-Eric
- fix bugs!.