Meeting_2006-11-27
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 27.11.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 07.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact writers who already have received contracts
- Not done
- Not done
- consider a script for automatic testing of the spell checker in Word
- Downloaded documentation on Microsoft Office and AppleScript
- Downloaded documentation on Microsoft Office and AppleScript
- consider more testing routines
- Not done
- Not done
- update Maaren's Forrest installation to r430284
- Not done
- Not done
-
sma discussions with SD (with Sjur, Trond)
- Not done
- Not done
- add a simple password protection to risten.no in the G5
- Not done
- Not done
- consider infra for testing feedback
- Not done
- Not done
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- Had a short discussion with Sjur about it.
- Had a short discussion with Sjur about it.
- check corpus contract issue
- Done
- Done
- port all i18n work to the main branch (from the i18n branch)
- Done, but haven't set up the web server to use it.
- Done, but haven't set up the web server to use it.
- update all forrest installations, including local patches
- It is available as: http: //divvun.no/static_files/forrest_divvun.tar.gz
- It is available as: http: //divvun.no/static_files/forrest_divvun.tar.gz
- send in hotel room list for employee seminar
- Done
- Done
-
fix bugs!
- Not done
- Not done
- Other work:
- Aspell: Contacted mailing list, doing some research.
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
- not done
- not done
- send in hotel room list for employee seminar
- done
Saara
- finalize server of the Xerox tools.
- did some testing with parallel processing.
- did some testing with parallel processing.
- help Trond with some shell commands
- update namelex2xml
- done
- done
- add nor output file to namelex2xml
- done by Sjur.
- done by Sjur.
- update hyph-filter.pl
- done
- done
-
fix bugs!
- done some
- done some
- other
- work with analyzed corpus: bug fixing, more efficient analysis,
- work with analyzed corpus: bug fixing, more efficient analysis,
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- done bug fixing in the existing risten.no code, + some additional testing
- done bug fixing in the existing risten.no code, + some additional testing
- refactor SD-terms editor code
- hire linguist and programmer
- make new list of unrecognised forms in the hyphenator
- done, Trond sent it to Maaren
- done, Trond sent it to Maaren
- investigate unrecognised word forms in the hyphenator
- not done
- not done
- decide how to specify compounding behaviour info in the lexicon
- still not decided, we need a short meeting to settle this one
- still not decided, we need a short meeting to settle this one
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- consider a script for automatic testing of the spell checker in Word
- consider more testing routines
- topic added to this meeting schedule
- topic added to this meeting schedule
- consider infra for testing feedback
- topic added to this meeting schedule
- topic added to this meeting schedule
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- redo the derived words work for smj
- mostly done by Thomas
- mostly done by Thomas
- revitalize the Aspell work
- done by Børre
- done by Børre
- update the generating transducer to only be normative
- done
- done
- discuss clitics in PLX format with Polderland
- done
- done
- publish corpus contracts and project infra on NoDaLi-sta
- not done
- not done
- send in hotel room list for employee seminar
- everyone sends in him-/herself
- everyone sends in him-/herself
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- redo the derived words work for smj
- finished soon
- finished soon
- investigate unrecognised word forms in hyphenator
- not done
- not done
- decide how to specify compounding behaviour info in the lexicon
- not done
- not done
- check why some SUB-marked entries got included in the normative transducer
- not done
- not done
- create / check the paradigm grammar
- done
- done
- investigate the generated word form list sent to Polderland - use the command
- done
- done
- send in hotel room list for employee seminar
- done
- done
-
fix bugs!
- not this week
Tomi
- make PLX lexicon for all open POSes
- Not done
- Not done
- add derivations to the PLX generation
- not done
- not done
- add compound stems to the PLX generation
- not done
- not done
- add closed POS and clitics to PLX generation
- not done
- not done
- revitalize the Aspell work
- send in hotel room list for employee seminar
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Not done
- Not done
- get more sma texts
- Will have to wait until people return from sick leave.
- Will have to wait until people return from sick leave.
- update the corpus tag list in the cwb/ directory
- Done, will have to work more on this.
- Done, will have to work more on this.
- investigate unrecognised word forms in the hyphenator
- Not done.
- Not done.
- decide how to specify compounding behaviour info in the lexicon
- Not done.
- Not done.
-
sma discussions with SD (with Børre, Sjur)
- Not done.
- Not done.
- redo the derived words work for smj
- Not done.
- Not done.
-
fix bugs!.
- At least some done.
3. Documentation
There's a problem with the reliability of the Jetty (forrest run) version of
TODO:
- update all forrest installations, including local patches (Børre)
- created a tailored forrest distribution to download and install
- created a tailored forrest distribution to download and install
- port all i18n work to the main branch (from the i18n branch) (Børre)
- done
4. Corpus gathering
We finally received the Áššu and Min Áigi files from 1995 to 2001. Many of the
No newer files received from Áššu, Børre should contact them again.
To be able to convert WriteNow files (WriteNow is a long dead MacOS Classic application), we need the
More South Sámi files are forthcoming soon.
TODO:
- get sma Bible / NT texts (Trond)
- not done due to illness
- not done due to illness
- Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
- send the filename renaming script to NSI; get their corpus (Børre)
- done
- done
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)
- not done
5. Corpus infrastructure
More texts to the graphical corpus interface:
Lars suggested changes (<p> only policy) to the corpus DTD (in news). Good
TODO:
- Consider whether to implement the <p> only policy or not (Saara)
- add text to the server (Lars)
- Discuss with Lars (Trond)
Aligner
The latest cvs version of our copy of the Aligner can be run from the command
The exact instructions on how to compile and run the program is in the README.txt file. Basic usage as follows:
tca2 -a anchorfile samifile otherlangfile
TODO:
- gather more parallel texts (Trond)
- re-analyze parallel files using the command-line version (Saara)
- make a runnable version available on both the g5 and victorio (Børre)
- Done
Language recognition
TODO:
- get more sma texts, first the Bible / NT (Trond)
6. Infrastructure
Xerox tools wrapped as servers
The PLX format for clitics have been checked with Polderland, and we should be
TODO:
- finalise the paradigm generator (requires paradigm grammar) (Saara)
- done, except needs to remove the clitics
- done, except needs to remove the clitics
- fix the corpus tag list in the cwb/ directory (Trond)
- The list now matches the actual tags.
- The list now matches the actual tags.
- create / check the paradigm grammar as exemplified above (Thomas)
- done
- done
- update the generating transducer to only be normative, and without the
- done
- done
- discuss clitics in PLX format with Polderland (Sjur)
- done
Hyphenator
The list of unrecognised words is found in victorio in /tmp/hyph-errors.txt,
Typical problem words/reports (in hyph-errors.txt) are:
No forms for šveicalažžanat No forms for šveicalažžaneamet No forms for šveicalažžaneame No forms for šveicalažžanan No forms for šveicalažžanis No forms for šveicalažžaneaset ....
TODO:
- make new list of unrecognised forms (Sjur)
- done
- done
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- not done
7. Linguistics
Names and multilinguality
TODO:
- finish first version of the editing (Sjur)
- working on it
- working on it
- add @type=secondary and @excl=speller,hyph to all names marked with !SUB
- done
- done
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
- merge placenames which are errouneously in different entries: e.g. Helsinki,
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
Derivation and spellers like Aspell
TODO:
- redo the work for smj, including discussion regarding Actio
- Thomas is working on it.
North Sámi
The following words are included in the normative list despite being marked
accompagnerejun ábuhuvvože ábuhuvvože ábuhit+V+TV+Pass+Pot+Prs+Du1 áccohallagođežedne
TODO:
- investigate the generated word form list sent to Polderland - use the command
- check why some SUB-marked entries got included in the normative transducer
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
- new suggestion in the meeting
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- add a simple password protection to risten.no in the G5 (Børre)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
Polderland data generation
All nouns are now generated in PLX, not yet the other POSes. Also the closed
Also, there might be possible to ask lookup to print out the whole paradigm.
We need to close the compound specification discussion. We'll have a meeting for
TODO:
- write paradigm grammar for the closed POSes (Thomas)
- check whether lookup can be used to generate paradigms for closed POSes
- decide how to specify compounding behaviour info for the lexicon
- meeting Tuesday at 12.30
- meeting Tuesday at 12.30
- update the hyphenation filter to always return one - if there are multiple
- done
- done
- make PLX lexicon for all open POSes (Tomi)
- add derivations to the PLX generation (Tomi)
- add compound stems to the PLX generation (Tomi)
- add closed POS and clitics to PLX generation (Tomi)
Aspell
In preparation for the Aspell work, Børre has sent an e-mail to the Aspell
> What features in hunspell would you specifically like to have in aspell? Possibly: - Max. 65535 affix classes and twofold affix stripping - Handling conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Support complex compoundings I believe some of these will benefit you. However I only want to implement them if these is a clear benefit to it. For example based on what several people have told be complex compounding rules are not worth it. Aspell is far more complex then Myspell and each feature needs to implemented carefully so that it will behave sensibly with the suggestion code. Also it is important that the addition of the feature won't degrade performance, Especially when the feature isn't used.
The whole thread is available here .
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
When the PLX-based speller is ready: use the generated word list as test input:
We need a meeting to plan testing. We'll do it shortly this week, and perhaps
TODO:
- meeting Wednesday at 9.30 (Sjur, Børre)
- consider a script (shell+AppleScript?) for automatic testing of MS Word
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
Bug fixing
57 open Divvun/Disamb bugs, and 24 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Task lists as iCal entries
TODO:
- update Maaren's Forrest installation to r430284 (Børre)
Employee seminar in Alta
TODO:
- send in hotel room list (Divvun)
- done (all)
11. Next meeting, closing
The next meeting is 4.12.2006, 09: 30 Norwegian time.
The meeting was closed at 11: 27.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- consider a script for automatic testing of the spell checker in Word
- meeting Wednesday at 9.30 with Sjur to plan testing
-
sma discussions with SD (with Sjur, Trond)
- add a simple password protection to risten.no in the G5
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- update all forrest installations, including local patches
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
Saara
- finalize server of the Xerox tools.
- help Trond with some shell commands
- re-analyze parallel files
- consider implementing some new features to the corpus files
- add closed POSes to the paradigm gen, if needed.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- meeting Tuesday 12.30
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- consider a script for automatic testing of the spell checker in Word
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- publish corpus contracts and project infra on NoDaLi-sta
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts
- check whether lookup can be used to generate paradigms for closed POSes
- meeting Wednesday at 9.30 with Børre to plan testing
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- redo the derived words work for smj
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- meeting Tuesday 12.30
- check why some SUB-marked entries got included in the normative transducer
- investigate unrecognised word forms in hyphenator
- test editing of the xml files
- clean terms-sme.xml such that all names have the correct tag for their use
- write paradigm grammar for the closed POSes
- fix bugs!
Tomi
- make PLX lexicon for all open POSes
- add derivations to the PLX generation
- add compound stems to the PLX generation
- add closed POS and clitics to PLX generation
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- get more sma texts
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- meeting Tuesday 12.30
-
sma discussions with SD (with Børre, Sjur)
- redo the derived words work for smj