Meeting_2006-10-16
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix -task lists for the next week
Meeting setup
- Date: 16.10.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 11.
Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond
Absent: Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- contact Ája (Kåfjord)
- Not done
- Not done
- discuss access to older Min Áigi and Áššu files with Richard Valkeapää
- I will provide him with a renaming script, and then we will get the Áššu and Min Áigi files.
- I will provide him with a renaming script, and then we will get the Áššu and Min Áigi files.
- contact Ája (Kåfjord)
- Move norwegian documents in Min Áigi from sme to nob
- Not done
- Not done
- set up Bugzilla automatic reminders for open issues
- Not done
- Not done
- finish Forrest i18n work (pdf)
- Not done
- Not done
- set up Tomcat for use with eXist and the propnouns db on the G5
- Not done
- Not done
- investigate use of proxies for Maaren, for her to be able to use iChat AV
- Done. SquidMan is a possible solution, will test as soon as possible
- Done. SquidMan is a possible solution, will test as soon as possible
-
fix bugs!
- Not done
- Not done
- Other
- Worked a lot with the aligner, tca2.
Maaren
- download and install latest Marratech
- with the help of SquidMan and http proxy, this is now obsolote - good bye,
- with the help of SquidMan and http proxy, this is now obsolote - good bye,
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- generate parallel corpus files manually (with Trond)
- Improve text_cat
- export corpus tools to location available to all (with cron), cf news disc.
- Improve hyph-filter.pl
- Implement M4-rules for sme-dis.rle
- improve smj LM based on the new, larger corpus
- help Trond with some shell commands
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- started
- started
- implement missing propnouns editing functions
- nothing
- nothing
- implement improvements decided upon in Tromsø
- nothing
- nothing
- refactor SD-terms editor code
- move corpus user doc issue to Bugzilla
- nope
- nope
- hire linguist and programmer
- nothing last week
- nothing last week
- finish i18n work of Forrest
- nothing
- nothing
- install eXist and our local copy of risten.no and propnouns on the G5
- nope
- nope
- add normative M4 macros to the twol code; utilize it in the make file
- done
- done
-
fix bugs!
- fixed one with Thomas and Trond
- fixed one with Thomas and Trond
- other things done:
- spent a lot of time improving hyphenation and speller generation for
- spent a lot of time improving hyphenation and speller generation for
Thomas
- work with Polderland phonetic rules for smj
- done
- done
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
- nothing this week
- find and study all derived verbs in our corpus (depends on Saara)
- have not started yet
- have not started yet
- suggest which derivations could be generated (depends on Saara)
- not yet started
Tomi
- continue implementation of the speller lexicon conversion
- continued
- continued
-
fix bugs!
- fixed some bug
Trond
- Discuss shell script wrapper teaksta.sh with Saara
- Not done
- Not done
- write user account form, probably ask for copy of existing ones from the IT
- Not done
- Not done
- write documentation for our bound users, with pointers to the ordinary
- Not done
- Not done
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Not done
- Not done
- Get more sma texts to improve language recognition
- Not done
- Not done
- study paragraphs with mixed content
- Not done
- Not done
- add corpus artice framework to Bugzilla
- Not done
- Not done
- add corpus user accounts and access issues to Bugzilla
- Not done
- Not done
-
fix bugs!.
- As can be seen, this week has been devoted to totally different things, first
3. Documentation
TODO:
- finish i18n work (Børre and Sjur)
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- i18n does not work in PDF ("Table of Content" won't translate)
- still doesn't work
- still doesn't work
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- move these issues into Bugzilla (Sjur)
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Børre talked to Richard Valkeapää, they have several computers, and have
TODO:
- discuss corpus transfer with NSI (Børre)
- continue to contact authors (Børre)
5. Corpus infrastructure
User accounts and access
The whole issue should be moved into Bugzilla as several tasks.
TODO:
- add the issue with subissues to Bugzilla (Trond)
More texts to the graphical corpus interface:
Trond, Børre and Saara aligned and sent the first parallell corpus
TODO:
- align texts, analyse, and send to Lars (Trond, Saara)
- add text to the server (Lars)
Aligner
Børre has made some important improvements to the Bergen Aligner. The
TODO:
- gather parallel texts (Trond)
-
Trond is working on it
-
Trond is working on it
- try out NT alignment strategies (Saara)
Language recognition
TODO:
- get more sma texts, first the Bible / NT (Trond)
- improve the smj LM (Saara)
- done
- done
- what about paragraphs with mixed content? Build a corpus of such paragraphs
6. Infrastructure
Xerox tools wrapped as servers
The Divvun project needs generation pretty soon, and Tomi can add generation
TODO:
- improve and finish the present prototype (Saara)
- add generator (Tomi)
Hyphenator
Sjur has hyphenated the pl-wordlist.txt file (the full-form word list
Hyphenation for smj still does not work, as the hyphenation rules file does
Overgeneration problem:
Abakalikilaččabuččaideamet A^ba^ka^li^kilač^ča^buč^čai^dea^met Abakalikilaččabuččaideamet A^ba^ka^li^kilač^ča^buč^čai^dea^met Abakalikilaččabuččaideamet A^ba^ka^li^kilep^po^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^kilep^po^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^kilab^bo^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^kilab^bo^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^ki#lač^ča^buč^čai^dea^met Abakalikilaččabuččaideamet A^ba^ka^li^ki#lep^po^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^ki#lep^po^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^ki#lab^bo^žiid^dá^met Abakalikilaččabuččaideamet A^ba^ka^li^ki#lab^bo^žiid^dá^met
Unrecognised input:
abbalaččaboiid abbalaččaboiid +? Abbalaččaboiidda Abbalaččaboiidda +? abbalaččaboiidda abbalaččaboiidda +? Abbalaččaboiiddáde Abbalaččaboiiddáde +? abbalaččaboiiddáde abbalaččaboiiddáde +? Abbalaččaboiiddádeguin Abbalaččaboiiddádeguin +?
What is the percentage of unrecognized input? (i.e. of word forms generated by
- Wordlist generated = 27 354 884
- unrecognised wordforms = 11 284 570
- percentage unrecognised = 41,3 %
This is way too high, and must be investigated. The input is generated from our
TODO:
- Update the sma hyphenator rule set with the insights gained from smj
- linguistic issues in the hyphenator: ( Sjur using M4)
- shortening in compounds (grossly overgenerates in the hyphenator output)
- done - the shortening rules are now filtered out by M4 when compiling
- done - the shortening rules are now filtered out by M4 when compiling
- shortening in compounds (grossly overgenerates in the hyphenator output)
- consider case conversion problems (Saara)
- check #- when comparing input string with hyphenated string - the hyphen
- investigate unrecognised input word forms (Maaren, Thomas, Trond, Sjur)
Automatic Bugzilla reminder for untouched bugs
Some perl-libraries needed by Bugzilla weren't in the path, causing it to not
TODO:
- fix the remaining issues (Børre)
M4
TODO:
- make speller make targets that utilise M4 to produce normative
- done
- done
- disamb variants for derivations (see next) (Saara)
- include make as part of a shell script, which will run the M4 processor on
- done - but the disambiguation REMOVE rules need to be added in again for
- done - but the disambiguation REMOVE rules need to be added in again for
- include make as part of a shell script, which will run the M4 processor on
7. Linguistics
Names and multilinguality
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
Observations:
1) Multilinguality is always optional.
2) We can observe that "foreign" names in texts follows a domination pattern:
3) When looking at our name classification, multilinguality varies according to:
Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong for the national and country names, weak (informative) for foreign names Sur - none Tit - strong (titles)
Suggestion:
We need to reconsider the all names in all languages policy. That policy is
A further issue is whether we should reconsider our cohort policy. Today, Sur
"<Trosterud>" "Trosterud" N Prop Sur Sg Nom <<< @HNOUN "Trosterud" N Prop Plc Sg Nom <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
Derivation and spellers like Aspell
- enclose the CG rule that preferres lexicalised forms over derivations in M4
- done
- done
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
We have a lot of overgeneration, mainly possible derivations that are spelled
TODO:
- investigate the generated word form list sent to Polderland - use the command
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- turn Tomcat on on the G5; send admin username and password to Sjur
- add eXist and the proper noun interface to the G5 (Sjur)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
News
The Greenlandic speller was released today. It is available at the website of
Speller data generation
It reads the lexc files, reads each word, and requests (or will request) the
Still missing: the generator isn't yet available, Tomi should implement a
- single word forms
- reduced paradigm
- full paradigm (all possessives, clitics etc, including compound forms)
The Java code needs to be checked in into cvs, suggested location:
gt/src/lexc2xspell
TODO:
- add code to cvs (Tomi)
- implement generator server based on Saara's code (Tomi)
- specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
- add hyphenation points to the generated output (Tomi)
10. Other
Bug fixing
66 open Divvun/Disamb bugs, and 24 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and the SD Firewall
TODO:
- investigate use of proxies for Maaren ( Børre)
- done and solved nicely!
- done and solved nicely!
- document SquidMan use at SD (Børre)
Task lists as iCal entries
TODO:
- update Maaren's installation to r430284 (Børre)
Words section
You all need to check out CVS/words, and link to the relevant place, cf how
11. Next meeting, closing
Next meeting 23.10.2006 at 9: 30.
Sjur will be away on Thursday and Friday this week.
Closed at 11: 16.
Appendix -task lists for the next week
Boerre
- corpus collection:
- contact Ája (Kåfjord)
- discuss access to older Min Áigi and Áššu files with Richard Valkeapää
- contact Ája (Kåfjord)
- Move norwegian documents in Min Áigi from sme to nob
- set up Bugzilla automatic reminders for open issues
- finish Forrest i18n work (pdf)
- set up Tomcat on the G5 for use with eXist and the propnouns db, as well as
- document SquidMan use at SD
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised input word forms in the hyphenator
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- generate parallel corpus files manually (with Trond)
- Improve text_cat
- export corpus tools to location available to all (with cron), cf news disc.
- Improve hyph-filter.pl
- help Trond with some shell commands
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- move corpus user doc issue to Bugzilla
- hire linguist and programmer
- finish i18n work of Forrest
- install eXist and our local copy of risten.no and propnouns on the G5
- investigate unrecognised input word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived verbs in our corpus
- suggest which derivations could be generated
- investigate unrecognised input word forms in hyphenator
- investigate the generated word form list sent to Polderland - use the command
- decide how to specify compounding behaviour info in the lexicon
Tomi
- continue implementation of the speller lexicon conversion
- make generator as server, based on Saara's code
- add lexc2xspell code to cvs
- add hyphenation points to the generated output
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma texts to improve language recognition
- study paragraphs with mixed content
- add corpus user accounts and access issues to Bugzilla
- investigate unrecognised input word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!.