Meeting_2006-10-23
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 23.10.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 58.
Present: Børre, Maaren, Saara, Sjur
Absent: Thomas, Tomi, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- contact Ája (Kåfjord)
- They will send us documents on a monthly basis
- They will send us documents on a monthly basis
- discuss access to older Min Áigi and Áššu files with Richard Valkeapää
- Done ...
- Done ...
- contact Ája (Kåfjord)
- Move norwegian documents in Min Áigi from sme to nob
- Not done
- Not done
- set up Bugzilla automatic reminders for open issues
- Done!
- Done!
- finish Forrest i18n work (pdf)
- set up Tomcat on the G5 for use with eXist and the propnouns db, as well as
- tomcat is up and running
- tomcat is up and running
- document SquidMan use at SD
- Available at /doc/infras/ichat-through-firewalls.html
- Available at /doc/infras/ichat-through-firewalls.html
-
fix bugs!
- Other
- Done a lot of work on the Bergen aligner.
Maaren
- investigate the generated word form list sent to Polderland - use the command
- have started working
- have started working
- investigate unrecognised input word forms in the hyphenator
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- implemented xml-conversions.
- implemented xml-conversions.
- generate parallel corpus files manually (with Trond)
- prepared and analyzed a set of files for alignment
- prepared and analyzed a set of files for alignment
- Improve text_cat
- not finalized
- not finalized
- export corpus tools to location available to all (with cron), cf news disc.
- not done
- not done
- Improve hyph-filter.pl
- done, with respect to the case conversion and check for #-
- done, with respect to the case conversion and check for #-
- help Trond with some shell commands
- not done
- not done
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- partly done, refactored and completed the classification editor code
- partly done, refactored and completed the classification editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- move corpus user doc issue to Bugzilla
- done
- done
- hire linguist and programmer
- finish i18n work of Forrest
- install eXist and our local copy of risten.no and propnouns on the G5
- investigate unrecognised input word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
fix bugs!
- other tasks:
- hyphenation and normative improvements to the data delivered to Polderland
- delivered hyphenated data to Polderland
- hyphenation and normative improvements to the data delivered to Polderland
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived verbs in our corpus
- suggest which derivations could be generated
- investigate unrecognised input word forms in hyphenator
- investigate the generated word form list sent to Polderland - use the command
- decide how to specify compounding behaviour info in the lexicon
Tomi
- continue implementation of the speller lexicon conversion
- implementing, it will be very nice, maybe too nice and complicated
- implementing, it will be very nice, maybe too nice and complicated
- make generator as server, based on Saara's code
- add lexc2xspell code to cvs
- still moving and renaming codefiles
- still moving and renaming codefiles
- add hyphenation points to the generated output
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma texts to improve language recognition
- study paragraphs with mixed content
- add corpus user accounts and access issues to Bugzilla
- investigate unrecognised input word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!.
3. Documentation
TODO:
- finish i18n work (Børre and Sjur)
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- i18n does not work in PDF ("Table of Content" won't translate)
- nothing
- nothing
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- move these issues into Bugzilla (Sjur)
- finally done (see Bug 348)
- move these issues into Bugzilla (Sjur)
4. Corpus gathering
Børre wrote the renaming script for NSI, but it didn't work on their
Børre has also talked with Ája. They agreed that they will send monthly all
He has also been digging more into the SD document hierarchy, to try to find
One author has signed the corpus contract last week, and sent us a book:
TODO:
- discuss corpus transfer with NSI (Børre)
- done
- done
- continue to contact authors and text producers (Børre)
- done
5. Corpus infrastructure
User accounts and access
TODO:
- add the issue with subissues to Bugzilla (Trond)
More texts to the graphical corpus interface:
TODO:
- align texts, analyse, and send to Lars (Trond, Saara)
- add text to the server (Lars)
Aligner
Børre has been working hard on the Bergen aligner: fixed compiling errors
The aligner now works somewhat automatically, it handles memory issues
We'll stop working on it for this week, as the fixes already done are very
Some parallell texts in the corpus are now aligned (but many of the
TODO:
- gather parallel texts (Trond)
- try out NT alignment strategies (Saara)
Language recognition
TODO:
- get more sma texts, first the Bible / NT (Trond)
- what about paragraphs with mixed content? Build a corpus of such paragraphs
6. Infrastructure
Xerox tools wrapped as servers
Tomi tried to make the generator, but no success. Saara will cooperate
TODO:
- improve and finish the present prototype (Saara)
- improved
- improved
- add generator to the server setup (Tomi)
- tried
Hyphenator
TODO:
- Update the sma hyphenator rule set with the insights gained from smj
- consider case conversion problems (Saara)
- done
- done
- check #- when comparing input string with hyphenated string - the hyphen
- done
- done
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- nothing
Automatic Bugzilla reminder for untouched bugs
We now receive reminders for untouched bug reports, once a week.
TODO:
- fix the remaining issues (Børre)
- done!
M4
Still anything?
It is problematic for the CG rules, as the rule numbering gets mixed up. The
7. Linguistics
Names and multilinguality
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
Observations:
1) Multilinguality is always optional.
2) We can observe that "foreign" names in texts follows a domination pattern:
3) When looking at our name classification, multilinguality varies according to:
Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong for the national and country names, weak (informative) for foreign names Sur - none Tit - strong (titles)
Suggestion:
We need to reconsider the all names in all languages policy. That policy is
A further issue is whether we should reconsider our cohort policy. Today, Sur
"<Trosterud>" "Trosterud" N Prop Sur Sg Nom <<< @HNOUN "Trosterud" N Prop Plc Sg Nom <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
Derivation and spellers like Aspell
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
Unwanted word forms:
- comparation of -laš derivations (they should not be generated in comparative
TODO:
- investigate the generated word form list sent to Polderland - use the command
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
8. Name lexicon infrastructure
Sjur has cleaned a lot of risten.no code during last week.
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- turn Tomcat on on the G5; send admin username and password to Sjur
- done
- done
- add eXist and the proper noun interface to the G5 (Sjur)
- eXist installed.
- eXist installed.
- cvs synching of the risten.no code in eXist (read-only) (Børre)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
Speller data generation
Derivations during generation of word forms: how do we generate derivations of
Problem example: how do we get from Oslo to oslolaš?
Input: Oslo Output: Oslo Oslos ... oslolaš oslolaččat ... (+ other derivations and their inflections)
Oslo ->
Oslo+N+Sg+Nom
oslolaš+N+Sg+Nom
Make a list of all possible derivations (see
- if it succeeds, take the new stem, and inflect it
- it not, try the next
What about compounding stem? How do we generate it?
TODO:
- add code to cvs (Tomi)
- implement generator server based on Saara's code (Tomi)
- specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
- add hyphenation points to the generated output (Tomi)
- planning meeting for the word form generator / data conversion script
Automatic testing of the Word spellchecker
Ask MS Word to spell check the open documents, and store all unrecognised words
We should also ask Polderland whether they have tools for this.
This will only test unrecognised words. We also need to test the suggestions,
TODO:
- consider a script for automatic testing (Sjur, Børre)
- ask Polderland about testing tools (Sjur)
- consider more testing routines (Sjur, Børre)
10. Other
Bug fixing
66 open Divvun/Disamb bugs, and 24 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and the SD Firewall
TODO:
- document SquidMan use at SD (Børre)
- done
How do we set environment variables effective for all users
Look into /etc/environment on victorio - NOT FOUNT on the Mac! (only
Task lists as iCal entries
TODO:
- update Maaren's Forrest installation to r430284 (Børre)
Employee seminar in Alta
SD has an employee seminar in Alta in December - should we go there? We'll
11. Next meeting, closing
Next meeting 30.10.2006 at 9: 30.
Closed at 11: 24.
Appendix - task lists for the next week
Boerre
- contact writers who already have received contracts
- Move norwegian documents in Min Áigi from sme to nob
- finish Forrest i18n work (pdf)
- cvs synching of the risten.no code in eXist (read-only)
- consider a script for automatic testing of the spell checker
- consider more testing routines
- update Maaren's Forrest installation to r430284
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- generate parallel corpus files manually (with Trond)
- export corpus tools to location available to all (with cron), cf news disc.
- help Trond with some shell commands
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- finish i18n work of Forrest
- install our local copy of risten.no and propnouns on the G5
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived verbs in our corpus
- suggest which derivations could be generated
- investigate unrecognised word forms in hyphenator
- investigate the generated word form list sent to Polderland - use the command
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!
Tomi
- continue implementation of the speller lexicon conversion
- make generator as server, based on Saara's code
- add lexc2xspell code to cvs
- add hyphenation points to the generated output
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma texts to improve language recognition
- study paragraphs with mixed content
- add corpus user accounts and access issues to Bugzilla
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!.