Meeting_2006-11-06
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 7.11.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
The meeting was delayed one day.
Opened at 10: 56.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi
Absent: Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact writers who already have received contracts
- Not done
- Not done
- move norwegian documents in Min Áigi from sme to nob
- Mostly done
- Mostly done
- consider a script for automatic testing of the spell checker in Word
- Not done
- Not done
- consider more testing routines
- Not done
- Not done
- update Maaren's Forrest installation to r430284
- check potential raw HTML bug/problem
- Done
- Done
-
sma discussions with SD (with Sjur, Trond)
- add as much smj texts as possible
- Done
- Done
- report improvements in aligner back to Øystein
- Not done
- Not done
- add a simple password protection to risten.no in the G5
- Not done
- Not done
- consider infra for testing feedback
- Not done
- Not done
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- done some
- done some
- investigate unrecognised word forms in the hyphenator
- create / check the paradigm grammar as exemplified above
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- some performance testing done, other testing with Tomi.
- some performance testing done, other testing with Tomi.
- improve text_cat with paragraphs of mixed content
- done
- done
- generate parallel corpus files manually (with Trond)
- export corpus tools to location available to all (with cron), cf news disc.
- in progress
- in progress
- help Trond with some shell commands
- not done
- not done
- plan the word form generator / data conversion script
- what was this?
- what was this?
- add <span>to the corpus processing, encapsulating identifiable sequences
- done otherwise, the processing is not implemented in the preprocessor, nor
- done otherwise, the processing is not implemented in the preprocessor, nor
- fine-tuning of the character conversion
- done
- done
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- one more candidate contacted
- one more candidate contacted
- finish i18n work of Forrest with Børre
- pdf finished - the forrest run version is ready now
- pdf finished - the forrest run version is ready now
- investigate unrecognised word forms in the hyphenator
- done some updates to the compilation, and some corrections
- the generator generates non-existent word forms - needs to be investigated!
- done some updates to the compilation, and some corrections
- decide how to specify compounding behaviour info in the lexicon
- further discussions in news
- further discussions in news
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations
- done (by Thomas)
- done (by Thomas)
- plan the word form generator / data conversion script
- nohting last week
- nohting last week
- consider a script for automatic testing of the spell checker in Word
- checked the AppleScript dictionary in Word - it looks both positive and
- checked the AppleScript dictionary in Word - it looks both positive and
- consider more testing routines
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- ask Julie Eira about SD employee seminar
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
- nothing this week
- find and study all derived words in our corpus (with Trond)
- nothing this week
- nothing this week
- suggest which derivations could be generated
- nothing this week
- nothing this week
- investigate unrecognised word forms in hyphenator
- worked
- worked
- decide how to specify compounding behaviour info in the lexicon
- worked
- worked
- check why some SUB-marked entries got included in the normative transducer
- worked
- worked
- remove comparation from -laš derivations
- done
- done
-
fix bugs!
- worked
Tomi
- continue implementation of the speller lexicon conversion
- worked, refactored the code
- worked, refactored the code
- add hyphenation points to the generated output
- plan the word form generator / data conversion script
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Made a new dump from sme, gone through the lexica, more to be done.
- Made a new dump from sme, gone through the lexica, more to be done.
- get more sma texts, first the Bible / NT
- Got text, but not Bible yet.
- Got text, but not Bible yet.
- add corpus user accounts and access issues to Bugzilla
- fix the corpus tag list in the cwb/ directory
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Sjur)
- find and study all derived words in our corpus (with Thomas)
- Done, helped Thomas with the setup of this
- Done, helped Thomas with the setup of this
- fix bugs!.
3. Documentation
The new documentation is up and running on the giellatekno server:
The main difference from our existing version: i18n and l10n - all menus, tabs
TODO:
- finish i18n work (Børre and Sjur)
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- finish i18n in PDF
- done
- done
- set up Tomcat at the faculty server, and install Forrest as war. Needs
- check potential raw HTML bug/problem (Børre)
- done and fixed (it was not a bug or a problem, it was our own sitemap: -( )
- done and fixed (it was not a bug or a problem, it was our own sitemap: -( )
- port all i18n work to the main branch (from the i18n branch) (Børre)
- update all forrest installations, including local patches (Børre)
4. Corpus gathering
Børre got in contact with an author Erling Persen, got a book. Also got
Number of words in our corpus is now:
-
sme - 4.38 mill words
- smj - 376 000 words
TODO:
- continue to help NSI to get their corpus (Børre)
- sma:
- get Bible / NT texts (Trond).
- nohting new since last week
- nohting new since last week
- Discussions with the Sámi Parliament (Børre, Sjur)
- get Bible / NT texts (Trond).
- add as much smj texts as possible (Børre)
- done
5. Corpus infrastructure
User accounts and access
TODO:
- add the issue with subissues to Bugzilla (Trond)
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
TODO:
- report improvements in aligner back to Øystein ( Børre)
- gather more parallel texts (Trond)
- try out NT alignment strategies (Saara)
Language recognition
TODO:
- get more sma texts, first the Bible / NT (Trond)
- add <span>to the corpus processing, encapsulating identifiable sequences
- done in text categorisation; all spans are dropped in ccat
6. Infrastructure
Xerox tools wrapped as servers
TODO:
- improve and finish the present prototype (Saara)
- it is now ready, but it is slow if the input is multiline - make sure that
- the multiline input option will go
- still some issues with the paradigm generation
- running the script and starting the demon needs to be done
- the multiline input option will go
- it is now ready, but it is slow if the input is multiline - make sure that
- fix the corpus tag list in the cwb/ directory (Trond)
- add the hyphenation filter to the hyphenator server (Saara)
- done
- done
- create / check the paradigm grammar as exemplified above (Maaren)
Hyphenator
TODO:
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
7. Linguistics
Names and multilinguality
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
Observations:
1) Multilinguality is always optional.
2) We can observe that "foreign" names in texts follows a domination pattern:
3) When looking at our name classification, multilinguality varies according to:
Ani - weak/none? (pet, myth anim. names) Fem - weak (informative) Mal - weak (informative) Obj - strong Org - strong Plc - strong for the national and country names, weak (informative) for foreign names Sur - none Tit - strong (titles)
Suggestion:
We need to reconsider the all names in all languages policy. That policy is
A further issue is whether we should reconsider our cohort policy. Today, Sur
"<Trosterud>" "Trosterud" N Prop Sur Sg Nom <<< @HNOUN "Trosterud" N Prop Plc Sg Nom <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN "<Trosterud>" "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN
Derivation and spellers like Aspell
TODO:
- find and study all derived words in our corpus (Thomas and Trond)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
The sme generator (make wordlist TARGET=sme) (possibly also smj)
Example of generated string that's not recognised by the analyser:
šuorpmoskáidilašruoksatčeavžžatvuođaineattet
Two lessions learned:
- make sure to always use identical versions of source files and compiled files
- there was a hidden circularity in the propernoun file
TODO:
- investigate the generated word form list sent to Polderland - use the command
- alphabet letters need to be correctly inflected (colon as case separator)
- done
- done
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations (Thomas, Sjur)
- done, but appearently not completely
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- hire new linguist (Sjur)
- one more contacted
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- add a simple password protection to risten.no in the G5 (Børre)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Spellers
Speller data generation
Tomi has done more work. We are now awaiting full specification of the PLX
There was discussions in the newsgroup last week, but no conclusion yet. We need
TODO:
- decide how to specify compounding behaviour info for the lexicon
- add hyphenation points to the generated output (Tomi)
Automatic testing of the Word spellchecker
It should be possible to write a script that runs texts through Word from the
TODO:
- consider a script for automatic testing (Sjur, Børre)
- checked some features of Word
- checked some features of Word
- consider more testing routines (Sjur, Børre)
- consider infra for testing feedback (Børre, Sjur)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
10. Other
Report from Gothenburg
The seminar focused on actions to establish common infrastructure for language
Sjur would like our projects to take some first public steps, by announcing
Prerequisites:
- the corpus contracts are checked for any outstanding issues
- anonymous cvs is working and regularly updated
- done now
TODO:
- check corpus contract issue (Børre, Sjur, Trond)
Bug fixing
64 open Divvun/Disamb bugs, and 24 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Task lists as iCal entries
TODO:
- update Maaren's Forrest installation to r430284 (Børre)
Employee seminar in Alta
SD has an employee seminar in Alta 7.-8. December - should we go there? Sjur
TODO:
- ask Julie Eira about SD employee seminar (Sjur)
11. Next meeting, closing
Closed at 12: 27.
Appendix - task lists for the next week
Boerre
- contact writers who already have received contracts
- consider a script for automatic testing of the spell checker in Word
- consider more testing routines
- update Maaren's Forrest installation to r430284
-
sma discussions with SD (with Sjur, Trond)
- report improvements in aligner back to Øystein
- add a simple password protection to risten.no in the G5
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- check corpus contract issue
- port all i18n work to the main branch (from the i18n branch)
- update all forrest installations, including local patches
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
- investigate unrecognised word forms in the hyphenator
- create / check the paradigm grammar as exemplified above
Saara
- add more texts to the graphical corpus interface
- finalize server of the Xerox tools.
- generate parallel corpus files manually (with Trond)
- export corpus tools to location available to all (with cron), cf news disc.
- help Trond with some shell commands
- plan the word form generator / data conversion script
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- refactor SD-terms editor code
- hire linguist and programmer
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- remove comparation from -laš derivations
- plan the word form generator / data conversion script
- consider a script for automatic testing of the spell checker in Word
- consider more testing routines
- consider infra for testing feedback
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- ask Julie Eira about SD employee seminar
- check corpus contract issue
- fix bugs!
Thomas
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived words in our corpus (with Trond)
- suggest which derivations could be generated
- investigate unrecognised word forms in hyphenator
- decide how to specify compounding behaviour info in the lexicon
- check why some SUB-marked entries got included in the normative transducer
- fix bugs!
Tomi
- continue implementation of the speller lexicon conversion
- make generator as server, based on Saara's code
- add lexc2xspell code to cvs
- add hyphenation points to the generated output
- plan the word form generator / data conversion script
- fix bugs!
Trond
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- get more sma texts, first the Bible / NT
- add corpus user accounts and access issues to Bugzilla
- fix the corpus tag list in the cwb/ directory
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
-
sma discussions with SD (with Børre, Sjur)
- check corpus contract issue
- fix bugs!.