Meeting_2006-09-11
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting follow-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 11.09.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 03.
Present: Børre, Saara, Sjur, Thomas, Tomi, Trond
Absent: Maaren
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- Not contacted Ája, talked with Lene, she has a lot of texts.
- Not contacted Ája, talked with Lene, she has a lot of texts.
- send contracts to Čálliid Lágádus
- Not done
- Not done
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- Not done
- Not done
- contact Bård Eriksen again
- Phoned him. He said they (Báhko) would like to help us, but would charge
- Phoned him. He said they (Báhko) would like to help us, but would charge
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- not done
- not done
- convert fin, swe to paratext or directly to our XML
- not done
- not done
- review the paratext2xml converter
- not done
- not done
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- Not done
- Not done
- make a test user
- Done
- Done
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- I have started on this
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- Not done
- Not done
- create document & document entry for semantic double-tagging of names (for
- Not done
- Not done
- update all Forrest installations to svn version r430284
- Me, Trond, Thomas and victorio ok.
- Me, Trond, Thomas and victorio ok.
- finish Forrest i18n and Sámi in PDF work
- Some work done. Still absolute paths in pdf work.
- Some work done. Still absolute paths in pdf work.
- Get more sma, smj texts to improve language recognition
- Will contact Stig Gælok in Tromsø. Talked to him in the weekend, he was
- Will contact Stig Gælok in Tromsø. Talked to him in the weekend, he was
- set up Tomcat for use with eXist and the propnouns db on the G5
- Not done
- Not done
- download and install latest Marratech
- Done
- Done
-
fix bugs!
- None fixed this week
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- in progress
- in progress
- Implement server of the analysis tools.
- not done
- not done
- Add more languages to the lexc2xml propernoun conversion.
- done
- done
- Refine the namelex output
- done
- done
- convert roughly 100 smj names from gt/smj/propernoun-smj-lex.txt (lines
- implemented to namelex2xml.pl
- implemented to namelex2xml.pl
- download and install latest Marratech
- done
- done
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- nothing last week
- nothing last week
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- progress! XIncludes (and XML/w3 standard) works and will replace the planned
- progress! XIncludes (and XML/w3 standard) works and will replace the planned
- implement improvements decided upon in Tromsø
- continued
- continued
- implement editing functions
- review user and admin documentation for corpus access
- nothing
- nothing
- write user account form, probably ask for copy of existing ones from the IT
- nothing
- nothing
- start hiring process of linguist and programmer
- continued
- continued
- help Børre finish i18n work of Forrest with a language override menu
- done, but not finished
- done, but not finished
- consider the problems of lexicalised derivations schewing analyses of
- nothing more since last meeting
- nothing more since last meeting
- install eXist and our local copy of risten.no and propnouns on the G5
- nothing
- nothing
- speller follow-up from the Tromsø meeting
- not yet
- not yet
- fix bugs!
Thomas
- send more even-syllable VNAs to cover all stem types, with derivations
- done
- done
- fix multiple inflection for identical name
- done
- done
- sme G3 issue
- not this week
- bug-fixing!
- worked and still working
- worked and still working
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- review user documentation for corpus access
- not done
- not done
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- download and install latest Marratech
- done
- done
- suggest which derivations could be generated (depends on Trond)
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- fix compilation on Victorio
- download and install latest Marratech
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- Continue discussion with Bergen about aligner issues
- Discussed with them, they will work on a command-line version this month
- Discussed with them, they will work on a command-line version this month
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Part two still not done.
- Part two still not done.
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on semantic double-tagging of names
- Jotted down some notes on the subject, on prop-noun-editing.jspwiki.
- Jotted down some notes on the subject, on prop-noun-editing.jspwiki.
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Not done, this must be a joint work with Thomas.
- Not done, this must be a joint work with Thomas.
- discuss web-only user access management with Oslo
- Wrote letter, awaiting response.
- Wrote letter, awaiting response.
- write short user guide for the corpus web interface
- Discussed it with Oslo, they are going to write some, awaiting that.
- Discussed it with Oslo, they are going to write some, awaiting that.
- install Gobby
- Done (thanks, Børre!) But it seems UTF-8 isn't in place!
- Done (thanks, Børre!) But it seems UTF-8 isn't in place!
- Get more sma, smj texts to improve language recognition
- Not done
- Not done
- study corpus for language recognition errors, as well as paragraphs with mixed
- Done some work here with Saara and Ilona, progress underway, but still work
- Done some work here with Saara and Ilona, progress underway, but still work
- consider the problems of lexicalised derivations schewing analyses of
- fix compilation on Victorio
- Done (well, Tomi, actually)
- Done (well, Tomi, actually)
- download and install latest Marratech
- Done. Now ready to test this out.
- Done. Now ready to test this out.
-
fix bugs!.
- Went through the list, at least...
3. Documentation
TODO:
- finish i18n work by adding a list of available language versions to each
- on its way
- on its way
- make pdf set-up work using relative paths (Børre)
- we'll have to use fixed paths working on victorio, so that the public site is
- we'll have to use fixed paths working on victorio, so that the public site is
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Contacted Bård Eriksen regarding smj texts. He would like to help us
Børre has gathered phone numbers, will phone writers and then send letters
Some authors run their own publishing house, they should be contacted (Lene
NSI: Børre to contact the IT consultant with an offer for help on relevant issues.
Authors: parallell work: Contact authors while waiting for Publishers, not
TODO:
- contact NSI (Børre)
- contact authors (Børre, eventually Lene)
- evalutate an agreement with Bård Eriksen helping us collecting smj
5. Corpus infrastructure
General
TODO:
- remove headers and footers from antiword documents, other improvements
- implemented some with JPedal would like help from Tomi on the java issues in
- implemented some with JPedal would like help from Tomi on the java issues in
- fix Min Áigi filenames (Saara)
- some done, but still some issues
- some done, but still some issues
- Go through the java issues of JPedal (Saara, Tomi)
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- Decision:
- compiled transducers to /opt also in the future
- scripts etc to /usr/local/share/bin/
- compiled transducers to /opt also in the future
- Decision:
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Some script (location) discussion: we have two different locations in CVS for
gt/src/see-tools/ - source files for tools to be installed gt/script/emacs/ - ready-to-use scripts (but might still need installation)
Conclusion: This is fine for the time being.
Web browser access
TODO:
- discuss with Oslo (Trond)
- written a letter to Lars
- written a letter to Lars
- delay other tasks until we are ready to go public?
- user management for access to bound texts
- short user guide needed before going public (either write one or take whatever
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
TODO:
- continue Bergen discussions (Trond)
-
Bergen will work on a command-line version during the next two weeks
-
Bergen will work on a command-line version during the next two weeks
- use the present aligner to generate some initial input for Oslo to test
Language recognition
TODO:
- Get more text of the poorly covered languages: sma, smj ( Trond, Børre)
-
sma: get the Bible texts (Trond)
-
sma: get the Bible texts (Trond)
- study the paragraphs of 20 or less characters, where the errors will be
- study the mistakes our recogniser makes today (Trond)
- what about paragraphs with mixed content? Needs more investigation (Trond)
6. Infrastructure
Xerox tools wrapped as servers
TODO:
- decide the programming language to use (Saara)
- find some (almost-)ready-to-use code to build on (Saara)
- implement it (Saara)
- nothing so far
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
M4
Setup and infra finished. Now we are ready to start using M4.
- What can we use M4 for? (programmers)
- Select and/or exclude different parts of the twol files.
- Specialised make-targets that depend on a profiled twol (=M4-processed)
- Select and/or exclude different parts of the twol files.
- What do we want to use M4 for? (linguists)
- Hyphenation
- Regional diphthong simplification (oahpahe(a)ddjiid)
- shortening in 3-part compounds
- explicit output of G3 mark ' and of allophones e2, o2 etc. for text-to-speech
- more?
- Hyphenation
TODO:
- make speller and hyphenator make targets that utilise M4 to produce normative
- started, Tomi and Sjur will continue after this meeting
7. Linguistics
Derivation and spellers like Aspell
- add an option to lookup2cg to keep +Der/ tags (Saara)
- revert the CG rule that preferres lexicalised forms over derivations with M4
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
- see source code above, but also consider overgeneration problems, as well as
- lexicalise the rest (Thomas)
Semantic double-tagging of names
TODO:
- Make a section under gt/doc/lang/smi/, add a chapter
- write guidelines for annotators wrt. to name tagging and put them under
Systematic- done (first draft) by adding to an existing document
- done (first draft) by adding to an existing document
- make sure all linguists is aware of the guidelines (Trond, Sjur)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
Issue: Lene has found quite a few non-words as base forms in the verb lexicon.
This issue is linked to the question of whether our lexicon files should be
Nothing this week, but see above re: derivations.
TODO:
- check all XXX cases (Thomas, Lene)
- consider checking all verbs for non-verbs (Thomas, Lene)
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- convert roughly 100 smj names from that file (lines 740-843) to XML
- done, but requires converting two times, and then manually combining the
- done, but requires converting two times, and then manually combining the
8. Name lexicon infrastructure
Decided in Tromsø:
- add smj proper noun lexicon file to the output
- done
- done
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- been thinking
- been thinking
- all names in all languages by default
- done in the conversion
- done in the conversion
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- finish refactoring for multiple collections in the search interfarce
- waiting for a bug fix (Tomi is investigating it)
- doesn't have to wait anymore, we're using another component with the same
- doesn't have to wait anymore, we're using another component with the same
- waiting for a bug fix (Tomi is investigating it)
- develop the needed XQueries and UI (Sjur, Tomi)
- continued
- continued
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
- fix multiple inflection for identical names (Trond and Thomas)
- done
- done
- add eXist and the proper noun interface to the G5 using Tomcat
9. Tromsø meeting follow-up
TODO:
- speller development - see the meeting memo. Separate follow-up next week.
- Lule Sámi linguist (Sjur)
- a second candidate can only work on an hourly basis, and probably not that
- a second candidate can only work on an hourly basis, and probably not that
10. Other
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Gobby
TODO:
- install Gobby (Trond)
- Done (Thanks, Børre)
Compilation on victorio
Hypothesis: closed-sme-lex.txt is broken
TODO:
- Fix compilation on Victorio (Tomi, Trond)
- Done. Compilation order had to be changed.
Meetings and Marratech
TODO:
- download and install newest Marratech
- Downloaded by: Trond, Saara, Thomas, Børre, Tomi
- Downloaded by: Trond, Saara, Thomas, Børre, Tomi
- we need instructions on how to use it, and test it
Task lists as iCal entries
TODO:
- update all forrest installations to r430284 (Børre)
- Still not updated: Maaren, Saara (the command was not found)
11. Next meeting, closing
Next meeting 18.9.2006 at 9: 30.
Closed at 11: 52.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- discuss with Bård Eriksen about collecting smj texts (with Sjur)
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- finish Forrest i18n and Sámi in PDF work
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- fix bugs!
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- add an option for including derivational tags to lookup2cg output
- examine text_cat for character limit 20 char
- generate parallel corpus files manually (with Trond)
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- help Børre finish i18n work of Forrest with a language override menu
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follo-up from the Tromsø meeting
- discuss with Bård Eriksen about collecting smj texts (with Børre)
- fix bugs!
Thomas
- sme G3 issue
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
- check all XXX cases in verb-file, consider marking them sub
- consider checking all verbs for non-verbs
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- help Saara with JPedal
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- discuss web-only user access management with Oslo
- write short user guide for the corpus web interface
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- generate parallel corpus files manually (with Saara)
- block out the CG rule(s) that remove(s) the Der readings using M4
- fix bugs!.