Meeting_2006-09-18
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting follow-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 18.09.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 58.
Present: Børre, Saara, Sjur, Thomas, Tomi, Trond
Absent: Maaren
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Contracts to Rauni Lukkari and Saimi Kaarina Lukkari
- Contracts to Rauni Lukkari and Saimi Kaarina Lukkari
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Send out letters to the rest of the Iđut authors
- Not done
- Not done
- contact Ája (Kåfjord), talk to Lene
- Not contacted Ája. No texts from Lene so far.
- Not contacted Ája. No texts from Lene so far.
- send contracts to Čálliid Lágádus
- Not done
- Not done
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- Not done
- Not done
- discuss with Bård Eriksen about collecting smj texts (with Sjur)
- Haven't heard from him last week.
- Haven't heard from him last week.
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- Move norwegian documents in Min Áigi from sme to nob
- Most of 2003 is done
- Most of 2003 is done
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- Nothing this week
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- finish Forrest i18n and Sámi in PDF work
- Still working.
- Still working.
- Get more sma, smj texts to improve language recognition
- Will talk to Stig Gælok later today, he said he has lots of text
- Will talk to Stig Gælok later today, he said he has lots of text
- set up Tomcat for use with eXist and the propnouns db on the G5
- Not done
- Not done
- fix bugs!
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- in progress
- in progress
- Implement server of the analysis tools.
- A prototype is ready.
- A prototype is ready.
- add an option for including derivational tags to lookup2cg output
- not done, I forgot it completely. I'll do it straight away.. There
- not done, I forgot it completely. I'll do it straight away.. There
- examine text_cat for character limit 20 char
- not done
- not done
- generate parallel corpus files manually (with Trond)
- some planning done.
- some planning done.
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- done, but needs more work - it overgenerates
- done, but needs more work - it overgenerates
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- some work done
- some work done
- implement editing functions
- review user and admin documentation for corpus access
- nothing
- nothing
- write user account form, probably ask for copy of existing ones from the IT
- nothing
- nothing
- start hiring process of linguist and programmer
- continued, got a list of possible
- continued, got a list of possible
- help Børre finish i18n work of Forrest with a language override menu
- progressing, but not finished
- progressing, but not finished
- consider the problems of lexicalised derivations schewing analyses of
- nothing
- nothing
- install eXist and our local copy of risten.no and propnouns on the G5
- nothing
- nothing
- speller follow-up from the Tromsø meeting
- nothing
- nothing
- discuss with Bård Eriksen about collecting smj texts (with Børre)
- nothing
- nothing
- get instructions on how to use Marratech, and test it
- nothing received
- nothing received
- fix bugs!
Thomas
- sme G3 issue
- somekind of solved it, not in the way that was intended though!
- somekind of solved it, not in the way that was intended though!
- bug-fixing!
- worked
- worked
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- no
- no
- review user documentation for corpus access
- no
- no
- find and study all derived verbs in our corpus (depends on Trond)
- no
- no
- suggest which derivations could be generated (depends on Trond)
- no
- no
- check all XXX cases in verb-file, consider marking them sub
- done
- done
- consider checking all verbs for non-verbs
- talked with Lene, this is not necessary! They are recently checked.
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- help Saara with JPedal
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendliness
- The missing one is teaksta.sh
- The missing one is teaksta.sh
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- discuss web-only user access management with Oslo
- Discussed, they will administer it when we add closed text. Discussion of
- Discussed, they will administer it when we add closed text. Discussion of
- write short user guide for the corpus web interface
- The user interface will be instable this month, and a new version will be
- The user interface will be instable this month, and a new version will be
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- Errors are spotted on a regular basis, the mixed paragraph issue still awaits
- Errors are spotted on a regular basis, the mixed paragraph issue still awaits
- generate parallel corpus files manually (with Saara)
- We need a catalogue of parallel texts
- We need a catalogue of parallel texts
- block out the CG rule(s) that remove(s) the Der readings using M4
- Pseudocode written, now awaiting the m4 literates.
- Pseudocode written, now awaiting the m4 literates.
- fix bugs!.
3. Documentation
TODO:
- finish i18n work by adding a list of available language versions to each
- Sjur has done quite a lot, but is not finished
- Sjur has done quite a lot, but is not finished
- make pdf set-up work on victorio (Børre)
- working on victorio (and thus on the external site), but it does not work
- working on victorio (and thus on the external site), but it does not work
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Børre will focus on the gathering this week, we need more material...
TODO:
- contact NSI (Børre)
- contact authors (Børre, eventually Lene)
- evalutate an agreement with Bård Eriksen helping us collecting smj
5. Corpus infrastructure
General
Our way of dealing with the conversion of input documents has now reached an
TODO:
- remove headers and footers in the PDF conversion (Saara)
- done something, see JPedal below
- done something, see JPedal below
- fix Min Áigi filenames (Saara)
- done
- done
- Go through the java issues of JPedal (Saara, Tomi)
- soon ready?
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- Decision:
- compiled transducers to /opt also in the future
- scripts etc to /usr/local/share/bin/
- compiled transducers to /opt also in the future
- Decision:
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
- short user guide needed before going public (either write one or take whatever
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
TODO:
- use the present aligner to generate some initial input for Oslo to test
Language recognition
TODO:
- Get more text of the poorly covered languages: sma, smj ( Trond, Børre)
-
sma: get the Bible texts (Trond)
-
sma: get the Bible texts (Trond)
- study the paragraphs of 20 or less characters, where the errors will be
- study the mistakes our recogniser makes today (Trond)
- what about paragraphs with mixed content? Build a corpus of such paragraphs
6. Infrastructure
Xerox tools wrapped as servers
Saara has made a prototype, available as server_anl.pl (the server) and
The server communicates purely over TCP/IP, which means that in principle any
Very brief user instructions:
In one window: server_anl.pl. In another window: client_anl.pl -p.
TODO:
- improve and finish the present prototype (Saara)
- feature request: option for XML output from server
Hyphenator
First hyphenating transducer was made last week, but it produces wrong output
gahpira => gah-pi-ra and ga-hpir, should be only the first one.
´ hyphentated output | | | hyphenation rules / | filter.fst & hyph.fst <- generator <-------- overgeneration: \ | ----------baseform/analysis | analyser | | ` input
We need a "filter" fst: a-z... -:0 -:- ^:0 #:0
Sketch:
[%-, ^, %# ] (<-) 0 ;
and the rest by default: a = a: a.
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- done (by Tomi and Sjur), needs improvement because of overgeneration
- done (by Tomi and Sjur), needs improvement because of overgeneration
- implement the 'filter.fst' above (Sjur, Tomi, Trond)
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
M4
TODO:
- make speller and hyphenator make targets that utilise M4 to produce normative
- done for hyphenation (Sjur and Tomi)
7. Linguistics
Derivation and spellers like Aspell
- add an option to lookup2cg to keep +Der/ tags (Saara)
- doesn't seem to remove any tags at all (or very few if at all)
- doesn't seem to remove any tags at all (or very few if at all)
- revert the CG rule that preferres lexicalised forms over derivations with M4
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
Semantic double-tagging of names
TODO:
- move the existing guidelines to a separate document (Børre)
- done (by Sjur)
- done (by Sjur)
- make sure all linguists are aware of the guidelines (Trond, Sjur)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
TODO:
- check all XXX cases (Thomas, Lene)
- done
- done
- consider checking all verbs for non-verbs (Thomas, Lene)
- Thomas talked with Lene, this is not necessary, Lenes "non-words" were all
- Thomas talked with Lene, this is not necessary, Lenes "non-words" were all
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- finish refactoring for multiple collections in the search interfarce
- develop the needed XQueries and UI (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
- add eXist and the proper noun interface to the G5 using Tomcat
9. Tromsø meeting follow-up
TODO:
- speller development - see the meeting
memo. Separate
- Lule Sámi linguist (Sjur)
10. Other
Bug fixing
64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and Marratech
TODO:
- download and install newest Marratech
- we need instructions on how to use it, and test it (Sjur)
Task lists as iCal entries
TODO:
- update Maaren's and Saara's installations to r430284 (Børre)
11. Next meeting, closing
Next meeting 25.9.2006 at 9: 30.
Closed at 11: 03.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- discuss with Bård Eriksen about collecting smj texts (with Sjur)
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- finish Forrest i18n and Sámi in PDF work
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- fix bugs!
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- examine text_cat for character limit 20 char
- generate parallel corpus files manually (with Trond)
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- help Børre finish i18n work of Forrest with a language override menu
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follo-up from the Tromsø meeting
- discuss with Bård Eriksen about collecting smj texts (with Børre)
- get instructions on how to use Marratech, and test it
- fix bugs!
Thomas
- sme G3 issue
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- help Saara with JPedal
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- generate parallel corpus files manually (with Saara)
- block out the CG rule(s) that remove(s) the Der readings using M4
- fix bugs!.