Meeting_2006-09-25
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting follow-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 25.09.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 32.
Present: Børre, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- discuss with Bård Eriksen about collecting smj texts (with Sjur)
- Asked him to send us a book catalogue, so that we can contact authors.
- Asked him to send us a book catalogue, so that we can contact authors.
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
- finish Forrest i18n and Sámi in PDF work
- Get more sma, smj texts to improve language recognition
- Will get smj text today, 25th.
- Will get smj text today, 25th.
- set up Tomcat for use with eXist and the propnouns db on the G5
- fix bugs!
Maaren
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Tomi did his part. There were some drawbacks, so the tool is not
- Tomi did his part. There were some drawbacks, so the tool is not
- Implement server of the analysis tools.
- Parallel processing implemented. Not otherwise finalized.
- Parallel processing implemented. Not otherwise finalized.
- generate parallel corpus files manually (with Trond)
- Started, but waiting for pdf-conversion.
- Started, but waiting for pdf-conversion.
- Improve text_cat
- The code is ready. I'll generate better language models for some
- The code is ready. I'll generate better language models for some
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- gymnastics done earlier - now a perl script is underway that will clean the
- gymnastics done earlier - now a perl script is underway that will clean the
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- continued to work on the specifications
- continued to work on the specifications
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- help Børre finish i18n work of Forrest with a language override menu
- almost DONE! It is working, only i18n of the language menu left, and sending
- almost DONE! It is working, only i18n of the language menu left, and sending
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follow-up from the Tromsø meeting
- discuss with Bård Eriksen about collecting smj texts (with Børre)
- get instructions on how to use Marratech, and test it
- fix bugs!
Thomas
- sme G3 issue
- this is still fixed, but not in the intended way
- this is still fixed, but not in the intended way
- bug-fixing!
- yeah!
- yeah!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
- not done
- review user documentation for corpus access
- not done
- not done
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- suggest which derivations could be generated (depends on Trond)
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- make speller and hyphenator make targets using M4
- help Saara with JPedal
-
fix bugs!
- other tasks:
- worked on JPedal, to help Saara fix PDF conversion
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Worked on this with Sjur, it turned out to be quite hard. Will discuss it
- Worked on this with Sjur, it turned out to be quite hard. Will discuss it
- make shell script wrappers for the most common commands for user friendlyness
- This issue was passed on to the programmers.
- This issue was passed on to the programmers.
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- Done some work here, with Ilona.
- Done some work here, with Ilona.
- generate parallel corpus files manually (with Saara)
- Not done, but the aligner is now available in a debugged and faster version.
- Not done, but the aligner is now available in a debugged and faster version.
- block out the CG rule(s) that remove(s) the Der readings using M4
- Also this issue has been passed on to the programmers, as the pseudocode is
- Also this issue has been passed on to the programmers, as the pseudocode is
- fix bugs!.
3. Documentation
TODO:
- finish i18n work by adding a list of available language versions to each
- Sjur and Børre finished most late last Friday night (stopped around midnight)
- Sjur and Børre finished most late last Friday night (stopped around midnight)
- make pdf set-up work on victorio (Børre)
- working as it should on Victorio.
- working as it should on Victorio.
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- add the new Words section to the site
4. Corpus gathering
Børre contacted several authors:
- Jovnna Ánde Vest
- Stig Gælok
- Aage Solbakk
Børre will meet Stig Gælok today, he has a lot of texts in Lule Sámi.
Bård Eriksen was concerned that it would be too much work for them to
TODO:
- contact NSI (Børre)
- not yet
- not yet
- contact authors (Børre, eventually Lene)
- done, see above; no discussions with Lene
- done, see above; no discussions with Lene
- evaluate an agreement with Bård Eriksen helping us collecting smj
- discussed with him
5. Corpus infrastructure
General
Our way of dealing with the conversion of input documents has now reached an
JPedal work: Tomi went through the source code and added an option that
TODO:
- remove headers and footers in the PDF conversion (Saara)
- still needs some work
- still needs some work
- Go through the java issues of JPedal (Saara, Tomi)
- isn't quite delivering what we hoped, will need more work
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- Decision:
- compiled transducers to /opt also in the future
- scripts etc to /usr/local/share/bin/
- compiled transducers to /opt also in the future
- Decision:
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
Has been discussed with Oslo. They will release a new version of the web
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
There has been a bug in the Bergen aligner, we will get a new (graphical)
TODO:
- use the present aligner to generate some initial input for Oslo to test.
- gather parallel texts (Trond)
Language recognition
New .wm files heve been made, with better performance. Saara, Ilona and
TODO:
- Get more text of the poorly covered languages: sma, smj ( Trond, Børre)
-
sma: get the Bible texts (Trond)
-
sma: get the Bible texts (Trond)
- study the mistakes our recogniser makes today (Trond, Ilona)
- what about paragraphs with mixed content? Build a corpus of such paragraphs
6. Infrastructure
Xerox tools wrapped as servers
Feature request:
- option for XML output from server
TODO:
- improve and finish the present prototype (Saara)
- done some, still more work to do
Hyphenator
Sjur got help from Saara to sketch a Perl solution to the overgeneration
TODO:
- finish the hyphenator clean-up script (Sjur)
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
- At last I found a solution. Will implement it today!
M4
TODO:
- make speller make targets that utilise M4 to produce normative
7. Linguistics
Derivation and spellers like Aspell
- revert the CG rule that preferres lexicalised forms over derivations with M4
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
Semantic double-tagging of names
Waiting for the name conversion to take place before the disamb rules can be
North Sámi
Nothing this week?
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Schedule a T-T meeting this week - Wednesday.
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in the meeting memo.
TODO:
- finish refactoring for multiple collections in the search interfarce
- worked on a specification (in the new CVSROOT/words/ section)
- worked on a specification (in the new CVSROOT/words/ section)
- develop the needed XQueries and UI (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
- add eXist and the proper noun interface to the G5 using Tomcat
9. Tromsø meeting follow-up
TODO:
- speller development - see the meeting
memo. Separate
- Lule Sámi linguist (Sjur)
Speller data generation
We need to convert our Xerox lexicons to the format required by Polderland,
TODO:
- start to plan the implementation of the speller data conversion/generation
10. Other
Bug fixing
64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and Marratech
TODO:
- download and install newest Marratech
- we need instructions on how to use it, and test it (Sjur)
Task lists as iCal entries
TODO:
- update Maaren's and Saara's installations to r430284 (Børre)
11. Next meeting, closing
Next meeting 2.10.2006 at 9: 30.
Closed at 10: 10.
Appendix - task lists for the next week
Børre
- corpus collection:
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- contact Ája (Kåfjord), talk to Lene
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- finish Forrest i18n and Sámi in PDF work
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- add the new Words section to the site
- fix bugs!
Maaren
- On sick leave
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- generate parallel corpus files manually (with Trond)
- Improve text_cat
- fix bugs!
Sjur
- finish the hyphenator clean-up script
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- finish i18n work of Forrest
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follo-up from the Tromsø meeting
- get instructions on how to use Marratech, and test it
- fix bugs!
Thomas
- work with Polderland phonetic rules
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
- meeting with Trond Wednesday on smj proper nouns
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement improvements decided upon in Tromsø
- data synchronisation of proper nouns between risten.no and CVS
- export corpus tools to /opt (with cron)
- make speller make targets using M4
- start to plan the implementation of the speller data conversion/generation
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma, smj texts to improve language recognition
- study corpus for language recognition errors, as well as paragraphs with mixed
- generate parallel corpus files manually (with Saara)
- block out the CG rule(s) that remove(s) the Der readings using M4
- meeting with Thomas Wednesday on smj proper nouns
- fix bugs!.