Meeting_2006-10-02
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting follow-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 02.10.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 39.
Present: Børre, Saara, Sjur, Thomas, Tomi
Absent: Maaren, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- contact Ája (Kåfjord), talk to Lene
- Got documents from Lene
- Got documents from Lene
- send contracts to Čálliid Lágádus
- Done
- Done
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- Not done
- Not done
- contact Ája (Kåfjord), talk to Lene
- Move norwegian documents in Min Áigi from sme to nob
- Will have to check for other folders than 2003 ...
- Will have to check for other folders than 2003 ...
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- Not done
- Not done
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- Not done
- Not done
- finish Forrest i18n and Sámi in PDF work
- Some more done
- Some more done
- Get more sma, smj texts to improve language recognition
- Got lots of texts from Stig Gælok
- Got lots of texts from Stig Gælok
- set up Tomcat for use with eXist and the propnouns db on the G5
- not done
- not done
- add the new Words section to the site
- Not done
- Not done
-
fix bugs!
- None fixed this week
- None fixed this week
- Fixed a computer for Thomas
Maaren
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- I'll add this to the bug db.
- I'll add this to the bug db.
- remove headers and footers from pdf documents
- the documents with multiple columns still not ready
- the documents with multiple columns still not ready
- Implement server of the analysis tools.
- not yet ready
- not yet ready
- generate parallel corpus files manually (with Trond)
- the routines are mostly done.
- the routines are mostly done.
- Improve text_cat
- the new language models are still not done.
- the new language models are still not done.
- fix bugs!
Sjur
- finish the hyphenator clean-up script
- done, but still too much output because we're using circular transducers
- done, but still too much output because we're using circular transducers
- name lexicon:
- implement editing functions
- waiting for the item below to be finished
- waiting for the item below to be finished
- finalise refactoring for multiple collections
- search interface ready and converted to the new code for SD-terms
- will add basic searching for the other collections as well, to demonstrate
- search interface ready and converted to the new code for SD-terms
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- nothing done last week, tried to move substantially forward with the name db
- nothing done last week, tried to move substantially forward with the name db
- finish i18n work of Forrest
- what was done is now working completely as it should
- but finishing it turned out to be not so simple because it needs to work with
- what was done is now working completely as it should
- consider the problems of lexicalised derivations schewing analyses of
- awaits M4-implementation in Disamb
- awaits M4-implementation in Disamb
- install eXist and our local copy of risten.no and propnouns on the G5
- nothing
- nothing
- speller follow-up from the Tromsø meeting
- done, planning and code design has started
- done, planning and code design has started
- get instructions on how to use Marratech, and test it
- talked briefly with Leif Åge, asked him to push Geir or Roy;
- talked briefly with Leif Åge, asked him to push Geir or Roy;
- fix bugs!
Thomas
- work with Polderland phonetic rules
- soon finished
- soon finished
- bug-fixing!
- worked some
- worked some
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- begun to look at it
- begun to look at it
- review user documentation for corpus access
- not done
- not done
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- suggest which derivations could be generated (depends on Trond)
- not done
- not done
- meeting with Trond Wednesday on smj proper nouns
- we did meet
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- partly
- partly
- implement improvements decided upon in Tromsø
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- export corpus tools to /opt (with cron)
- not done
- not done
- make speller make targets using M4
- not done
- not done
- start to plan the implementation of the speller data conversion/generation
- started
- started
- fix bugs!
Trond
- better smj NT text
- Will give our .doc text a second try with Børre.
- Will give our .doc text a second try with Børre.
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- We use Sjur and Saara's perl solution instead.
- We use Sjur and Saara's perl solution instead.
- make shell script wrappers for the most common commands for user friendlyness
- Issue passed on to Tomi
- Issue passed on to Tomi
- write user account form, probably ask for copy of existing ones from the IT
- Discussed with Thor Øivind, this is up to us.
- Discussed with Thor Øivind, this is up to us.
- write documentation for our bound users, with pointers to the ordinary
- Not done.
- Not done.
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Gone through the lexicon with Thomas, done minor adjustments.
- Gone through the lexicon with Thomas, done minor adjustments.
- Get more sma, smj texts to improve language recognition
- Have started working on sma, Børre works on smj.
- Have started working on sma, Børre works on smj.
- study corpus for language recognition errors, as well as paragraphs with mixed
- generate parallel corpus files manually (with Saara)
- Worked with this issue, provided nob texts.
- Worked with this issue, provided nob texts.
- block out the CG rule(s) that remove(s) the Der readings using M4
- Pseudocode written, issue passed on to M4 literates.
- Pseudocode written, issue passed on to M4 literates.
- meeting with Thomas Wednesday on smj proper nouns
- Had short meeting, will return to the issue.
- Had short meeting, will return to the issue.
- fix bugs!.
3. Documentation
TODO:
- finish i18n work (Børre and Sjur)
- done when running in 'forrest run' mode - it works perfect now
- does NOT work when running 'forrest site', instead it generates a heap of
- i18n does not work in PDF ("Table of Content" won't translate)
- done when running in 'forrest run' mode - it works perfect now
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- add the new Words section to the site (Børre)
- not yet done
4. Corpus gathering
Børre got text from Lene and Stig Gælok, now we have 24 000 words in
TODO:
- contact NSI (Børre)
- bug Bård Eriksen about the book list (Børre)
- continue to contact authors (Børre)
5. Corpus infrastructure
General
Our way of dealing with the conversion of input documents has now reached an
TODO:
- remove headers and footers in the PDF conversion (Saara)
- improving, but still needs some work
- improving, but still needs some work
- consider an article on our corpus framework (all)
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
- Decision:
- compiled transducers to /opt also in the future
- scripts etc to /usr/local/share/bin/
- compiled transducers to /opt also in the future
- Decision:
- make shell script wrappers for the most common commands for user friendliness
- (first version of first script, teaksta.sh, was checked in, but it is still
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
Next week: discuss NT parallel corpus
Trond and Saara has tested manual alignment for testing.
TODO:
- use the present aligner to generate some initial input for Oslo to test.
- done
- done
- gather parallel texts (Trond)
Language recognition
TODO:
- Get more text of the poorly covered languages: sma, smj ( Trond, Børre)
-
sma: get the Bible texts (Trond)
- we now have more smj texts, should be used to improve the smj LM
-
sma: get the Bible texts (Trond)
- study the mistakes our recogniser makes today (Trond, Ilona)
- done
- done
- what about paragraphs with mixed content? Build a corpus of such paragraphs
6. Infrastructure
Xerox tools wrapped as servers
Will continue this week, possible issues in the next meeting.
Feature request:
- option for XML output from server
TODO:
- improve and finish the present prototype (Saara)
- done some, still more work to do
Hyphenator
Problem: we overgenerate because we are using a circular transducer:
A-fi^ná^laid#ea^me A-fi^ná^lai^dea^me A-fi^ná^lai^dea^me A-#fi^ná^laid#ea^me A-#fi^ná^lai^dea^me A-#fi^ná^lai^dea^me A-fi^ná^laid#ea^me A-fi^ná^lai^dea^me A-fi^ná^lai^dea^me
Pseudocode for hyph cleanup:
- read cohort - remove all but the readings with the least word boundaries - compare the rest with the input string, disregarding ^ and #: -- delete forms that do not correspond to the input string - unique the final set - print what is left (it should normally be only one form)
The input that correspond to the partially-cleaned data above:
A-finálaideame A--fi^ná^laid-ea^mi A-finálaideame A--fi^ná^laid-ea^me A-finálaideame A--fi^ná^laid#ea^mi A-finálaideame A--fi^ná^laid#ea^me A-finálaideame A--fi^ná^lai^dea^mi A-finálaideame A--fi^ná^lai^dea^me A-finálaideame A--fi^ná^lai^dea^me A-finálaideame A-fi^ná^laid-ea^mi A-finálaideame A-fi^ná^laid-ea^me A-finálaideame A-fi^ná^laid#ea^mi A-finálaideame A-fi^ná^laid#ea^me A-finálaideame A-fi^ná^lai^dea^mi A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-#fi^ná^laid-ea^mi A-finálaideame A-#fi^ná^laid-ea^me A-finálaideame A-#fi^ná^laid#ea^mi A-finálaideame A-#fi^ná^laid#ea^me A-finálaideame A-#fi^ná^lai^dea^mi A-finálaideame A-#fi^ná^lai^dea^me A-finálaideame A-#fi^ná^lai^dea^me A-finálaideame A-fi^ná^laid-ea^mi A-finálaideame A-fi^ná^laid-ea^me A-finálaideame A-fi^ná^laid#ea^mi A-finálaideame A-fi^ná^laid#ea^me A-finálaideame A-fi^ná^lai^dea^mi A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-fi^ná^lai^dea^me A-finálaideame a--fi^ná^laid-ea^mi A-finálaideame a--fi^ná^laid-ea^me A-finálaideame a--fi^ná^laid#ea^mi A-finálaideame a--fi^ná^laid#ea^me A-finálaideame a--fi^ná^lai^dea^mi A-finálaideame a--fi^ná^lai^dea^me A-finálaideame a--fi^ná^lai^dea^me A-finálaideame a-fi^ná^laid-ea^mi A-finálaideame a-fi^ná^laid-ea^me A-finálaideame a-fi^ná^laid#ea^mi A-finálaideame a-fi^ná^laid#ea^me A-finálaideame a-fi^ná^lai^dea^mi A-finálaideame a-fi^ná^lai^dea^me A-finálaideame a-fi^ná^lai^dea^me A-finálaideame a-#fi^ná^laid-ea^mi A-finálaideame a-#fi^ná^laid-ea^me A-finálaideame a-#fi^ná^laid#ea^mi A-finálaideame a-#fi^ná^laid#ea^me A-finálaideame a-#fi^ná^lai^dea^mi A-finálaideame a-#fi^ná^lai^dea^me A-finálaideame a-#fi^ná^lai^dea^me
The fst used is:
sme/bin/hyph-sme.fst To make: make TARGET=sme hyph-sme.fst
TODO:
- finish the hyphenator clean-up script (Sjur)
- done, but multi-word expressions are not handled correctly. Change separator
- improve the clean-up script to remove more-complex word forms (Saara)
- done, but multi-word expressions are not handled correctly. Change separator
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
Some perl-libraries needed by Bugzilla weren't in the path, causing it to not
TODO:
- give mail reminders a second try (Børre)
- investigated, but libraries or path not updated.
M4
TODO:
- make speller make targets that utilise M4 to produce normative
- disamb variants for derivations (see next) (Saara)
7. Linguistics
Derivation and spellers like Aspell
- revert the CG rule that preferres lexicalised forms over derivations with M4
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
Nothing this week? No.
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- started
- started
- Schedule a T-T meeting this week - Wednesday.
- done
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in
TODO:
- finish refactoring for multiple collections in the search interfarce
- almost done - SD-terms now converted and working, still needs additions for a
- worked on a specification (in the new CVSROOT/words/ section)
- finished
- finished
- almost done - SD-terms now converted and working, still needs additions for a
- develop the needed XQueries and UI (Sjur, Tomi)
- turn Tomcat on on the G5; send admin username and password to Sjur
- add eXist and the proper noun interface to the G5 (Sjur)
- data synchronisation between risten.no and the cvs repo - postponed
- new version of xml2lexc (based on ccat), should handle complex names correct:
9. Tromsø meeting follow-up
TODO:
- speller development - see the meeting
memo. Separate
- done
- done
- Lule Sámi linguist (Sjur)
- nothing done last week, but Børre has another possible candidate
Speller data generation
TODO:
- start to plan the implementation of the speller data conversion/generation
- started; plan documents should be added to gt/doc/proof/
10. Other
Bug fixing
64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and Marratech
TODO:
- download and install newest Marratech
- we need instructions on how to use it, and test it (Sjur)
Task lists as iCal entries
TODO:
- update Maaren's and Saara's installations to r430284 (Børre)
11. Next meeting, closing
Next meeting 9.10.2006 at 9: 30.
Closed at 10: 38.
Appendix - task lists for the next week
Børre
- corpus collection:
- contact Ája (Kåfjord)
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- remind Bård Eriksen about the book catalogue
- contact Ája (Kåfjord)
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- finish Forrest i18n work (pdf)
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- add the new Words section to the site
- fix bugs!
Maaren
- download and install latest Marratech
Saara
- add more texts to the graphical corpus interface
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- generate parallel corpus files manually (with Trond)
- Improve text_cat
- export corpus tools to location available to all (with cron), cf news disc.
- Write a script for cleaning sme-hyph-output.
- Implement M4-rules for sme-dis.rle
- improve smj LM based on the new, larger corpus
- fix bugs!
Sjur
- finish the hyphenator clean-up script
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- XQuery refactoring and code development for our proper noun editor
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- start hiring process of linguist and programmer
- finish i18n work of Forrest
- consider the problems of lexicalised derivations schewing analyses of
- install eXist and our local copy of risten.no and propnouns on the G5
- speller follow-up from the Tromsø meeting
- get instructions on how to use Marratech, and test it
- fix bugs!
Thomas
- work with Polderland phonetic rules
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- start to plan the implementation of the speller data conversion/generation
- fix bugs!
Trond
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma texts to improve language recognition
- study paragraphs with mixed content
- fix bugs!.