Meeting_2006-10-02
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Tromsø meeting follow-up
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 02.10.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 39.
Present: Børre, Saara, Sjur, Thomas, Tomi
Absent: Maaren, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- contact Ája (Kåfjord), talk to Lene
- Got documents from Lene
- Got documents from Lene
- send contracts to Čálliid Lágádus
- Done
- Done
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- Not done
- Not done
- contact Ája (Kåfjord), talk to Lene
- Move norwegian documents in Min Áigi from sme to nob
- Will have to check for other folders than 2003 ...
- Will have to check for other folders than 2003 ...
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- Not done
- Not done
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- Not done
- Not done
- finish Forrest i18n and Sámi in PDF work
- Some more done
- Some more done
- Get more sma, smj texts to improve language recognition
- Got lots of texts from Stig Gælok
- Got lots of texts from Stig Gælok
- set up Tomcat for use with eXist and the propnouns db on the G5
- not done
- not done
- add the new Words section to the site
- Not done
- Not done
-
fix bugs!
- None fixed this week
- None fixed this week
- Fixed a computer for Thomas
Maaren
- download and install latest Marratech
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- I'll add this to the bug db.
- I'll add this to the bug db.
- remove headers and footers from pdf documents
- the documents with multiple columns still not ready
- the documents with multiple columns still not ready
- Implement server of the analysis tools.
- not yet ready
- not yet ready
- generate parallel corpus files manually (with Trond)
- the routines are mostly done.
- the routines are mostly done.
- Improve text_cat
- the new language models are still not done.
- the new language models are still not done.
- fix bugs!
Sjur
- finish the hyphenator clean-up script
- done, but still too much output because we're using circular transducers
- done, but still too much output because we're using circular transducers
- name lexicon:
- implement editing functions
- waiting for the item below to be finished
- waiting for the item below to be finished
- finalise refactoring for multiple collections
- search interface ready and converted to the new code for SD-terms
- will add basic searching for the other collections as well, to demonstrate
and test
- search interface ready and converted to the new code for SD-terms
- implement improvements decided upon in Tromsø
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond) - start hiring process of linguist and programmer
- nothing done last week, tried to move substantially forward with the name db
and the required changes to risten.no
- nothing done last week, tried to move substantially forward with the name db
- finish i18n work of Forrest
- what was done is now working completely as it should
- but finishing it turned out to be not so simple because it needs to work with
the command-line client (CLI). And that one has no knowledge of locales and how to handle them.
- what was done is now working completely as it should
- consider the problems of lexicalised derivations schewing analyses of
derivation patterns- awaits M4-implementation in Disamb
- awaits M4-implementation in Disamb
- install eXist and our local copy of risten.no and propnouns on the G5
- nothing
- nothing
- speller follow-up from the Tromsø meeting
- done, planning and code design has started
- done, planning and code design has started
- get instructions on how to use Marratech, and test it
- talked briefly with Leif Åge, asked him to push Geir or Roy;
nothing received
- talked briefly with Leif Åge, asked him to push Geir or Roy;
- fix bugs!
Thomas
- work with Polderland phonetic rules
- soon finished
- soon finished
- bug-fixing!
- worked some
- worked some
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- begun to look at it
- begun to look at it
- review user documentation for corpus access
- not done
- not done
- find and study all derived verbs in our corpus (depends on Trond)
- not done
- not done
- suggest which derivations could be generated (depends on Trond)
- not done
- not done
- meeting with Trond Wednesday on smj proper nouns
- we did meet
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name entry- partly
- partly
- implement improvements decided upon in Tromsø
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- export corpus tools to /opt (with cron)
- not done
- not done
- make speller make targets using M4
- not done
- not done
- start to plan the implementation of the speller data conversion/generation
- started
- started
- fix bugs!
Trond
- better smj NT text
- Will give our .doc text a second try with Børre.
- Will give our .doc text a second try with Børre.
- get fin, swe, nob and nno NT and OT in paratext format
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer- We use Sjur and Saara's perl solution instead.
- We use Sjur and Saara's perl solution instead.
- make shell script wrappers for the most common commands for user friendlyness
- Issue passed on to Tomi
- Issue passed on to Tomi
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)- Discussed with Thor Øivind, this is up to us.
- Discussed with Thor Øivind, this is up to us.
- write documentation for our bound users, with pointers to the ordinary
documentation- Not done.
- Not done.
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Gone through the lexicon with Thomas, done minor adjustments.
- Gone through the lexicon with Thomas, done minor adjustments.
- Get more sma, smj texts to improve language recognition
- Have started working on sma, Børre works on smj.
- Have started working on sma, Børre works on smj.
- study corpus for language recognition errors, as well as paragraphs with mixed
content - generate parallel corpus files manually (with Saara)
- Worked with this issue, provided nob texts.
- Worked with this issue, provided nob texts.
- block out the CG rule(s) that remove(s) the Der readings using M4
- Pseudocode written, issue passed on to M4 literates.
- Pseudocode written, issue passed on to M4 literates.
- meeting with Thomas Wednesday on smj proper nouns
- Had short meeting, will return to the issue.
- Had short meeting, will return to the issue.
- fix bugs!.
3. Documentation
TODO:
- finish i18n work (Børre and Sjur)
- done when running in 'forrest run' mode - it works perfect now
- does NOT work when running 'forrest site', instead it generates a heap of
errors - i18n does not work in PDF ("Table of Content" won't translate)
- done when running in 'forrest run' mode - it works perfect now
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus. - Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
- User documentation probably in several languages. This covers how to apply
- add the new Words section to the site (Børre)
- not yet done
4. Corpus gathering
Børre got text from Lene and Stig Gælok, now we have 24 000 words in
TODO:
- contact NSI (Børre)
- bug Bård Eriksen about the book list (Børre)
- continue to contact authors (Børre)
5. Corpus infrastructure
General
Our way of dealing with the conversion of input documents has now reached an
TODO:
- remove headers and footers in the PDF conversion (Saara)
- improving, but still needs some work
- improving, but still needs some work
- consider an article on our corpus framework (all)
User accounts and access
For details, see a previous meeting memo, as well as the
Shell access
TODO:
- export to /opt (with cron) tools that the project team members find in
their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Saara)- Decision:
- compiled transducers to /opt also in the future
- scripts etc to /usr/local/share/bin/
- compiled transducers to /opt also in the future
- Decision:
- make shell script wrappers for the most common commands for user friendliness
(we must think of what commands they are) (Tomi)?- (first version of first script, teaksta.sh, was checked in, but it is still
not working (the problem is a simple handling of input-output, some shell script literates should have a look at it)
- (first version of first script, teaksta.sh, was checked in, but it is still
- write user account form, probably ask for copy of existing ones from the IT
centre (Trond and Sjur) - possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
documentation (Børre, Trond) - write documentation for how to apply for a user account (where's the form, to
whom do I send the form, who needs it, etc.) (Børre) - make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
Next week: discuss NT parallel corpus
Trond and Saara has tested manual alignment for testing.
TODO:
- use the present aligner to generate some initial input for Oslo to test.
( Trond and Saara)- done
- done
- gather parallel texts (Trond)
Language recognition
TODO:
- Get more text of the poorly covered languages: sma, smj ( Trond, Børre)
-
sma: get the Bible texts (Trond)
- we now have more smj texts, should be used to improve the smj LM
( Saara)
-
sma: get the Bible texts (Trond)
- study the mistakes our recogniser makes today (Trond, Ilona)
- done
- done
- what about paragraphs with mixed content? Build a corpus of such paragraphs
( Trond)
6. Infrastructure
Xerox tools wrapped as servers
Will continue this week, possible issues in the next meeting.
Feature request:
- option for XML output from server
TODO:
- improve and finish the present prototype (Saara)
- done some, still more work to do
Hyphenator
Problem: we overgenerate because we are using a circular transducer:
A-fi^ná^laid#ea^me A-fi^ná^lai^dea^me A-fi^ná^lai^dea^me A-#fi^ná^laid#ea^me A-#fi^ná^lai^dea^me A-#fi^ná^lai^dea^me A-fi^ná^laid#ea^me A-fi^ná^lai^dea^me A-fi^ná^lai^dea^me
Pseudocode for hyph cleanup:
- read cohort - remove all but the readings with the least word boundaries - compare the rest with the input string, disregarding ^ and #: -- delete forms that do not correspond to the input string - unique the final set - print what is left (it should normally be only one form)
The input that correspond to the partially-cleaned data above:
A-finálaideame A--fi^ná^laid-ea^mi A-finálaideame A--fi^ná^laid-ea^me A-finálaideame A--fi^ná^laid#ea^mi A-finálaideame A--fi^ná^laid#ea^me A-finálaideame A--fi^ná^lai^dea^mi A-finálaideame A--fi^ná^lai^dea^me A-finálaideame A--fi^ná^lai^dea^me A-finálaideame A-fi^ná^laid-ea^mi A-finálaideame A-fi^ná^laid-ea^me A-finálaideame A-fi^ná^laid#ea^mi A-finálaideame A-fi^ná^laid#ea^me A-finálaideame A-fi^ná^lai^dea^mi A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-#fi^ná^laid-ea^mi A-finálaideame A-#fi^ná^laid-ea^me A-finálaideame A-#fi^ná^laid#ea^mi A-finálaideame A-#fi^ná^laid#ea^me A-finálaideame A-#fi^ná^lai^dea^mi A-finálaideame A-#fi^ná^lai^dea^me A-finálaideame A-#fi^ná^lai^dea^me A-finálaideame A-fi^ná^laid-ea^mi A-finálaideame A-fi^ná^laid-ea^me A-finálaideame A-fi^ná^laid#ea^mi A-finálaideame A-fi^ná^laid#ea^me A-finálaideame A-fi^ná^lai^dea^mi A-finálaideame A-fi^ná^lai^dea^me A-finálaideame A-fi^ná^lai^dea^me A-finálaideame a--fi^ná^laid-ea^mi A-finálaideame a--fi^ná^laid-ea^me A-finálaideame a--fi^ná^laid#ea^mi A-finálaideame a--fi^ná^laid#ea^me A-finálaideame a--fi^ná^lai^dea^mi A-finálaideame a--fi^ná^lai^dea^me A-finálaideame a--fi^ná^lai^dea^me A-finálaideame a-fi^ná^laid-ea^mi A-finálaideame a-fi^ná^laid-ea^me A-finálaideame a-fi^ná^laid#ea^mi A-finálaideame a-fi^ná^laid#ea^me A-finálaideame a-fi^ná^lai^dea^mi A-finálaideame a-fi^ná^lai^dea^me A-finálaideame a-fi^ná^lai^dea^me A-finálaideame a-#fi^ná^laid-ea^mi A-finálaideame a-#fi^ná^laid-ea^me A-finálaideame a-#fi^ná^laid#ea^mi A-finálaideame a-#fi^ná^laid#ea^me A-finálaideame a-#fi^ná^lai^dea^mi A-finálaideame a-#fi^ná^lai^dea^me A-finálaideame a-#fi^ná^lai^dea^me
The fst used is:
sme/bin/hyph-sme.fst To make: make TARGET=sme hyph-sme.fst
TODO:
- finish the hyphenator clean-up script (Sjur)
- done, but multi-word expressions are not handled correctly. Change separator
from space to tab to fix it. - improve the clean-up script to remove more-complex word forms (Saara)
- done, but multi-word expressions are not handled correctly. Change separator
- Update the sma hyphenator rule set with the insights gained from smj updates
( Trond during weekends)
Automatic Bugzilla reminder for untouched bugs
Some perl-libraries needed by Bugzilla weren't in the path, causing it to not
TODO:
- give mail reminders a second try (Børre)
- investigated, but libraries or path not updated.
M4
TODO:
- make speller make targets that utilise M4 to produce normative
and hyphenation transducers - postponed a while (Tomi or Sjur) - disamb variants for derivations (see next) (Saara)
7. Linguistics
Derivation and spellers like Aspell
- revert the CG rule that preferres lexicalised forms over derivations with M4
( Trond wrote the M4 pseudocode, M4-literates to translate). In the beginning of sme-dis.rle there is an explanation of the pseudocode. Just search for the rules as explained there. - find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- lexicalise the rest (Thomas)
North Sámi
Nothing this week? No.
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
( Thomas, Trond)- started
- started
- Schedule a T-T meeting this week - Wednesday.
- done
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items) - tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in
TODO:
- finish refactoring for multiple collections in the search interfarce
( Sjur)- almost done - SD-terms now converted and working, still needs additions for a
couple of other collections (our name lexicon, and the mechanical collection) - worked on a specification (in the new CVSROOT/words/ section)
- finished
- finished
- almost done - SD-terms now converted and working, still needs additions for a
- develop the needed XQueries and UI (Sjur, Tomi)
- turn Tomcat on on the G5; send admin username and password to Sjur
( Børre) - add eXist and the proper noun interface to the G5 (Sjur)
- data synchronisation between risten.no and the cvs repo - postponed
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name entry - postponed
9. Tromsø meeting follow-up
TODO:
- speller development - see the meeting
memo. Separate
follow-up next week.- done
- done
- Lule Sámi linguist (Sjur)
- nothing done last week, but Børre has another possible candidate
Speller data generation
TODO:
- start to plan the implementation of the speller data conversion/generation
( Tomi)- started; plan documents should be added to gt/doc/proof/
10. Other
Bug fixing
64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Meetings and Marratech
TODO:
- download and install newest Marratech
( Maaren) - we need instructions on how to use it, and test it (Sjur)
Task lists as iCal entries
TODO:
- update Maaren's and Saara's installations to r430284 (Børre)
11. Next meeting, closing
Next meeting 9.10.2006 at 9: 30.
Closed at 10: 38.
Appendix - task lists for the next week
Børre
- corpus collection:
- contact Ája (Kåfjord)
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- remind Bård Eriksen about the book catalogue
- contact Ája (Kåfjord)
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- finish Forrest i18n work (pdf)
- Get more sma, smj texts to improve language recognition
- set up Tomcat for use with eXist and the propnouns db on the G5
- add the new Words section to the site
- fix bugs!
Maaren
- download and install latest Marratech
Saara
- add more texts to the graphical corpus interface
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- generate parallel corpus files manually (with Trond)
- Improve text_cat
- export corpus tools to location available to all (with cron), cf news disc.
- Write a script for cleaning sme-hyph-output.
- Implement M4-rules for sme-dis.rle
- improve smj LM based on the new, larger corpus
- fix bugs!
Sjur
- finish the hyphenator clean-up script
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- XQuery refactoring and code development for our proper noun editor
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond) - start hiring process of linguist and programmer
- finish i18n work of Forrest
- consider the problems of lexicalised derivations schewing analyses of
derivation patterns - install eXist and our local copy of risten.no and propnouns on the G5
- speller follow-up from the Tromsø meeting
- get instructions on how to use Marratech, and test it
- fix bugs!
Thomas
- work with Polderland phonetic rules
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- start to plan the implementation of the speller data conversion/generation
- fix bugs!
Trond
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur) - write documentation for our bound users, with pointers to the ordinary
documentation - refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Get more sma texts to improve language recognition
- study paragraphs with mixed content
- fix bugs!.

