Meeting_2006-05-09
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Public tender
- 11. Other
- 12. Summary, task list
- 13. Next meeting, closing
Meeting setup
- Date: 09.05.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 00.
Present: Børre, Saara, Sjur, Thomas, Trond, Tomi
Absent: Maaren
Main secretary: Sjur
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Two contracts signed, hurray!
- Two contracts signed, hurray!
- Gather public texts, preferrably also parallel ones
- Done
- Done
- Continue converting text from input format to our xml
- Not done
- Not done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the Iđut authors
- Not done
- Not done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Status?
- Status?
- integrate generated corpus repository summaries in the Forrest site
- This is done
- This is done
- mirror Odin URL (create cron task to do it automatically?)
- Not done
- Not done
-
fix bugs!
- 243 is fixed …
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- make free texts available
- set up a weekly cron script (but only if there are new files)
- move this to bugzilla
- move this to bugzilla
- set up a weekly cron script (but only if there are new files)
- Implement links to parallel files in corpus header.
- not done
- not done
- Implement parallel corpus upload in web upload script
- not ready
- not ready
- Implement turning off the language recognition in the xsl-file
- not ready
- not ready
- Refine language recognition
- not ready
- not ready
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- I'll move these to bugzilla.
- I'll move these to bugzilla.
- Investigate the decomposed Unicode characters in file names -problem.
- Found no solution.
- Found no solution.
- Convert Min Áigi file names.
- Done, but the decomposed characters are left as is.
- Done, but the decomposed characters are left as is.
- Convert Min Áigi documents
- Converted 2005 and 2006. The character conversion for text files
- Converted 2005 and 2006. The character conversion for text files
- fix bugs!
Sjur
- read through and evaluate the public tender offers
- done, still more work
Thomas
- add all word boundaries
- done
- done
- work on compounding and derivation
- begun to work again
- begun to work again
- smj G3 issue
- nothing
- nothing
- sme G3 issue
- nothing
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Not done
- Not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- Not done
- Not done
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- Not done
- Not done
- XQuery refactoring and code development for our proper noun editor
- Not done
- Not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- Not done
- Not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Not done
- Not done
- install and test Gobby, install new version of SEE
- Not done
- Not done
- Set up the mechanism for the hash-mark transducer package
- Not done
- Not done
-
fix bugs!
- Fixed some
Trond
- Make nob into nor in the anchor list (A, B, C is done)
- Now all done.
- Now all done.
- better smj NT text, get fin NT texts
- Not done
- Not done
- Discuss Gobby 0.4 with Sjur
- Done, there is no 0.4, only 0.3, is testing it out with different people,
- Done, there is no 0.4, only 0.3, is testing it out with different people,
- get a key for Maaren in May
- Will get one today, another on Thursday
- Will get one today, another on Thursday
- install aligner, test it and give feedback
- Not done.
- Not done.
- correct hyphenation of word boundaries and exceptions with Thomas
- We did something.
- We did something.
- get/upgrade keys for Børre's room for Tomi and Thomas
- Got the key numbers, but not upgraded yet.
- Got the key numbers, but not upgraded yet.
-
fix bugs!.
- Not done.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- The item will be moved to the TODO list, again.
4. Corpus gathering
Trond added sme beuraucratic texts, roughly 0,4 mill words, total size
Trip to Sámi municipalities
Børre back from his trip.
- Min Áigi
- 3-4 years of texts ready, both uncorrected and corrected (with bullet)
- 3-4 years of texts ready, both uncorrected and corrected (with bullet)
- Sámi parliament
- Tana
- Isak Saba guovddáš (Nesseby, talked with Signe Iversen)
- Utsjoki (a lot of admin texts)
- pdf files as free texts
- orig MS Word files require a closed contract (and a Finnish language version
- pdf files as free texts
- Sámi ráđđi (Utsjoki)
- Inari (Sámi parliament)
- got four texts, will ask for more
- got four texts, will ask for more
- Giellagas instituhtta (Oulu, talked with Seija Risten Somby)
The Isak Saba guovddaš will not sign unless the contract is in Sámi, and they
Note to be placed somewhere:
http://www.oqaasileriffik.gl/dk/ Greenland's language secretariat have a
The Min Áigi format should be dealt with: \@ingress etc should be dealt with for
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Signed contracts since last meeting:
- NSI (=> a lot of Min Áigi and Áššu texts)
- Seija Risten Somby, giving her gradu-paper.
Odin
Sæth replied by e-mail, hasn't had time to follow-up, but will try to
-
Børre will weekly mirror the URL mentioned in Sæths e-mail,
Olavi Korhonen's Lule Sámi dictionary.
- No news this week
TODO: Børre to contact Olavi Korhonen and Kuhmunen
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- wait for the discussions with Davvi Girji
- A talk with Brita Kåven, revealed that they would have a look at
- A talk with Brita Kåven, revealed that they would have a look at
Bible texts
We will get text from Finland, but still haven't received any. We have got the
Swedish html has arrived, no paratext. Norsk bibelselskap has not sent
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Min Áigi
Børre has received texts, and forwarded them to Trond. Problems with
The files (appr 2000 files) are added, here:
We have problems with Unicode characters in filenames. All characters with
a84-231-8-254:~ sjur$ l a+TAB áda áde ádo åde a84-231-8-254:~ sjur$ l a
This was solved once before, and we need to look at this again. The old Bugzilla
TODO:
-
Børre to contact Per Christian Biti on technical issues (how to
- reopen Bugzilla issue, and study the previous discussion and solution
- add filename extensions to files not having one
- investigate whether the bullet has a meaning or could be removed
- Space in file names should be changed to underscore (and not to hyphen!).
Min Áigi seems to have been changing from text files to MS Word around issue
Kåfjord
Promised to send us texts, but nothing has arrived yet.
TODO: Børre to contact them.
Sámi Instituhtta
Audhild Schanche has signed the contract. We will have to contact them about
TODO: Børre to contact them.
5. Corpus infrastructure
https: //giellalt.uit.no/lang/corp/corpus-summary.html
TODO:
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- make free texts available
- done weekly by a cron script (but only if there are new files) (Saara)
- move to bugzilla
- move to bugzilla
- done weekly by a cron script (but only if there are new files) (Saara)
- Make a link, easily available, to these texts.
- done as a downloadable tar package.
Name change again?
gt -> gtbound/ gtbound -> some nifty new letter... ? gtfree -> some nifty new letter... ?
Trond to come up with some new suggestion.
Free and non-free texts
More info in a previous meeting memo.
TODO:
- Check the status of the texts, again. (Børre, Trond, Saara)
- Rerun the conversion afterwards (Saara is the one with the magic spell)
- check against bugs
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface
- We add the largest texts first.
- We add the largest texts first.
- We would like to have grammatical searchability, not only POS. (Saara,
- This presupposes a discussion with Oslo. (Trond and Saara to continue
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- The list of good candicates: The longest (admin) texts.
- We need a ccat version of the script for analysing text, still keeping xml
- We need a ccat version of the script for analysing text, still keeping xml
Top-two priorities:
- Trond and Saara to discuss with Lars.
- Lars to add text to the server.
- Tomi to prepare for the parallel corpus.
Language recognition
TODO:
- refine language recognition (Saara)
- in progress
- in progress
- create a short word list to help the trigram heuristics
- Trond has made such lists for all lgs except sme, smj and nob.
- Trond has made such lists for all lgs except sme, smj and nob.
- send Saara sme, smj and nob files (sort, but not uniq -c) (Trond)
- Add some flag to write into the xsl file (Saara):
- method: do not run lg recognition
- method: Choose between these 2: nob, sme
- method: do not run lg recognition
6. Infrastructure
Aligner
Today, we have two anchor files in addition to the original one.
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
- conflate nno with nob into 'nor' (Trond, letters a, b, c done)
- Partly done, must go through and see.
- Partly done, must go through and see.
-
Saara to install the aligner, everyone to read the documentation
- done, waiting for the test files from Bergen.
- done, waiting for the test files from Bergen.
- Trond and Saara will continue this issue.
Hyphenator
Trond and Thomas have been updating the propernoun file with ^ tags. We need the
TODO:
- Remove the ^ signs from the UPPER level when generating (just like the TV and
- done
- done
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done.
- Still not done.
- add a possibility to upload whole documents for hyphenation (and also
- we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
- we'll do it, but it does not have first priority (Saara)
- add exceptions marks to the smj lexicon (boks^távva)
7. Linguistics
General - hyphenation
See discussion, open questions and decission in the previous meeting memo.
TODO:
- add all word boundaries (Thomas)
- Done
- Done
- Set up the mechanism for the hash transducer package - fst gymnastics, see
- Lule Sámi behaviour around -st- is still somewhat unclear. Trond has had a
North Sámi
There are some heavy bugs:
- he(a)ddjiid is one
- compounding cleanup - no shortening when normative, still shortening when
- vowel-shortening when compounding (we need the input from Pekka!)
We should have some linguistic workshops while Maaren is here.
diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehto#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehto#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehto#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2 diehtojuohkindoaimmat diehto#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom diehtojuohkindoaimmat diehto#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2 diehtojuohkindoaimmat diehto#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
After postprocessing, obeying Karlsson's law (choose the wordform with the least
"<diehtojuohkindoaimmat>" "diehtojuohkin#doaibma" N Sg Acc PxSg2 "diehtojuohkin#doaibma" N Pl Nom "diehtojuohkin#doaibma" N Sg Gen PxSg2
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- everything added that is possible now, about 50 unknown words left+2 abbr.
- everything added that is possible now, about 50 unknown words left+2 abbr.
- There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. These become more visible as we get real texts.
- names
- compounds? Shortening here as well, but not in written language (some
- loanwords?
- numerals, e.g. These become more visible as we get real texts.
8. Name lexicon infrastructure
TODO:
- refactor and prepare risten.no for multiple collections:
- refactor the code into more and more specific components according to our
- things are moving forward
- things are moving forward
- refactor the code into more and more specific components according to our
- write down the most common editing scenarios (to be used as guides for making
- develop the needed XQueries and interface (Sjur, Tomi)
- developing
- developing
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing this week
- nothing this week
- test and review when ready
- Rethink the doubletagging procedure for names, consider grammatically
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
- aspell
- hunspell
10. Public tender
TODO:
- read the offers (Børre, Sjur, Trond, Tomi)
- meet on Tuesday 13 to sum up the findings (Børre, Sjur, Tomi, Trond)
- telephone meeting next Wednesday with Finnut (Børre, Sjur)
11. Other
Bug fixing
50 open Divvun/Disamb bugs, and 25 risten.no bugs
Please help Saara with bug 279. Not much help...
After the corpus issues have been somewhat settled, we should do a bug
Move to victorio
xerox tools: update PATH to
/opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.4/bin/ /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.4/lib/
Victorio still does not compile, despite a path fix, cf. bug #282.
*** Building sme.save *** printf "compile-source sme/src/sme-lex.txt sme/src/adv-sme-lex.txt ... \n\ read-rules sme/bin/twol-sme.bin \n\ compose-result \n\ save-result sme/bin/sme.save \n\ quit \n" > tmp/save-script lexc -utf8 < tmp/save-script /bin/sh: lexc: command not found make: *** [sme/bin/sme.save] Error 127
12. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the Iđut authors
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- mirror Odin URL (create cron task to do it automatically?)
- read & evaluate received offers
- telephone meeting Wednesday with Finnut
- Check the status & license of the corpus texts
- fix bugs!
Maaren
- Reso
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- move to bugzilla:
- set up a weekly cron script to make free texts available
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- set up a weekly cron script to make free texts available
- Implement links to parallel files in corpus header.
- Implement parallel corpus upload in web upload script
- Implement turning off the language recognition in the xsl-file (and corpus.dtd).
- Refine language recognition
- Investigate the decomposed Unicode characters in file names -problem.
- Correct decomposed Unicode in Min Áigi file names
- Check the status & license of the corpus texts
- Rerun corpus conversion
- fix bugs!
Sjur
- read & evaluate received offers
- telephone meeting Wednesday with Finnut
- fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- name lexicon:
- implement editing functions
- write down the most common editing scenarios, to guide development
- implement editing functions
Thomas
- correct hyphenation of exceptions
- correct hyphenation of smj -st-
- work on compounding and derivation
- smj G3 issue
- sme G3 issue
Tomi
- move to Bugzilla:
- aspell UTF-8 suffix bug
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- aspell UTF-8 suffix bug
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- write down the most common editing scenarios (to be used as guides for making
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby, install new version of SEE
- Set up the mechanism for the hash-mark transducer package
- read & evaluate received offers
- fix bugs!
Trond
- better smj NT text, get fin NT texts
- get a key for Maaren in May
- install aligner, test it and give feedback
- correct hyphenation of word boundaries and exceptions with Thomas
- fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- get/upgrade keys for Børre's room for Tomi and Thomas
-
fix bugs!.
- Rethink the doubletagging procedure for names, consider grammatically
- write down the most common editing scenarios (to be used as guides for making
- read & evaluate received offers
- Check the status & license of the corpus texts
13. Next meeting, closing
15.05.2006 09: 30
Closed at 11: 18