Meeting_2006-04-24
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 24.04.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 11.
Present: Børre, Thomas, Tomi, Trond, Saara
Absent: Maaren (sick leave), Sjur (paternal leave)
Agenda accepted with additions under "Other".
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Not done
- Not done
- Gather public texts, preferrably also parallel ones
- Done
- Done
- Continue converting text from input format to our xml
- Not done
- Not done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the Iđut authors
- Not done
- Not done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Not done
- Not done
- integrate generated corpus repository summaries in the Forrest site
- Not done
- Not done
- make free texts available
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- check if this has been done
- Not done
- check if this has been done
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- fix bugs!
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- file a bug report of utf-8 check in xml-validation of the corpus files.
- done
- done
- make free texts available
- also provide an xml-free version?
- done
- done
- set up a weekly cron script (but only if there are new files)
- not done.
- not done.
- also provide an xml-free version?
- Discuss: allow for more than one file at a time when uploading a file.
- done
- done
- Implement links to parallel files in corpus header.
- not done
- not done
- Turn on language recognition, skipping Finnish.
- done
- done
- add a possibility to upload whole documents for hyphenation (and also
- not done, discussed in news
- not done, discussed in news
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- not done
- not done
- fix bugs!
Sjur
- on paternal leave
Thomas
- add all word boundaries
- still working with sme-propernames
- still working with sme-propernames
- work on compounding and derivation
- not
- not
- smj G3 issue
- not
- not
- sme G3 issue
- not
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- not done
- not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- doing
- doing
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- install and test Gobby, install new version of SEE
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
-
fix bugs!
- not
Trond
- Unify anchor lists, and conflate nno and nob into nor
- Done the unification, done letter A on nob/nno issue.
- Done the unification, done letter A on nob/nno issue.
- remove deleted files from the CVS repository (in the Attic)
- Done
- Done
- better smj NT text, get fin NT texts
- Not followed up this one.
- Not followed up this one.
- Prepare a list of good candicates for first inclusion into the corpus.
- Decided to follow the principle "biggest texts first".
- Decided to follow the principle "biggest texts first".
- start and lead discussion and work on semantic features for disamb
- Not done.
- Not done.
- Install Gobby
- Done, was able to install 0.3. According to Sjur, 0.41 should be within reach for Mac users also, we should try to get 0.41 versions, since the two are not compatible with each other.
- Done, was able to install 0.3. According to Sjur, 0.41 should be within reach for Mac users also, we should try to get 0.41 versions, since the two are not compatible with each other.
- get a key for Maaren in May
- Not done
- Not done
- install aligner, test it and give feedback
- Not done
- Not done
- correct hyphenation of word boundaries and exceptions
- Worked quite a lot on this issue (as has Thomas), the file starts to be ok now.
- Worked quite a lot on this issue (as has Thomas), the file starts to be ok now.
- get/upgrade keys for Børre's room for Tomi and Thomas
- fix bugs!.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- The item will be moved to the TODO list, again.
4. Corpus gathering
Trip to Sámi municipalities and ,,,,
Børre is going to a trip to Karasjok, Tana, Nesseby, Utsjoki and Inari, to visit the municipalities, parliaments and Sámi Council. In Karasjok he will try to visit Davvi Girji as well.
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Sæth replied by e-mail, hasn't had time to follow-up, but will try to
-
Børre will weekly mirror the URL mentioned in Sæths e-mail,
Olavi Korhonen's Lule Sámi dictionary.
- No news this week
TODO: Børre to contact Olavi Korhonen and Kuhmunen
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- wait for the discussions with Davvi Girji
- A talk with Brita Kåven, revealed that they would have a look at
- A talk with Brita Kåven, revealed that they would have a look at
Bible texts
We will get text from Finland, but still haven't received any. We have got the
Swedish html has arrived, no paratext. Norsk bibelselskap has not sent
TODO:
- convert smj NT to paratext (Børre)
- Not done
- Not done
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
- No news
Min Áigi
Everything ok here.
TODO:
- Børre to contact Per Christian Biti on technical issues (how to transfer texts).
Kåfjord
Promised to send us texts, but nothing has arrived yet.
TODO: Børre to contact them.
Sámi Instituhtta
Audhild Schanche has signed the contract. We will have to contact them about transferring the texts.
TODO: Børre to contact them.
5. Corpus infrastructure
https: //giellalt.uit.no/lang/corp/corpus-summary.html
TODO:
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- make free texts available
- done weekly by a cron script (but only if there are new files) (Saara)
- done weekly by a cron script (but only if there are new files) (Saara)
- Make a link, easily available, to these texts.
Free and non-free texts
More info in a previous meeting memo.
TODO:
- Check the status of the texts, again. (Børre, Trond)
- Rerun the conversion afterwards (Saara is the one with the magic spell)
Linking parallel files
DECISION:
TODO:
- develop the web interface for uploading to make it easier to add several
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface http: //omilia.uio.no/CE/sami/ (Saara)
- We add the largest texts first.
- We add the largest texts first.
- We would like to have grammatical searchability, not only POS. (Saara,
- This presupposes a discussion with Oslo. (Trond and Saara to continue this discussion)
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- The list of good candicates: The longest (admin) texts.
- We need a ccat version of the script for analysing text, still keeping xml tags. (Tomi).
Top-two priorities:
- Trond and Saara to discuss with Lars.
- Lars to add text to the server.
- Tomi to prepare for the parallel corpus.
Language recognition
TODO:
- turn on language recognition, skipping Finnish (Saara)
- Done, it seems.
- Done, it seems.
some flag to write into the xsl file:
- method: do not run lg recognition
- method: Choose between these 2: nob, sme
TODO: Saara to implement this and to write a short documentation on how to write in the appropriate commands in the file-specifix xsl documents.
6. Infrastructure
Aligner
Today, we have two anchor files in addition to the original one.
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
- conflate nno with nob into 'nor' (Trond)
- Partly done, must go through and see.
- Partly done, must go through and see.
-
Saara to install the aligner, everyone to read the documentation
- Trond and Saara will continue this issue.
Hyphenator
Trond and Thomas have been updating the propernoun file with ^ tags. We need the tag in front of compound parts beginning in a vowel or in two or more consonants. Compound parts beginning with one consonant are handled correctly.
TODO:
- Remove the ^ signs from the UPPER level (just like the TV and IV tags are
- Otherwise we wait with the following items until Sjur is back.
- correct hyphenation of word boundaries and exceptions (Sjur, Trond)
- add a possibility to upload whole documents for hyphenation (and also
- we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
7. Linguistics
General - hyphenation
See discussion, open questions and decission in the previous meeting memo.
TODO:
- add all word boundaries (Thomas)
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- Set up the mechanism for the hash transducer package. (Sjur, Tomi, Trond)
- Not done, let us wait until Sjur is back.
TODO:
- Thomas and Trond to carry on marking ^ on the compound parts beginning in a
North Sámi
Semantic feature system
We postpone this issue until Linda has met with Eckhard in early May.
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- everything added that is possible now, about 50 unknown words left+2 abbr.
- everything added that is possible now, about 50 unknown words left+2 abbr.
8. Name lexicon infrastructure
TODO:
- refactor and prepare risten.no for multiple collections:
- refactor the code into more and more specific components according to our
- things are moving forward
- things are moving forward
- refactor the code into more and more specific components according to our
- develop the needed XQueries and interface (Sjur, Tomi)
- developing
- developing
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing this week
- nothing this week
- test and review when ready
Discussion postponed until Sjur is back.
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
Discussion postponed until Sjur is back.
10. Other
Bug fixing
50 open Divvun/Disamb bugs, and 25 risten.no bugs
Please help Saara with bug 279. Not much help...
Saara will contact Roy on this issue.
After the corpus issues have been somewhat settled, we should do a bug barnraising. ... and then a new one after the name lexicon is fixed.
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Contact Per Christian Biti (Min Áigi) on technical issues (how to transfer texts)
- Contact Kåfjord
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the Iđut authors
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- contact Olavi Korhonen and Henrik Mikael Kuhmunen
- fix bugs!
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- make free texts available
- set up a weekly cron script (but only if there are new files)
- set up a weekly cron script (but only if there are new files)
- Implement links to parallel files in corpus header.
- Implement turning off the language recognition in the xsl-file (and corpus.dtd).
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- fix bugs!
Sjur
- on paternal leave
Thomas
- add all word boundaries
- work on compounding and derivation
- smj G3 issue
- sme G3 issue
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby, install new version of SEE
- Set up the mechanism for the hash-mark transducer package
- fix bugs!
Trond
- Make nob into nor in the anchor list (A is done)
- better smj NT text, get fin NT texts
- Discuss Gobby 0.4 with Sjur
- get a key for Maaren in May
- install aligner, test it and give feedback
- correct hyphenation of word boundaries and exceptions with Thomas
- get/upgrade keys for Børre's room for Tomi and Thomas
- fix bugs!.
12. Next meeting, closing
02.05.2006 09: 30
Sjur is on paternal leave.
Closed at 12: 04