Meeting_2006-04-18
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 18.04.2006
- Time: 09.30 Norw. time
- Place: Room 4453, University of Tromsø
- Tools: SubEthaEdit, eye-to-eye contact, sound-to-ear
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 43.
Present: Børre, Thomas, Tomi, Trond
Absent: Maaren (sick leave), Sjur (paternal leave), Saara (not
Main secretary: Børre
Agenda accepted with additions under "Other".
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Not done
- Not done
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Continue converting text from input format to our xml
- Done
- Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Didn't work one and a half week ago
- Didn't work one and a half week ago
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the Iđut authors
- Not fonr
- Not fonr
- Add corpus security re G5 syncing as an issue to Bugzilla
- Done
- Done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Not done
- Not done
- integrate generated corpus repository summaries in the Forrest site
- Not done
- Not done
- install and test Gobby, install new version of SEE
- Not done
- Not done
- make free texts available
- add license info DEADLINE: Tuesday 4.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- also provide an xml-free version?
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- add license info DEADLINE: Tuesday 4.4.)
- fix bugs!
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- file a bug report of utf-8 check in xml-validation of the corpus files.
- make free texts available
- rerun conversion (DEADLINE: Wednesday 5.4.)
- done, started fixing errors.
- done, started fixing errors.
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- done
- done
- also provide an xml-free version?
- set up a weekly cron script (but only if there are new files)
- not done.
- not done.
- rerun conversion (DEADLINE: Wednesday 5.4.)
- Discuss: allow for more than one file at a time when uploading a file.
- Implement links to parallel files in corpus header.
- Turn on language recognition, skipping Finnish.
- done
- done
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- corpus repo access to free texts (with Børre)
-
Saara did it
-
Saara did it
- answer requests/questions
- e-mail Finnut about the availability of the free corpus and the download link
- done
- done
- conversion of corpus repo summary xml to Forrest xml
- done
- done
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- done some
- done some
- code design for XQueries needed for dict/term editing
- not really done yet
- not really done yet
- other eXist/risten.no/proper noun related:
- implemented a simple classification scheme editor as a test case for XQuery
- added browsing of terms based on the classification scheme (to help develop
- implemented a simple classification scheme editor as a test case for XQuery
- refactor code
- correct hyphenation of word boundaries and exceptions
- the same as the next task
- the same as the next task
- Set up the mechanism for the hash-mark transducer package
- fix bugs!
Thomas
- add incoming Lule Sámi word
- everything added that is possible now, about 50 unknown words left+2 abbr.
- everything added that is possible now, about 50 unknown words left+2 abbr.
- work on North Sámi compounding and derivation
- not
- not
- smj G3 issue
- not
- sme G3 issue
- not
- not
- add all word boundaries and exception hyphenation marks
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- SubEthaEdit update
- done
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- not done
- not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- doing
- doing
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- install and test Gobby, install new version of SEE
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
-
fix bugs!
- not
Trond
- Unify anchor lists, and conflate nno and nob into nor
- Forgotten.
- Forgotten.
- remove deleted files from the CVS repository (in the Attic)
- Not done.
- Not done.
- grammatical searchability in the graphical corpus interface: revise taglist
- Done.
- Done.
- better smj NT text, get fin and swe NT texts
- Swe ok, fin promised but still not arrived. Nothing done to smj.
- Swe ok, fin promised but still not arrived. Nothing done to smj.
- Prepare a list of good candicates for first inclusion into the corpus.
- Not done.
- Not done.
- start and lead discussion and work on semantic features for disamb
- Not done.
- Not done.
- Install Gobby
- Not done.
- Not done.
- get a key for Maaren in May
- Not done.
- Not done.
- install aligner, test it and give feedback
- Not done.
- Not done.
- correct hyphenation of word boundaries and exceptions
- Not done.
- Not done.
- Set up the mechanism for the hash-mark transducer package
- Discussed with Sjur and set up some
- Discussed with Sjur and set up some
- get/upgrade keys for Børre's room for Tomi and Thomas
- Also not done.
- Also not done.
-
fix bugs!.
- Read through the list...
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- Not done
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Sæth replied by e-mail, hasn't had time to follow-up, but will try to
-
Børre will weekly mirror the URL mentioned in Sæths e-mail,
Olavi Korhonen's Lule Sámi dictionary.
- No news this week
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- wait for the discussions with Davvi Girji
- A talk with Brita Kåven, revealed that they would have a look at
- A talk with Brita Kåven, revealed that they would have a look at
Bible texts
We will get text from Finland, but still haven't received any. We have got the
Swedish html has arrived, no paratext. Norsk bibelselskap has not sent
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Min Áigi
Everything ok here.
TODO:
- Børre to contact Per Christian Biti on technical issues.
Kåfjord
Promised to send us texts, but nothing has arrived yet.
Sámi Instituhtta
Audhild Schanche promised to sign the contract and send us texts they
5. Corpus infrastructure
TODO:
- remove deleted files from the CVS repository (Trond, still not done!!)
- add UTF-8 check as part of the validation (Saara, not finished yet)
- Will file a bug report to Bz and continue the work.
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- convert from that xml to Forrest document format (Sjur)
- nothing last week
- nothing last week
- integrate the final Forrest documents into Forrest, and make sure it gets
- waiting for the above
- waiting for the above
- make free texts available
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- Done
- Done
- also provide an xml-free version? I.e. only paragraphs, whatever, as given by
- done weekly by a cron script (but only if there are new files) (Saara)
- e-mail Finnut about the availability of the free corpus, and the download
- Børre to check if this is done.
- zip up the xml files in gtfree/ and put it up on the Divvun download area
Free and non-free texts
More info in a previous meeting memo.
TODO:
- we need to rerun the conversion, and add/check copyright/license status
- add license info (Børre, DEADLINE: Tuesday 4.4.)
- Done
- Done
- rerun conversion (Saara, DEADLINE: Wednesday 5.4.)
- Børre to check if this is done.
- add license info (Børre, DEADLINE: Tuesday 4.4.)
Linking parallel files
DECISION:
TODO:
- develop the web interface for uploading to make it easier to add several
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface http: //omilia.uio.no/CE/sami/ (Saara)
- We would like to have grammatical searchability, not only POS. (Saara,
- The tag list is revised, and Lars has implemented a graphical demo version of it. He will install the revised list to the demo corpus.
- The tag list is revised, and Lars has implemented a graphical demo version of it. He will install the revised list to the demo corpus.
- This presupposes a discussion with Oslo. (Trond to start discussion
- The discussion is ongoing, now we will only have to choose and install texts.
- The discussion is ongoing, now we will only have to choose and install texts.
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- preparations: gather more texts (we are doing this)
- Review the tag list and have it ready for inclusion (gt/cwb/korpustags.txt)
- Done.
- Done.
- Prepare a list of good candicates for first inclusion into the corpus.
- Not done.
Top-two priorities:
- Linda and Trond to find text to add.
- Lars to add them.
Language recognition
TODO:
- turn on language recognition, skipping Finnish (Saara)
- Done, it seems.
Now, there are gross errors in the material, the texts must be reconverted, it seems.
6. Infrastructure
Aligner
Today, we have two anchor files in addition to the original one.
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
- conflate nno with nob into 'nor' (Trond)
- Partly done, must go through and see.
- Partly done, must go through and see.
-
Saara to install the aligner, everyone to read the documentation
- Move smj to the anchor.txt file, so that we get eng/nor/sme/smj/fin.
- Thomas has finished smj, Trond to add the smj column to the rest.
Not much has happened, we must work more on this issue.
Hyphenator
TODO:
- correct hyphenation of word boundaries and exceptions (Sjur, Trond)
- add a possibility to upload whole documents for hyphenation (and also
- we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
Trond and Sjur have had a look at the issue. We postpone it until Sjur is back.
7. Linguistics
General - hyphenation
See discussion, open questions and decission in the previous meeting memo.
TODO:
- add all word boundaries (Thomas)
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- Done: All sme except propernoun-file. Trond has gone through the propernoun
- Set up the mechanism for the hash transducer package. (Sjur, Tomi, Trond)
- Not done, let us wait until Sjur is back.
TODO:
Fix the rest of the propernoun file (Trond, Thomas)
North Sámi
Semantic feature system
Further discussion and details in
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- everything added that is possible now, about 50 unknown words left+2 abbr.
- everything added that is possible now, about 50 unknown words left+2 abbr.
- translate Northern Sámi lists and sets to Lule Sámi
- Hmm, this is done, isn't it? We have a working smj-dis.rle file.
8. Name lexicon infrastructure
TODO:
- refactor and prepare risten.no for multiple collections:
- refactor the code into more and more specific components according to our
- things are moving forward
- things are moving forward
- refactor the code into more and more specific components according to our
- develop the needed XQueries and interface (Sjur, Tomi)
- developing
- developing
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing this week
- nothing this week
- test and review when ready
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
10. Other
Bug fixing
47 open Divvun/Disamb bugs, and 25 risten.no bugs
After the corpus issues have been somewhat settled, we should do a bug barnraising. ... and then a new one after the name lexicon is fixed.
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the Iđut authors
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- make free texts available
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- check if this has been done
- check if this has been done
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- fix bugs!
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- file a bug report of utf-8 check in xml-validation of the corpus files.
- make free texts available
- rerun conversion (DEADLINE: Wednesday 5.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- also provide an xml-free version?
- set up a weekly cron script (but only if there are new files)
- rerun conversion (DEADLINE: Wednesday 5.4.)
- Discuss: allow for more than one file at a time when uploading a file.
- Implement links to parallel files in corpus header.
- Turn on language recognition, skipping Finnish
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- fix bugs!
Sjur
- on paternal leave
Thomas
- add all word boundaries
- work on compounding and derivation
- smj G3 issue
- sme G3 issue
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby, install new version of SEE
- Set up the mechanism for the hash-mark transducer package
- fix bugs!
Trond
- Unify anchor lists, and conflate nno and nob into nor
- remove deleted files from the CVS repository (in the Attic)
- better smj NT text, get fin NT texts
- Prepare a list of good candicates for first inclusion into the corpus.
- start and lead discussion and work on semantic features for disamb
- Install Gobby
- get a key for Maaren in May
- install aligner, test it and give feedback
- correct hyphenation of word boundaries and exceptions
- get/upgrade keys for Børre's room for Tomi and Thomas
- fix bugs!.
12. Next meeting, closing
24.04.2006 09: 30
Sjur is on paternal leave.
Closed at 11: 47