Meeting_2006-04-03
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 03.04.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 43.
Present: Børre, Saara, Sjur, Thomas, Tomi, Trond
Absent: Maaren (sick leave)
Main secretary: Børre
Agenda accepted with additions under "Other".
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Not done
- Not done
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Continue converting text from input format to our xml
- Some done
- Some done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Move complex name lexicon issue to bugzilla
- Done
- Done
- Send out letters to the Iđut authors
- Waiting for address list from Åge Persen
- Waiting for address list from Åge Persen
- Add corpus security re G5 syncing as an issue to Bugzilla
- Not done
- Not done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Not done
- Not done
- integrate generated corpus repository summaries in the Forrest site
- Not done
- Not done
- Ask for email-address: corpus@giellatekno.uit.no
- Done. It works.
- Done. It works.
- install and test Gobby, install new version of SEE (also for Thomas)
- Haven't installed SEE. Tried to compile Gobby, but it failed when linking.
- Haven't installed SEE. Tried to compile Gobby, but it failed when linking.
-
fix bugs!
- Other
- Contacted various publishers
- Davvi Girji: will look at the contracts and give an answer after easter.
- Báhko (Lule Sámi): Bård Eriksen will discuss the contract with some other
- DAT is fine with the contract, will compile an address list and send it
- NSI: Audhild Schanche will discuss our request with her co-workers, but
- Davvi Girji: will look at the contracts and give an answer after easter.
- Contacted various publishers
Maaren
- will be on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- change the name of gt/ to gtbound/ and add a symbolic link.
- done. the scripts are not yet updated.
- done. the scripts are not yet updated.
- fix the email address for corpus upload.
- done.
- done.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- add utf-8 check to xml-validation of the corpus files.
- not done.
- not done.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- nothing
- nothing
- Follow up on place names from Norge Digitalt
- nothing
- nothing
- Evaluate SFST as speller (and analyzer) lexicon
- nothing
- nothing
- write a background document on the corpus contracts
- nothing
- nothing
- public tender:
- answer requests/questions
- wrote a lengthy e-mail formulated as a FAQ
- wrote a lengthy e-mail formulated as a FAQ
- corpus repo access to free texts (with Børre)
- not done yet
- not done yet
- answer requests/questions
- conversion of corpus repo summary xml to Forrest xml
- nothing last week
- nothing last week
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
- risten.no/proper noun lexicon development:
- refactor code
- done some
- done some
- implement inheritance/collection overriding for css using sitemaps
- done also for CSS now
- done also for CSS now
- code design for XQueries needed for dict/term editing
- some initial discussions in the newsgroup, based on the file naming schemes
- some initial discussions in the newsgroup, based on the file naming schemes
- refactor code
- fix bugs!
Thomas
- add incoming Lule Sámi words
- added a few
- added a few
- work on North Sámi compounding and derivation
- nothing
- nothing
- smj G3 issue
- nothing
- nothing
- sme G3 issue
- nothing
- nothing
- translate stopword list into smj (aligner; list from Trond)
- finished
- finished
- assist Trond and Linda with the smj disamb work
- assisted
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- not done
- not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- done some
- done some
- XQuery refactoring and code development for our proper noun editor
- started
- started
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- translate stopword list into sme (aligner; list from Trond)
- done
- done
- install and test Gobby, install new version of SEE
- not done
- not done
- fix bugs!
Trond
- Translate anchor list into nno, work on sme, fin.
- Not done for nno (which I think should be conflated with nob), but worked
- good enough to be used for development, they only need correction.
- Not done for nno (which I think should be conflated with nob), but worked
- Add the anchor list translations to cvs
- Done (the naming is not consistent: anchor.txt contains eng-nob-sme-fin, and
- Done (the naming is not consistent: anchor.txt contains eng-nob-sme-fin, and
- remove deleted files from the CVS repository (in the Attic)
- Forgot that.
- Forgot that.
- grammatical searchability in the graphical corpus interface: revise taglist
- Not done.
- Not done.
- better smj NT text
- Not done.
- Not done.
- Prepare a list of good candicates for first inclusion into the corpus.
- Not done
- Not done
- translate Northern Sámi lists and sets to Lule Sámi
- What was this?
- What was this?
- work on semantically based sets (sme, smj)
- Minor work done.
- Minor work done.
- start and lead discussion and work on semantic features for disamb
- Not done.
- Not done.
- Install Gobby with support programs, see, etc.
- Installed all the prerequisites, plan to ask Børre to help me out on this one
- Installed all the prerequisites, plan to ask Børre to help me out on this one
- get a key for Maaren in May
- Not done.
- Not done.
- install aligner, test it and give feedback
- Not done.
- Not done.
-
fix bugs!.
- Not done.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- Not done
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Sæth replied by e-mail, hasn't had time to follow-up, but will try to
Olavi Korhonen's Lule Sámi dictionary.
- No news this week
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- wait for the discussions with Davvi Girji
Bible texts
We will get text from Finland, but still haven't received any. We have got the
TODO:
- convert smj NT to paratext (Børre)
- get fin and swe NT and OT in paratext format. (Trond)
Min Áigi
Had a meeting last week, nothing heard of it yet.
Kåfjord
Contacted us last week, they would like to give us texts. Excellent initiative
5. Corpus infrastructure
TODO:
- remove deleted files from the CVS repository (Trond, still not done.)
- we need to develop strong enough security routines for the G5 to fulfill our
- move this to bugzilla (Børre, not done yet.)
- move this to bugzilla (Børre, not done yet.)
- add UTF-8 check as part of the validation (Saara, not finished yet)
- Will file a bug report to Bz and continue the work.
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- convert from that xml to Forrest document format (Sjur)
- nothing last week
- nothing last week
- integrate the final Forrest documents into Forrest, and make sure it gets
- waiting for the above
- waiting for the above
- make free texts available
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- also provide an xml-free version? I.e. only paragraphs, whatever, as given by
- done weekly by a cron script (but only if there are new files) (Saara)
- e-mail Finnut about the availability of the free corpus, and the download
- zip up the xml files in gtfree/ and put it up on the Divvun download area
Free and non-free texts
More info in a previous meeting memo.
TODO:
- update scripts to handle this dichotomy. (Saara)
- done
- done
- gt/ vs gtbound/: change to gtbound/, add symbolic link from gt/ to gtbound/
- done
- done
- we need to rerun the conversion, and add/check copyright/license status
- add license info (Børre, DEADLINE: Tuesday 4.4.)
- rerun conversion (Saara, DEADLINE: Wednesday 5.4.)
- add license info (Børre, DEADLINE: Tuesday 4.4.)
Linking parallel files
How do we know that two (or more) files are parallel language versions of each
One option:
samefilename.sme.doc.xml samefilename.nob.doc.xml nno/facta/samefilename.nno.html.xml sme/facta/samefilename.sme.html.xml <== parallel file sme/facta/somefilename.html.xml <== file in one lg only
The other option: to store the parallel files as links in the meta info/header
Should we allow for more than one file at a time when uploading? Use cases:
DECISION:
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface (Saara)
- We would like to have grammatical searchability, not only POS. (Saara,
- This presupposes a discussion with Oslo. (Trond to start discussion
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- preparations: gather more texts (we are doing this)
- Review the tag list and have it ready for inclusion (gt/cwb/korpustags.txt)
- Prepare a list of good candicates for first inclusion into the corpus.
Top-two priorities:
- Linda and Trond to go through the taglist
- Saara and Trond to contact Anders in 0slo
Text upload
TODO:
- Ask for email-address: corpus@giellatekno.uit.no (Børre)
- done
- done
- Make a setup for this email address so that it goes to Børre, and then
- done, but not tested on the real server
- now also tested, and working
- done, but not tested on the real server
Language recognition
As a work-around before Finnish recognition is reliable, treat all "Finnish"
TODO:
- turn on language recognition, skipping Finnish (Saara)
6. Infrastructure
Aligner
Today, we have two anchor files in addition to the original one.
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
-
Trond to send relevant documents to Tomi.
-
Trond to send relevant documents to Tomi.
- Translate the anchor list anchor-eng-nor.txt into sme (and fin?)
- done for all but nno
- nno should be conflated with nob into 'nor' (Trond)
- done for all but nno
-
Saara to install the aligner, everyone to read the documentation
- Add the anchor list translations to cvs (Trond)
- add to cvs location: gt/common/src/anchor.txt
- done
- done
- "eng / nob / sme / smj / fin".
- add to cvs location: gt/common/src/anchor.txt
- Move smj to the anchor.txt file, so that we get eng/nor/sme/smj/fin.
Hyphenator
TODO:
- make target for the hyphenator(s)
- done
- done
- add a web interface to the hyphenator
- done
- done
- correct hyphenation of word boundaries and exceptions (Sjur, Trond)
- add a possibility to upload whole documents for hyphenation (and also
- we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
7. Linguistics
General - hyphenation
We need to add word boundaries in our lexicons. All compounds need explicit word
It is not clear how this will be done, but Sjur has ideas.
Problematic word boundaries:
CVCV#CVCV OK CV-CV-CV-CV need no fix CVCVC#CVCV OK CV-CVC-CV-CV need no fix CVCVC#VCV !! *CV-CV-C#V-CV -> CV-CVC#V-CV <= manually fix only these
Exceptions:
geografiija ge-og-ra-fii-ja Voionmaa Voion-maa => Voi-on-maa (oi no diphth) tak-si-eaig-gi :-)
These needs to be marked in the lexicons in each case, probably something like:
geo^grafiija Voi^on#maa Voi^on#maa geo^grafiija geo^gra-fii-ja torne^träsk tor-ne^träsk -- or -- torne#träsk tor-ne#träsk
We need to introduce one new symbol: ^: 0
Goal: Analyse divvunáhkus as div-vun#áh-kus and not as div-vu-náh-kus.
output level twh upper divvun#áhkus <= first analysis how do you get this transducer? twh lower divvunáhkus <= text input hyp upper div-vun#áh-kus => hyph-sme.fst ok, we know this one twh lower divvun#áhkus twl lower divvun#áh0ku0s => twolhash-sme.bin #:# ^:^, not #:0 ^:0 twl upper divvun#áhkkuX4s \ smehash.fst lex lower divvun#áhkkuX4s / lex upper divvun#áhkku+N+Sg+Loc => sme.save ----- mirror: below, regular order, above mirrored order lex upper divvun#áhkku+N+Sg+Loc => sme.save lex lower divvun#áhkkuX4s \ sme.fst twl upper divvun#áhkkuX4s => twol-sme.bin / twl lower divvun0áh0ku0s input level
TODO:
- add all word boundaries and exception hyphenation marks (Thomas)
- Done: The noun file a-h (only word boundaries?)
- Done: The noun file a-h (only word boundaries?)
- Set up the mechanism for the hash transducer package. (Sjur, Tomi, Trond)
OPEN questions:
- what about compound names?
- The compound boundaries should be added to the propernoun lexicon
- The compound boundaries should be added to the propernoun lexicon
- loan words breaking Sámi syllable structure Voionmaa,
- Also here, boundaries should be added. The transducer should be equipped with
- Also here, boundaries should be added. The transducer should be equipped with
- do we want to differentiate between degrees of (mental) lexicalisation?
DECISION:
- use # and ^ for all compounds and hyphenation points in the lexicon
- later find a way to generate without having to spaecify # and ^ (this is also
North Sámi
Semantic feature system
Further discussion and details in the previous meeting
memos
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- done, still some more
- done, still some more
- translate Northern Sámi lists and sets to Lule Sámi
8. Name lexicon infrastructure
TODO:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- Done, now also for CSS, thus complete
- Done, now also for CSS, thus complete
- refactor the code into more and more specific components according to our
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- commiting is moving forward
- commiting is moving forward
- test and review when ready
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
10. Other
Easter vacation/absenses
Who? | When? |
---|---|
Børre | from the 10th to the 12th of April |
Saara | at work normally |
Sjur | no vacation, possibly paternal leave |
Thomas | from the 10th to the 12th of April, 3 days |
Tomi | from the 10th to the 12th of April, might be at work offline |
Trond | don't know yet |
No meeting during easter.
Gobby
TODO:
- install and test it, to prepare for cooperation with non-Mac users (use case:
SubEthaEdit update
TODO:
- upgrade SEE
- install jspwiki mode from Sjur ( all interested)
Bug fixing
35 open Divvun/Disamb bugs, and 25 risten.no bugs
Min Áigi letters
There are four texts on language correction, two interesting to us:
- genitive and spelling of "puma"
- why are there so few "ordinary" words in risten.no? Risten/SD should answer.
Key to the G5 room
All Tromsø people need access to Børre's office, to be able to initiate
TODO:
- give/upgrade keys to Tomi and Thomas as well (Trond)
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- install and test Gobby, install new version of SEE
- make free texts available
- add license info DEADLINE: Tuesday 4.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- also provide an xml-free version?
- possibly e-mail Finnut about the resource (if Sjur goes on paternal
- add license info DEADLINE: Tuesday 4.4.)
- fix bugs!
Maaren
- on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- file a bug report of utf-8 check in xml-validation of the corpus files.
- make free texts available
- rerun conversion (DEADLINE: Wednesday 5.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area
- also provide an xml-free version?
- set up a weekly cron script (but only if there are new files)
- rerun conversion (DEADLINE: Wednesday 5.4.)
- Discuss: allow for more than one file at a time when uploading a file.
- Implement links to parallel files in corpus header.
- Turn on language recognition, skipping Finnish
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- corpus repo access to free texts (with Børre)
- answer requests/questions
- e-mail Finnut about the availability of the free corpus and the download link
- conversion of corpus repo summary xml to Forrest xml
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- code design for XQueries needed for dict/term editing
- refactor code
- correct hyphenation of word boundaries and exceptions
- Set up the mechanism for the hash-mark transducer package
- fix bugs!
Thomas
- add incoming Lule Sámi words
- work on North Sámi compounding and derivation
- smj G3 issue
- sme G3 issue
- add all word boundaries and exception hyphenation marks
- SubEthaEdit update
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby, install new version of SEE
- Set up the mechanism for the hash-mark transducer package
- fix bugs!
Trond
- Unify anchor lists, and conflate nno and nob into nor
- remove deleted files from the CVS repository (in the Attic)
- grammatical searchability in the graphical corpus interface: revise taglist
- better smj NT text, get fin and swe NT texts
- Prepare a list of good candicates for first inclusion into the corpus.
- start and lead discussion and work on semantic features for disamb
- Install Gobby
- get a key for Maaren in May
- install aligner, test it and give feedback
- correct hyphenation of word boundaries and exceptions
- Set up the mechanism for the hash-mark transducer package
- get/upgrade keys for Børre's room for Tomi and Thomas
- fix bugs!.
12. Next meeting, closing
18.04.2006 09: 30
Sjur is on paternal leave.
Closed at 12: 47