Meeting_2006-03-27
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 27.03.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 56.
Present: Børre, Saara, Sjur, Thomas, Tomi, Trond
Absent: Maaren
Main secretary: Trond
Agenda accepted with additions under "Other".
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Davvi Girji, NSI (Sámi Instituhtta), Min Áigi, Aššu, DAT,
- Davvi Girji, NSI (Sámi Instituhtta), Min Áigi, Aššu, DAT,
- Gather public texts, preferrably also parallel ones
- Some gathered, but not converted
- Some gathered, but not converted
- Continue converting text from input format to our xml
- Tried to convert html documents, but didn't succeed
- Tried to convert html documents, but didn't succeed
- convert nob and nno bible texts to be used as part of a parallel corpus
- waiting for Saara and Tomi
- waiting for Saara and Tomi
- review the paratext2xml converter
- same as above
- same as above
- convert smj NT to paratext
- waiting for the two issues above
- waiting for the two issues above
- Call Ove Sæth
- Impossible to reach on the phone, sent a mail
- Impossible to reach on the phone, sent a mail
- Move complex name lexicon issue to bugzilla
- Done
- Done
- Send out letters to the Iđut authors
- waiting for address list from Åge Persen leader of Iđut.
- waiting for address list from Åge Persen leader of Iđut.
- Add corpus security re G5 syncing as an issue to Bugzilla
- Not done
- Not done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- Not done
- Not done
- copy updated DTD's to the permlink location, or help Saara do it
- Done, and given Saara instructions on how to do it herself.
- Done, and given Saara instructions on how to do it herself.
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- tried another approach
- tried another approach
-
fix bugs!
- Resolved 197 (Sjur and Thor-Øivind), 241, 259 (by Sjur)
- Resolved 197 (Sjur and Thor-Øivind), 241, 259 (by Sjur)
- Misc:
- Added the GPL to our cvs repositories.
Maaren
- work with new missing lists
- done
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
- done
- done
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Validation is implemented. There were new errors found during this
- Validation is implemented. There were new errors found during this
- Finish corpus dtd documentation, dtd location and permlink reference
- done
- done
- update the corpus dtd with option for correction tags
- done
- done
- copy updated dtd's to permanent external location
- done (by børre)
- done (by børre)
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- done, but the name of gt-tree is not yet changed.
- done, but the name of gt-tree is not yet changed.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- review paratext2xml converter.
- the paratext2xml was not implemented. now it's written and part of
- the paratext2xml was not implemented. now it's written and part of
- install sentence aligner.
- Aligner has a graphical interface, so it was not installed on
- Aligner has a graphical interface, so it was not installed on
- test anonymous cvs access and review documentation.
- done
- done
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- test anon. read-only cvs, review docu, and send link to Finnut
- done
- done
- corpus repo access to free texts (with Børre)
- answer requests/questions
- conversion of corpus repo summary xml to Forrest xml
- nothing
- nothing
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- done
- done
- code design for XQueries needed for dict/term editing
- refactor code
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- sent to Anne-Britt and Per Edvard instead
- sent to Anne-Britt and Per Edvard instead
- add manual editing of corpus files as an issue to Bugzilla (error tags)
- done
- done
- fix bugs!
Thomas
- add incoming Lule sámi words
- not this week
- not this week
- work on North Sámi compounding and derivation
- not this week
- not this week
- smj G3 issue
- not this week
- not this week
- sme G3 issue
- not this week
- not this week
- translate stopword list into smj (aligner; list from Trond)
- translated half of it til now
- translated half of it til now
- assist Trond and Linda with the smj disamb work
- done
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- fix bugs!
Trond
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Contacted both. The Finnish one is open to research use, we will get
- Contacted both. The Finnish one is open to research use, we will get
- translate stopword list into nno?
- Not done, but partly into Finnish. cvs?
- Not done, but partly into Finnish. cvs?
- double check all remaining docs in gt/sme/corp/ for copyright issues
- Done.
- Done.
- grammatical searchability in the graphical corpus interface
- Important issue, not done.
- Important issue, not done.
- better smj NT text
- Asked the Bible society, still not got any.
- Asked the Bible society, still not got any.
- work on semantically based sets (sme, smj)
- Not done.
- Not done.
- start and lead discussion and work on semantic features for disamb
- Done some thinking, that's all.
- Done some thinking, that's all.
-
fix bugs!.
- Tested anon. cvs and corpus upload. Both worked very well.
3. Documentation
Changes and updates because of the Divvun public tender
TODO:
- review anon. cvs: Sjur, Saara, by Wednesday morning
- done
- done
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo
- Not done
- Not done
- we need to finish the corpus dtd documentation (Saara)
- done
TODO:
- copy updated DTD's to the permlink location (Børre or Saara)
- done
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Børre has sent a letter to the publishers, has talked to Brita Kåven (she
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
TODO:
- call Sæth (Børre)
- I have mailed him, not able to reach him by phone. No answer yet, though …
Olavi Korhonen's Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
- No news this week
KIO Grafisk and the Iđut books
-
Sjur has sent an e-mail explaining the issues as we see them, to Anne Britt
TODO:
- Børre will send letters to the authors
Bible texts
We will get text from Finland. We are awaiting an answer from Sweden. As for the
TODO:
- review paratext2xml converter (Saara)
- converter corrected/made, use suffix .ptx when converting.
- converter corrected/made, use suffix .ptx when converting.
- convert smj NT to paratext (Børre)
- Will be done now that the paratext2xml has been finished.
- Will be done now that the paratext2xml has been finished.
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Work in progress/texts underway.
5. Corpus infrastructure
TODO in transferring the old gt/sme/corp files to the new corpus repo:
- make sure there's nothing left with a copyright attached to it (Trond)
-
Trond will go a second round
- done
- done
-
Trond will go a second round
- remove the deleted files from the CVS repository (Trond)
Further discussion about corpus analysis and computer use:
- we need to develop strong enough security routines for the G5 to fulfill our
- TODO: Børre to move this to bugzilla
TODO dtd usage and documentation:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location = permlink):
- TODO: Saara to write and finish the docu, also check the dtd link
- done
- done
- structure, content/model and location of the dtd (location = permlink):
- add xml validation against our dtd to the corpus conversion process
- done. Some new errors were found, they are almost fixed now.
- done. Some new errors were found, they are almost fixed now.
- add UTF-8 check as part of the validation (Saara)
Correction tags?
TODO:
- update the DTD (Saara)
- done.
OPEN ISSUES:
- since this is manual editing, we break the automatic regeneration/reconversion
- done
- done
- the proposed markup is too simplistic for describing more complex error
- discussed, and nesting added as well
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- convert from that xml to Forrest document format (Sjur)
- nothing last week
- nothing last week
- integrate the final Forrest documents into Forrest, and make sure it gets
- waiting for the above
Free and non-free texts
More info in a previous meeting memo.
Newsgroup discussion - whether to rename gt/ to gtbound/ or not:
Saara:
Sjur:
Solution:
TODO:
- update scripts to handle this dichotomy. (Saara)
- almost finished
- almost finished
- gt/ vs gtbound/: change to gtbound/, add symbolic link from gt/ to gtbound/
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface (Saara)
- We would like to have grammatical searchability, not only POS. (Saara,
- This presupposes a discussion with Oslo. (Trond to start discussion
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- preparations: gather more texts (we are doing this)
- Review the tag list and have it ready for inclusion (gt/cwb/korpustags.txt)
- Prepare a list of good candicates for first inclusion into the corpus.
Text upload
The upload is working, but Børre doesn't receive an automatic message
TODO:
- Ask for email-address: corpus@giellatekno.uit.no (Børre)
- Make a setup for this email address so that it goes to Børre, and then
6. Infrastructure
Aligner
We are working on it, there are problems, and the test files are not good
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
-
Trond to send relevant documents to Tomi.
-
Trond to send relevant documents to Tomi.
- Translate the anchor list anchor-eng-nor.txt into sme (and fin?)
-
Saara to install the aligner, everyone to read the documentation on
- Add the anchor list translations to cvs (Trond)
- add to cvs location: gt/common/src/anchor.txt
- "eng / nob / sme / smj / fin".
- contra mono: hard to align
- contra bi: each lg twice
- usage: for eng/nob alignment, use eng/nob, for nob/sme alignment, use
- contra mono: hard to align
- add to cvs location: gt/common/src/anchor.txt
Perhaps best to have all lgs in one list, and extract pairs via
7. Linguistics
North Sámi
Semantic feature system
TODO:
- decide on a semantic feature system for nouns (Linda).
- Work with semantically based sets (Trond, Linda)
- Return to the infrastructure issue (Trond)
- A full semantic encoding of the lexicon is a future project, outside the
Further discussion and details in the previous meeting memo.
TODO (Trond):
- Discussion testing
- infrastructure
- semiautomatic retagging
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- nothing done this week
- nothing done this week
- name morphology (Thomas)
- handed Tomi list
- handed Tomi list
- translate Northern Sámi lists and sets to Lule Sámi
-
Linda, Trond, with help from mother tongue speakers (Thomas, others).
-
Linda, Trond, with help from mother tongue speakers (Thomas, others).
8. Name lexicon infrastructure
Complex names
TODO:
- Move xml2lexc complex name issue to bugzilla (Børre)
- Done!
Editing
TODO on eXist as editor:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- done for XQueries and XSLT; only CSS left (needs to be handled differently)
- done for XQueries and XSLT; only CSS left (needs to be handled differently)
- refactor the code into more and more specific components according to our
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing last week
- nothing last week
- test and review when ready
Data synchronisation task list/specification:
Details in the previous meeting memo.
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
Divvun admin
The project manager would like all Divvun project
Making such lists is necessary to be able to document to the SD administration
I have been doing the same thing for myself for a long time, and the benefit is
TODO:
- keep a list of worked hours (all Divvun team members)
- start this week, then every week
Divvun project management while Sjur is on paternal leave
Sjur will soon go on paternal leave (expected April 6), and most likely be
TODO:
- set up Monday meetings
- conduct the meeting (or let Trond do it: -)
- finalize the meeting memo afterwords, making sure all tasks discussed have
- also add the meeting memo template for the next meeting, so that people can
- be the main contact person for Finnut Consult AS, and
Børre is temp. Project Manager: -)
Easter vacation/absenses
Who? | When? |
---|---|
Børre | from the 10th to the 12th of April |
Saara | at work normally |
Sjur | no vacation, possibly paternal leave |
Thomas | from the 10th to the 12th of April, 3 days |
Tomi | from the 10th to the 12th of April, might be at work offline |
Trond | don't know yet |
Gobby
TODO:
- install and test it, to prepare for cooperation with non-Mac users (use case:
SubEthaEdit update
SEE 2.3 is released. It is now commercial only, but 2.2 is still available for
Sjur: I have made a simple, but useful jspwiki mode, for syntax coloring of
Sjur: I have also made a first attempt at an XQuery mode, but that one isn't
TODO:
- upgrade SEE ( all)
- install jspwiki mode from Sjur ( all interested)
Bug fixing
35 open Divvun/Disamb bugs, and 25 risten.no bugs
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Move complex name lexicon issue to bugzilla
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- Ask for email-address: corpus@giellatekno.uit.no
- install and test Gobby, install new version of SEE (also for Thomas)
- fix bugs!
Maaren
- will be on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- change the name of gt/ to gtbound/ and add a symbolic link.
- fix the email address for corpus upload.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- add utf-8 check to xml-validation of the corpus files.
- install aligner, test it and give feedback
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- corpus repo access to free texts (with Børre)
- answer requests/questions
- conversion of corpus repo summary xml to Forrest xml
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for css using sitemaps
- code design for XQueries needed for dict/term editing
- refactor code
- fix bugs!
Thomas
- add incoming Lule sámi words
- work on North Sámi compounding and derivation
- smj G3 issue
- sme G3 issue
- translate stopword list into smj (aligner; list from Trond)
- assist Trond and Linda with the smj disamb work
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- install and test Gobby, install new version of SEE
- fix bugs!
Trond
- Translate anchor list into nno, work on sme, fin.
- Add the anchor list translations to cvs
- remove deleted files from the CVS repository (in the Attic)
- grammatical searchability in the graphical corpus interface: revise taglist
- better smj NT text
- Prepare a list of good candicates for first inclusion into the corpus.
- translate Northern Sámi lists and sets to Lule Sámi
- work on semantically based sets (sme, smj)
- start and lead discussion and work on semantic features for disamb
- Install Gobby with support programs, see, etc.
- get a key for Maaren in May
- install aligner, test it and give feedback
- fix bugs!.
12. Next meeting, closing
03.04.2006 09: 30
Closed at 12: 19