Meeting_2006-03-06
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 06.03.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 04.
Present: Maaren, Sjur, Thomas, Tomi
Absent: Børre, Saara, Trond
Main secretary: all
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
On Winter holiday.
Maaren
- work with risten.no
- not done. I have worked with the top-ten list lately. Almost finished.
- not done. I have worked with the top-ten list lately. Almost finished.
- discuss with relevant people regarding seminar on proofing tools, normativity
- Laila wants us to send the list to the SGL-members (as we have done before).
- Laila wants us to send the list to the SGL-members (as we have done before).
- get the normativity decissions from the December SGL meeting
- I have send decissions to Thomas and asked Thomas to phone to Laila.
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- Try to add numeral treatment as part of the analyzator.
- Move gt2ga.sh to G5 and implement copying of the gt-dir.
- Create a parallel corpora of the new testaments.
- Routine for adding new languages to the propernoun xml-structure.
- Move to Bugzilla: the analyzer needs to be optimized.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- not done
- not done
- Lule Sámi twol problems, with Thomas and Trond
- not done
- not done
- project planning with Trond, continued
- not done
- not done
- Follow up on place names from Norge Digitalt
- not done
- not done
- Evaluate SFST as speller (and analyzer) lexicon
- not done
- not done
- write a background document on the corpus contracts
- not done
- not done
- public tender:
- review draft tender document from Finnut
- done and published
- done and published
- the public tender would benefit a lot from anonymous cvs access (read-only)
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- not done
- not done
- sme G3 issue with Thomas and Trond
- not done
- not done
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
- risten.no/proper noun lexicon development:
- refactor code
- not done
- not done
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- tried to work out how to match parts of a request parameter, and continue
- tried to work out how to match parts of a request parameter, and continue
- data synchronisation between risten.no and the cvs repository
- not done
- not done
- refactor code
- fix bugs!
Thomas
- work with Lule sámi 90% goal
- worked and reached goal 92%
- worked and reached goal 92%
- work on North Sámi compounding and derivation
- worked
- worked
- smj G3 issue with Sjur and Trond
- not had time to
- not had time to
- sme G3 issue with Sjur and Trond
- not had time to
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- not done
- not done
- corpus infrastructure:
- dtd location (both public and internal)
- not done
- not done
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- not done
- not done
- Look into data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- discuss the new lexicon format and other issues in the newsgroup
- fix bugs!
Trond
On Winter holiday.
3. Documentation
Changes and updates because of the Divvun public tender
- document anonymous, read-only access to our cvs repo
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo
- we need to finish the corpus dtd documentation
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
TODO:
- call Sæth (Børre)
Olavi Korhonen's Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
KIO Grafisk and the Iđut books
We need a test file in order to find out whether file conversion from Quark
TODO:
-
Trond to ask for a Quark test file from his sister
-
Børre to ask KIO to send a more elaborate test file, representative for
- Børre will send letters to the authors.
Bible texts
TODO:
- review paratext2xml converter (Børre)
- convert smj NT to paratext (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Still not done. Trond will contact Bibelselskapet for a new sme
- Still not done. Trond will contact Bibelselskapet for a new sme
5. Corpus infrastructure
We need more "version control" in the corpus work - we don't know which version
Transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
- then these files should be removed from gt/sme/corp/
- all small files could just be forgotten/ignored
TODO:
- Access control to corpus repo resolved through Unix groups: one group for
- Saara will ask Roy to do what should be done.
Further discussion about corpus analysis and computer use:
The new G5 is tremendeously faster than cochise, thus we want to use it. But
- the corpus/gt/ dir will be synchronised with the G5
- TODO: set up copying script (Saara)
- TODO: set up copying script (Saara)
- we need to develop strong enough security routines for the G5 to fulfill our
- TODO: Børre to move this to bugzilla
- TODO: Børre to move this to bugzilla
- we are still using only one processor when analysing - making some simple
- script implemented, but not tested due to copying not in place yet (see
- script implemented, but not tested due to copying not in place yet (see
New tasks:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location =
- the link isn't working
- the link isn't working
- structure, content/model and location of the dtd (location =
- finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to
- add xml validation against our dtd to the corpus conversion process
Changes and updates because of the Divvun public tender
- routines for setting up new users of our corpus
- create the final text corpus license: the regular end user computer license
-
Børre will be the recipient of the SD end user license (approving corpus
- who will do the actual account setup? What type of account is needed?
- create the final text corpus license: the regular end user computer license
- an automatic build of the content of our corpus repo:
- for each text:
- license attached to each text
- length (words, characters, sentences, paragraphs)
- (source) language and other properties of the text
- for freely available texts, a download link?
- license attached to each text
- an overview document with statistics for the whole corpus
- the automatic build should generate one (or several) Forrest XML document(s)
- for each text:
6. Infrastructure
We need to set up anonymous, read-only access to our cvs repo as outlined by our
We might even consider seting up patching routines, such that people with
Our open-source policy now demands that we really take the step, and not only
Should we set up view-cvs (web interface to cvs)?
Problems:
- IP-problemacy (Do we have the complete rights to the content, or does the work
- Sámi place names - we need to ask our sources:
- Statens kartverk (Trond) - materialet er fritt, jf Stortinget!
- Finnish same (Thomas) - we don't have it, so this is no issue.
- Statens kartverk (Trond) - materialet er fritt, jf Stortinget!
- Are there any parts in smj, sms, etc that shouldn't be publicly available?
- What about the inc-*.txt files in smj/src/ ? Are they the Anders K source
- do we have any corpus files in smX/corp/ with copyright/unchecked lisence?
- What about the inc-*.txt files in smj/src/ ? Are they the Anders K source
- No other problems? Pekka is fine with this.
- = No other problems.
- = No other problems.
- Sámi place names - we need to ask our sources:
- Do we dare to show our work?
- yes
Howto/who:
- what do we need?
- web interface? maybe
- command line check-out? yes (Roy Dragseth / Børre)
- need to be able to restrict anon. cvs to only specific modules
- web interface? maybe
7. Linguistics
North Sámi
We want to get the decisions from the SGL meeting in December.
- Laila has sent those to Thomas by email ...a long time a go...
- Thomas will include the responses in our normativity document.
SGL do
We have to make a list and send the list to SGL. They
TODO:
- Find the book by O. H. Magga on normativity desicions made in the 80'ies, and
- Dieđut nr 2 1985: GIELLA, dutkan, dikšun ja oahpaheapmi.Sámi instituhtta. s.
- documentation on later decisions from Laila, Maaren will get copies
- Dieđut nr 2 1985: GIELLA, dutkan, dikšun ja oahpaheapmi.Sámi instituhtta. s.
- make sure we only send them issues that are really open
- new issues need probably to be better explained from the point of view of
TODO:
- get the decisions (Maaren)
- done
Lule Sámi
Goal for March 1st: Reach the 90 percent lexical recall limit.
TODO:
-
Trond and Thomas: work on the 90 % limit
- we reached 92%!
- we reached 92%!
-
Trond: Write a report to NFR.
- done
8. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
TODO:
- Move this issue to bugzilla (Børre)
XML format
TODO on eXist as editor:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- refactor the code into more and more specific components according to our
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- test whether eXist as editor is actually working well (linguists)
Data synchronisation task list/specification:
- the xml file needs to be stored/updated in cvs
- there should be no diffs on whitespace and sorting order (to ensure we get
- the prop name update cycle should something like:
- dump the xml from eXist (in proper sorting order)
- check whether there are diffs against cvs; continue only if there are
- update from cvs
- error check: are there conflicts? if yes, send report to <somebody>
- are there still diffs? if yes, continue:
- check in/commit w. generated comment
- error check: is the document valid and conformant xml? if no, stop and send a
- reimport the xml file into eXist
- dump the xml from eXist (in proper sorting order)
- question: do we need to lock the file in eXist through this update cycle?
- the update cycle should be a nightly cron job
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
SGL Seminar
SGL don't want a seminar - at least not yet.
Bug fixing
35 open Divvun/Disamb bugs, and 24 risten.no bugs
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Ask KIO Grafisk to make a test Quark document based on a Word document from us
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- fix bugs!
Maaren
- work with the top-ten list
- send copies of normativity decisions from 1985 to 1992 (2003?) to Thomas
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- Try to add numeral treatment as part of the analyzator.
- Move gt2ga.sh to G5 and implement copying of the gt-dir.
- Create a parallel corpora of the new testaments.
- Routine for adding new languages to the propernoun xml-structure.
- Move to Bugzilla: the analyzer needs to be optimized.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- set up anon. read-only cvs with Børre
- corpus repo access
- answer requests/questions
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- refactor code
- fix bugs!
Thomas
- add incoming Lule sámi words
- include the SGL: decisions in our normativity document
- include normativity desicions made by Magga and Sammalahti in our normativity
- follow-up on the Sámi place names in Finland
- work on North Sámi compounding and derivation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- discuss the new lexicon format and other issues in the newsgroup
- fix bugs!
Trond
- Bring Lule Sámi up to 90 %
- Write Lule Report to NFR
- Apply for strategy funds
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara
- Ask for a Quark test file from his sister
- News group discussion followup.
- clean all texts from gt/smX/corp - they need to be in our corpus repo before
- fix bugs!.
12. Next meeting, closing
13.03.2006 09: 30
Closed at 12: 07