Meeting_2006-03-06
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 06.03.2006 
- Time: 09.30 Norw. time 
- Place: Wherever we are : -) 
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review 
- Reviewing the task list from two weeks ago 
- Documentation - divvun.no 
- Corpus gathering 
- Corpus infrastructure 
- Infrastructure 
- Linguistics 
- name lexicon infrastructure 
- Spellers 
- Other issues 
- Summary, task lists 
- Closing
1. Opening, agenda review, participants
Opened at 10: 04.
Present: Maaren, Sjur, Thomas, Tomi
Absent: Børre, Saara, Trond
Main secretary: all
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
On Winter holiday.
Maaren
- work with risten.no - not done. I have worked with the top-ten list lately. Almost finished. 
 
- not done. I have worked with the top-ten list lately. Almost finished. 
- discuss with relevant people regarding seminar on proofing tools, normativity - Laila wants us to send the list to the SGL-members (as we have done before).
 
- Laila wants us to send the list to the SGL-members (as we have done before).
- get the normativity decissions from the December SGL meeting - I have send decissions to Thomas and asked Thomas to phone to Laila.
 
Saara
- continue discussion on the new lexicon format 
- Fix the preprocess script and optimize it. 
- Try to add numeral treatment as part of the analyzator. 
- Move gt2ga.sh to G5 and implement copying of the gt-dir. 
- Create a parallel corpora of the new testaments. 
- Routine for adding new languages to the propernoun xml-structure. 
- Move to Bugzilla: the analyzer needs to be optimized. 
- Implement validation of xml corpus against the dtd. 
- Create a group for corpus users. 
- Finish corpus dtd documentation, dtd location and permlink reference 
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml 
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts - not done 
 
- not done 
- Lule Sámi twol problems, with  Thomas and  Trond 
- not done 
 
- not done 
- project planning with  Trond, continued - not done 
 
- not done 
- Follow up on place names from Norge Digitalt - not done 
 
- not done 
- Evaluate SFST as speller (and analyzer) lexicon- not done 
 
- not done 
- write a background document on the corpus contracts - not done 
 
- not done 
- public tender: - review draft tender document from Finnut - done and published 
 
- done and published 
- the public tender would benefit a lot from anonymous cvs access (read-only)
 
- review draft tender document from Finnut 
- smj G3 issue with  Thomas and  Trond 
- not done 
 
- not done 
- sme G3 issue with  Thomas and  Trond 
- not done 
 
- not done 
- call EDD/ Christian Emil Ore about national place name lexicon - not done 
 
- not done 
- risten.no/proper noun lexicon development: - refactor code - not done 
 
- not done 
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps - tried to work out how to match parts of a request parameter, and continue 
 
- tried to work out how to match parts of a request parameter, and continue 
- data synchronisation between risten.no and the cvs repository - not done 
 
- not done 
 
- refactor code 
- fix bugs!
Thomas
- work with Lule sámi 90% goal - worked and reached goal 92% 
 
- worked and reached goal 92% 
- work on North Sámi compounding and derivation - worked 
 
- worked 
- smj G3 issue with  Sjur and  Trond 
- not had time to 
 
- not had time to 
- sme G3 issue with  Sjur and  Trond 
- not had time to
 
Tomi
- move aspell UTF-8 suffix bug to Bugzilla - not done 
 
- not done 
- corpus infrastructure: - dtd location (both public and internal)- not done 
 
- not done 
 
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml - not done 
 
- not done 
- new proper name lexicon - discuss the new lexicon format and other issues in the newsgroup - not done 
 
- not done 
- Look into data synchronisation of proper nouns between risten.no and CVS - not done 
 
- not done 
- XQuery refactoring and code development for our proper noun editor - not done 
 
- not done 
- new version of xml2lexc (based on ccat), should handle complex names correct: - not done 
 
- not done 
 
- discuss the new lexicon format and other issues in the newsgroup 
- fix bugs!
Trond
On Winter holiday.
3. Documentation
Changes and updates because of the Divvun public tender
- document anonymous, read-only access to our cvs repo 
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo 
- we need to finish the corpus dtd documentation
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for  Sæth to discuss with colleagues about how to implement the 
TODO: 
- call Sæth (Børre)
Olavi Korhonen's Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
KIO Grafisk and the Iđut books
We need a test file in order to find out whether file conversion from Quark  
TODO: 
- 
Trond to ask for a Quark test file from his sister 
- 
Børre to ask KIO to send a more elaborate test file, representative for - Børre will send letters to the authors.
Bible texts
TODO: 
- review paratext2xml converter (Børre) 
- convert smj NT to paratext (Børre) 
- ask to get fin and swe NT and OT in paratext format. (Trond) - Still not done.  Trond will contact Bibelselskapet for a new sme 
 
- Still not done.  Trond will contact Bibelselskapet for a new sme 
5. Corpus infrastructure
We need more "version control" in the corpus work - we don't know which version
Transferring the old gt/sme/corp files to the new corpus repo: 
- for the biggest top ten (or so) the orig. should be located and copied to the- then these files should be removed from gt/sme/corp/ 
- all small files could just be forgotten/ignored
TODO: 
- Access control to corpus repo resolved through Unix groups: one group for - Saara will ask Roy to do what should be done.
 
Further discussion about corpus analysis and computer use:
The new G5 is tremendeously faster than cochise, thus we want to use it. But 
- the corpus/gt/ dir will be synchronised with the G5 - TODO: set up copying script (Saara) 
 
- TODO: set up copying script (Saara) 
- we need to develop strong enough security routines for the G5 to fulfill our - TODO: Børre to move this to bugzilla 
 
- TODO: Børre to move this to bugzilla 
- we are still using only one processor when analysing - making some simple - script implemented, but not tested due to copying not in place yet (see
 
- script implemented, but not tested due to copying not in place yet (see
New tasks:
- corpus dtd documentation: - structure, content/model and location of the dtd (location =- the link isn't working 
 
- the link isn't working 
 
- structure, content/model and location of the dtd (location =
- finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to- add xml validation against our dtd to the corpus conversion process 
Changes and updates because of the Divvun public tender
- routines for setting up new users of our corpus - create the final text corpus license: the regular end user computer license - 
Børre will be the recipient of the SD end user license (approving corpus- who will do the actual account setup? What type of account is needed? 
 
- create the final text corpus license: the regular end user computer license 
- an automatic build of the content of our corpus repo: - for each text: - license attached to each text 
- length (words, characters, sentences, paragraphs)
- (source) language and other properties of the text 
- for freely available texts, a download link? 
 
- license attached to each text 
- an overview document with statistics for the whole corpus 
- the automatic build should generate one (or several) Forrest XML document(s)
 
- for each text: 
6. Infrastructure
We need to set up anonymous, read-only access to our cvs repo as outlined by our 
We might even consider seting up patching routines, such that people with 
Our open-source policy now demands that we really take the step, and not only 
Should we set up view-cvs (web interface to cvs)?
Problems: 
- IP-problemacy (Do we have the complete rights to the content, or does the work- Sámi place names - we need to ask our sources: - Statens kartverk (Trond) - materialet er fritt, jf Stortinget! 
- Finnish same (Thomas) - we don't have it, so this is no issue. 
 
- Statens kartverk (Trond) - materialet er fritt, jf Stortinget! 
- Are there any parts in smj, sms, etc that shouldn't be publicly available? - What about the inc-*.txt files in smj/src/ ? Are they the Anders K source - do we have any corpus files in smX/corp/ with copyright/unchecked lisence? 
 
- What about the inc-*.txt files in smj/src/ ? Are they the Anders K source 
- No other problems? Pekka is fine with this. - = No other problems. 
 
- = No other problems. 
 
- Sámi place names - we need to ask our sources: 
- Do we dare to show our work? - yes
 
Howto/who: 
- what do we need? - web interface? maybe 
- command line check-out? yes (Roy Dragseth / Børre) 
- need to be able to restrict anon. cvs to only specific modules 
 
- web interface? maybe 
7. Linguistics
North Sámi
We want to get the decisions from the SGL meeting in December. 
- Laila has sent those to  Thomas by email ...a long time a go... 
- Thomas will include the responses in our normativity document.
SGL do 
We have to make a list and send the list to SGL. They 
TODO: 
- Find the book by O. H. Magga on normativity desicions made in the 80'ies, and - Dieđut nr 2 1985: GIELLA, dutkan, dikšun ja oahpaheapmi.Sámi instituhtta. s. - documentation on later decisions from  Laila,  Maaren will get copies 
 
- Dieđut nr 2 1985: GIELLA, dutkan, dikšun ja oahpaheapmi.Sámi instituhtta. s. 
- make sure we only send them issues that are really open 
- new issues need probably to be better explained from the point of view of 
TODO: 
- get the decisions (Maaren) - done
 
Lule Sámi
Goal for March 1st: Reach the 90 percent lexical recall limit.
TODO: 
- 
Trond and Thomas: work on the 90 % limit - we reached 92%! 
 
- we reached 92%! 
- 
Trond: Write a report to NFR. - done
 
8. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our - the resulting file format should be identical to our present prop-name 
 
- the resulting file format should be identical to our present prop-name 
TODO: 
- Move this issue to bugzilla (Børre)
XML format
TODO on eXist as editor: 
- refactor and prepare risten.no for multiple collections: - develop the Cocoon sitemap to delegate requests to the proper folder level, - refactor the code into more and more specific components according to our 
 
- develop the Cocoon sitemap to delegate requests to the proper folder level, 
- develop the needed XQueries and interface (Sjur, Tomi) 
- data synchronisation between risten.no and the cvs repo (Tomi) 
- test whether eXist as editor is actually working well (linguists)
Data synchronisation task list/specification:
- the xml file needs to be stored/updated in cvs 
- there should be no diffs on whitespace and sorting order (to ensure we get- the prop name update cycle should something like: - dump the xml from eXist (in proper sorting order)
- check whether there are diffs against cvs; continue only if there are 
- update from cvs 
- error check: are there conflicts? if yes, send report to <somebody>
- are there still diffs? if yes, continue: 
- check in/commit w. generated comment 
- error check: is the document valid and conformant xml? if no, stop and send a - reimport the xml file into eXist 
 
- dump the xml from eXist (in proper sorting order)
- question: do we need to lock the file in eXist through this update cycle? 
- the update cycle should be a nightly cron job
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
SGL Seminar
SGL don't want a seminar - at least not yet.
Bug fixing
35 open Divvun/Disamb bugs, and 24 risten.no bugs
11. Summary, task list
Børre
- send out contracts with accompanying letter 
- Gather public texts, preferrably also parallel ones 
- Continue converting text from input format to our xml 
- convert nob and nno bible texts to be used as part of a parallel corpus 
- review the paratext2xml converter 
- convert smj NT to paratext 
- Call Ove Sæth 
- Move complex name lexicon issue to bugzilla 
- Ask KIO Grafisk to make a test Quark document based on a Word document from us 
- Send out letters to the Iđut authors 
- Add corpus security re G5 syncing as an issue to Bugzilla 
- fix bugs!
Maaren
- work with the top-ten list 
- send copies of normativity decisions from 1985 to 1992 (2003?) to Thomas
Saara
- continue discussion on the new lexicon format 
- Fix the preprocess script and optimize it. 
- Try to add numeral treatment as part of the analyzator. 
- Move gt2ga.sh to G5 and implement copying of the gt-dir. 
- Create a parallel corpora of the new testaments. 
- Routine for adding new languages to the propernoun xml-structure. 
- Move to Bugzilla: the analyzer needs to be optimized. 
- Implement validation of xml corpus against the dtd. 
- Create a group for corpus users. 
- Finish corpus dtd documentation, dtd location and permlink reference 
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml 
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts 
- Lule Sámi twol problems, with  Thomas and  Trond 
- project planning with  Trond, continued 
- Follow up on place names from Norge Digitalt 
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts 
- public tender: - answer requests/questions 
- set up anon. read-only cvs with  Børre 
- corpus repo access 
 
- answer requests/questions 
- smj G3 issue with  Thomas and  Trond 
- sme G3 issue with  Thomas and  Trond 
- call EDD/ Christian Emil Ore about national place name lexicon 
- risten.no/proper noun lexicon development: - refactor code 
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps 
 
- refactor code 
- fix bugs!
Thomas
- add incoming Lule sámi words 
- include the SGL: decisions in our normativity document 
- include normativity desicions made by Magga and Sammalahti in our normativity - follow-up on the Sámi place names in Finland 
- work on North Sámi compounding and derivation 
- smj G3 issue with  Sjur and  Trond 
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla 
- corpus infrastructure: - dtd location (both public and internal)
 
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml - new proper name lexicon - discuss the new lexicon format and other issues in the newsgroup 
- Look into data synchronisation of proper nouns between risten.no and CVS 
- XQuery refactoring and code development for our proper noun editor 
- new version of xml2lexc (based on ccat), should handle complex names correct: 
 
- discuss the new lexicon format and other issues in the newsgroup 
- fix bugs!
Trond
- Bring Lule Sámi up to 90 % 
- Write Lule Report to NFR 
- Apply for strategy funds 
- Work on corpus texts with Børre. 
- Contact the Finnish and Swedish Bible societies to get Bible texts. 
- Look at ga/ directory issue with  Saara 
- Ask for a Quark test file from his sister 
- News group discussion followup. 
- clean all texts from gt/smX/corp - they need to be in our corpus repo before - fix bugs!.
12. Next meeting, closing
13.03.2006 09: 30
Closed at 12: 07

