Meeting_2006-02-27
Meeting setup
- Date: 27.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 36.
Present: Børre, Saara, Sjur, Thomas, Trond
Absent: Maaren, Tomi
Main secretary: Trond
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- Some new texts added to the corpus, from the Sámediggi
- Some new texts added to the corpus, from the Sámediggi
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Call Ove Sæth and Olavi Korhonen
- Called Korhonen, he was very positive about sharing his Lule Sámi
- Called Korhonen, he was very positive about sharing his Lule Sámi
- Correct Forrest integration for new projects and project ideas
- Done with Sjur
- Done with Sjur
- Move complex name lexicon issue to bugzilla
- Not done
- Not done
- fix bugs!
Maaren
- work with risten.no
- Added words. Also added words from a frequency-based missing list.
- Added words. Also added words from a frequency-based missing list.
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- not done.
- not done.
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- not done
- not done
- Look at crontab ga/ directory issue with Trond.
- postponed.
- postponed.
- Create a parallel corpora of the new testaments.
- In progress.
- In progress.
- Routine for adding new languages to the propernoun xml-structure (Lule Sámi
- In progress.
- In progress.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- almost finished - will finish today.
- almost finished - will finish today.
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
-
fix bugs!
- other:
- was on Winter Holiday
Thomas
- work on North Sámi compounding and derivation
- nothing, been working with Lule sámi namelexicas and incoming words
- nothing, been working with Lule sámi namelexicas and incoming words
- review corpus usage documentation
- nothing, been working with Lule sámi namelexicas and incoming words
- nothing, been working with Lule sámi namelexicas and incoming words
- smj G3 issue with Sjur and Trond
- nothing, been working with Lule sámi namelexicas and incoming words
- nothing, been working with Lule sámi namelexicas and incoming words
- sme G3 issue with Sjur and Trond
- nothing, been working with Lule sámi namelexicas and incoming words
Tomi
On sick leave.
Trond
- Mostly worked on Lule Sámi this week.
- Work on corpus texts with Børre.
- Discussed issues.
- Discussed issues.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Not done.
- Not done.
- Look at ga/ directory issue with Saara.
- Postponed. We discussed it, though...
- Postponed. We discussed it, though...
- News group discussion followup.
- Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
- Done.
- Done.
- Ask IT guys for an e-mail adress for corpus upload script:
- Other people have looked into this.
- Other people have looked into this.
- fix bugs!.
3. Documentation
TODO:
- Integrating the future project plans and ideas in our Forrest documentation
- Done by Sjur and Børre.
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
TODO:
- call Sæth (Børre)
Olavi Korhonen's Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
KIO Grafisk and the Iđut books
We need a test file in order to find out whether file conversion from Quark
TODO:
-
Trond to ask for a Quark test file from his sister
-
Børre to ask KIO to send a more elaborate test file, representative for
- Børre will send letters to the authors.
Bible texts
TODO:
- review paratext2xml converter (Børre)
- convert smj NT to paratext (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Still not done. Trond will contact Bibelselskapet for a new sme version, and
- Still not done. Trond will contact Bibelselskapet for a new sme version, and
5. Corpus infrastructure
We need more "version control" in the corpus work - we don't know which version
Transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
- then these files should be removed from gt/sme/corp/
- all small files could just be forgotten/ignored
TODO:
- Access control to corpus repo resolved through Unix groups: one group for corpus
- Saara will ask Roy to do what should be done.
Further discussion about corpus analysis and computer use:
The new G5 is tremendeously faster than cochise, thus we want to use it. But
- the corpus/gt/ dir will be synchronised with the G5
- TODO: set up copying script (Saara)
- TODO: set up copying script (Saara)
- we need to develop strong enough security routines for the G5 to fulfill our
- TODO: Børre to move this to bugzilla
- TODO: Børre to move this to bugzilla
- we are still using only one processor when analysing - making some simple
- script implemented, but not tested due to copying not in place yet (see
- script implemented, but not tested due to copying not in place yet (see
Secure copying:
To get cvs ssh working without password prompting: ssh-keygen -t rsa <just type enter to all questions> chmod 0644 .ssh/id_rsa.pub scp .ssh/id_rsa.pub <user>@cochise.uit.no:.ssh/authorized_keys2
New tasks:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location =
- the link isn't working
- the link isn't working
- structure, content/model and location of the dtd (location =
- finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to
- add xml validation against our dtd to the corpus conversion process (Saara)
6. Linguistics
North Sámi
We want to get the decisions from the SGL meeting in December.
TODO:
- get the decisions (Maaren)
Lule Sámi
NFR wrote:
[Programstyret] bed (...) om at det umiddelbart vert laga ein alternativ plan for resten av prosjektperioden. Denne planen må gå ut frå den endra føresetnaden, at ein ikkje har tilgang til den lulesamiske ordboka. Programstyret vil at ein går over til den alternative planen frå 1. mars, dersom tilgangen ikkje er reelt i orden på det tidspunktet.
The criterion for continuing with Lule Sámi that we set up in our plan was the
kriteriet for om det blir satsing på lulesamisk er om vi har ein betaversjon av den lulesamiske transdusaren, med integrert leksikon, i gang 1. mars.
Moments to our March 1st report:
- We have recieved the dictionary, and integrated most of it (27.2:
- Our Lule Sámi disambiguator is up and running as an alpha version
- The analyser runs on 80,0 recall on token and 67,4 on type, on the New Testament
- The analyser has 17773 lexicon lines
- We have 3367 lines of non-allocated Kintel words
- The integration of proper nouns will be done in parallel with Northern Sámi, and
Goal for March 1st: Reach the 90 percent lexical recall limit.
Conclusion: we have already met the basic requirements for continuing with Lule
TODO:
-
Trond and Thomas: work on the 90 % limit
- Trond: Write a report to NFR.
7. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
TODO:
- Move this issue to bugzilla (Børre)
XML format
TODO:
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and (Sjur, Tomi)
- test whether eXist as editor is actually working well (Thomas, others)
- develop the needed XQueries and interface (Sjur, Tomi)
8. Spellers
Nothing while Tomi is on sick leave, and until the new proper name lexicon is in
9. Other
Upcoming Strategy money application deadline at Samisk senter
Possible grant proposal themes include:
- Southern Sámi:
- Grammar research
- corpus collection
- normativity seminar (future speller)
- Grammar research
- Lule Sámi:
- Disambiguator
- Disambiguator
- Northern Sámi:
- Text-to-speech
- Semantic annotation
- Text-to-speech
- Programming infrastructure:
- Setting up a structure for inflecting dictionaries
SGL Seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- not all languages, only North Sámi
- date:
- As early as possible, end of March?
- As early as possible, end of March?
- place? Maaren will investigate
Bug fixing
33 open bugs (and 24 risten.no bugs)
10. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Ask KIO Grafisk to make a test Quark document based on a Word document from us
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
- get the normativity decissions from the December SGL meeting
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- Try to add numeral treatment as part of the analyzator.
- Move gt2ga.sh to G5 and implement copying of the gt-dir.
- Create a parallel corpora of the new testaments.
- Routine for adding new languages to the propernoun xml-structure.
- Move to Bugzilla: the analyzer needs to be optimized.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- data synchronisation between risten.no and the cvs repository
- refactor code
- fix bugs!
Thomas
- work with Lule sámi 90% goal
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- discuss the new lexicon format and other issues in the newsgroup
- fix bugs!
Trond
- Bring Lule Sámi up to 90 %
- Write Lule Report to NFR
- Apply for strategy funds
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara
- Ask for a Quark test file from his sister
- News group discussion followup.
- fix bugs!.
11. Next meeting, closing
06.03.2006 09: 30
Børre, Saara, Trond will be away.
Closed at 12: 00