Meeting_2006-02-13
Meeting setup
- Date: 13.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 57.
Present: Børre, Saara, Sjur, Trond
Absent: Maaren, Thomas, Tomi
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Min Áigi
- Min Áigi
- Gather public texts, preferrably also parallel ones
- Gathered some, sitting on a computer at the Sámediggi.
- Gathered some, sitting on a computer at the Sámediggi.
- Continue converting text from input format to our xml
- Converted existing texts using the upload form. Nice experience: -)
- Converted existing texts using the upload form. Nice experience: -)
- review code and documentation for corpus xsl files under version control
- Not done
- Not done
- convert nob and nno bible texts to be used as part of a parallel corpus, and
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- close bug 211 as WONTFIX
- DONE : -)
- DONE : -)
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- not done
- not done
- Finish the review of the hyphenation detection.
- not done
- not done
- Review the handling of xsl-files in corpus infrastructure, including version
- done
- done
- Fix the preprocess script and optimize it.
- not done
- not done
- finalize an improved working version of the CGI and command line scripts for
- done
- done
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- not done
- not done
- Look at crontab ga/ directory issue with Trond.
- done, but there is a bug.. which should be fixed now.
- done, but there is a bug.. which should be fixed now.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- not done
- not done
- Lule Sámi twol problems, with Thomas and Trond
- delayed till Thomas is back
- delayed till Thomas is back
- project planning with Trond, continued
- not done
- not done
- Follow up on place names from Norge Digitalt
- not done
- not done
- Evaluate SFST as speller (and analyzer) lexicon
- not done
- not done
- write a background document on the corpus contracts
- not done
- not done
- public tender:
- review draft tender document from Finnut
- done, feedback and changes returned
- done, feedback and changes returned
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- delayed till Thomas is back
- delayed till Thomas is back
- sme G3 issue with Thomas and Trond
- delayed till Thomas is back
- delayed till Thomas is back
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
- risten.no/proper noun lexicon development: fix bugs, continue development
- wrote a draft specification of filename conventions, that at the same time
- some coding as well (don't remember the details any more)
- wrote a draft specification of filename conventions, that at the same time
-
fix bugs!
- closed 217
- closed 217
- other:
- monthly report for January
- report for 2005 to Nordplus Sprog
- monthly report for January
Thomas
On sick-leave.
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
- remove last part of complex names not used as simplex names
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Done, but more to do.
- Done, but more to do.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Not done.
- Not done.
- Look at ga/ directory issue with Saara.
- Done. She has made a script, and I have posted things to the newsgroup.
- Done. She has made a script, and I have posted things to the newsgroup.
- News group discussion followup.
- Some done.
- Some done.
- Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
- Hmm, not done, after all.
- Hmm, not done, after all.
- Ask for e-mail adress for corpus upload script
- Don't remember this one.
- Don't remember this one.
-
fix bugs!.
- This one was forgotten.
3. Documentation
Reviews
XSLT processing part of the corpus infra review is finished. The code is
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Since last meeting:
- Min Áigi
- called Anders Kintel - will sign it as soon as he gets the contract; will also
- Swedish Sámi Parliament: Grundström will be finished by summer time, now
Next:
- calling Olavi Korhonen, his dictionary is now in for printing
- then continuing on the list of orgs/persons to contact
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
TODO:
- call Sæth (Børre)
Bible texts
TODO:
- review paratext2xml converter (Børre)
- convert smj NT to paratext. (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
5. Corpus infrastructure
We need more "version control" in the corpus work - we don't know which version
Transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
- then these files should be removed from gt/sme/corp/
- all small files could just be forgotten/ignored
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
- Access control resolved through Unix groups: one group for corpus
- Access control resolved through Unix groups: one group for corpus
- RCS version control is almost finished, but an issue with access control is
- Improve Finnish language detection as part of the corpus processing
- Move to Bugzilla (Saara)
- Move to Bugzilla (Saara)
- Review automatic hyphen:
- Acceptable results: 90% of all real hyphens correctly tagged.
- Move to Bugzilla (Saara)
- Acceptable results: 90% of all real hyphens correctly tagged.
Further discussion about corpus analysis and computer use:
- the new G5 is tremendeously faster than cochise, thus we want to use it
- cochise will continue to be our main corpus repo
- the corpus/gt/ dir will be synchronised with the G5
- corpus analysis and usage will happen on the G5
- we need to develop strong enough security routines for the G5 to fulfill our
- we are still using only one processor when analysing - making some simple
6. Linguistics
Anything? Nothing.
7. Name lexicon infrastructure
Complex names
TODO:
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
-
Saara has added the analyzer as part
Move these issues to bugzilla (Børre)
Preprocessor optimization
To optimize one could build a targeted transducer only containing the relevant
- (Punctuation)
- (Abbreviation)
- (Acronym)
- (Adposition)
- (Negativeverb)
- (Copula)
- (VerbRoot)
- (AdjectiveRoot)
- (At)
- (NounSecond)
- (ALIT)
- (NAMAT)
- (SAS)
Perhaps picking the
hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt earret% eará adv ; dan% dihte adv ; ...
TODO:
- make a lexc Root lexicon (first 40 lines of sme-lex.txt)
- extract the relevant parts of the relevant lexica from the main transducer
- built from the union of a and b.
Discussion will continue on the newsgroup.
XML format
TODO:
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- data synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
SGL Seminar
SGL has now been elected, with the folowing members:
- Rolf Olsen (Else Turi)
- Tor Magne Berg (Marit Breie Henriksen)
- Elle Marja Vars (-)
- Lena Kappfjell (Albert Jåma)
- Heidi Andersen (-)
SGL/normativity seminar:
- all members = potentially/likely all languages
- not all languages, only North Sámi
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
Infra for new projects and ideas:
- make Forrest integration work as expected (Børre)
Bug fixing
30 open bugs (and 24 risten.no bugs)
- Add bug report for the Xerox backspace error (Trond)
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth and Olavi Korhonen
- Correct Forrest integration for new projects and project ideas
- Move complex name lexicon issue to bugzilla
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Move the issue "Refine language detection for Finnish" to Bugzilla
- Move the issue "Finnish the review of the hyphenation detection" to Bugzilla
- Add version information of the tools to part of the corpus infra.
- Fix the preprocess script and optimize it.
- finalize an improved working version of the CGI and command line scripts for
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
- discuss the new lexicon format and other issues in the newsgroup
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- News group discussion followup.
- Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
- Ask IT guys for an e-mail adress for corpus upload script:
- fix bugs!.
10. Next meeting, closing
20.02.2006 09: 30
Closed at 11: 33