Meeting_2006-02-13

Meeting setup

Date: 13.02.2006
Time: 09.30 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
name lexicon infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 57.

Present: Børre, Saara, Sjur, Trond

Absent: Maaren, Thomas, Tomi

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

send out contracts with accompanying letter
- Min Áigi
Gather public texts, preferrably also parallel ones
- Gathered some, sitting on a computer at the Sámediggi.
Continue converting text from input format to our xml
- Converted existing texts using the upload form. Nice experience: -)
review code and documentation for corpus xsl files under version control
- Not done
convert nob and nno bible texts to be used as part of a parallel corpus, and review the paratext2xml converter as part of the conversion
- Not done
convert smj NT to paratext
- Not done
close bug 211 as WONTFIX
- DONE : -)
fix bugs!

Maaren

work with risten.no
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

continue discussion on the new lexicon format
Refine language detection for Finnish
- not done
Finish the review of the hyphenation detection.
- not done
Review the handling of xsl-files in corpus infrastructure, including version control
- done
Fix the preprocess script and optimize it.
- not done
finalize an improved working version of the CGI and command line scripts for corpus additions
- done
update conversion from lexc to xml (proper names) with the latest refinements
Try to add numeral treatment as part of the analyzator.
- not done
Look at crontab ga/ directory issue with Trond.
- done, but there is a bug.. which should be fixed now.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
- not done
Lule Sámi twol problems, with Thomas and Trond
- delayed till Thomas is back
project planning with Trond, continued
- not done
Follow up on place names from Norge Digitalt
- not done
Evaluate SFST as speller (and analyzer) lexicon
- not done
write a background document on the corpus contracts
- not done
public tender:
- review draft tender document from Finnut
  - done, feedback and changes returned
smj G3 issue with Thomas and Trond
- delayed till Thomas is back
sme G3 issue with Thomas and Trond
- delayed till Thomas is back
call EDD/ Christian Emil Ore about national place name lexicon
- not done
risten.no/proper noun lexicon development: fix bugs, continue development
- wrote a draft specification of filename conventions, that at the same time also repeates what has been said earlier regarding dir. structure.
- some coding as well (don't remember the details any more)
fix bugs!
- closed 217
other:
- monthly report for January
- report for 2005 to Nordplus Sprog

Thomas

On sick-leave.

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
corpus infrastructure:
- dtd location (both public and internal)
Document aspell and corpus infrastructure
new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
comment review template made by Saara
fix bugs!

Trond

Work on corpus texts with Børre.
- Done, but more to do.
Contact the Finnish and Swedish Bible societies to get Bible texts.
- Not done.
Look at ga/ directory issue with Saara.
- Done. She has made a script, and I have posted things to the newsgroup.
News group discussion followup.
- Some done.
Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
- Hmm, not done, after all.
Ask for e-mail adress for corpus upload script
- Don't remember this one.
fix bugs!.
- This one was forgotten.

3. Documentation

Reviews

XSLT processing part of the corpus infra review is finished. The code is finished and ready to use, but language identification still needs improvements.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

Since last meeting:

Min Áigi
called Anders Kintel - will sign it as soon as he gets the contract; will also burn on a CD
Swedish Sámi Parliament: Grundström will be finished by summer time, now proofreading

calling Olavi Korhonen, his dictionary is now in for printing
then continuing on the list of orgs/persons to contact

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

call Sæth (Børre)

Bible texts

TODO:

review paratext2xml converter (Børre)
convert smj NT to paratext. (Børre)
ask to get fin and swe NT and OT in paratext format. (Trond)

5. Corpus infrastructure

We need more "version control" in the corpus work - we don't know which version of the XSL script was used (but roughly whether it was used or not). We need to document within the XSL file which version of all tools were used, including the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:

for the biggest top ten (or so) the orig. should be located and copied to the new corpus repo
then these files should be removed from gt/sme/corp/
all small files could just be forgotten/ignored

Task list:

Include the xsl files under version control
1. RCS version control is almost finished, but an issue with access control is still open.
  1. Access control resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access.
Improve Finnish language detection as part of the corpus processing
1. Move to Bugzilla (Saara)
Review automatic hyphen:
1. Acceptable results: 90% of all real hyphens correctly tagged.
  1. Move to Bugzilla (Saara)

Further discussion about corpus analysis and computer use:

the new G5 is tremendeously faster than cochise, thus we want to use it
cochise will continue to be our main corpus repo
the corpus/gt/ dir will be synchronised with the G5
corpus analysis and usage will happen on the G5
we need to develop strong enough security routines for the G5 to fulfill our obligations towards the text licensers
we are still using only one processor when analysing - making some simple multiprocessor analysis script will increase the speed of the analysis a lot

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
Saara has added the analyzer as part of the preprocess, but it is slow, and needs to be optimized.

Move these issues to bugzilla (Børre)

Preprocessor optimization

To optimize one could build a targeted transducer only containing the relevant lexicons for preprocessing. But presently that leads to the following lexicons being reported as referenced but undefined:

(Punctuation)
(Abbreviation)
(Acronym)
(Adposition)
(Negativeverb)
(Copula)
(VerbRoot)
(AdjectiveRoot)
(At)
(NounSecond)
(ALIT)
(NAMAT)
(SAS)

Perhaps picking the

hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt 
earret% eará adv ;
dan% dihte adv ;
...

TODO:

make a lexc Root lexicon (first 40 lines of sme-lex.txt)
extract the relevant parts of the relevant lexica from the main transducer
built from the union of a and b.

Discussion will continue on the newsgroup.

XML format

TODO:

testing of conversion
eXist as editor:
1. develop the needed XQueries and interface
2. data synchronisation between risten.no and
3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

SGL has now been elected, with the folowing members:

Rolf Olsen (Else Turi)
Tor Magne Berg (Marit Breie Henriksen)
Elle Marja Vars (-)
Lena Kappfjell (Albert Jåma)
Heidi Andersen (-)

SGL/normativity seminar:

all members = potentially/likely all languages
- not all languages, only North Sámi
date? As early as possible, end of February/beginning of March
place? Maaren will investigate

Infra for new projects and ideas:

make Forrest integration work as expected (Børre)

Bug fixing

30 open bugs (and 24 risten.no bugs)

Add bug report for the Xerox backspace error (Trond)

9. Summary, task list

Børre

send out contracts with accompanying letter
Gather public texts, preferrably also parallel ones
Continue converting text from input format to our xml
convert nob and nno bible texts to be used as part of a parallel corpus
review the paratext2xml converter
convert smj NT to paratext
Call Ove Sæth and Olavi Korhonen
Correct Forrest integration for new projects and project ideas
Move complex name lexicon issue to bugzilla
fix bugs!

Maaren

work with risten.no
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

continue discussion on the new lexicon format
Move the issue "Refine language detection for Finnish" to Bugzilla
Move the issue "Finnish the review of the hyphenation detection" to Bugzilla
Add version information of the tools to part of the corpus infra.
Fix the preprocess script and optimize it.
finalize an improved working version of the CGI and command line scripts for corpus additions
update conversion from lexc to xml (proper names) with the latest refinements
Try to add numeral treatment as part of the analyzator.
Look at crontab ga/ directory issue with Trond.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
Lule Sámi twol problems, with Thomas and Trond
project planning with Trond, continued
Follow up on place names from Norge Digitalt
Evaluate SFST as speller (and analyzer) lexicon
write a background document on the corpus contracts
public tender:
- review draft tender document from Finnut
smj G3 issue with Thomas and Trond
sme G3 issue with Thomas and Trond
call EDD/ Christian Emil Ore about national place name lexicon
risten.no/proper noun lexicon development: fix bugs, continue development
fix bugs!

Thomas

work on North Sámi compounding and derivation
review corpus usage documentation
smj G3 issue with Sjur and Trond
sme G3 issue with Sjur and Trond

Tomi

move aspell UTF-8 suffix bug to Bugzilla
corpus infrastructure:
- dtd location (both public and internal)
Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
fix bugs!

Trond

Work on corpus texts with Børre.
Contact the Finnish and Swedish Bible societies to get Bible texts.
Look at ga/ directory issue with Saara.
News group discussion followup.
Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
Ask IT guys for an e-mail adress for corpus upload script: corpus@giellatekno.uit.no (forwarded to Børre)
fix bugs!.

10. Next meeting, closing

20.02.2006 09: 30

Closed at 11: 33