Meeting_2006-04-03

Contents:

Meeting setup
Agenda
1. Opening, agenda review, participants
2. Reviewing the task list from the last meeting
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
3. Documentation
4. Corpus gathering
5. Corpus infrastructure
6. Infrastructure
7. Linguistics
8. Name lexicon infrastructure
9. Spellers
10. Other
11. Summary, task list
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
12. Next meeting, closing

Meeting setup

Date: 03.04.2006
Time: 09.30 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 43.

Present: Børre, Saara, Sjur, Thomas, Tomi, Trond

Absent: Maaren (sick leave)

Main secretary: Børre

Agenda accepted with additions under "Other".

2. Reviewing the task list from the last meeting

Børre

send out contracts with accompanying letter
- Not done
Gather public texts, preferrably also parallel ones
- Not done
Continue converting text from input format to our xml
- Some done
convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
review the paratext2xml converter
- Not done
convert smj NT to paratext
- Not done
Move complex name lexicon issue to bugzilla
- Done
Send out letters to the Iđut authors
- Waiting for address list from Åge Persen
Add corpus security re G5 syncing as an issue to Bugzilla
- Not done
write docu for how to apply for a corpus user account (forms, recipients, etc)
- Not done
remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Not done
integrate generated corpus repository summaries in the Forrest site
- Not done
Ask for email-address: corpus@giellatekno.uit.no
- Done. It works.
install and test Gobby, install new version of SEE (also for Thomas)
- Haven't installed SEE. Tried to compile Gobby, but it failed when linking.
fix bugs!
Other
- Contacted various publishers
  - Davvi Girji: will look at the contracts and give an answer after easter.
  - Báhko (Lule Sámi): Bård Eriksen will discuss the contract with some other in Árran, and give his answer after that.
  - DAT is fine with the contract, will compile an address list and send it to us.
  - NSI: Audhild Schanche will discuss our request with her co-workers, but presumes there would be no problems.

Maaren

will be on sick leave throughout April

Saara

Create a parallel corpora of the new testaments.
change the name of gt/ to gtbound/ and add a symbolic link.
- done. the scripts are not yet updated.
fix the email address for corpus upload.
- done.
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
add utf-8 check to xml-validation of the corpus files.
- not done.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
- nothing
Follow up on place names from Norge Digitalt
- nothing
Evaluate SFST as speller (and analyzer) lexicon
- nothing
write a background document on the corpus contracts
- nothing
public tender:
- answer requests/questions
  - wrote a lengthy e-mail formulated as a FAQ
- corpus repo access to free texts (with Børre)
  - not done yet
conversion of corpus repo summary xml to Forrest xml
- nothing last week
call EDD/ Christian Emil Ore about national place name lexicon
- not done
risten.no/proper noun lexicon development:
- refactor code
  - done some
- implement inheritance/collection overriding for css using sitemaps
  - done also for CSS now
- code design for XQueries needed for dict/term editing
  - some initial discussions in the newsgroup, based on the file naming schemes and inheritance/override system
fix bugs!

Thomas

add incoming Lule Sámi words
- added a few
work on North Sámi compounding and derivation
- nothing
smj G3 issue
- nothing
sme G3 issue
- nothing
translate stopword list into smj (aligner; list from Trond)
- finished
assist Trond and Linda with the smj disamb work
- assisted

Tomi

move aspell UTF-8 suffix bug to Bugzilla
- not done
Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
- not done
new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
  - done some
- XQuery refactoring and code development for our proper noun editor
  - started
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  - not done
read aligner docu, install, provide feedback
- not done
translate stopword list into sme (aligner; list from Trond)
- done
install and test Gobby, install new version of SEE
- not done
fix bugs!

Trond

Translate anchor list into nno, work on sme, fin.
- Not done for nno (which I think should be conflated with nob), but worked quite substantially on sme, fin, and looked through smj. The lists are now
- good enough to be used for development, they only need correction.
Add the anchor list translations to cvs
- Done (the naming is not consistent: anchor.txt contains eng-nob-sme-fin, and anchor-eng-nob-smj.txt contains what it is called.
remove deleted files from the CVS repository (in the Attic)
- Forgot that.
grammatical searchability in the graphical corpus interface: revise taglist
- Not done.
better smj NT text
- Not done.
Prepare a list of good candicates for first inclusion into the corpus.
- Not done
translate Northern Sámi lists and sets to Lule Sámi
- What was this?
work on semantically based sets (sme, smj)
- Minor work done.
start and lead discussion and work on semantic features for disamb
- Not done.
Install Gobby with support programs, see, etc.
- Installed all the prerequisites, plan to ask Børre to help me out on this one
get a key for Maaren in May
- Not done.
install aligner, test it and give feedback
- Not done.
fix bugs!.
- Not done.

3. Documentation

TODO:

documentation on how to apply for a user account for the corpus repo ( Børre)
- Not done

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Sæth replied by e-mail, hasn't had time to follow-up, but will try to include us in their plans.

Olavi Korhonen's Lule Sámi dictionary.

No news this week

KIO Grafisk and the Iđut books

TODO:

send letters to the authors (Børre)
wait for the discussions with Davvi Girji

Bible texts

We will get text from Finland, but still haven't received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

TODO:

convert smj NT to paratext (Børre)
get fin and swe NT and OT in paratext format. (Trond)

Min Áigi

Had a meeting last week, nothing heard of it yet.

Kåfjord

Contacted us last week, they would like to give us texts. Excellent initiative (we didn't contact them)! They were told to use the web interface, and contact us again if there are problems.

5. Corpus infrastructure

TODO:

remove deleted files from the CVS repository (Trond, still not done.)
we need to develop strong enough security routines for the G5 to fulfill our obligations towards the text licensers
- move this to bugzilla (Børre, not done yet.)
add UTF-8 check as part of the validation (Saara, not finished yet)
- Will file a bug report to Bz and continue the work.

Changes and updates because of the Divvun public tender

User account admin and infra: see previous memo.

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see previous memo.

TODO:

convert from that xml to Forrest document format (Sjur)
- nothing last week
integrate the final Forrest documents into Forrest, and make sure it gets published (Børre)
- waiting for the above
make free texts available
- zip up the xml files in gtfree/ and put it up on the Divvun download area http://www.divvun.no/downloads/ ( Saara, Børre, DEADLINE: Thursday 6.4.)
- also provide an xml-free version? I.e. only paragraphs, whatever, as given by ccat
- done weekly by a cron script (but only if there are new files) (Saara)
- e-mail Finnut about the availability of the free corpus, and the download link (Sjur, Børre if Sjur goes on paternal leave)

Free and non-free texts

More info in a previous meeting memo.

TODO:

update scripts to handle this dichotomy. (Saara)
- done
gt/ vs gtbound/: change to gtbound/, add symbolic link from gt/ to gtbound/ ( Saara)
- done
we need to rerun the conversion, and add/check copyright/license status
- add license info (Børre, DEADLINE: Tuesday 4.4.)
- rerun conversion (Saara, DEADLINE: Wednesday 5.4.)

Linking parallel files

How do we know that two (or more) files are parallel language versions of each other? Suggestions:

One option:

samefilename.sme.doc.xml
samefilename.nob.doc.xml

nno/facta/samefilename.nno.html.xml
sme/facta/samefilename.sme.html.xml <== parallel file

sme/facta/somefilename.html.xml <== file in one lg only

The other option: to store the parallel files as links in the meta info/header Then we can keep the original filename.

Should we allow for more than one file at a time when uploading? Use cases: parallel texts, chapters in a book, many documents from the same author. Saara will think about it, and discuss/propose something in the newsgroup.

DECISION: We'll keep the original filename, and store linking info in the header (has to be added manually). Saara will develop the web interface for uploading to make it easier to add several documents in one go (as a serial process).

More texts to the graphical corpus interface:

TODO:

We would like to have more than the NT in the graphical interface (Saara)
We would like to have grammatical searchability, not only POS. (Saara, Trond
This presupposes a discussion with Oslo. (Trond to start discussion and Trond and Saara to continue
For Lule Sámi: We would like to have a parallel corpus interface with NT (text only). This presupposes better quality texts (Trond, Børre)
- Better Lule NT text still not made.
preparations: gather more texts (we are doing this)
Review the tag list and have it ready for inclusion (gt/cwb/korpustags.txt) ( Trond, Linda)
Prepare a list of good candicates for first inclusion into the corpus. ( Trond, Linda)

Top-two priorities:

Linda and Trond to go through the taglist
Saara and Trond to contact Anders in 0slo

Text upload

TODO:

Ask for email-address: corpus@giellatekno.uit.no (Børre)
- done
Make a setup for this email address so that it goes to Børre, and then turn on the procedure (Saara)
- done, but not tested on the real server
  - now also tested, and working

Language recognition

As a work-around before Finnish recognition is reliable, treat all "Finnish" sections as North Sámi (we don't yet have any Finnish texts(?)). We need to be able to recognise the other languages, to remove noise, to identify parallel texts in teh same docu, etc.

TODO:

turn on language recognition, skipping Finnish (Saara)

6. Infrastructure

Aligner

Today, we have two anchor files in addition to the original one.

TODO:

Read documentation and try out, give feedback to Bergen. (Trond, Saara, Tomi)
- Trond to send relevant documents to Tomi.
Translate the anchor list anchor-eng-nor.txt into sme (and fin?) ( Tomi), and into smj (Thomas or Anders Urheim) (and nno? Trond). Note that the format is "lang1 / lang2", and that the number of lines should be kept in order to make it possible to move from one language to the next.
- done for all but nno
- nno should be conflated with nob into 'nor' (Trond)
Saara to install the aligner, everyone to read the documentation
Add the anchor list translations to cvs (Trond)
- add to cvs location: gt/common/src/anchor.txt
  - done
- "eng / nob / sme / smj / fin". word / ord* / sátni*, sáni* / sana*, sanoj*
Move smj to the anchor.txt file, so that we get eng/nor/sme/smj/fin. ( Trond)

Hyphenator

TODO:

make target for the hyphenator(s)
- done
add a web interface to the hyphenator
- done
correct hyphenation of word boundaries and exceptions (Sjur, Trond)
add a possibility to upload whole documents for hyphenation (and also analysis, generation, etc) (Saara)
we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)

7. Linguistics

General - hyphenation

We need to add word boundaries in our lexicons. All compounds need explicit word boundary markers, which will go to 0 in the regular transducer. It will go to - or # in a specially made hyphenation transducer, which will include the rules made by Trond (see: gt/sme/src/hyph-sme.txt).

It is not clear how this will be done, but Sjur has ideas.

Problematic word boundaries:

CVCV#CVCV  OK CV-CV-CV-CV  need no fix
CVCVC#CVCV OK CV-CVC-CV-CV need no fix
CVCVC#VCV  !! *CV-CV-C#V-CV -> CV-CVC#V-CV <= manually fix only these

Exceptions:

geografiija     ge-og-ra-fii-ja
Voionmaa        Voion-maa  => Voi-on-maa (oi no diphth)
 tak-si-eaig-gi :-)

These needs to be marked in the lexicons in each case, probably something like:

geo^grafiija
Voi^on#maa      Voi^on#maa
geo^grafiija    geo^gra-fii-ja
torne^träsk     tor-ne^träsk
  -- or --
torne#träsk     tor-ne#träsk

We need to introduce one new symbol: ^: 0

Goal: Analyse divvunáhkus as div-vun#áh-kus and not as div-vu-náh-kus. That is, to preserve the word boundary from the lexicon in the output, but keep the output as word form, not as stem/baseform+analysis. The following is a suggestion on how to process the input (bottom) to arrive at the wanted output. Start reading from below, and upwards.

output level
twh upper divvun#áhkus <= first analysis  how do you get this transducer?
twh lower divvunáhkus  <= text input

hyp upper div-vun#áh-kus => hyph-sme.fst    ok, we know this one
twh lower divvun#áhkus

twl lower divvun#áh0ku0s  => twolhash-sme.bin      #:# ^:^, not #:0 ^:0
twl upper divvun#áhkkuX4s                       \
                                                   smehash.fst
lex lower divvun#áhkkuX4s                       /
lex upper divvun#áhkku+N+Sg+Loc => sme.save

----- mirror: below, regular order, above mirrored order

lex upper divvun#áhkku+N+Sg+Loc => sme.save
lex lower divvun#áhkkuX4s                      \
                                                   sme.fst
twl upper divvun#áhkkuX4s  =>   twol-sme.bin    /
twl lower divvun0áh0ku0s   
input level

TODO:

add all word boundaries and exception hyphenation marks (Thomas)
- Done: The noun file a-h (only word boundaries?)
Set up the mechanism for the hash transducer package. (Sjur, Tomi, Trond)

OPEN questions:

what about compound names?
- The compound boundaries should be added to the propernoun lexicon
loan words breaking Sámi syllable structure Voionmaa, Bur^ås Bu-rås, Stein-vik-sholm *Steinviks#holm Stein-viks%holm
- Also here, boundaries should be added. The transducer should be equipped with at least some foreign phonotactic rules. We restrict the markup to ^ for now, as word boundaries of foreign words are normally not visible to the Sámi grammar.
do we want to differentiate between degrees of (mental) lexicalisation? It isn't needed for hyphenation, but can be useful when decomposing compounds for IR and other purposes.

DECISION:

use # and ^ for all compounds and hyphenation points in the lexicon
later find a way to generate without having to spaecify # and ^ (this is also a problem today for a subset of the words)

North Sámi

Semantic feature system

Further discussion and details in the previous meeting memos memos

Lule Sámi

TODO:

add the rest of the inc- words (Thomas)
- done, still some more
translate Northern Sámi lists and sets to Lule Sámi ( Linda, Trond, will be postponed till we have done more on smehash. )

8. Name lexicon infrastructure

TODO:

refactor and prepare risten.no for multiple collections:
1. develop the Cocoon sitemap to delegate requests to the proper folder level, such that the most specific code is always used (Sjur)
  1. Done, now also for CSS, thus complete
2. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
develop the needed XQueries and interface (Sjur, Tomi)
data synchronisation between risten.no and the cvs repo (Tomi)
1. commiting is moving forward
test and review when ready

9. Spellers

Nothing until the new proper noun lexicon is in place. We don't have enough people to do both.

10. Other

Easter vacation/absenses

Who?	When?
Børre	from the 10th to the 12th of April
Saara	at work normally
Sjur	no vacation, possibly paternal leave
Thomas	from the 10th to the 12th of April, 3 days
Tomi	from the 10th to the 12th of April, might be at work offline
Trond	don't know yet

No meeting during easter.

Gobby

TODO:

install and test it, to prepare for cooperation with non-Mac users (use case: Lars Nygård in Oslo) (Børre, Tomi, Trond, if it works ok then also the others)

SubEthaEdit update

TODO:

upgrade SEE ( Børre, Tomi, Thomas)
install jspwiki mode from Sjur ( all interested)

Bug fixing

35 open Divvun/Disamb bugs, and 25 risten.no bugs

Min Áigi letters

There are four texts on language correction, two interesting to us:

genitive and spelling of "puma"
why are there so few "ordinary" words in risten.no? Risten/SD should answer.

Key to the G5 room

All Tromsø people need access to Børre's office, to be able to initiate group video conferences, but only Børre and Trond has it.

TODO:

give/upgrade keys to Tomi and Thomas as well (Trond)

11. Summary, task list

Børre

send out contracts with accompanying letter
Gather public texts, preferrably also parallel ones
Continue converting text from input format to our xml
convert nob and nno bible texts to be used as part of a parallel corpus
review the paratext2xml converter
convert smj NT to paratext
Send out letters to the Iđut authors
Add corpus security re G5 syncing as an issue to Bugzilla
write docu for how to apply for a corpus user account (forms, recipients, etc)
remove old corpus files from gt/sme/corp/ after Trond has cleaned it
integrate generated corpus repository summaries in the Forrest site
install and test Gobby, install new version of SEE
make free texts available
- add license info DEADLINE: Tuesday 4.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area http://www.divvun.no/downloads/ (DEADLINE: Thursday 6.4.)
- also provide an xml-free version?
- possibly e-mail Finnut about the resource (if Sjur goes on paternal leave)
fix bugs!

Maaren

on sick leave throughout April

Saara

Create a parallel corpora of the new testaments.
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
file a bug report of utf-8 check in xml-validation of the corpus files.
make free texts available
- rerun conversion (DEADLINE: Wednesday 5.4.)
- zip up the xml files in gtfree/ and put it up on the Divvun download area http://www.divvun.no/downloads/ (DEADLINE: Thursday 6.4.)
- also provide an xml-free version?
- set up a weekly cron script (but only if there are new files)
Discuss: allow for more than one file at a time when uploading a file.
Implement links to parallel files in corpus header.
Turn on language recognition, skipping Finnish
add a possibility to upload whole documents for hyphenation (and also analysis, generation, etc)
add a log of every word/text uploaded/hyphenated/analyzed etc.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
Follow up on place names from Norge Digitalt
Evaluate SFST as speller (and analyzer) lexicon
write a background document on the corpus contracts
public tender:
- answer requests/questions
- corpus repo access to free texts (with Børre)
e-mail Finnut about the availability of the free corpus and the download link
conversion of corpus repo summary xml to Forrest xml
call EDD/ Christian Emil Ore about national place name lexicon
risten.no/proper noun lexicon development:
- refactor code
- code design for XQueries needed for dict/term editing
correct hyphenation of word boundaries and exceptions
Set up the mechanism for the hash-mark transducer package
fix bugs!

Thomas

add incoming Lule Sámi words
work on North Sámi compounding and derivation
smj G3 issue
sme G3 issue
add all word boundaries and exception hyphenation marks
SubEthaEdit update

Tomi

move aspell UTF-8 suffix bug to Bugzilla
Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
read aligner docu, install, provide feedback
install and test Gobby, install new version of SEE
Set up the mechanism for the hash-mark transducer package
fix bugs!

Trond

Unify anchor lists, and conflate nno and nob into nor
remove deleted files from the CVS repository (in the Attic)
grammatical searchability in the graphical corpus interface: revise taglist
better smj NT text, get fin and swe NT texts
Prepare a list of good candicates for first inclusion into the corpus.
start and lead discussion and work on semantic features for disamb
Install Gobby
get a key for Maaren in May
install aligner, test it and give feedback
correct hyphenation of word boundaries and exceptions
Set up the mechanism for the hash-mark transducer package
get/upgrade keys for Børre's room for Tomi and Thomas
fix bugs!.

12. Next meeting, closing

18.04.2006 09: 30

Sjur is on paternal leave.

Closed at 12: 47