Meeting_2006-05-15

Contents:

Meeting setup
Agenda
1. Opening, agenda review, participants
2. Reviewing the task list from the last meeting
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
3. Documentation
4. Corpus gathering
5. Corpus infrastructure
6. Infrastructure
7. Linguistics
8. Name lexicon infrastructure
9. Spellers
10. Public tender
11. Other
12. Summary, task list
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
13. Next meeting, closing

Meeting setup

Date: 15.05.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 54.

Present: Børre, Saara, Sjur, Thomas, Trond, Tomi

Absent: Maaren

Main secretary: Tomi/Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

send out contracts with accompanying letter
- Some done
Gather public texts, preferrably also parallel ones
- Not done
Continue converting text from input format to our xml
- Done
convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
review the paratext2xml converter
- Not done
convert smj NT to paratext
- Not done
Send out letters to the Iđut authors
- Some done
write docu for how to apply for a corpus user account (forms, recipients, etc)
remove old corpus files from gt/sme/corp/ after Trond has cleaned it
mirror Odin URL (create cron task to do it automatically?)
- Not done
read & evaluate received offers
- Done
telephone meeting Wednesday with Finnut
- What happened with this? Sjur had a short meeting alone, as we are not ready for the concluding meeting yet.
Check the status & license of the corpus texts
- Not done
fix bugs!

Maaren

Working on missing list
- Done.

Saara

Create a parallel corpora of the new testaments.
grammatical searchability in the graphical corpus interface
move to bugzilla:
- set up a weekly cron script to make free texts available
- add a possibility to upload whole documents for hyphenation (and also analysis, generation, etc)
- add a log of every word/text uploaded/hyphenated/analyzed etc.
  - done
Implement links to parallel files in corpus header.
- not done
Implement parallel corpus upload in web upload script
- not done
Implement turning off the language recognition in the xsl-file (and corpus.dtd).
- not done
Refine language recognition
- not done
Investigate the decomposed Unicode characters in file names -problem.
- solved
Correct decomposed Unicode in Min Áigi file names
- done
Check the status & license of the corpus texts
- not finished yet, some discussion: bug 284
Rerun corpus conversion
- done extensively
fix bugs!

Sjur

read & evaluate received offers
- done, we need more info from the tenderers - questions sent, with a deadline on May 17.
telephone meeting Wednesday with Finnut
- had a short meeting (without Børre), clarifying what's left regarding technical issues. Real meeting postponed till we have all the needed answers, and can complete the technical evaluation
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- nothing
name lexicon:
- implement editing functions
  - started
- write down the most common editing scenarios, to guide development
  - done, quite helpful

Thomas

correct hyphenation of exceptions
- not done
correct hyphenation of smj -st-
- not done
work on compounding and derivation
- worked
smj G3 issue
- not done
sme G3 issue
- not done

Tomi

move to Bugzilla:
- aspell UTF-8 suffix bug
  - not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
  - not done
new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
  - not done
- XQuery refactoring and code development for our proper noun editor
  - not done
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  - not done
- write down the most common editing scenarios (to be used as guides for making the editing interface) (adding / changing )
  - done
read aligner docu, install, provide feedback
- not done
install and test Gobby, install new version of SEE
- not done
Set up the mechanism for the hash-mark transducer package
- not done
read & evaluate received offers
- done
fix bugs!

Trond

better smj NT text, get fin NT texts
- Not done.
get a key for Maaren in May
- Well, done, sort of.
install aligner, test it and give feedback
- Not done.
correct hyphenation of word boundaries and exceptions with Thomas
- Not done (but Thomas has worked on this).
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- Not done.
get/upgrade keys for Børre's room for Tomi and Thomas
- Not done.
fix bugs!.
- Done some, but suggesting a collective effort wrt. the bugs.
Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org)
- Not done.
write down the most common editing scenarios (to be used as guides for making the editing interface) (adding / changing )
- Done, wasn't it?
read & evaluate received offers
- Work in progress.
Check the status & license of the corpus texts
- Must check whether all is done.

3. Documentation

TODO:

documentation on how to apply for a user account for the corpus repo ( Børre)
- The item will be moved to Bugzilla

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

New contacts:

Elle Márjá Vars (sme, 38 titles in Bibsys, novels, plays, both iđut and Davvi Girji, one translated into smj and sma, several into nob)

Odin

Børre will set up a weekly cron job to mirror the URL mentioned in Sæths e-mail, and add missing files to the corpus.

Olavi Korhonen's Lule Sámi dictionary.

No news this week

TODO: Børre to contact Olavi Korhonen and Kuhmunen

KIO Grafisk and the Iđut books

TODO:

send letters to the authors (Børre)
- one contacted, the rest this week

Bible texts

We will get text from Finland, but still haven't received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:

convert smj NT to paratext (Børre)
get fin, swe, nob and nno NT and OT in paratext format. (Trond)

Davvi Girji

A talk with Brita Kåven, revealed that they would have a look at the contracts after easter. Has been away, Børre will call her again today.

Min Áigi

Agreement: they will send us updates each month. Standard license.

We have problems with Unicode characters in filenames. This was solved once before, and we need to look at this again. The old Bugzilla issue should be reopened. The bug was reopened: http: //giellatekno.uit.no/bugzilla/show_bug.cgi?id=76

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for the .txt, but business as usal for the .doc files.

Done so far: Trond and Saara? Trond made a list of tags, and Saara will make the xsl conversion routine for the typographic tags.

Saara is implementing the tag format.

TODO:

reopen Bugzilla issue, and study the previous discussion and solution regarding Unicode characters in filenames (Saara)
- done
add filename extensions to files not having one
- done
Space in file names should be changed to underscore (and not to hyphen!).
- done
send bug report to Apple (typing filenames in Terminal does not match, moving filenames across plattforms via tarballs creates problems - the names are not the same!) (Sjur)
make xsl conversion routine for the typographic tags (Saara)

Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

TODO: Børre will contact them.

Sámi Instituhtta

Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on his todo list, as he would have to contact the person who has worked with the newspaper texts anyway. He said this would be done in the near future (within a month).

5. Corpus infrastructure

https: //giellalt.uit.no/lang/corp/corpus-summary.html

TODO:

Changes and updates because of the Divvun public tender

User account admin and infra: see previous memo.

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see previous memo.

TODO:

make free texts available
- done weekly by a cron script (but only if there are new files) (Saara)
  - move to bugzilla: done
Make a link, easily available, to these texts.
- done as a downloadable tar package.

Name change again? Trond has come up with some new suggestion:

gt -> gtbound/
gtbound -> bound/
gtfree -> free/

NB! keep symlinks from the old names to the new names for now.

Free and non-free texts

More info in a previous meeting memo.

TODO:

Check the status of the texts, again. (Børre, Trond, Saara)
Rerun the conversion afterwards (Saara is the one with the magic spell)
check against bugs 273 and 272
- these are now fixed.
update the script(s) to only copy texts to the free/ dir which are explicitly marked as free (Saara)

More texts to the graphical corpus interface:

TODO:

We would like to have more than the NT in the graphical interface ( Saara)
- We add the largest texts first.
We would like to have grammatical searchability, not only POS. (Saara, Trond This presupposes a discussion with Oslo. (Trond and Saara to continue this discussion)
For Lule Sámi: We would like to have a parallel corpus interface with NT (text only). This presupposes better quality texts (Trond, Børre)
- Better Lule NT text still not made.
The list of good candicates: The longest (admin) texts.
- We need a new option in ccat for analysing text while still keeping the xml tags and structure (Tomi)
- xml texts number <p>, preprocess finds <.> <?> <!> and ccat numbers them as <s>.
- Then the aligner aligns...

Top-two priorities:

Trond to discuss more with Lars on tag unification, and unify them.
Tomi to change ccat to be able to create the right input for the corpus analysis.
Lars to add text to the server.

Language recognition

TODO:

refine language recognition (Saara)
- in progress, continue discussion in 249
create a short word list to help the trigram heuristics
- Trond has made such lists for all lgs except smj.
send Saara smj files (Trond)
Add some flag to write into the xsl file (Saara):
- method: do not run lg recognition
- method: Choose between these 2: nob, sme, etc.

6. Infrastructure

Paradigm generation

Greenland's language secretariat has a paradigm generator based upon Xerox tools, we have asked for their source code, and will get it (site in Greeenlandic, English and Danish).

Try e.g. illu "house", aput "snow" (or any word in st/kal/src/noun-kal-lex.txt)

TODO:

Put Saara and Tero in contact with each other (Trond)
The paradigm generator should also have an xml-out option (for use in e.g. electronic dictionaries) - on hold till we know more about the generator

Aligner

TODO:

Read documentation and try out, give feedback to Bergen. (Trond, Saara)
Saara to install the aligner, everyone to read the documentation
- done, waiting for the test files from Bergen.
Trond and Saara will continue this issue.

Hyphenator

Trond and Thomas have been updating the propernoun file with ^ tags. We need the tag in front of compound parts beginning in a vowel or in two or more consonants. Compound parts beginning with one consonant are handled correctly.

TODO:

correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
- Still not done.
add a possibility to upload whole documents for hyphenation (and also analysis, generation, etc) (Saara)
- moved to Bugzilla
we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
  - moved to Bugzilla

7. Linguistics

General - hyphenation

See discussion, open questions and decission in the previous meeting memo.

We did a sme overkill: V^CV, when the rulebased hyphenator gave the right result. For sme: Find the default, only tag the exceptions. (issue: VCsCV, VsCV)

TODO:

Set up the mechanism for the hash transducer package - fst gymnastics, see above.
add exceptions marks to the smj lexicon (boks^távva)
- Status? We discussed with Kåre Tjihkkom, and the principles are clear, even behaviour around -st-, but it is not implemented in the lexicon

North Sámi

There are some heavy bugs (11 sme bugs all in all):

he(a)ddjiid is one
compounding cleanup - no shortening when normative, still shortening when analysing corpus texts?
vowel-shortening when compounding (we need the input from Pekka!)

We should have some linguistic workshops while Maaren is here.

TODO:

set up a linguistic workshop while Maaren is in Tromsø (and remember that she is on sick leave all the time) (Trond, Thomas)

Lule Sámi

TODO:

50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. These become more visible as we get real texts.
- names
- compounds? Shortening here as well, but not in written language (some lexicalised exceptions, as in sme)
- loanwords?

8. Name lexicon infrastructure

TODO:

refactor and prepare risten.no for multiple collections:
1. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
  1. done for editors, some open issues for regular searches
write down the most common editing scenarios (to be used as guides for making the editing interface) (adding / changing ) (Trond, Tomi)
1. Done.
develop the needed XQueries and interface (Sjur, Tomi)
1. developing
data synchronisation between risten.no and the cvs repo (Tomi)
1. nothing this week
test and review when ready
Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org, or the Lyndi England issue) (Trond)

9. Spellers

Nothing until the new proper noun lexicon is in place. We don't have enough people to do both. Here's our most important targets for spellers in the near future:

aspell
hunspell

10. Public tender

TODO:

read the offers (Børre, Sjur, Trond, Tomi)
- done
meet on Tuesday 13 to sum up the findings (Børre, Sjur, Tomi, Trond)
- done, need more info - questions sent
telephone meeting next Wednesday with Finnut (Børre, Sjur)
- postponed till we have more info.
new meeting on Thursday 18.5. at 10.00 AM (Børre, Sjur, Tomi Trond)

11. Other

Summer vacation

Who	When
Børre	?
Linda	?
Maaren	?
Saara	July
Sjur	?
Thomas	3.7 - 7.8
Trond	July
Tomi	8.7 - 16.7, more?

Bug fixing

45 open Divvun/Disamb bugs, and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help... Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. ... and then a new one after the name lexicon is fixed.

Move to victorio

Trond's victorio problem:

sme:
FirstTag...1, ProperNoun...10000...
*** Warning:  Ignoring info strings.
20000...30000...flex scanner jammed
make: *** [sme/bin/sme.save] Error 2

smj:
Reading from 'smj/src/adv-smj-lex.txt'
adv...1, Adverb...1064

Reading from 'smj/src/noun-smj-lex.txt'
NounRoot...flex scanner jammed
make: *** [smj/bin/smj.save] Error 2

sma, kal compile without jamming.cd

Saara compiles sme without problems, has problems with smj. Conclusion: it is a source code problem. Tomi and Børre are compiling smj just fine.

Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all computers c.f. http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide (our preinstalled Xcode veriosn is 2.0, must be 2.1):

Børre - todo
Maaren - Børre to do it
Saara - todo
Sjur - ok
Thomas - Børre to do it
Tomi - todo
Trond - ok

Trond should ask Lars Nygård, Per Langgård and Tero Avellan to install Gobby as well.

12. Summary, task list

Børre

corpus work:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the rest of the Iđut authors
- call Brita Kåven
- contact Kåfjord
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Check the status & license of the corpus texts
- contact Korhonen & Kuhmunen
public tender:
- continue public tender offer evaluation
- meeting on Thursday 18.5. at 10.00 AM with Sjur, Tomi, Trond
- telephone meeting with Finnut
install latest SEE
install Gobby using Darwin Ports (also for Thomas and Maaren)
move to Bugzilla:
- write docu for how to apply for a corpus user account
fix bugs!

Maaren

On sick leave

Saara

Create a parallel corpora of the new testaments.
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
Implement links to parallel files in corpus header.
Implement parallel corpus upload in web upload script
Check the status & license of the corpus texts and change the default of the availability to "license" instead of "free"
Rerun corpus conversion
Install Gobby
make xsl conversion routine for the typographic tags
update the corpus script(s) to only copy texts to the free/ dir which are explicitly marked as free
Add some language recognition flags to write into the xsl file
fix bugs!

Sjur

public tender:
- read & evaluate received offers
- meeting on Thursday 18.5. at 10.00 AM with Børre, Tomi, Trond
- telephone meeting Friday with Finnut
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
name lexicon:
- implement editing functions
- write down the most common editing scenarios, to guide development
- finalise refactoring for multiple collections (regular search interface)
update corpus-summary processing to adhere to the new structure
send bug report to Apple re filename matching and accented characters in Terminal
fix bugs!

Thomas

correct hyphenation of exceptions
correct hyphenation of smj -st-
work on compounding and derivation
smj G3 issue
sme G3 issue
set up a linguistic workshop while Maaren is in Tromsø

Tomi

move to Bugzilla:
- aspell UTF-8 suffix bug
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
read aligner docu, install, provide feedback
install and test Gobby
Set up the mechanism for the hash-mark transducer package
meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Trond
add ccat option to analyse text while keeping the xml tags and structure
fix bugs!

Trond

better smj NT text
get fin, swe, nob and nno NT and OT in paratext format
install aligner, test it and give feedback
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
get/upgrade keys for Børre's room for Tomi and Thomas
Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines ("Helsinki" from Plc to Obj to Org)
Check the status & license of the corpus texts
Work on the graphical corpus tag list
send Saara smj files for language recognition
Put Saara and Tero in contact with each other
meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Tomi
ask Lars Nygård, Per Langgård and Tero Avellan to install Gobby 0.3
set up a linguistic workshop while Maaren is in Tromsø
fix bugs!.

13. Next meeting, closing

22.05.2006 09: 30

Closed at 11: 30