Meeting_2006-12-04

Meeting setup

Date: 4.12.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 48.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact authors who have already received the corpus licensing contract
consider a script for automatic testing of the spell checker in Word
- Began work. Some AppleScript done
meeting Wednesday at 9.30 with Sjur to plan testing
- Done
sma discussions with SD (with Sjur, Trond)
- Not done
add a simple password protection to risten.no in the G5
- Not done
consider infra for testing feedback
- Some plans made during the meeting with Sjur
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- Not done
update all forrest installations, including local patches
- Not done
fix bugs!
- Done some work on bug 348

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator

Saara

finalize server of the Xerox tools.
help Trond with some shell commands
re-analyze parallel files
consider implementing some new features to the corpus files
add closed POSes to the paradigm gen, if needed.
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
  - finally got multiuser editing safe against data corruption - after more than a year!
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
investigate unrecognised word forms in the hyphenator
- nothing done, but this has become much easier with the debug output from Tomi's PLX conversion.
decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
  - had a meeting, but didn't finish, and the second part disappeared in other issues
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
- discussed with Børre, pseudocode in place
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- make sure we get one/more licenses in Alta
publish corpus contracts and project infra on NoDaLi-sta
- not yet
ask SD/Sig-Britt Persson about some of the South Sámi bible texts
check whether lookup can be used to generate paradigms for closed POSes
- done, doesn't look like it
meeting Wednesday at 9.30 with Børre to plan testing
- done, memo available soon (not checked in) in the gt/doc/proof/ dir
fix bugs!
- done some for risten.no, gleaned through others

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
redo the derived words work for smj
- done
decide how to specify compounding behaviour info in the lexicon
- begun
- meeting Tuesday 12.30
  - done
check why some SUB-marked entries got included in the normative transducer
- not done
investigate unrecognised word forms in hyphenator
- not done
test editing of the xml files
- not done
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary)
- not done
write paradigm grammar for the closed POSes
- done
fix bugs!
- just added new bugs

Tomi

make PLX lexicon for all open POSes
- not done - how much is done, what is left?
- there is something fishy in the server, Saara has got it working but I haven't, we have same compilation
add derivations to the PLX generation
- not done
add compound stems to the PLX generation
- added noun
add closed POS and clitics to PLX generation
- not done
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts
- Still waiting
investigate unrecognised word forms in the hyphenator
- Not done
decide how to specify compounding behaviour info in the lexicon
- Not done
meeting Tuesday 12.30
- Meeting kept, but more to discuss.
sma discussions with SD (with Børre, Sjur)
- Not done.
redo the derived words work for smj
- Not done. fix bugs!.

3. Documentation

TODO:

update all forrest installations, including local patches (Børre)
- not yet, will do it in Alta

4. Corpus gathering

Nothing new during last week.

TODO:

get sma Bible / NT texts (Trond)
Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)

5. Corpus infrastructure

More texts to the graphical corpus interface

Trond has talked with Lars, who is writing documentation for the users.

TODO:

Consider whether to implement the <p> only policy or not (Saara)
add text to the server (Lars)
Discuss with Lars (Trond)

Aligner

No more texts yet, Saara has included the aligner in the relevant perl script.

TODO:

gather more parallel texts (Trond, Børre)
re-analyze parallel files using the command-line version (Saara)

6. Infrastructure

Xerox tools wrapped as servers

The server hasn't been working for Tomi, but is now working again. The paradigm generator now only generates 16-17 word forms - far too few. It seems all possessives have disappeared:

sajin   NIR # !SUB
sa^jis  NIR
sa^jiin NIR
sa^ji   NIR
sad^jái NIR
sad^ji  NIR
sad^je  NIRL - this is done by client
sa^je   NIRL
sa^ji   NIRL
sa^je   NIRL
sajiis  NIR # !SUB
sa^jiin NIR
sa^jii^guin     NIR
sa^jiid NIR
sa^jii^de       NIR
sa^jit  NIR
sa^jiid NIRL

sajin   NIR #N+Sg+Loc
sa^jis  NIR #N+Sg+Loc
sa^jiin NIR #N+Sg+Com
sa^ji   NIR #N+Sg+Acc
sad^jái NIR #N+Sg+Ill
sad^ji  NIR
sad^je  NIRL #N+Sg+Nom
sa^je   NIRL #N+Sg+Gen
sa^ji   NIRL #N+Sg+Gen
sa^je   NIRL #N+Sg+Gen
sajiis  NIR #N+Pl+Loc
sa^jiin NIR #N+Pl+Loc
sa^jii^guin     NIR #N+Pl+Com
sa^jiid NIR #N+Pl+Acc
sa^jii^de       NIR #N+Pl+Ill
sa^jit  NIR #N+Pl+Nom
sa^jiid NIRL #N+Pl+Gen

Using fsts:
/opt/smi/sme/bin/isme.fst
/opt/smi/sme/bin/hyph-sme.fst

It should have been using: ifst-norm:inverse-norm.fst. The file is available to the server, cf /opt/smi/sme/bin/:

-rwxrwxr-x  1 root  cvs    2257 jun 21 10:13 allcaps.fst
-rwxrwxr-x  1 root  cvs      92 jun 21 10:16 cap-sme
-rwxrwxr-x  1 root  cvs 6995574 des  4 00:38 hyph-sme.fst
-rwxrwxr-x  1 root  cvs 1206092 des  4 00:38 isme.fst
-rwxrwxr-x  1 root  cvs 3106957 des  4 00:38 isme-norm.fst
-rwxrwxr-x  1 root  cvs  674609 des  4 00:38 sme-dis.rle
-rwxrwxr-x  1 root  cvs 1251450 des  4 00:38 sme.fst

TODO:

remove clitics from the paradigm generator (Saara)
- done
investigate why possessives have disappeared from the paradigm generator (Number, also a facultative (?) category, has not disappeared) ( Saara, Tomi)
make sure the normative generator is used when generating paradigms (Tomi)

Hyphenator

TODO:

investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- possibly the main cause was found in the meeting (non-normative forms included in the generated output - these are not recognised by the hyphenator, which is strictly normative)

7. Linguistics

Names and multilinguality

TODO:

finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

Derivation and spellers like Aspell

TODO:

redo the work for smj, including discussion regarding Actio ( Thomas, Sjur, Trond)
- done by Thomas

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun -JUVVON
accompagnerejun accompagneret+V+TV+Der1+Der/j+Der/Pass+PrfPrc
ábuhuvvože -ETNE
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne -ETNE
áccohallagođežedne      áccohallat+V+TV+Der3+Der/goahti+Pot+Prs+Du1

In sme-lex.txt:
 +Der2+Der/Pass:uvvo DOHPPEINCH ;
 +Der/Pass+PrfPrc:un K ;  !SUB
 +Du1:e K ;  !SUB
 +Du1:edne K ;   !SUB
 +Du1:etne K ;

These are generated by make wordlist TARGET=sme, which uses nonrec-sme.fst ( print lower).

The last version of the wordlist does not include the errouneous words anymore. They seem to have disappeared as part of other changes.

TODO:

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio ( Maaren, Thomas)
check why some SUB-marked entries got included in the normative transducer ( Thomas, Sjur)
- done in the meeting

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
hire new linguist (Sjur)

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in the meeting memo.

TODO:

develop the needed XQueries and UI (Sjur, Tomi)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Polderland data generation

All other open POSes now included in the paradigm generator, below is a verb example:

galggaže        VIR #V+Pot+Prs+Du1
galggažedne     VIR #V+Pot+Prs+Du1
galg^ga^žet^ne  VIR #V+Pot+Prs+Du1
galg^ga^žeh^pet VIR #V+Pot+Prs+Pl2
galg^gaš        VIR #V+Pot+Prs+ConNeg
galg^ga^žat     VIR #V+Pot+Prs+Pl1
galg^ga^žit     VIR #V+Pot+Prs+Pl1
galg^ga^žan     VIR #V+Pot+Prs+Sg1
galg^ga^žea^ba  VIR #V+Pot+Prs+Du3
galg^ga^žat     VIR #V+Pot+Prs+Sg2
galg^ga^žeahp^pi        VIR #V+Pot+Prs+Du2
galg^ga^žit     VIR #V+Pot+Prs+Pl3
galg^ga^ža      VIR #V+Pot+Prs+Sg3
galg^ga^leim^me VIR #V+Cond+Prs+Du1
galg^ga^šeim^me VIR #V+Cond+Prs+Du1
galg^ga^leid^det        VIR #V+Cond+Prs+Pl2
galg^ga^šeid^det        VIR #V+Cond+Prs+Pl2
galg^ga^le      VIR #V+Cond+Prs+ConNeg
galg^ga^še      VIR #V+Cond+Prs+ConNeg
galg^ga^leim^met        VIR #V+Cond+Prs+Pl1
galg^ga^šeim^met        VIR #V+Cond+Prs+Pl1
...

TODO:

write paradigm grammar for the closed POSes (Thomas)
- done
check whether lookup can be used to generate paradigms for closed POSes ( Sjur)
- couldn't find anything
decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
- meeting Tuesday at 12.30
  - done, needs to be followed up
make inflection PLX lexicon for all open POSes (Tomi)
- done
add closed POS and clitics to PLX generation (Tomi)
add derivations to the PLX generation (Tomi)
add compound stems to the PLX generation (Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

When the PLX-based speller is ready: use the generated word list as test input: all should be accepted (coverage self-testing). Pick random 1% and randomly change them with edit distance 1, run through speller = testing false positives

We need a meeting to plan testing. We'll do it shortly this week, and perhaps a longer meeting in Alta.

TODO:

meeting Wednesday at 9.30 (Sjur, Børre)
- done
consider a script (shell+AppleScript?) for automatic testing of MS Word ( Sjur, Børre)
- pseudocode written, some AppleScript
get an Intel Mac for testing Windows spellers; get a WinXP license from SD ( Børre, Sjur)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- not yet

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

update Maaren's Forrest installation (Børre)

11. Next meeting, closing

The next meeting is 11.12.2006, 09: 30 Norwegian time.

The meeting was closed at 11: 22.

Appendix - task lists for the next week

Boerre

contact authors who have already received the corpus licensing contract
continue work on script for automatic testing of the spell checker in Word
sma discussions with SD (with Sjur, Trond)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
update all forrest installations, including local patches
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

finalize server of the Xerox tools.
help Trond with some shell commands
re-analyze parallel files
consider implementing some new features to the corpus files
add closed POSes to the paradigm gen, if needed.
investigate why possessives have disappeared from the paradigm generator
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Trond)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
publish corpus contracts and project infra on NoDaLi-sta
ask SD/Sig-Britt Persson about some of the South Sámi bible texts
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
decide how to specify compounding behaviour info in the lexicon
fix bugs!

Tomi

add closed POS and clitics to PLX generation
add derivations to the PLX generation
add compound stems to the PLX generation
make sure the normative generator is used when generating paradigms
investigate why possessives have disappeared from the paradigm generator
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Sjur) fix bugs!.