Meeting_2007-01-15

Meeting setup

Date: 15.01.2007
Time: 09.00 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 9: 44.

Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact authors who have already received the corpus licensing contract
- not done
continue work on script for automatic testing of the spell checker in Word
- not done
fix sme texts in corpus this month
- not done
find missing nob parallel texts in corpus
- not done
translate Windows installer to sme
- some done, helped Thomas
work on the Polderland data generation (PLX format conversion)
- done, not finished.
go through other directories, fix parallelity information for other documents
- not done
fix bugs!
- not done

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
- not done

Saara

fix sme texts in corpus this month
- in progress
send aligned, xml nob texts to Lars
add correction markup to the xml files (string-to-correction markup)
- done, but see newsgroup message
first new version of xml2lexc in Perl
- done
fix bugs!
- fixed couple of bugs

Sjur

name lexicon:
- rewrite the integration with forrest, to get a more flexible integration with proper i18n, solving some problems with the previous solution, and make a foundation for better search and editing interfaces.
  - search interface finished, editor half-way; still needs some javascript and css tweaks to be really well-behaved, but can b
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
- finally done!
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix stuorra-oslolaš lower case o
fix bugs!

Steinar

conversion error screening
- not done
missing lists
- done some work
report conversion errors to Saara
- not done
Go through the Num bugs
- not done
Look at the actio compound issue when adding from missing lists
- added words
fix bugs!
- not done
worked with cg-sets
- done some

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
decide how to specify compounding behaviour info in the lexicon
- decided
translate Windows installer to sme and smj
- ready soon
Actio compounds: The disamb crew is satisfied. Now it is up to the divvun folks to see whether it is too hard to lexicalise
- nothing this week
Lack of lowering before hyphen: Twol rewrite.
- nothing this week
include numbers in the non-recursive transducers
- not done
Go through the Num bugs
- not done
Write diphthong hyphenation pseudocode
- done
fix stuorra-oslolaš lower case o
- not done
fix bugs!
- worked

Tomi

add closed POS and clitics to PLX generation
- done with help from Børre
add derivations to the PLX generation
- not done
add compound stems to the PLX generation
- not done
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- No smj last week.
decide how to specify compounding behaviour info in the lexicon
- Decided
Set up work on missing and conversion screening with Steinar and Ilona.
- Done.
fix sme texts in corpus this month
- Continuously working on this one.
find missing nob parallel texts in corpus, go through Saara's list
report conversion errors to Saara
- Saara has been leading this work...
Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
- Not done
Go through the Num bugs
- Not done
Make numeral testbed
- Not done.
Rewrite hyphenation-code (pseudocode from Thomas) sme, smj
- Done
Get input on sma hyphenations
- Not done.
fix stuorra-oslolaš lower case o
- This one I would like to pass over to Tomi.
include numbers in the non-recursive transducers for sme, smj
- Started work on this one. Split the closed-smX-lex.txt file with Børre.
fix bugs!.

3. Documentation

Nothing this week.

4. Corpus gathering

Trond finally got the sma texts from Snåsa, quite a lot of text, but not all. Børre will add it to the corpus repository.

The relevant persons have worked on the tasks below.

TODO:

sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
missing nob parallel texts should be added if such wholes are found ( Børre, Trond)
Go through the list of missing or errouneous nob texts, based upon Saaras perfect list (Børre, Trond)
add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Lars Nygård has left UiO. Anders Nøklestad is back in his old position. For us, this means that Anders will be the person to contact for technical matters, and Kristin Hagen the one for parsing of the nob parallel texts.

Alignment

TODO:

go through other directories, fix parallelity information for other documents ( Børre)
- Still to be done.
re-analyze parallel files using the command-line version (Saara)
- done all existing files
when aligned, send aligned, xml nob texts to Kristin ( Saara)
- not yet done

Conversion issues

TODO:

add correction markup to the xml files (string-to-correction markup) ( Saara)
- see news discussion - we will and should allow text corrections concerning character encoding problems.
report conversion errors to Saara ( Trond, Steinar)
- Not done.

6. Infrastructure

Nothing this week.

7. Linguistics

North Sámi

TODO:

lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Thomas, Maaren, Steinar)
Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)

Numbers:

One problem we have is to correctly identify base forms of numerals, cf: (the baseform of 16 is given as 6)

guhttanuppelohkái
guhttanuppelohkái       guhtta+Num+Sg+Nom
guhttanuppelohkái       guhtta+Num+Sg+Acc

TODO:

discontinous case inflection (but only for maximally three-part compound numerals) (viđain/goalmmát/logiin and guvttiin/logiin/viđain) ( Thomas, Trond)
produce correct base forms in the analyzer (Thomas, Trond)
include numbers in the non-recursive transducers (i.e. split the recursive and the non-recursive part of the numerals) (Trond, Thomas)
Set up test bed for numerals, test and revise (who?)
Make a test bed make num-paradigm ( Trond)
Go through the Num bugs (Trond, Thomas, Steinar)
Preprocessing of ordinals at the end of sentences - reported as bug #368. ( Trond)

Hyphenation problem

TODO:

write diphthong hyphenation pseudocode (Thomas)
- done for both sme and smj
rewrite hyphenation code (Trond)
- done for both sme and smj
ask Ove Lorentz to report on our sma hyphenator (Trond)
- Not done.

Lule Sámi

It could actually be that the smj numerals are not recursive. They were made differently from the sme ones, since Spiik reported them as written sepa- rately.

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
include numbers in the non-recursive transducers
Set up a test bed for numerals, test and revise (who?)

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in the meeting memo.

Postponed:

data synchronisation between risten.no and the cvs repo

TODO:

try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
1. done
restructure interface code for easier maintenance, coding and use
1. well under way, still some work
finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

Polderland data generation

There is now a decision on compound parts, and compounding can now be included in the PLX generation. Compounding is a sine qua non (a must) for the beta version. The specification is found in this document.

We have a UTF-8 problem with the paradigm server in some cases, some characters are returned as Latin1. When the server runs on G5, everything works fine. But when it is run on victorio, some conversion errors turn up. The problem may be Java-related, according to some net sources, and also with the perl settings in victorio, related to the change in perl setup.

Suggestion: Just use the G5, and not victorio, since there is no time to fix the setup in victorio (the real error).

TODO:

decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
1. Done!
add closed POS and clitics to PLX generation (Børre, Tomi)
1. Progressing.
add compound stems to the PLX generation (Børre, Tomi)
add derivations to the PLX generation (Børre, Tomi)
Include numerals in the speller (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for testing Windows spellers (Børre, Sjur)
- nothing yet

Localisation

TODO:

translate Windows installer text to sme and smj ( Børre, Thomas)
- progressing (smj is mostly done, lots lacking in sme)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- not done

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 22.1.2007, 09: 30 Norwegian time.

The meeting was closed at 10: 44.

Appendix - task lists for the next week

Boerre

continue work on script for automatic testing of the spell checker in Word
fix sme texts in corpus this month
find missing nob parallel texts in corpus
translate Windows installer text to sme and smj
work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
go through other directories, fix parallelity information for other documents
add sma texts to the corpus repository
fix bugs!

Maaren

tasks according to Thomas

Saara

fix sme texts in corpus this month
send aligned, xml nob texts to Kristen
fix problems with xml2lexc if needed
fix bugs!

Sjur

name lexicon:
- restructure interface code for easier maintenance, coding and use
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix stuorra-oslolaš lower case o
fix bugs!

Steinar

Complete the semantic sets in sme-dis.rle
missing lists
report conversion errors to Saara
Look at the actio compound issue when adding from missing lists
Go through the Num bugs
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
work with compounding
translate Windows installer to sme and smj
lexicalise actio compounds
Lack of lowering before hyphen: Twol rewrite.
Go through the Num bugs
fix stuorra-oslolaš lower case o
include basic numbers in the non-recursive transducers
implement discontinous case inflection for numbers
produce correct base forms in the analyzer
fix bugs!

Tomi

add compound stems to the PLX generation
add closed POS and clitics to PLX generation
add derivations to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt (not this week)
fix sme texts in corpus this month
find missing nob parallel texts in corpus, go through Saara's list
report conversion errors to Saara
Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
Go through the Num bugs
Make numeral testbed
Get input on sma hyphenations
include numbers in the non-recursive transducers for sme, smj
implement discontinous case inflection for numbers
produce correct base forms in the analyzer
fix bugs!.