Meeting_2007-01-15

Meeting setup

  • Date: 15.01.2007
  • Time: 09.00 Norw. time
  • Place: Where we are
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 9: 44.

Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • contact authors who have already received the corpus licensing contract
    • not done
  • continue work on script for automatic testing of the spell checker in Word
    • not done
  • fix sme texts in corpus this month
    • not done
  • find missing nob parallel texts in corpus
    • not done
  • translate Windows installer to sme
    • some done, helped Thomas
  • work on the Polderland data generation (PLX format conversion)
    • done, not finished.
  • go through other directories, fix parallelity information for other documents
    • not done
  • fix bugs!
    • not done

Maaren

  • investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
    • not done

Saara

  • fix sme texts in corpus this month
    • in progress
  • send aligned, xml nob texts to Lars
  • add correction markup to the xml files (string-to-correction markup)
    • done, but see newsgroup message
  • first new version of xml2lexc in Perl
    • done
  • fix bugs!
    • fixed couple of bugs

Sjur

  • name lexicon:
    • rewrite the integration with forrest, to get a more flexible integration with proper i18n, solving some problems with the previous solution, and make a foundation for better search and editing interfaces.
      • search interface finished, editor half-way; still needs some javascript and css tweaks to be really well-behaved, but can b
    • refactor SD-terms editor code
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
  • hire linguist and programmer
  • decide how to specify compounding behaviour info in the lexicon
    • finally done!
  • get an Intel Mac for testing Windows spellers
  • publish corpus contracts and project infra on NoDaLi-sta
  • fix stuorra-oslolaš lower case o
  • fix bugs!

Steinar

  • conversion error screening
    • not done
  • missing lists
    • done some work
  • report conversion errors to Saara
    • not done
  • Go through the Num bugs
    • not done
  • Look at the actio compound issue when adding from missing lists
    • added words
  • fix bugs!
    • not done
  • worked with cg-sets
    • done some

Thomas

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
    • nothing this week
  • decide how to specify compounding behaviour info in the lexicon
    • decided
  • translate Windows installer to sme and smj
    • ready soon
  • Actio compounds: The disamb crew is satisfied. Now it is up to the divvun folks to see whether it is too hard to lexicalise
    • nothing this week
  • Lack of lowering before hyphen: Twol rewrite.
    • nothing this week
  • include numbers in the non-recursive transducers
    • not done
  • Go through the Num bugs
    • not done
  • Write diphthong hyphenation pseudocode
    • done
  • fix stuorra-oslolaš lower case o
    • not done
  • fix bugs!
    • worked

Tomi

  • add closed POS and clitics to PLX generation
    • done with help from Børre
  • add derivations to the PLX generation
    • not done
  • add compound stems to the PLX generation
    • not done
  • fix bugs!

Trond

  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
    • No smj last week.
  • decide how to specify compounding behaviour info in the lexicon
    • Decided
  • Set up work on missing and conversion screening with Steinar and Ilona.
    • Done.
  • fix sme texts in corpus this month
    • Continuously working on this one.
  • find missing nob parallel texts in corpus, go through Saara's list
  • report conversion errors to Saara
    • Saara has been leading this work...
  • Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
    • Not done
  • Go through the Num bugs
    • Not done
  • Make numeral testbed
    • Not done.
  • Rewrite hyphenation-code (pseudocode from Thomas) sme, smj
    • Done
  • Get input on sma hyphenations
    • Not done.
  • fix stuorra-oslolaš lower case o
    • This one I would like to pass over to Tomi.
  • include numbers in the non-recursive transducers for sme, smj
    • Started work on this one. Split the closed-smX-lex.txt file with Børre.
  • fix bugs!.

3. Documentation

Nothing this week.

4. Corpus gathering

Trond finally got the sma texts from Snåsa, quite a lot of text, but not all. Børre will add it to the corpus repository.

The relevant persons have worked on the tasks below.

TODO:

  • sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
  • missing nob parallel texts should be added if such wholes are found ( Børre, Trond)
  • Go through the list of missing or errouneous nob texts, based upon Saaras perfect list (Børre, Trond)
  • add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Lars Nygård has left UiO. Anders Nøklestad is back in his old position. For us, this means that Anders will be the person to contact for technical matters, and Kristin Hagen the one for parsing of the nob parallel texts.

Alignment

TODO:

  • go through other directories, fix parallelity information for other documents ( Børre)
    • Still to be done.
  • re-analyze parallel files using the command-line version (Saara)
    • done all existing files
  • when aligned, send aligned, xml nob texts to Kristin ( Saara)
    • not yet done

Conversion issues

TODO:

  • add correction markup to the xml files (string-to-correction markup) ( Saara)
    • see news discussion - we will and should allow text corrections concerning character encoding problems.
  • report conversion errors to Saara ( Trond, Steinar)
    • Not done.

6. Infrastructure

Nothing this week.

7. Linguistics

North Sámi

TODO:

  • lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Thomas, Maaren, Steinar)
  • Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
  • fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)

Numbers:

One problem we have is to correctly identify base forms of numerals, cf: (the baseform of 16 is given as 6)

guhttanuppelohkái
guhttanuppelohkái       guhtta+Num+Sg+Nom
guhttanuppelohkái       guhtta+Num+Sg+Acc

TODO:

  • discontinous case inflection (but only for maximally three-part compound numerals) (viđain/goalmmát/logiin and guvttiin/logiin/viđain) ( Thomas, Trond)
  • produce correct base forms in the analyzer (Thomas, Trond)
  • include numbers in the non-recursive transducers (i.e. split the recursive and the non-recursive part of the numerals) (Trond, Thomas)
  • Set up test bed for numerals, test and revise (who?)
  • Make a test bed make num-paradigm ( Trond)
  • Go through the Num bugs (Trond, Thomas, Steinar)
  • Preprocessing of ordinals at the end of sentences - reported as bug #368. ( Trond)

Hyphenation problem

TODO:

  • write diphthong hyphenation pseudocode (Thomas)
    • done for both sme and smj
  • rewrite hyphenation code (Trond)
    • done for both sme and smj
  • ask Ove Lorentz to report on our sma hyphenator (Trond)
    • Not done.

Lule Sámi

It could actually be that the smj numerals are not recursive. They were made differently from the sme ones, since Spiik reported them as written sepa- rately.

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
  • Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
  • include numbers in the non-recursive transducers
  • Set up a test bed for numerals, test and revise (who?)

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in the meeting memo.

Postponed:

  • data synchronisation between risten.no and the cvs repo

TODO:

  1. try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
    1. done
  2. restructure interface code for easier maintenance, coding and use
    1. well under way, still some work
  3. finish first version of the editing (Sjur)
  4. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  5. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  6. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

Polderland data generation

There is now a decision on compound parts, and compounding can now be included in the PLX generation. Compounding is a sine qua non (a must) for the beta version. The specification is found in this document.

We have a UTF-8 problem with the paradigm server in some cases, some characters are returned as Latin1. When the server runs on G5, everything works fine. But when it is run on victorio, some conversion errors turn up. The problem may be Java-related, according to some net sources, and also with the perl settings in victorio, related to the change in perl setup.

Suggestion: Just use the G5, and not victorio, since there is no time to fix the setup in victorio (the real error).

TODO:

  1. decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
    1. Done!
  2. add closed POS and clitics to PLX generation (Børre, Tomi)
    1. Progressing.
  3. add compound stems to the PLX generation (Børre, Tomi)
  4. add derivations to the PLX generation (Børre, Tomi)
  5. Include numerals in the speller (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

  • add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
  • study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

  • get an Intel Mac for testing Windows spellers (Børre, Sjur)
    • nothing yet

Localisation

TODO:

  • translate Windows installer text to sme and smj ( Børre, Thomas)
    • progressing (smj is mostly done, lots lacking in sme)

10. Other

Corpus contracts

TODO:

  • publish corpus contracts and project infra on NoDaLi-sta (Sjur)
    • not done

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 22.1.2007, 09: 30 Norwegian time.

The meeting was closed at 10: 44.

Appendix - task lists for the next week

Boerre

  • continue work on script for automatic testing of the spell checker in Word
  • fix sme texts in corpus this month
  • find missing nob parallel texts in corpus
  • translate Windows installer text to sme and smj
  • work on the Polderland data generation (PLX format conversion)
    • Concentrate on compounding
  • go through other directories, fix parallelity information for other documents
  • add sma texts to the corpus repository
  • fix bugs!

Maaren

  • tasks according to Thomas

Saara

  • fix sme texts in corpus this month
  • send aligned, xml nob texts to Kristen
  • fix problems with xml2lexc if needed
  • fix bugs!

Sjur

  • name lexicon:
    • restructure interface code for easier maintenance, coding and use
    • refactor the rest of the SD-terms editor code
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
  • hire linguist and programmer
  • get an Intel Mac for testing Windows spellers
  • publish corpus contracts and project infra on NoDaLi-sta
  • fix stuorra-oslolaš lower case o
  • fix bugs!

Steinar

  • Complete the semantic sets in sme-dis.rle
  • missing lists
  • report conversion errors to Saara
  • Look at the actio compound issue when adding from missing lists
  • Go through the Num bugs
  • fix bugs!

Thomas

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  • work with compounding
  • translate Windows installer to sme and smj
  • lexicalise actio compounds
  • Lack of lowering before hyphen: Twol rewrite.
  • Go through the Num bugs
  • fix stuorra-oslolaš lower case o
  • include basic numbers in the non-recursive transducers
  • implement discontinous case inflection for numbers
  • produce correct base forms in the analyzer
  • fix bugs!

Tomi

  • add compound stems to the PLX generation
  • add closed POS and clitics to PLX generation
  • add derivations to the PLX generation
  • fix bugs!

Trond

  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt (not this week)
  • fix sme texts in corpus this month
  • find missing nob parallel texts in corpus, go through Saara's list
  • report conversion errors to Saara
  • Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
  • Go through the Num bugs
  • Make numeral testbed
  • Get input on sma hyphenations
  • include numbers in the non-recursive transducers for sme, smj
  • implement discontinous case inflection for numbers
  • produce correct base forms in the analyzer
  • fix bugs!.