150107

Samest meeting january 7th 2014

Present: Heiki, Heli, Jaak, Kadri, Sjur, Tiina

Agenda

  • fst
  • Things to be done before the physical meeting (19.-21.01.)

fst

Compounding

Lexicalised compounds

Jaak has included the compounds needed by Oahpa into the fst. Generation of forms of these lexicalised compounds works fine for Oahpa now.

Problem: if we want to use the same lexicon for the hyphenation then we need the hash mark as the separator of the parts of the compounds.

Example of a lexicalised compound in other languages: word1word2: word1#word2 CONTLEX ; Exceptional hyphenation points: lemma: ste^m CONTLEX ;

Dynamic compounds

Done by fst concatenation + filters to weed out overgeneration in Estonian. The system of filters is complex.

Problem with some illative forms

Some illative forms do not generate, e.g.

mees+N+Sg+Ill	+? 

This has happened as a result of suppressing some short illative (previously tagged as additive) forms, e.g. mees:*mehhe. Now mehesse does not generate either.

CG

Tiina has added the Estonian-specific scripts that include lists of pronouns, intransitive and partitive verbs etc.

To do before the meeting on 19.-21.01.

Heiki: describe the results of the comparison of Estonian and Finnish tags. Article on a linguistically motivated tag system. Multilingual, broader view? Discuss with Tommi Pirinen and "the Saami people" in Tromsø.

Jaak: check the illative forms (some are missing now) and ja- and mine-forms

Heli: Make Oahpa Morfa-S with full coverage of the "E nagu Eesti" dictionary available online (generate a new database).

Tiina: make further steps towards an Estonian CG that is working in gt infrastructure.

Heiki, Heli, Trond: Put together an agenda of the physical meeting and e-mail it to all the participants.

One topic for discussion: preprocess and determination of sentence boundaries (and clause boundaries) for Estonian. Compare EstNLTK (Python) with Giellatekno tools.