160223

Contents:

SamEst meeting 23.02.2016
Participants:Heiki-Jaan, Heli, Jaak, Kadri, Kaili, Sjur, Sulev, Tiina, Trond
Agenda:
Subgroups
FST
Oahpa
MT
Long-term goals
Samest report due March 1
Next meeting

SamEst meeting 23.02.2016

Participants:Heiki-Jaan, Heli, Jaak, Kadri, Kaili, Sjur, Sulev, Tiina, Trond

Agenda:

Subgroups
FST
Oahpa
MT
Samest report due March 1
Long-term goals
Next meeting

Subgroups

Estonian FST: Jaak, Heiki-Jaan, Heli
- Three meetings so far: 5 Jan (Tartu), 1 Feb (Skype audio), 9 Feb (Skype chat)
Estonian Oahpa: Heli, Tiina, Kadri
- No meetings so far
Võro Oahpa and FST: Sulev, Jack, Heli
- No meetings this year yet.
CG: Tiina, Trond, Lene, Fran to discuss
- Nothing so far.
MT: fin2X, X2fin: Heiki, Trond, Fran, Tiina,
- Nothing so far.
Infra: Sjur, Fran, ...

FST

Estonian fst

Work on est.fst has improved the Oahpa database generation, so that the errors now are mostly linked to the words that have marginal parallel forms.

Next task: mask marginal forms as +Use/NG. Are there any other standard tags we could use?

+Use/NG: do not generate this form
+Ill+Short, +Ill+Long, and then choose Short and Long according to rule?

maja -> majja, majasse

(use: Generate majasse for Oahpa MySQL and say "ok, this is the long one, but use the short one", but for MT: do not generate majasse, only majja)

Heiki-Jaan has worked on estonian fst in parallel, in apertium/branches, in order to inspire Jaak.

Heiki's suggestion:

õukonnasse: õukond<n><sg><ill>+Use/Hyp <-- norm OK, never used
tõugusid: tõug<n><pl><par>+Use/Rare <-- norm OK, seldom used
pöidlatesse: pöial<n><pl><ill>+Use/NotNorm <-- not norm, and seldom used
peeneid: peen<n><pl><par>+Use/CommonNotNorm <-- not norm, and frequently used

Is there a standard way to declare difference between normative and descriptive fsts? Yes: all tags beginning with +Err/ will be treated as non-normative, and a normative filter removing strings containing such tags is automatically generated and applied on the normative fst.

Once Heiki-Jaan gets to the stage that he wants to convert the lexicon to his fst and try the resulting system on real texts, he will do this on startup-languages; will ask Heli or Jaak to help.

Use Bugzilla for reporting Estonian FST bugs!

Võro fst

The fst is working much better on nouns than on the verbs at the moment.
Last months Sulev has expanded and corrected yaml tests of verbs. Most of the tests are done. Sulev will continue with this.
At the end of last year we tried to find errors of fst using a genereted noun forms' table generated by Jack. Sulev corrected some noun types and gave feedback to Jack. This work stopped because Jack was too busy. We need to continue with this soon.

Yaml status

Verbs, gt-norm: gt-norm fst(s): PASSES: 2156 / FAILS: 1206 / TOTAL: 3362
Nouns, gt-norm: gt-norm fst(s): PASSES: 5470 / FAILS: 2099 / TOTAL: 7569
Adjectives, gt-norm fst(s): PASSES: 328 / FAILS: 8 / TOTAL: 336

Oahpa

Võro Oahpa

testing.oahpa.no/voro

Võro Oahpa database has been regenerated this morning, using the newest fst and the newest Oahpa lexicon.
Sulev has got some feedback on Võro Oahpa from some Tarto/Võro colleagues: numbers and lexicon are very good but morfology part makes still too much mistakes (fst is still weak). An idea on this: maybe we should add temporary warnings that Morfa parts don't work good enough yet, so that users won't be surprised about the rubbish generated there. But Leksa and Numra games are already brilliant!
Sulev has completed the word lists for some semantic classes (food and drink) in order to get more variation in the Morfa-C exercises. Heli has added these words into the Oahpa lexicon together with the translations to 7 languages. These words are included in the online version already, although not all them are available in Morfa-C yet (words that are not in lexc).

Estonian Oahpa

testing.oahpa.no/eesti

Things fixed:

Generation of forms of the words that have homonyms in singular nominative (e.g. sokk, koor, kokk, tikk).
Removed repetitions from the same exercise set in Morfa-S/C.
Leksa: Added more translations to Swedish, now e.g. for all verbs.

No contact with teachers during last months.

MT

fin-est

Compounding:

Works for H-J, for Tiina for twopart, not threepart compounds, for Trond not at all.

  
Maria haluaa näyttelijäksi tai oopperalaulajaksi.
Marit tahab näitlejaks või ooper lauljaks.

Correct: Marit tahab saada näitlejaks või ooperilauljaks.

Issues:

Wrong case (nominative vs genitive) ooperi lauljaks - transfer rules? Sometimes also drop some letters in some cases as 'teisendamisreeglid'.
Single word: ooperlauljaks (incorrect)
tahab _saada_ (kelleks)?

echo sopimusluostari|apertium -d. fin-est

fin disambiguation

Worked on morphological disambiguation of Finnish based on textbook, 1/3 % is missing from 100% recall currently. Would like to continue tuning on converted Turku treebank. There was a plan to make a translation of that, is it progressing? It would be very useful, also for training for MT.

New parallel resource ready: 1700 sentences from Turku Dependency Treebank translated to Estonian. We should also annotate them for morphology and dependency structure but haven't done anything yet. No need for annotating the Estonian part.

Tiina has been working on textbook text disambiguation, things are progressing. This goes not into the langs/fin/src/syntax, but into a separate place (?).

sme-fin and fin-sme

fin-sme

Implementation not done, awaiting better fin results.

sme-fin

Work has been done.

Forward: subgroup meetings CG, MT next week.

Long-term goals

The project runs until April 2017, so there is a bit more than one year to go. We need to compare the current status to the expected outcome of the project.

Heiki will send around the documents:

the initial project application
the annual report of 2014
the annual report of 2015

Samest report due March 1

Submitted by Heiki-Jaan! Credits and thanks.

Next meeting

Friday 4 March 9.00 Norwegian time (10.00 Estonian time)