Samest meeting 13.03.2017

Participants: Heiki-Jaan, Heli, Jaak, Jack, Kadri, Trond, Sjur


    • status
    • papers
  • FSTs
    • status
    • papers
  • MT
    • status
    • papers
  • Next meeting



Võro Oahpa transducers rebuilt and database updated. hfst transducers are being used instead of xfst transducers because of the compilation error of the xfst transducers.

The xfst compiler reports a syntax error in the row: %0%0: nullʼ% nullʼ END ; whereas the hfst compiler is able to compile this. Here ʼ is modifier letter apostrophe (palatalisation mark in Võro), not the regular apostrophe.


A paper about Võro Oahpa to Nodalida NLP4CALL workshop in progress.

The Estonian Oahpa is not quite there yet.



vro sh test/yaml-check.sh

SUMMARY for the gt-norm fst(s): PASSES: 13226 / FAILS: 56 / TOTAL: 13282

langs/est -- it appears that tests and fst are drifted apart and there are both errors in fst and things tested that do not exist as such in fst, yet. Some of the errors have been corrected since the last meeting but there is more to do. Also, there are some apertium tests that are not adjusted to Estonian.


TODO to next meeting: Make the apart-drifting components coming closer.


Open for now.



Starting point fin2X X2fin incubator/apertium... cat ~/big/langs/fin/corp/aho_rautatie.txt|apertium -d. fin-est > aho_rautatie.mt.est nursary/... cat ~/big/langs/fin/corp/aho_rautatie.txt|apertium -d. fin-sme > aho_rautatie.mt.sme

Test on 23500 words Juhani Aho "Rautatie" $ cat aho_rautatie.mt.sme |tr ' ' '\n'|grep '^\*'|wc -l 3314 $ cat aho_rautatie.mt.est |tr ' ' '\n'|grep '^\*'|wc -l 4378

fin-sme produce/understand coverage 86 %
fin-est produce/understand coverage 82 % but a 'better than G' impression
sme-fin produce/understand Not that bad (ipmressionistic)
est-fin produce/understand ??



46 prosentissa kirjeistä kirjoittaja myönsi suoraan ja avoimesti, että kaupungin suunnitelmat tökkivät, koska ne asettuvat vastakkain hänen oman etunsa kanssa.


46 protsendis kirjadest kirjutaja nõustus otse ja varjamatult, et linna plaanid tökkivät, kuna need asetuvad vastakkain ta oma eelise koos.

The problem here is:

a rule looks at adj noun sequence, finds oman etunsa another rule looks at etunsa kanssa, but cannot fire because etunsa has been consumed already

Move the "etunsa kanssa" rule earlier than the A N rule in the t1x file.

my current proposal for a solution: make more interchunk transfer stages, i.e rule files, so that different rules look at different patterns one after another, and do not care if these overlap.

  • chunker
  • interchunk1
  • interchunk2
  • postchunk


Finnish text in Estonian:


The same Finnish text in North Saami



  1. Main problem grammatical (transfer)


  1. Main problem technical (tags in the target)
  2. Then bad vocabulary coverage
  3. How bad the transfer is is early to tell


Papers should investigate the production/understanding difference.

We should be able to conclude that with a decent lexicon, L1 syntax and morphology, we will get good understanding even by a bad transfer component (since there are free rides).

We should pursue the good fin2est results, and understand them.

Understanding would be easier to measure for est2fin vs. sme2fin (we would have two Finnish texts to look at)


  1. Spring
    1. Fix technical things (Trond, Heiki-Jaan)
    2. Look at lexical coverage
    3. Improve coverage
  2. Evaluation: Look at ways to evaluate understanding
  3. Analyse and write a paper before august

On evaluation:

  • Cf. work on forthcoming sme-nob article
  • smt uses BLEU (how close are you to a predefined mark)
  • RBMT likes WER (how much must you change output before you are satisfied)

Next meeting

Next meeting: 27.3.2017 at 11 am Norwegian time.