170313

Contents:

Samest meeting 13.03.2017
ICALL
FSTs
MT
Next meeting

Samest meeting 13.03.2017

Participants: Heiki-Jaan, Heli, Jaak, Jack, Kadri, Trond, Sjur

Agenda

ICALL
- status
- papers
FSTs
- status
- papers
MT
- status
- papers
Next meeting

ICALL

Status

Võro Oahpa transducers rebuilt and database updated. hfst transducers are being used instead of xfst transducers because of the compilation error of the xfst transducers.

The xfst compiler reports a syntax error in the row: %0%0: nullʼ% nullʼ END ; whereas the hfst compiler is able to compile this. Here ʼ is modifier letter apostrophe (palatalisation mark in Võro), not the regular apostrophe.

Papers

A paper about Võro Oahpa to Nodalida NLP4CALL workshop in progress.

The Estonian Oahpa is not quite there yet.

FSTs

Status

vro sh test/yaml-check.sh

SUMMARY for the gt-norm fst(s): PASSES: 13226 / FAILS: 56 / TOTAL: 13282

langs/est -- it appears that tests and fst are drifted apart and there are both errors in fst and things tested that do not exist as such in fst, yet. Some of the errors have been corrected since the last meeting but there is more to do. Also, there are some apertium tests that are not adjusted to Estonian.

langs/est/test/tools/mt/apertium/apertiumtests_mt-gt-desc.ana.yaml

TODO to next meeting: Make the apart-drifting components coming closer.

Papers

Open for now.

MT

Status

Starting point fin2X X2fin incubator/apertium... cat ~/big/langs/fin/corp/aho_rautatie.txt|apertium -d. fin-est > aho_rautatie.mt.est nursary/... cat ~/big/langs/fin/corp/aho_rautatie.txt|apertium -d. fin-sme > aho_rautatie.mt.sme

Test on 23500 words Juhani Aho "Rautatie" $ cat aho_rautatie.mt.sme |tr ' ' '\n'|grep '^\*'|wc -l 3314 $ cat aho_rautatie.mt.est |tr ' ' '\n'|grep '^\*'|wc -l 4378

                         
fin-sme produce/understand coverage 86 %
fin-est produce/understand coverage 82 % but a 'better than G' impression
sme-fin produce/understand Not that bad (ipmressionistic)
est-fin produce/understand ??

fin-est

fin:

46 prosentissa kirjeistä kirjoittaja myönsi suoraan ja avoimesti, että kaupungin suunnitelmat tökkivät, koska ne asettuvat vastakkain hänen oman etunsa kanssa.

est:

46 protsendis kirjadest kirjutaja nõustus otse ja varjamatult, et linna plaanid tökkivät, kuna need asetuvad vastakkain ta oma eelise koos.

The problem here is:

a rule looks at adj noun sequence, finds oman etunsa another rule looks at etunsa kanssa, but cannot fire because etunsa has been consumed already

Move the "etunsa kanssa" rule earlier than the A N rule in the t1x file.

my current proposal for a solution: make more interchunk transfer stages, i.e rule files, so that different rules look at different patterns one after another, and do not care if these overlap.

chunker
interchunk1
interchunk2
postchunk

fin2X

Finnish text in Estonian:

http://gtweb.uit.no/tolkimine/index.eng.html?dir=fin-est&qP=http%3A%2F%2Fyle.fi%2Fuutiset%2Fosasto%2Fsapmi%2Fnrk_uusi_tenon_kalastussopimus_on_toisen_voitto_ja_toisen_tappio%2F9502974#webpageTranslation

The same Finnish text in North Saami

http://gtweb.uit.no/mt-testing/index.sme.html?dir=fin-sme&qP=http%3A%2F%2Fyle.fi%2Fuutiset%2Fosasto%2Fsapmi%2Fnrk_uusi_tenon_kalastussopimus_on_toisen_voitto_ja_toisen_tappio%2F9502974#webpageTranslation

fin2est:

Main problem grammatical (transfer)

fin2sme:

Main problem technical (tags in the target)
Then bad vocabulary coverage
How bad the transfer is is early to tell

Papers

Papers should investigate the production/understanding difference.

We should be able to conclude that with a decent lexicon, L1 syntax and morphology, we will get good understanding even by a bad transfer component (since there are free rides).

We should pursue the good fin2est results, and understand them.

Understanding would be easier to measure for est2fin vs. sme2fin (we would have two Finnish texts to look at)

fin2X

Spring
1. Fix technical things (Trond, Heiki-Jaan)
2. Look at lexical coverage
3. Improve coverage
Evaluation: Look at ways to evaluate understanding
Analyse and write a paper before august

On evaluation:

Cf. work on forthcoming sme-nob article
smt uses BLEU (how close are you to a predefined mark)
RBMT likes WER (how much must you change output before you are satisfied)

Next meeting

Next meeting: 27.3.2017 at 11 am Norwegian time.