160105

Samest meeting Jan 5 2016

Participants: Heiki-Jaan, Heli, Jaak, Tiina, Trond

Last meeting: /lang/est/meetings/151215.html

Agenda

  • FST
  • Oahpa
  • MT
  • Subgroups
  • other business: bach thesis on saami-est MT, samest report due March 1

FST

The hfst-problematic priority union was fixed in Hki. This gives correct analyses for the problematic nägema, etc. This issue worked for xfst earlier, but xfst then has some other problems.

Everything in the overriding exceptions file is probably fixed now.

Next step: Jaak to fix the issues in Heli's letter.

  • Vokaalmitmus (vocal plural?) -- there seems to be overgeneration (when does the word have it?)
  • Short illative -- HJK has told that he has classes where it is possible and where it is preferred? Oahpa would like to have them distinctively tagged when both are possible (or even if the pupil uses form that is obtained "regularly" but which actually is not used:
  • poesse, )
  • Lexicalized compounds -- some twolc rules assume # being there, should we add it? (töökojas)

Oahpa

Recently, Heli has worked on Morfa-C for both vro and est. The new solution for avoiding repeating exercises should be implememented for other lgs, but the solution is still not general enough. Heli works on this.

Heli will regenerate the lexicon (after the priority-union fix) and have a look. Next step is to generalise Morfa-C and work on Vasta-S.

MT

Heiki-Jaan has used the old hfst compiler, the one without the priority union problem. The bug appeared this year only.

echo 'ole' | hfst-lookup tools/mt/apertium/analyser-mt-apertium-desc.est.hfstol
> ole        olla<vblex><actv><imp><p2><sg>        0,000000
ole        olla<vblex> <imp><prs><conneg><p2><sg>        0,000000
ole        olla<vblex>  <ind><prs><conneg>        0,000000

try 'ollut' and 'olleet' - don't have conneg reading for 'ei ollut'
"olla" V Act Ind Prt ConNeg Sg - is missing
"olla" V Act Ind Prt ConNeg Pl - is missing

incubator/apertium-fin-est/gt2apertium.cg3r

The ConNeg analysis is in place, both for the fin and the est CG:

"<Hän>"
        "hän" Pron Pers Sg Nom
"<ei>"
        "ei" V Neg Act Sg3
"<tullut>"
        "tulla" V Act Ind Prt ConNeg Sg
"<.>"
        "." Punct

"<tema>"
        "tema" Pron Emph Sg Nom
"<ei>"
        "ei" V Neg
"<tulnud>"
        "tulema" V Pers Prt Ind Neg <==

echo 'Hän ei tullut.' | apertium -d . fin-est
Ta ei saanud.

echo 'Hän ei ostanut.' | apertium -d . fin-est
Ta ei ostnud.

echo 'Hän ei koskaan ostanut.' | apertium -d . fin-est
Ta ei ostnud kunagi.

Task: Add the V Act Ind Prt ConNeg Sg reading to olla.

   <e><p><l>tulla<s n="vblex"/></l><r>tulema<s n="vblex"/></r></p></e>
    <e><p><l>tulla<s n="vblex"/></l><r>saama<s n="vblex"/></r></p></e>
    <e><p><l>tulla<s n="vblex"/></l><r>sattuma<s n="vblex"/></r></p></e>
        
tf-hsl-m0016:~ ttr000$ ufin
ole
ole        olla+V+Act+Imprt+Sg2
ole        olla+V+Act+Imprt+Prs+ConNeg+Sg2
ole        olla+V+Act+Ind+Prs+ConNeg

ollut
ollut        olla+V+Pss+PrfPrc+Pl+Nom
ollut        olla+V+Act+PrfPrc+Sg+Nom
ollut        olla+V+Pss+PrfPrc+Pl+Nom
ollut        olla+V+Act+Ind+Prt+ConNeg+Sg  <=== add this
olleet      olla+V+Act+Ind+Prt+ConNeg+Pl  <=== add this

tullut
tullut        tulla+V+Pss+PrfPrc+Pl+Nom
tullut        tulla+V+Act+PrfPrc+Sg+Nom
tullut        tulla+V+Act+Ind+Prt+ConNeg+Sg
tullut        tulla+V+Pss+PrfPrc+Pl+Nom
tullut        tulla+V+Act+PrfPrc+Sg+Nom
tullut        tulla+V+Act+Ind+Prt+ConNeg+Sg

tf-hsl-m0016:~ ttr000$ usme
leat
leat        leat+V+IV+Ind+Prs+Pl1
leat        leat+V+IV+Ind+Prs+Pl3
leat        leat+V+IV+Ind+Prs+Sg2
leat        leat+V+IV+Ind+Prs+ConNeg
leat        leat+V+IV+Inf

lean
lean        leat+V+IV+Ind+Prs+Sg1
lean        leat+V+IV+Ind+Prt+ConNeg

boahtán
boahtán        boahtit+V+IV+PrfPrc
boahtán        boahtit+V+IV+Ind+Prt+ConNeg
ii boahtán = ei tullut

Summing up tags

  • Conversion
    • There is a wiki.apertium.org page for documenting tag conversion
    • There are scripts both on the giella and the apertium side to govern the conversion
    • Goal: Harmonising tags across languages when possible
  • Linguistics:
    • The finnish verb olla should be brought into line with the other verbs (ollut = tullut)

fin2X

sme2fin fin2sme

Needed: Gold corpus. Forthcoming from a finnish Textbook.

  • Qnt
  • Qu

The documentation says:

+Qu !!= * {{@CODE@}}: Quantor                                                                                                                                                       
+Qnt !!= * {{@CODE@}}: Quantor?????  
  1. Find out why there are two tags
    1. If no good reason: Choose one of them
    2. If a good reason: Make transfer rules

Starting to make a kind of gold text from Finnish textbook. Some minor issues.

echo -e "ettei\neikä" | hfst-lookup -q -p $GTHOME/langs/fin/src/analyser-gt-desc.hfst | cut -f1-2 | cg-conv -f 
"<ettei>"
    "ettei" Pcle CS
    "ei" CS                                                 preferred, wrong form
        "että"                                               missing POS
    "että" CSei Neg Act Sg3
"<eikä>"
    "ei" V Neg Act Sg3 Foc_kA
    "ei" V Neg Act Sg3 Foc/ka @synfuction <== preferred
    "eikä" CC  <== problem with enkä
    "eikä" Pcle CC
    "ja" CCei Neg Act Sg3                            wrong
  • 46: Hän ei puhu viroa eikä venäjää. <-- eikä = ega = ja mitte
  • 209: Olutkin on edullista eikä niin kallista kuin muualla. <-- eikä = ja/aga mitte
  • 331: Sanna on juuri nyt vaikeassa iässä eikä halua puhua äidin kanssa. <-- eikä = ega
  • olen norjalainen enkä virolainen <-- eikä = aga mitte

echo 'olen norjalainen enkä virolainen'

apertium -d . fin-est

Olen *norjalainen ja ei eestlane

Foc_kA is plain wrong. It shall be Foc/ka CC vs Pcle CC: When are you Pcle CC and not CC ??

Cliticized forms:

etten
etten        etten+Pcle+CS
etten        että+CSei+Neg+Act+Sg1
etten        että#en+CS

ettet
ettet        ettet+Pcle+CS
ettet        että+CSei+Neg+Act+Sg2
ettet        että#et+CS

eikä
eikä        ei+V+Neg+Act+Sg3+Foc_kA
eikä        eikä+CC
eikä        eikä+Pcle+CC
eikä        ei+V+Neg+Act+Sg3+Foc/ka
eikä        ja+CCei+Neg+Act+Sg3

Double forms for translating into Finnish:

Added +Use/NG (omenien, omenoiden, omenoitten, omenain, ..) Omenien hinta on korkea (not: "Omenien/omenoiden hinta on korkea").

There may be some double forms left.

echo 'pääteos'

hfst-lookup - have 45 analyses

bach thesis on saami-est MT

This will be the sixth active MT pair for sme2X, X = nob, sma, smj, smn, fin, est.

TODO

  • Look at the tag conversion issue, technically, documentation
  • Discuss in the MT group

Subgroups

Oahpa: Heli, Tiina, Kadri FST: Jaak, Heiki-Jaan, Heli Võru Oahpa and FST: Sulev, Jack, Heli CG: Tiina, Trond, Lene, Fran to discuss MT: fin2X, X2fin: Heiki, Trond, Fran, Tiina, Infra: Sjur, Fran, ...

  • Trond to send timeslots for CG/MT
  • Heli to send timeslots for Oahpa, FST

samest report due March 1

Unni Norum@uit.no + trond.trosterud@uit.no

Next meeting

Tuesday, Feb 9th, 0900 (Trond perhaps to ask for change of time)