Samest meeting 8.11.2016

Participants: Fran, Heiki, Heli, Jaak, Kadri, Trond


  • Estonian FST(s)
  • Estonian Oahpa
  • Võro Oahpa
  • MT and CG
  • Next meeting

Estonian FST(s)

Heiki's experimental FST is in experimental-langs/est.

yaml tests pass, meaning that the Filosoft lexicon has been successfully converted to cover simplex word inflection.

still missing: derivation, compounding, and massive tests on real corpora

however: today I did svn up, and the documentation stuff does not generate (but fst-s do).

Parallel forms

echo 'siile' | hfst-lookup analyser-gt-desc.hfstol 
> siile        siil+N+Pl+Par        0,000000

$ echo 'siilisid' | hfst-lookup analyser-gt-desc.hfstol 
> siilisid        siil+N+Pl+Par+Use/Rare        0,000000

The tags are described in root.lexc as follows:

! paralleelvormide erinevat kasutussagedust iseloomustavad
! usage info for parallel forms (either correct according to the norm, or incorrect)

+Use/Rare       ! norm, but rare: puusid:puu+Pl+Par+Use/Rare
+Use/Hyp        ! norm, but so rare that norm is probaly wrong: tiivasse:tiib+Sg+Ill+Use/Hyp
+Use/NotNorm    ! not norm, but sometimes used: pöidlates:pöial+Pl+Ine+Use/NotNorm
+Use/CommonNotNorm ! not norm, and used more than norm: peeneid:peen+Pl+Par+Use/CommonNotNorm

Use/Hyp is used in continuation lexicons


+Sg+Ill%+Use%/Hyp:i%{WF%}%+sse GI ;    ! Fiatisse

+Sg+Ill%+Use%/Hyp:%{W%}%+sse GI ;          ! kotisse

To find out how many wordforms contain this /Hyp in the lexicon:

echo '$["+Use/Hyp"]' | hfst-regexp2fst |  hfst-compose -F -2 src/generator-gt-desc.hfst | hfst-fst2strings  | wc -l

tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |wc -l
tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |lookup -q src/analyser-gt-desc.xfst|grep '?'|wc -l

How to do compounding?

Suggestion from Trond:

What tools are in our toolbox? ... and are these Filosoft questions as well?

  1. Where may or must I be in the compound string? (+CmpNP/First, etc.)
  2. What demands do I put on my neighbours (+CmpN/SgNomLeft) <== I demand my last part is +Hum
  3. If participating as the non-final part of a compound, what shape may I take (+CmpN/SgG)

Encoding error: since CmpNP/Last demands bájáš to be last, tags governing its behaviour as first part will not be put into use

bájáš +CmpN/SgN +CmpN/SgG +CmpN/PlG +CmpNP/Last +Sem/Plc


This entry / word should be in the following position(s):

+CmpNP/All - ... in all positions, default, this tag does not have to be written

+CmpNP/First - ... only be first part in a compound or alone

+CmpNP/Pref - ... only first part in a compound, NEVER alone

+CmpNP/Last - ... only be last part in a compound or alone

+CmpNP/Suff - ... only last part in a compound, NEVER alone

+CmpNP/None - ... does not take part in compounds

+CmpNP/Only - ... only be part of a compound, i.e. can never be used alone, but can appear in any position

The tagged part of the compound should make a compound using:

+CmpN/SgN Singular Nominative

+CmpN/SgG Singular Genitive

+CmpN/PlG Plural Genitive

The second part of the compound may require that the previous (left part) is:

+CmpN/SgNomLeft Singular Nominative

+CmpN/SgGenLeft Singular Genitive

+CmpN/PlGenLeft Plural Genitive

We convert tags to flags src/filters/convert_to_flags-CmpNP-tags.regex e.g. like this:

  "@U.CmpFrst.TRUE@" <- "+CmpNP/First",
  "@U.CmpPref.TRUE@" <- "+CmpNP/Pref" ,
  "@P.CmpLast.TRUE@" <- "+CmpNP/Last" ,

Then we remove strings not fulfilling the demands

! Convert normative tags to positive reset flags:
 "@P.CmpN.SgN@" <- "+CmpN/SgN" ,
 "@P.CmpN.SgG@" <- "+CmpN/SgG" ,
 "@P.CmpN.PlG@" <- "+CmpN/PlG" ,

Note: In sme FST many compounds are lexicalised.


Start working on Estonian compounding.

    Estonian Oahpa

    Heli has tested the generation of forms with Jaak's FST where the tag +Use/NG is used for marking the marginal parallel forms. This approach is usable for Oahpa.

    Next: to also test the generation of Oahpa database with Heiki's FST.

    A student in Tartu is working on creating more question-answer templates for Morfa-C.

    Võro Oahpa

    Sulev and Heli had a presentation on Võro Oahpa at Läänemeresoome sügiskonverents in Võru on 28. October. We got positive feedback.

    Sulev and Jack had a poster on Võro FST at the same conference.

    Audio has been added to

    • Leksa (pronuncations of Võro words that are in the Võro-Estonian dictionary synaq.org) See http://oahpa.no/voro/leksa
    • Morfa-C (reading aloud the questions by the means of speech synthesis developed at Institute of Estonian Language). ATM it works correctly in Safari but not in all versions of Firefox. http://oahpa.no/voro/morfac

    MT and CG

    OmegaT support

    is on its way, working, but a bit unreliable.


    Estonian pairs have been plugged in OmegaT.

    Webpage translation support

    Glue in text from http: //avvir.no and get it in 5 languages, or upload a document. http://gtweb.uit.no/jorgal/

    Looking at inverse programs

    • So far: sme2smX as a production system
    • now thinking of smX2sme as a gisting system


    We are happy with that, and think that could be reproduced with est2fin. Main showstopper is then syntax tags.

    This could also be replicated fairly easily for sme2est (at least to similar results as sme2fin).

    Workshop IWCLUL3

    January 23rd, 2017, St. Petersburg

    One extra week to make a contribution. Works in progress gratefully received!

    Next meetings

    SamEst MT: 17th November, 11: 00 Tromsø

    SamEst: 22nd November, 11: 30 Tromsø