Transducer infrastructure

echo 'siile' | hfst-lookup analyser-gt-desc.hfstol 
> siile        siil+N+Pl+Par        0,000000

$ echo 'siilisid' | hfst-lookup analyser-gt-desc.hfstol 
> siilisid        siil+N+Pl+Par+Use/Rare        0,000000

The tags are described in root.lexc as follows:

! paralleelvormide erinevat kasutussagedust iseloomustavad
! usage info for parallel forms (either correct according to the norm, or incorrect)

+Use/Rare       ! norm, but rare: puusid:puu+Pl+Par+Use/Rare
+Use/Hyp        ! norm, but so rare that norm is probaly wrong: tiivasse:tiib+Sg+Ill+Use/Hyp
+Use/NotNorm    ! not norm, but sometimes used: pöidlates:pöial+Pl+Ine+Use/NotNorm
+Use/CommonNotNorm ! not norm, and used more than norm: peeneid:peen+Pl+Par+Use/CommonNotNorm

Use/Hyp is used in continuation lexicons

regular_declinations.lexc:

+Sg+Ill%+Use%/Hyp:i%{WF%}%+sse GI ;    ! Fiatisse

+Sg+Ill%+Use%/Hyp:%{W%}%+sse GI ;          ! kotisse

To find out how many wordforms contain this /Hyp in the lexicon:

echo '$["+Use/Hyp"]' | hfst-regexp2fst |  hfst-compose -F -2 src/generator-gt-desc.hfst | hfst-fst2strings  | wc -l
35,404

tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |wc -l
     976
tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |lookup -q src/analyser-gt-desc.xfst|grep '?'|wc -l
     305

How to do compounding?

Suggestion from Trond:

What tools are in our toolbox? ... and are these Filosoft questions as well?

Where may or must I be in the compound string? (+CmpNP/First, etc.)
What demands do I put on my neighbours (+CmpN/SgNomLeft) <== I demand my last part is +Hum
If participating as the non-final part of a compound, what shape may I take (+CmpN/SgG)

Encoding error: since CmpNP/Last demands bájáš to be last, tags governing its behaviour as first part will not be put into use

bájáš +CmpN/SgN +CmpN/SgG +CmpN/PlG +CmpNP/Last +Sem/Plc
----------------------
bájáž LEXDIMINC ;

/lang/sme/root-morphology.html

This entry / word should be in the following position(s):

+CmpNP/All - ... in all positions, default, this tag does not have to be written

+CmpNP/First - ... only be first part in a compound or alone

+CmpNP/Pref - ... only first part in a compound, NEVER alone

+CmpNP/Last - ... only be last part in a compound or alone

+CmpNP/Suff - ... only last part in a compound, NEVER alone

+CmpNP/None - ... does not take part in compounds

+CmpNP/Only - ... only be part of a compound, i.e. can never be used alone, but can appear in any position

The tagged part of the compound should make a compound using:

+CmpN/SgN Singular Nominative

+CmpN/SgG Singular Genitive

+CmpN/PlG Plural Genitive

The second part of the compound may require that the previous (left part) is:

+CmpN/SgNomLeft Singular Nominative

+CmpN/SgGenLeft Singular Genitive

+CmpN/PlGenLeft Plural Genitive

We convert tags to flags src/filters/convert_to_flags-CmpNP-tags.regex e.g. like this:

  "@U.CmpFrst.TRUE@" <- "+CmpNP/First",
  "@U.CmpPref.TRUE@" <- "+CmpNP/Pref" ,
  "@P.CmpLast.TRUE@" <- "+CmpNP/Last" ,

Then we remove strings not fulfilling the demands

! Convert normative tags to positive reset flags:
 "@P.CmpN.SgN@" <- "+CmpN/SgN" ,
 "@P.CmpN.SgG@" <- "+CmpN/SgG" ,
 "@P.CmpN.PlG@" <- "+CmpN/PlG" ,

Note: In sme FST many compounds are lexicalised.

TODO

Start working on Estonian compounding.

Estonian Oahpa

Heli has tested the generation of forms with Jaak's FST where the tag +Use/NG is used for marking the marginal parallel forms. This approach is usable for Oahpa.

Next: to also test the generation of Oahpa database with Heiki's FST.

A student in Tartu is working on creating more question-answer templates for Morfa-C.

Võro Oahpa

Sulev and Heli had a presentation on Võro Oahpa at Läänemeresoome sügiskonverents in Võru on 28. October. We got positive feedback.

Sulev and Jack had a poster on Võro FST at the same conference.

Audio has been added to

Leksa (pronuncations of Võro words that are in the Võro-Estonian dictionary synaq.org) See http://oahpa.no/voro/leksa
Morfa-C (reading aloud the questions by the means of speech synthesis developed at Institute of Estonian Language). ATM it works correctly in Safari but not in all versions of Firefox. http://oahpa.no/voro/morfac

MT and CG

OmegaT support

is on its way, working, but a bit unreliable.

http://wiki.apertium.org/wiki/Apertium_OmegaT_Native

Estonian pairs have been plugged in OmegaT.

Webpage translation support

Glue in text from http: //avvir.no and get it in 5 languages, or upload a document. http://gtweb.uit.no/jorgal/

Looking at inverse programs

So far: sme2smX as a production system
now thinking of smX2sme as a gisting system

sme2fin

We are happy with that, and think that could be reproduced with est2fin. Main showstopper is then syntax tags.

This could also be replicated fairly easily for sme2est (at least to similar results as sme2fin).

Workshop IWCLUL3

January 23rd, 2017, St. Petersburg

One extra week to make a contribution. Works in progress gratefully received!

Next meetings

SamEst MT: 17th November, 11: 00 Tromsø

SamEst: 22nd November, 11: 30 Tromsø