161108
Samest meeting 8.11.2016
Participants: Fran, Heiki, Heli, Jaak, Kadri, Trond
Agenda
- Estonian FST(s)
- Estonian Oahpa
- Võro Oahpa
- MT and CG
- IWCLUL
- Next meeting
Estonian FST(s)
Heiki's experimental FST is in experimental-langs/est.
yaml tests pass, meaning that the Filosoft lexicon has been successfully converted to cover simplex word inflection.
still missing: derivation, compounding, and massive tests on real corpora
however: today I did svn up, and the documentation stuff does not generate (but fst-s do).
Parallel forms
echo 'siile' | hfst-lookup analyser-gt-desc.hfstol > siile siil+N+Pl+Par 0,000000 $ echo 'siilisid' | hfst-lookup analyser-gt-desc.hfstol > siilisid siil+N+Pl+Par+Use/Rare 0,000000
The tags are described in root.lexc as follows:
! paralleelvormide erinevat kasutussagedust iseloomustavad ! usage info for parallel forms (either correct according to the norm, or incorrect) +Use/Rare ! norm, but rare: puusid:puu+Pl+Par+Use/Rare +Use/Hyp ! norm, but so rare that norm is probaly wrong: tiivasse:tiib+Sg+Ill+Use/Hyp +Use/NotNorm ! not norm, but sometimes used: pöidlates:pöial+Pl+Ine+Use/NotNorm +Use/CommonNotNorm ! not norm, and used more than norm: peeneid:peen+Pl+Par+Use/CommonNotNorm
Use/Hyp is used in continuation lexicons
regular_declinations.lexc:
+Sg+Ill%+Use%/Hyp:i%{WF%}%+sse GI ; ! Fiatisse +Sg+Ill%+Use%/Hyp:%{W%}%+sse GI ; ! kotisse
To find out how many wordforms contain this /Hyp in the lexicon:
echo '$["+Use/Hyp"]' | hfst-regexp2fst | hfst-compose -F -2 src/generator-gt-desc.hfst | hfst-fst2strings | wc -l 35,404 tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |wc -l 976 tf-hsl-m0016:est ttr000$ cat ~/big/langs/est/corp/vaalit2012.txt|preprocess |lookup -q src/analyser-gt-desc.xfst|grep '?'|wc -l 305
How to do compounding?
Suggestion from Trond:
What tools are in our toolbox? ... and are these Filosoft questions as well?
- Where may or must I be in the compound string? (+CmpNP/First, etc.)
- What demands do I put on my neighbours (+CmpN/SgNomLeft) <== I demand my last part is +Hum
- If participating as the non-final part of a compound, what shape may I take (+CmpN/SgG)
Encoding error: since CmpNP/Last demands bájáš to be last, tags governing its behaviour as first part will not be put into use
bájáš +CmpN/SgN +CmpN/SgG +CmpN/PlG +CmpNP/Last +Sem/Plc ---------------------- bájáž LEXDIMINC ;
/lang/sme/root-morphology.html
This entry / word should be in the following position(s):
+CmpNP/All - ... in all positions, default, this tag does not have to be written
+CmpNP/First - ... only be first part in a compound or alone
+CmpNP/Pref - ... only first part in a compound, NEVER alone
+CmpNP/Last - ... only be last part in a compound or alone
+CmpNP/Suff - ... only last part in a compound, NEVER alone
+CmpNP/None - ... does not take part in compounds
+CmpNP/Only - ... only be part of a compound, i.e. can never be used alone, but can appear in any position
The tagged part of the compound should make a compound using:
+CmpN/SgN Singular Nominative
+CmpN/SgG Singular Genitive
+CmpN/PlG Plural Genitive
The second part of the compound may require that the previous (left part) is:
+CmpN/SgNomLeft Singular Nominative
+CmpN/SgGenLeft Singular Genitive
+CmpN/PlGenLeft Plural Genitive
We convert tags to flags
"@U.CmpFrst.TRUE@" <- "+CmpNP/First", "@U.CmpPref.TRUE@" <- "+CmpNP/Pref" , "@P.CmpLast.TRUE@" <- "+CmpNP/Last" ,
Then we remove strings not fulfilling the demands
! Convert normative tags to positive reset flags: "@P.CmpN.SgN@" <- "+CmpN/SgN" , "@P.CmpN.SgG@" <- "+CmpN/SgG" , "@P.CmpN.PlG@" <- "+CmpN/PlG" ,
Note: In sme FST many compounds are lexicalised.
TODO
Start working on Estonian compounding.
Estonian Oahpa
Heli has tested the generation of forms with Jaak's FST where the tag +Use/NG is used for marking the marginal parallel forms. This approach is usable for Oahpa.
Next: to also test the generation of Oahpa database with Heiki's FST.
A student in Tartu is working on creating more question-answer templates for Morfa-C.
Võro Oahpa
Sulev and Heli had a presentation on Võro Oahpa at Läänemeresoome sügiskonverents in Võru on 28. October. We got positive feedback.
Sulev and Jack had a poster on Võro FST at the same conference.
Audio has been added to
- Leksa (pronuncations of Võro words that are in the Võro-Estonian dictionary synaq.org) See http://oahpa.no/voro/leksa
- Morfa-C (reading aloud the questions by the means of speech synthesis developed at Institute of Estonian Language). ATM it works correctly in Safari but not in all versions of Firefox. http://oahpa.no/voro/morfac
MT and CG
OmegaT support
is on its way, working, but a bit unreliable.
http://wiki.apertium.org/wiki/Apertium_OmegaT_Native
Estonian pairs have been plugged in OmegaT.
Webpage translation support
Glue in text from http: //avvir.no and get it in 5 languages, or upload a document.
Looking at inverse programs
- So far: sme2smX as a production system
- now thinking of smX2sme as a gisting system
sme2fin
We are happy with that, and think that could be reproduced with est2fin. Main showstopper is then syntax tags.
This could also be replicated fairly easily for sme2est (at least to similar results as sme2fin).
Workshop IWCLUL3
January 23rd, 2017, St. Petersburg
One extra week to make a contribution. Works in progress gratefully received!
Next meetings
SamEst MT: 17th November, 11: 00 Tromsø
SamEst: 22nd November, 11: 30 Tromsø