151215
SamEst meeting 15.12.2015
Participants: Heli, Jaska, Jaak, Heiki, Tiina, Trond, Sjur (late due to machine issues)
Agenda
- Status Estonian FST
- Status Finnish FST
- Status MT
- Status Oahpa
- Articles
- Establishing subgroups
Status Estonian FST
Capital letters
In Heli's last e-mail(s) there were specific problems to address.
Priority union
Jaak is looking at hfst priority union, a feature needed
Either rewrite the morphology or correct the bug. The bug may be delegated.
Steps forward
There are a class of other errors:
Testing:
- Lexical coverage (on running text, on frequency list)
- Appropriate analysis for any given form
- Distinguish between non-standard forms, standard parallel forms and preferred unique form
The way to solve it +Use/NG, +Err/Orth
TODO:
- Discuss the bug with the hfst team, and get it solved. https://sourceforge.net/p/hfst/bugs/321/
Status Finnish FST
Script to find double forms: langs/est/devtools/find_parallels.sh
Status MT
FST issues
- sme2fin - double forms in FST maiden/maitten,
- fin2sme, fin2est - bad CG
- missing disambiguation
- too many double analyses (multiple POS tags) for the same form in the fin (and est) FST
- missing disambiguation
Open questions
- two entries & POS tags in FST, one in the bidix pair, what
aina Adv aina Pcle
Translate verb categories correct and systematically.
CG issues
Three fsts:
- GT/D fst (lookup2cg)
- Apertium fst, not pruned by bidix (cg-conv)
- Apertium fst, pruned by bidix (cg-conv)
echo tietokonealalla | hufin | cg-conv
Wanted to approach more systematically to the disambiguation of Finnish and use for that translated textbook, but don't know what is a good solution for compounds, which are regular, not exceptional (even for choosing for golden standard):
"<hyväntuulinen>" "tuulinen" A Sg Nom <W:0> "hyvä" N Sg Gen Use/Hyphen <W:0> "tuulinen" A Sg Nom <W:0> "hyvä" N Sg Gen Use/NoHyphens <W:0> "hyväntuulinen" A Sg Nom <W:0> "<tietokonealalla>" "ala" N Sg Ade <W:0> "kone" N Sg Nom Use/Hyphen <W:0> "tieto" N Sg Nom Use/Hyphen <W:0> "ala" N Sg Ade <W:0> "kone" N Sg Nom Use/NoHyphens <W:0> "tieto" N Sg Nom Use/Hyphen <W:0> "ala" N Sg Ade <W:0> "kone" N Sg Nom Use/NoHyphens <W:0> "tieto" N Sg Nom Use/NoHyphens <W:0> "ala" N Sg Ade <W:0> "tietokone" N Sg Nom Use/Hyphen <W:0> "ala" N Sg Ade <W:0> "tietokone" N Sg Nom Use/NoHyphens <W:0>
Trond:
echo tietokonealalla | '$HLOOKUP $GTHOME/langs/fin/src/analyser-gt-desc.hfstol | cg-conv "<tietokonealalla>" "tietokonealalla" "<tieto+N+Sg+Nom+Use/Hyphen#kone+N+Sg+Nom+Use/Hyphen#ala+N+Sg+Ade>" "tieto+n+sg+nom+use/hyphen#kone+n+sg+nom+use/hyphen#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000" "<tietokonealalla>" "tietokonealalla" "<tieto+N+Sg+Nom+Use/Hyphen#kone+N+Sg+Nom+Use/NoHyphens#ala+N+Sg+Ade>" "tieto+n+sg+nom+use/hyphen#kone+n+sg+nom+use/nohyphens#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000" "<tietokonealalla>" "tietokonealalla" "<tieto+N+Sg+Nom+Use/NoHyphens#kone+N+Sg+Nom+Use/Hyphen#ala+N+Sg+Ade>" "tieto+n+sg+nom+use/nohyphens#kone+n+sg+nom+use/hyphen#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000" "<tietokonealalla>" "tietokonealalla" "<tieto+N+Sg+Nom+Use/NoHyphens#kone+N+Sg+Nom+Use/NoHyphens#ala+N+Sg+Ade>" "tieto+n+sg+nom+use/nohyphens#kone+n+sg+nom+use/nohyphens#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000" "<tietokonealalla>" "tietokonealalla" "<tietokone+N+Sg+Nom+Use/Hyphen#ala+N+Sg+Ade>" "tietokone+n+sg+nom+use/hyphen#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000" "<tietokonealalla>" "tietokonealalla" "<tietokone+N+Sg+Nom+Use/NoHyphens#ala+N+Sg+Ade>" "tietokone+n+sg+nom+use/nohyphens#ala+n+sg+ade" <mixed-upper> "<0,000000>" "0,000000"
All in same form but cannot predict which of them would be better for MT.
Conclusion for Trond: for MT purposes I have to use for CG input that is parsed already with apertium fst, otherwise there wil be more choices in input. But: bidix will be constantly changing and thus the apertium fst! Luckily, most bidix changes are augmentative.
Conclusion: 2 golden standards!
Gold corpus so far:
TODO:
- Look at tag/format issues for the gold corpus
- Two golden corpora (?)
Status Oahpa
Improvement in Morfa-C
- no more repeating exercises within the same set
Point to use for all Oahpas:
Student modeling: Keep track of input for Leksa to students.
Improvements in the user interface
- localisation to Estonian, Finnish, Russian (not ready yet)
- book chapters added under the "book" choice in Leksa and Morfa-S
TODO:
- Heli to write to Oahpa teams on the no more repeating issue
- Heli to write sketch on plan for student modeling for Leksa
Võru Oahpa
- Work on semantic sets underway.
- Work on completing the FST.
Articles
(to be looked at during forthcoming meetings)
- Monitor the "before" (implementation) articles
- Move article topics from "after" to "before"
The "before" articles
Contrastive/comparative tag/grammar article
Heiki-Jaan, Trond, Lene, Fran, ...
Heiki-Jaan has rewritten the tag conversion fin-est.
TODO:
- Comment the transfer rules:
- incubator/apertium-fin-est.fin-est.t1x
- apertium-sme-fin.fin-sme.t1x (...)
- apertium-sme-fin.sme-fin.t1x (...)
- incubator/apertium-fin-est.fin-est.t1x
- Start drafting an article
Look at this, and mail between each other.
echo 'olisin' | apertium -d . fin-est Oleksin
Article before spellchecker is out?
Relevant for Võru
TODO:
- Look at this idea (Note: the deadline of the forthcoming speller)
The "after" articles
The usual MT articles
finest: Comparing SMT (Google), SMT (Europarl), RBMT (Apertium), GF?
Time frame: Do this when these things are solved:
- technical problems related to tags
- CG harmonising
- Some MT work:
- A reasonable bidix
- some work on lexical selection (.lrx) and
- transfer (.t?x) files
- A reasonable bidix
Goal: Look for a deadline before summer.
Then, for the article:
Looking not at BLEU/WER, but at robustness:
- Marilla on kissa, hänellä ei ole koiraa.
- Google : Marilla on kass , ta ei ole koera .
- Apertium (old) Maril on kass, tal ei ole koera.
Establishing subgroups
- Oahpa: As today
- Võru: As today
- CG: Tiina, Trond, Lene, Fran to discuss
- MT: fin2X, X2fin: Heiki, Trond, Fran, Tiina,
- Infra: Sjur, Fran, ...
Next meeting
5.1.2016 10.00 Estonian time / 9.00 Norwegian time