130812
sme-sma-mt meeting 12.8.2013
Francis, Lene, Trond.
Agenda
- Evaluation
- Plan, overall principles
- Analysis
- Linguistic transfer issues
- Px
- Inflected forms
- Numerals
- Lexical selection
- Px
- ...
- Px
- Generation
Evaluation
The abstract and hence the plan:
- Show sme2sma as a pilot, that it is feasible.
Evaluation procedure
- Send text pairs to sma translators: sme2sma and nob.
- Which is quicker: editing sma MT vs translating from scratch
- Method: giving two texts: one to translate and one to edit
- Which is quicker: editing sma MT vs translating from scratch
- Questions:
- Time the task
- Answer question: How did you like the smaMT text?
- hypothesis: smaMT has a less Norwegian syntax, and this can
- Time the task
There is a similar study evaluating es2pt, giving pt translators
"Using the Apertium Spanish-Brazilian Portuguese machine translation system for localization".
Content:
- 2 articles, each one or two pages
- 3 translators
Plan, overall principles
Content:
- sme: Improve the analysis (syntactic functions...)
- sme-sma texts: pick words, add words
- sme-sma mt-tests: improve the syntax, morphosyntax
- sma: Improve the generation (double forms, ...)
- Worst-case-fix: word1/word2 => word1
- Worst-case-fix: word1/word2 => word1
- sma and sme: add missing words to fst
- CG-rules for lexical selection
- Improve/finish sme/src/smi-syn.rle (the file is temporarily in sme/src/)
Online:
-
https://gtweb.uit.no/mt/
- Update:
- gtweb: /opt/mt/README
Apertium Wiki:
Deadlines:
- Find texts
- Find translators
- 30.8. Send texts to the translators
- 15.9. Receive evaluation from the translators
- 26.9. Conference
Analysis
sme-dis.rle vs. Old-sme-dis.rle
Some syntactic tags are missing. Linda used
Lene will spend a day or two on that.
We do not use dependency.
Evaluate Francis' tag conversion: Analyse the same
Francis to look into that and report differences.
Linguistic issues
Inflected forms
Two ways of translating positive adjectives in the attributive:
- to adjective (attr -> attr)
- to a noun in the genitive
Here are the cases:
1) <e><p><l>adjsme<s n="a"/></l><r>adjsma<s n="a"/></r></p></e>
2) <e><p><l>adjsme<s n="a"/><s n="attr"/></l><r>nsma<s n="n"/><s n="sg"/><s n="gen"/></r></p></e>
3) <e><p><l>advsme<s n="adv"/></l><r>nsma<s n="n"/><s n="pl"/><s n="ine"/></r></p></e>
Numerals
guoktečuođigolbmalogi guokte#čuođi#golbma#logi+Num+Sg+Nom <= change the "#" to "+"? guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalogi guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalohki guokte#čuođi#golbma#logi+Num+Sg+Nom guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalogi guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalohki guokte#čuođi#golbma#logi+Num+Sg+Nom göökte#tjuetie#golme#luhkie+Num+Sg+Nom ^göökte+tjuetie+golme+luhkie<num><sg><nom>$ ^göökte$ ^tjuetie$ ^golme$ ^luhkie<num><sg><nom>$
Lexical selection
.dix:
<e><p><l>lávet<s n="n"/></l><r>tsietsehthmuerjie<s n="n"/></r></p></e> <e><p><l>lávet<s n="v"/><s n="tv"/></l><r>provhkedh<s n="v"/><s n="iv"/></r></p></e>
The default pair is listed in the file:
apertium-sme-sma.sme-sma.lrx:
transfer/bidix
<pron><indef>< <pron><indef><attr>
- Tag differences for the whole paradigm: bidix
- Tag differences for parts of the paradigm: t1x-files
input: ^lávet<n>$ -> ^lávet<n>/aaa<n>/bbb<n>$ ^lávet<v>$ -> ^lávet<v>/xxx<v>/yyy<v>$ sed 's/lávet/aaa/g' sed 's/lávet/yyy/g' vs. sed 's/lávet<n>/aaa/g' sed 's/lávet<v>/yyy/g' rules: 1. select aaa for lávet ; 2. select yyy for lávet ; l: á: v: e: t: :select(aaa) l: á: v: e: t: :select(yyy) vs. l: á: v: e: t: <n>: :select(aaa) l: á: v: e: t: <v>: :select(yyy) result: input: ^lávet<v>/xxx<v>/yyy<v>$ ; rules-matched: 1, 2 input: ^lávet<n>/aaa<n>/bbb<n>$ ; rules-matched: 1, 2 which rule is chosen ? 1 or 2 ?
<rule comment="..."> <match lemma="lávet" tags="n.*"><select lemma="tsietsehthmuerjie" tags="n.*"/></match> </rule> <rule> <match lemma="lávet" tags="v.tv.*"><select lemma="provhkedh" tags="v.*"/></match> </rule> <rule> <match lemma="sáhttit" tags="v.*"><select lemma="maehtedh" tags="v.*"/></match> <match lemma="leat" tags="v.*"/> </rule>
Compounds
The compound symbol is not the correct one.
Ovttasbargu<n><sgnomcmp><cmp>#šiehtadus<n><sg><nom> @Ovttasbargu#šiehtadus\<n\>\<sg\>\<nom\><n><sgnomcmp><cmp><n><sg><nom>
We thus want # -> +
Ovttasbargu<n><sgnomcmp><cmp>+šiehtadus<n><sg><nom>
fran@eki:~/source/giellatekno-sma/src$ cat tagsets/Makefile.am | grep echo echo -e "#\t+" >> $@ ovttasbargošiehtadus ovttasbargošiehtadus ovttasbargu+N+SgNomCmp+Cmp#šiehtadus+N+Sg+Nom ovttasbargošiehtadus ovttasbargošiehtadus+N+Sg+Nom echo "Mis lea ovttasbargošiehtadus." | apertium -d . sme-sma Mijjeste lea @ovttasbargu#šiehtadus\. ^Ovttasbargu<n><sgnomcmp><cmp>/Ektiebarkoe<n><sgnomcmp><cmp>$ ^šiehtadus<n><sg><nom>/latjkoe<n><sg><nom>$ usma: ektiebarkoe ektiebarkoe+N+Sg+Nom ektiebarkoelatjkoe ektiebarkoe+N+SgNomCmp+Cmp#latjkoe+N+Sg+Nom $ hfst-lookup sme-sma.autogen.hfst Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom> Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom> Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>+? inf
Px
px on gen/ill
#gyhtjelasse<n><sg><ill><pxsg3> #tjidtjie<n><sg><gen><pxsg3>
px on relatives/reflexive
This we want.
px on nouns -> poss. pron?
Trond, Lene
Generation
#sïebredahke<n><sg><ela> TYPO => siebriedahke #ealjoeh<a><der_lakaan><adv> #Seamma tïjjen<a> #sïjhtedh jiehtedh<v><tv><ind><prs><sg3> buaratjåbpoe/buerebe/bööretjåbpoe #vedrørende<po> NOB #onne<a><superl><der_lakaan><adv> #mubpie<a><ord><attr> #eejtegh<n><sg><nom> #jïjtje<pron><refl><gen><pxsg3> #learohke<n><nomag><pl><ela>
See also #akte<num><sg><gen><der_lágan><a><attr>
A systematic test:
Extract all lemma from bidix and check whether they generate.
MWE
iešguđet ládje Adv Adv joekehtslaakan
<l>iešguđet<b/>ládje</l>...<r>joekehtslaakan</r>
<l>New<b/>York</l> ... corresponds to space in the sme analyser.