130812

Contents:

Agenda
Evaluation
Plan, overall principles
Analysis
Linguistic issues
Generation

sme-sma-mt meeting 12.8.2013

Francis, Lene, Trond.

Agenda

Evaluation
Plan, overall principles
Analysis
Linguistic transfer issues
- Px
- Inflected forms
- Numerals
- Lexical selection
- Px
- ...
Generation

Evaluation

The abstract and hence the plan:

Show sme2sma as a pilot, that it is feasible. Choose a narrow domain.

Evaluation procedure

Send text pairs to sma translators: sme2sma and nob. (the sme2sma text perhaps enriched with missing words)
1. Which is quicker: editing sma MT vs translating from scratch
2. Method: giving two texts: one to translate and one to edit Two texts to three translators: one nob-only, one nob + smaMT
Questions:
1. Time the task
2. Answer question: How did you like the smaMT text?
3. hypothesis: smaMT has a less Norwegian syntax, and this can be seen as an asset (?)

There is a similar study evaluating es2pt, giving pt translators an en original and a es2pt MT text. Here is the paper:

"Using the Apertium Spanish-Brazilian Portuguese machine translation system for localization". François Masselot, Petra Ribiczey (both Autodesk) and Gema Ramírez-Sánchez (Prompsit) Annual Conference of the European Association for Machine Translation in 2010.

Content:

2 articles, each one or two pages
3 translators

Plan, overall principles

Content:

sme: Improve the analysis (syntactic functions...)
sme-sma texts: pick words, add words
sme-sma mt-tests: improve the syntax, morphosyntax
sma: Improve the generation (double forms, ...)
1. Worst-case-fix: word1/word2 => word1
sma and sme: add missing words to fst
CG-rules for lexical selection
Improve/finish sme/src/smi-syn.rle (the file is temporarily in sme/src/)

Online:

https://gtweb.uit.no/mt/
Update:
- gtweb: /opt/mt/README

Apertium Wiki:

http://wiki.apertium.org/wiki/Talk:North_S%C3%A1mi_and_South_S%C3%A1mi

Deadlines:

Find texts
Find translators
30.8. Send texts to the translators
15.9. Receive evaluation from the translators
26.9. Conference

Analysis

sme-dis.rle vs. Old-sme-dis.rle

Some syntactic tags are missing. Linda used syntactic functions in her rules.

Lene will spend a day or two on that.

We do not use dependency.

Evaluate Francis' tag conversion: Analyse the same sme text with identical morphology, and identical dis, but one with gt tags and one with Fran's converted apertium tags.

Francis to look into that and report differences.

Linguistic issues

Inflected forms

Two ways of translating positive adjectives in the attributive:

to adjective (attr -> attr)
to a noun in the genitive

Here are the cases:

1) <e><l>adjsme<s n="a"/></l><r>adjsma<s n="a"/></r></e>

2) <e><l>adjsme<s n="a"/><s n="attr"/></l><r>nsma<s n="n"/><s n="sg"/><s n="gen"/></r></e> estehtalaš vs. estetihken fágalaš vs. faagen

3) <e><l>advsme<s n="adv"/></l><r>nsma<s n="n"/><s n="pl"/><s n="ine"/></r></e> - báikkálaččat vs. byjreskisnie => fst? MT

Numerals

guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom <= change the "#" to "+"?

guoktečuođigolbmalogi+Num+Sg+Nom
guoktečuođigolbmalogi+Num+Sg+Nom    guoktečuođigolbmalogi
guoktečuođigolbmalogi+Num+Sg+Nom    guoktečuođigolbmalohki

guokte#čuođi#golbma#logi+Num+Sg+Nom
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalohki

guokte#čuođi#golbma#logi+Num+Sg+Nom
göökte#tjuetie#golme#luhkie+Num+Sg+Nom

^göökte+tjuetie+golme+luhkie<num><sg><nom>$
^göökte$ ^tjuetie$ ^golme$ ^luhkie<num><sg><nom>$

Lexical selection

.dix:

    <e><p><l>lávet<s n="n"/></l><r>tsietsehthmuerjie<s n="n"/></r></p></e>
    <e><p><l>lávet<s n="v"/><s n="tv"/></l><r>provhkedh<s n="v"/><s n="iv"/></r></p></e>

The default pair is listed in the file:

apertium-sme-sma.sme-sma.lrx:

transfer/bidix

<pron><indef><
<pron><indef><attr>

Tag differences for the whole paradigm: bidix
Tag differences for parts of the paradigm: t1x-files

input:

^lávet<n>$ -> ^lávet<n>/aaa<n>/bbb<n>$
^lávet<v>$ -> ^lávet<v>/xxx<v>/yyy<v>$ 

sed 's/lávet/aaa/g'
sed 's/lávet/yyy/g'

vs.

sed 's/lávet<n>/aaa/g'
sed 's/lávet<v>/yyy/g'

rules:

  1. select aaa for lávet ;
  2. select yyy for lávet ;
  
  l: á: v: e: t: :select(aaa)
  l: á: v: e: t: :select(yyy)
  
  vs. 
  
  l: á: v: e: t: <n>: :select(aaa)
  l: á: v: e: t: <v>: :select(yyy)
  
result:

 input: ^lávet<v>/xxx<v>/yyy<v>$ ; rules-matched: 1, 2
 input: ^lávet<n>/aaa<n>/bbb<n>$ ; rules-matched: 1, 2
 
 which rule is chosen ? 1 or 2 ?

  <rule comment="...">
    <match lemma="lávet" tags="n.*"><select lemma="tsietsehthmuerjie" tags="n.*"/></match>
  </rule>

  <rule>
    <match lemma="lávet" tags="v.tv.*"><select lemma="provhkedh" tags="v.*"/></match>
  </rule>

  <rule>
    <match lemma="sáhttit" tags="v.*"><select lemma="maehtedh" tags="v.*"/></match>
    <match lemma="leat" tags="v.*"/>
  </rule>

Compounds

The compound symbol is not the correct one.

Ovttasbargu<n><sgnomcmp><cmp>#šiehtadus<n><sg><nom> 
@Ovttasbargu#šiehtadus\<n\>\<sg\>\<nom\><n><sgnomcmp><cmp><n><sg><nom>

We thus want # -> +

Ovttasbargu<n><sgnomcmp><cmp>+šiehtadus<n><sg><nom>

fran@eki:~/source/giellatekno-sma/src$ cat tagsets/Makefile.am | grep echo
    echo -e "#\t+" >> $@

ovttasbargošiehtadus
ovttasbargošiehtadus    ovttasbargu+N+SgNomCmp+Cmp#šiehtadus+N+Sg+Nom
ovttasbargošiehtadus    ovttasbargošiehtadus+N+Sg+Nom

echo "Mis lea ovttasbargošiehtadus." | apertium -d . sme-sma
Mijjeste lea @ovttasbargu#šiehtadus\.

^Ovttasbargu<n><sgnomcmp><cmp>/Ektiebarkoe<n><sgnomcmp><cmp>$ ^šiehtadus<n><sg><nom>/latjkoe<n><sg><nom>$

usma:
ektiebarkoe ektiebarkoe+N+Sg+Nom
ektiebarkoelatjkoe  ektiebarkoe+N+SgNomCmp+Cmp#latjkoe+N+Sg+Nom


$ hfst-lookup sme-sma.autogen.hfst 
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>   Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>+? inf

Px

px on gen/ill

 #gyhtjelasse<n><sg><ill><pxsg3>
 #tjidtjie<n><sg><gen><pxsg3>

px on relatives/reflexive

This we want.

px on nouns -> poss. pron?

Trond, Lene

Generation

 #sïebredahke<n><sg><ela> TYPO  => siebriedahke
 #ealjoeh<a><der_lakaan><adv>
 #Seamma tïjjen<a>
 #sïjhtedh jiehtedh<v><tv><ind><prs><sg3>
 buaratjåbpoe/buerebe/bööretjåbpoe
 #vedrørende<po>  NOB
 #onne<a><superl><der_lakaan><adv> 
 #mubpie<a><ord><attr> 
 #eejtegh<n><sg><nom>
 #jïjtje<pron><refl><gen><pxsg3>
 #learohke<n><nomag><pl><ela>

See also #akte<num><sg><gen><der_lágan><a><attr>

A systematic test:

Extract all lemma from bidix and check whether they generate. Make frequencylist of words not generating

MWE

iešguđet ládje Adv Adv joekehtslaakan

<l>iešguđetládje</l>...<r>joekehtslaakan</r>

<l>NewYork</l> ... corresponds to space in the sme analyser.