160304

Contents:

Samest meeting 4.3. 2016
Participants
Agenda
Short updates
Project status vs project plan
- The goal of the project
- FST
- MT
- ICALL
Next meeting

Samest meeting 4.3. 2016

Participants

Heiki, Heli, Jaak, Jack, Sjur, Trond

Agenda

Short updates
- est FST
- fin - est MT
- CG group
Project status vs project plan

Short updates

est FST

Tiina has been active, added words, changed categories. This is a problem for MT. The issue has been discussed, but no conclusion so far.

The new categories are more giella-like.

The tagging of negation is different, e.g.

+Neg - > ConNeg
 0 -> ConNeg 
+Adp => +Pr, +Po

ei: +Neg
ole: +V > +V+ConNeg
oleks = Pers Prs Cond
oleks: +Neg > +ConNeg

current tags:

ole = Pers Prs Imprt Sg2
ole = Pers Prs Ind Neg
oleks = Pers Prs Cond

Tiina to join the FST group, and discussions.

Fin-est MT

echo "Minä en ollut tullut."|apertium -d . fin-est
Ma ei olnud saanud@.@

Troubleshooting check list for the MT process:

compilers

hfst
vislcg3 (cg-proc)
apertium
lttoolbox

Giella content:

fin fst
fin cg
est fst

Giella to apertium process

tag conversion program

Apertium content:

bidix .dix
lexical selection .lrx
transfer .t?x

translation process

fin
morphological analysis
disambiguation

MT process

lexical transfer & selection
structural transfer
generation

Pitfalls:

new versions of all the compilers? Latest version of all required tools available as nightly prebuilt binaries by running the shell script: http: //apertium.projectjj.com/osx/install-nightly.sh (OSX only)
new versions of the tags?
new routines for the tag conversion giella - apertium?
new lingustic content in any of the source files?

Examples

Ehdokkahien/Ehdokkahitten/Ehdokkaiden/Ehdokkaitten *koondnimekiri #laatia #27.9.2012.

(Showing the Finnish FST generation problem)

Kui su võimed liikuda või toimida on piirnenud selles/ses #kogus, et ei *pääse ilma õigustamatuid raskuseid/raskusi *äänestyspaikalle, saad *äänestää kodus *ennakkoäänestysaikana.

(and the same for Estonian)

Jos kykysi liikkua tai toimia on rajoittunut siinä määrin, että et pääse ilman kohtuuttomia vaikeuksia äänestyspaikalle, voit äänestää kotona ennakkoäänestysaikana.

Neg tags and lacunas in bidix:

    <e><p><l>päästää<s n="vblex"/></l><r>vabastama<s n="vblex"/></r></p></e>
    <e><p><l>päästää<s n="vblex"/></l><r>laskma<s n="vblex"/></r></p></e>

Problem: The verb päästä is missing in bidix, only päästää is there.

Reference text in the svn repository: biggies/trunk/langs/LANG/corp/vaalit2012.txt

(or google Finnish government vaalit 2012 brochure)

#hfst irc channel contains a lot of relevant people. asking there might give good first pointers. Also #apertium https: //webchat.freenode.net/ unless you have desktop client.

CG group

Meeting still forthcoming.

Project status vs project plan

The goal of the project

From the plan:

The goal of the project is twofold: We will provide computational models for Estonian and Võru, and we will put them into use in two types of applications: Machine translation (MT) and interactive computer-assisted language learning (iCALL).

For Estonian, the computational models will take into account some recent findings about its regularities (described in (Kaalep 2012)), in addition to the previous knowledge (Uibo 2005).

For Võru, the models will be based on a comprehensive description (Iva 2007) and a text corpus (see “Related projects” section).

For MT we intend to make two modular systems, Finnish-North Saami and Finnish-Estonian, and for iCALL we will implement a system for Estonian for Russian speakers and one for Võru for Estonian speakers.

FST

From the plan:

Finite state transducers are available for both North Saami and Finnish, and partly so for Estonian and Võru. For Estonian, the availability of the full FST is still an open question, in the worst case we will have to add missing components in order to achieve an openly available morphological transducer. For Võru, work is underway, and there is a theoretical outline of a transducer (Iva 2012?), completing it will be an important part of this project.

Goals:

To improve the quality the Estonian FST according to the requirements from the applications - MT and ICALL.
(maximal) To achieve the quality of Võru FST that is good enough for using it in Oahpa.

MT

From the plan:

For comparison and evaluation purposes, we will also set up a statistical phrase-based Estonian-Finnish machine translation system, based on open-source software Moses. We will rely on the experience of the University of Tartu on similar systems (Estonian - English http://masintolge.ut.ee and Estonian - French http://masintolge.ut.ee/fr) and on the large parallel corpora freely available at the moment (e.g. Europarl, OPUS, JRC DGT). We prefer this in-house system to Google translate because we can look at the inner workings of it (e.g. phrase translation table) and thus make more meaningful comparisons.

From the report:

Finnish-Estonian statistical MT was set up in 2014 already (http://masintolge.ut.ee/et-fi/).

From the plan:

For the MT part of the project, we will build upon an existing Finnish FST (Pirinen 2011). We will also need a good constraint grammar. Here, our starting point is the grammar presented in Karlsson 1990, we have converted it to a more recent CG compiler format and already form part of an alpha version of a Finnish - North Saami MT system. More work is needed in order to fine-tune it to the Finnish FST (it was originally written for another FST), but it provides a good basis for the analysis part of the MT systems. As for lexical and structural transfer, we will base our work upon the MT framework provided by the Apertium platform (Forçada et al 2012). For the transfer lexica we will for Finnish-Estonian partly use open bilingual resources, and partly extract translation pairs https://webchat.freenode.net/from available parallel corpora. For Finnish-North Saami we have a basic transfer lexicon already, also this will have to be enriched with translation pairs from parallel corpora.

Evaluation

Finnish FST was less MT-optimal than we were aware of, more work is needed to ensure one output.
The Finnish CG is not yet fine-tuned to the Finnish FST, this work needs attention.
Lexical and structural transfer: More will be revealed here as more basic problems are solved.

Intermediate goals:

Finnish FST generates only one form for each lemma+analysis
Estonian FST generates only one form for each lemma+analysis
Finnish CG is tuned to the Finnish FST
Goal for WER value: ?? 20%?
Finnish - North Saami tag match
Finnish - North Saami bidix

Publication goals:

As soon as we have some results...

ICALL

From the plan:

As a part of this project we intend to develop a basic version of an Estonian Oahpa (including the games Leksa, Numra, Morfa-S and Morfa-C) for learning Estonian as L2, with the focus on the learners whose mother tongue is Russian, as this is the biggest group among the learners of Estonian. We also plan to develop an experimental version of Võru Oahpa.

Evaluation and comments

Vasta and Sahka were not mentioned in the plan but we have implemented the first demo of Vasta. We are behind the plan in the following points:

Numra
Morfa-S adjectives, pronouns, numerals; improve the quality of N, V so that Oahpa can be used in the reality.
Morfa-C verbs, adjectives
Logging of user data is needed if we want to study the learning process (and write papers about it). Extracting log data is documented, but should be followed up.

Goal:

To make Estonian Oahpa practically usable by autumn term 2016, so that practical experiments could be carried out and research made on these.

Next meeting

Friday 1. April 9:00 (Norwegian time)