SamEst meeting 30.09.14 in Tartu

Present: Fran, Heiki, Heli, Inari, Jaak, Kadri, Sjur, Tiina, Trond


Finnish analyser

  1. Implement weighted FST (also for Estonian)
  2. Clean up Finnish CG The Finnish CG-1 we got from Fred have been updated but do not quite work.
  3. Look at the rules never used on a big corpus
  4. Look at the most frequently used rules
  5. Look at the largest remaining homonymies
  6. Fix the tagset ... get rid of inline sets.

Tag unification

  1. Of est / fin morphological analysers.
    1. Document differences This is a possible topic for a scientific article. Heiki-Jaan to look at this, also Tommi, Trond, Sjur, Fran. We would also like to involve grammarians doing contrastive est-fin grammar.

Bilingual lexicon

  1. Fran: Create list from composition of fin-eng est-eng
    1. Fran to improve the script and provide list
  2. Proofread (40,000 stems?)
    1. Workload: 10000 words in a couple of weeks (Tartu)

The next step will then be to take a Finnish frequency list and fill in holes. Relevant domains:

  • Eurospeak, policy statements
  • Regulations
  • Work or travel related
  • Subtitles
  • Software localisation

Transfer component

  1. First stage: based on tagset difference
  2. Second stage: TBD.

Practical stuff

Make sure everyone is set up with the Giellatekno + Apertium infrastructure.


Article http://ec.europa.eu/transparency/regcomitology/index.cfm?do=search.documentdetail&Yq9XhlnCmsmjEbl3DTIoXrCju19xIogcRSPVjQMmGtE= http://ec.europa.eu/transparency/regcomitology/index.cfm?do=search.documentdetail&wRMf8QUM3tZdTALxMadrddO+1LhYfhR2hoY/G4AJtVgxdbQ+AI/X9VTTMRqv00VG

x    bovine    nautaeläin<n>    veis
~    Botaurus    kaulushaikarat<n>    hüüp
x    gerund    gerundi<n>    kesksõna
.    onionskin    läpilyöntipaperi<n>    sibulakoor


What did we promise? What have we done? Plan for the samest project, 2014 - 2016

  • Estonian FST - 2013 - 2015
    • Revise the plamk fst or integrate it in the gt infra
      • Degree of adjustment of fst
      • Revision -- Tag adjustment
    • Goals: Ability to generate Oahpa, MT, Dictionaries
  • Võru FST - 2013 - 2015
    • Oahpa quality: generating the pedagogical lexicon
  • MT

Work underway

Finnish - Saami MT

There is a dictionary of appr. 7,000 words.

piim  milk mait+o
==> piim maito

if all nodes in the net clusert around the same mielki piim ? Yes, if many links between them via family: maito milk mjölk melk

sme-nob    77,000
nob-est    20,000
nob-fin   perhaps
est-fin    40,000
est-eng   100,000
eng-fin    80,000
fin-sme   ???,???
sme-eng     8,000 ?!

Mausam, Soderland, S., Etzioni, O.,  Weld, D. S., Skinner, M., Bilmes J. (2009)  ``Compiling a Massive, Multilingual Dictionary via Probabilistic Inference''.
Wushouer, M., Lin, D., Ishida, T., Hirayama, K. (2014) ``Bilingual Dictionary Induction as an Optimization Problem''.

Finnish-Estonian MT

Task order:

  1. start with alignment
  2. start with fst adjustment
  3. start with getting some dictionary, e.g. nob-est, est-nob (additional nicety)

Apertium usage

Install, compile, etc. as described here and in the INSTALL file of apertium-lang1-lang2 Then stand in apertium-fin-est (the dot is the folder) and write:

echo "Tämä on hauska" | apertium -d . fin-est


The application said

  • Evaluation of Oahpa by teachers and students - 2015-2016
  • Publication of results at conferences - 2014 - 2016

Setting up Estonian and Võro Oahpa - 2014-1

  • Numra est, vro 2014-1
    • Numra est - does not exist
    • Numra vro - exists, needs proofreading
  • Leksa - 2014-1,2
    • est - synonyms missing, otherwise a good start
    • vro - Sulev has made a ground work, it should be included in Oahpa
  • Morfa-S - 2014-1,2
    • nouns - initial setup for est, vro, more menus to add
    • todo - the rest of the word classes
  • Morfa-C - 2015
    • Not yet.
  • We will also give Vasta and Sahka a go

The plan for vro oahpa was:

  1. set up 2014-1,
    Prototypes of Morfa-S, Leksa, Numra: 2014-2
    Morfa-s: fst to generate forms of vocabulary
    Leksa: vocabulary + semantic markup
  2. Use in courses, feedback, adjustment: 2014-3,4 <==

Here we are after schedule. Later: Morfa-c, further work


  • est: Heli Noor, Piret Toomet, Katrin Jänese
  • vro: Sulev is the teacher + colleagues in Võro Institute + TÜ

Time schedule:

  • 2013-4 Estonian fst (discussion, adjustment) , Võro fst: approaching oahpa quality
  • 2014-1 Set up Oahpa for est, vro
  • 2014-2 Work with Oahpa for est, vro; work with fsts
  • 2014-3 Oahpa in courses: vro, est <==
  • 2014-4 Oahpa in courses: vro, est
  • 2015 It will be planned later
  • 2016 It will be planned later

Resources:People, tasks, time allocated

Revised time - month schedule

Task 2014 2015 2016 Persons
Est + fin FST 3 2 2 Jaak, Heli, Heiki, Trond, etc.
Võro FST 3 3 3 Sulev
CG 0 2 2 Tiina, Kadri
Oahpa L, N, M 4 3 1 Heli, teachers
Oahpa V, S 0 2 3 Heli, teachers
MT fin-est 0 2 2 -
f-e lex. proof. 1 1 0 TBD
MT fin-sme 1 1 1 Same team as fin-est
MT fin CG 0 3 0 -
Total 12 19 14 45+


Publication venues

Journal, Conference, Workshop... which one(s).

Research questions / Development descriptions




  • Comparative fin-est tags & grammar


  • Publication presenting the system
    • Publication in Estonian pedagogical publication
    • Keel ja kirjandus, Rakenduslingvistika aastaraamat
  • Publication on usage / learning effect


  • Evaluation paper RBMT vs SMT
  • Usage cases


  • Estonian - Norwegian - Estonian
    • Trond
  • More?

Estonian Oahpa

Sources of input:

  • FST - generating forms for Morfa-S (continuous work by Jaak)
  • tags and paradigms (= complete tag strings) - extract from yaml tests
grep elama+V  $GTHOME/langs/est/test/src/gt-norm-yamls/V-elama_gt-norm.yaml | awk '{  print $1 } ' | sed -e 's/^.*+V//' -e 's/:$//' | sort -u
  • synonyms for Leksa - electronic dictionaries of synonyms (WordNets) sentence frames for Morfa-C - ideas from the teachers -> students of CL (supervised by Kadri) formalise the ideas
  • user interface, exercise types - teachers
  • adding the word class to the words in the Oahpa lexicon automatically - Heli
  • programming, configuration, etc. - Heli

Plan for the next 6 months:

  • full-scale Morfa-S
  • Morfa-C


Problem: compound analysis Add weights or classifying tags to the components / non-components of compound words based on a corpus.

positions: first, middle, last

Adverbs in compounding:

  • lexicalise
  • use lists (e.g. -ohtu, -karva, -värvi as final components)

Next meetings

In Tromsø: perhaps in January 2015, before the First International Workshop for Uralic Languages. ??

In Skype: Tuesday, 7th October 12.00 Norwegian time