140930
SamEst meeting 30.09.14 in Tartu
MT
Finnish analyser
- Implement weighted FST (also for Estonian)
- Clean up Finnish CG The Finnish CG-1 we got from Fred have been updated but do not quite work.
- Look at the rules never used on a big corpus
- Look at the most frequently used rules
- Look at the largest remaining homonymies
- Fix the tagset ... get rid of inline sets.
Tag unification
- Of est / fin morphological analysers.
- Document differences
- Document differences
Bilingual lexicon
- Fran: Create list from composition of fin-eng est-eng
- Fran to improve the script and provide list
- Fran to improve the script and provide list
- Proofread (40,000 stems?)
- Workload: 10000 words in a couple of weeks (Tartu)
The next step will then be to take a Finnish frequency list and fill in holes.
- Eurospeak, policy statements
- Regulations
- Work or travel related
- Subtitles
- Software localisation
Transfer component
- First stage: based on tagset difference
- Second stage: TBD.
Practical stuff
Make sure everyone is set up with the Giellatekno + Apertium infrastructure.
Evaluation
x bovine nautaeläin<n> veis ~ Botaurus kaulushaikarat<n> hüüp x gerund gerundi<n> kesksõna . onionskin läpilyöntipaperi<n> sibulakoor
Plans
What did we promise? What have we done?
- Estonian FST - 2013 - 2015
- Revise the plamk fst or integrate it in the gt infra
- Degree of adjustment of fst
- Revision -- Tag adjustment
- Degree of adjustment of fst
-
Goals: Ability to generate Oahpa, MT, Dictionaries
- Revise the plamk fst or integrate it in the gt infra
- Võru FST - 2013 - 2015
- Oahpa quality: generating the pedagogical lexicon
- Oahpa quality: generating the pedagogical lexicon
- MT
Work underway
Finnish - Saami MT
There is a dictionary of appr. 7,000 words.
piim milk mait+o ==> piim maito
if all nodes in the net clusert around the same
sme-nob 77,000 nob-est 20,000 nob-fin perhaps est-fin 40,000 est-eng 100,000 eng-fin 80,000 fin-sme ???,??? sme-eng 8,000 ?! Mausam, Soderland, S., Etzioni, O., Weld, D. S., Skinner, M., Bilmes J. (2009) ``Compiling a Massive, Multilingual Dictionary via Probabilistic Inference''. Wushouer, M., Lin, D., Ishida, T., Hirayama, K. (2014) ``Bilingual Dictionary Induction as an Optimization Problem''.
Finnish-Estonian MT
Task order:
- start with alignment
- start with fst adjustment
- start with getting some dictionary, e.g. nob-est, est-nob (additional nicety)
Apertium usage
Install, compile, etc. as described
echo "Tämä on hauska" | apertium -d . fin-est
Oahpa
The application said
- Evaluation of Oahpa by teachers and students - 2015-2016
- Publication of results at conferences - 2014 - 2016
Setting up Estonian and Võro Oahpa - 2014-1
- Numra est, vro 2014-1
- Numra est - does not exist
- Numra vro - exists, needs proofreading
- Numra est - does not exist
- Leksa - 2014-1,2
- est - synonyms missing, otherwise a good start
- vro - Sulev has made a ground work, it should be included in Oahpa
- est - synonyms missing, otherwise a good start
- Morfa-S - 2014-1,2
- nouns - initial setup for est, vro, more menus to add
- todo - the rest of the word classes
- nouns - initial setup for est, vro, more menus to add
- Morfa-C - 2015
- Not yet.
- Not yet.
- We will also give Vasta and Sahka a go
The plan for vro oahpa was:
- set up 2014-1,
- Use in courses, feedback, adjustment: 2014-3,4 <==
Here we are after schedule.
Teachers
- est: Heli Noor, Piret Toomet, Katrin Jänese
- vro: Sulev is the teacher + colleagues in Võro Institute + TÜ
Time schedule:
- 2013-4 Estonian fst (discussion, adjustment) , Võro fst: approaching oahpa quality
- 2014-1 Set up Oahpa for est, vro
- 2014-2 Work with Oahpa for est, vro; work with fsts
- 2014-3 Oahpa in courses: vro, est <==
- 2014-4 Oahpa in courses: vro, est
- 2015 It will be planned later
- 2016 It will be planned later
Resources:People, tasks, time allocated
Revised time - month schedule
Task | 2014 | 2015 | 2016 | Persons |
---|---|---|---|---|
Est + fin FST | 3 | 2 | 2 | Jaak, Heli, Heiki, Trond, etc. |
Võro FST | 3 | 3 | 3 | Sulev |
CG | 0 | 2 | 2 | Tiina, Kadri |
Oahpa L, N, M | 4 | 3 | 1 | Heli, teachers |
Oahpa V, S | 0 | 2 | 3 | Heli, teachers |
MT fin-est | 0 | 2 | 2 | - |
f-e lex. proof. | 1 | 1 | 0 | TBD |
MT fin-sme | 1 | 1 | 1 | Same team as fin-est |
MT fin CG | 0 | 3 | 0 | - |
Total | 12 | 19 | 14 | 45+ |
Publications
Publication venues
- NoDaLiDa 2015
- Virsu symposium, Oulu FU Congress August 2015
Research questions / Development descriptions
Topics
FST+CG(?)
- Comparative fin-est tags & grammar
Oahpa
- Publication presenting the system
- Publication in Estonian pedagogical publication
- Keel ja kirjandus, Rakenduslingvistika aastaraamat
- Publication in Estonian pedagogical publication
- Publication on usage / learning effect
MT
- Evaluation paper RBMT vs SMT
- Usage cases
E-dictionary
- Estonian - Norwegian - Estonian
- Trond
- Trond
- More?
Estonian Oahpa
Sources of input:
- FST - generating forms for Morfa-S (continuous work by Jaak)
- tags and paradigms (= complete tag strings) - extract from yaml tests
grep elama+V $GTHOME/langs/est/test/src/gt-norm-yamls/V-elama_gt-norm.yaml | awk '{ print $1 } ' | sed -e 's/^.*+V//' -e 's/:$//' | sort -u
- synonyms for Leksa - electronic dictionaries of synonyms (WordNets)
- user interface, exercise types - teachers
- adding the word class to the words in the Oahpa lexicon automatically - Heli
- programming, configuration, etc. - Heli
Plan for the next 6 months:
- full-scale Morfa-S
- Morfa-C
FST
positions: first, middle, last
Adverbs in compounding:
- lexicalise
- use lists (e.g. -ohtu, -karva, -värvi as final components)
Next meetings
In Tromsø: perhaps in January 2015, before the First International Workshop for Uralic Languages. ??
In Skype: Tuesday, 7th October 12.00 Norwegian time