131211

Contents:

Presentation
Presentation of plamk
Presentation of integration alternatives
Discussion
Tags
Several newinfras:
Conversion plan
Milestones in near future
Next meeting

Meeting on est fst code

Present: Heiki, Heli, Jaak, Neeme, Sjur, Trond.

Agenda:

Presentation
Presentation of plamk
Presentation of integration alternatives
Discussion

Presentation

Of us, done.

Presentation of plamk

Files

Source code:

Lexicon from Eesti Keele Instituut
Two-level rules to handle mostlly phonology, some orthographic rules.

Separate rules for compounds:

File size:

Here are the 10 or so largest files of the compiled plamk:

2 179 802 saami descriptive analyser (for reference)
177 980 878 eesti.fst
45 534 357 lihtsonad.fst
44 674 879 full-compound.fst
2 954 677 lex_full.txt
2 913 885 lex_tyved.txt
2 107 529 tyvebaas.txt
403 494 lex-av.fst
138 373 lex.fst
76 897 lex_exc.txt
75 864 lex_override_gen.txt
47 355 rules.fst
46 282 form.exc
35 147 COPYING
26 041 lex_main.txt
19 897 eki2lex.pl
16 236 tyvebaas-lisa.txt
15 624 liitsonamask.fst
14 344 lex_extra.txt
11 383 morftrtabel.txt
10 875 rul.txt
7 748 liitsona_full.txt

lex_tyved.txt
 aPla 29; ! (29_V -> ) at: apla, an: abla
 aa+I:aa GI; ! (41_I -> +I) 
 aabe+S:aaPe 06; ! (06_S -> +S) an: aabe, at: aape
 aabits+S:aabits 02_A; ! (02_S -> +S) a0: aabits, b0: aabitsa, b0r: 0

Discussion

Code different from the giellatekno code, for sure.

lexc_main, most lexc files, are a way of expressing regular expressions in order to filter out irregularities.

Differences:

stem lexicon is generated from EKI database (?)
some lexc and xfst files stored in github
regular lexicon and exception lexicon separate
if words in both, then the exception lexcion overrides it

Presentation of integration alternatives

Full integration and rewrite, with updates done to the GT code. Cut the link to plamk code
Keep different codes, but build conversion scripts. Update in plamk, convert when needed
A hybrid solution, in itself with many nuances, one of which is to have full integration and rewrite of morphology files, but conversion routines for lexicon files.
Encapsulate plamk in the GT infrastructure, so that "this folder is different".
Gradually adapt Plamk to the GT infrastructure (play it safe, that is)

Nicknames:

One-time integration
Continous integration
Hybrid
No integration
Gradual integration (a safer version of "one-time integration")

Discussion

One-time integration

pro

it's "done" -- no dependencies either way
it will be maintained within GT fully?
Heiki: total and thorough rewrite is both needed and the best solution (it is a dream)
It will give us a common language, a common understanding
It will give bystanders an alternative to the plamk infrastructure

con

development happens in two different places? need to do extra work to synchronize? can be solved using version control systems
takes more time and effort in the beginning, probably harder than making the integration step-by-step
risky: we don't know the consequences of jumping

Continous integration

pro

just one "master copy"

con

does not fit into GTs infra that well
svn vs git? updating needs to be thought through

Hybrid

This would be a conversion light. The idea is that the core analysis is changed once and for all, and ...

morphophonology not changed, because it doesn't need to
the lexicon is what needs to be updated on a regular basis

The lexicon format for plamk and gt are similar:

akustik+S:akustik 02_U; ! (02_S -> +S) a0: akustik, b0: akustiku, b0r: 0
akustika+S:akustika 01; ! (01_S -> +S) a0: akustika, a0r: 0
akustiline+A:akustili 12_NE-SE-S; ! (12_A -> +A) a0: akustiline, b0: akustilise, c0: akustilis, b0v: akustilisi

One hybrid sketch:

LEXICON allstems
ihana adjlex ;  ! these as separate 
talo nounlex ;
nuori hybridlex ;

LEXICON hybridlex
adjlex ;
nounlex ;

LEXICON adjlex
+Comp: ... ;
nominalcase ;

LEXICON nounlex
pxlex ;
nominalcase ;

pro

less work now, maybe also later

con

linguist has to somehow deal with two different systems parallelly

No integration

pro

plamk development will continue we do not risk loosing insights in conversion

con

we miss the integration with end-user tools from gt
we will always have to come up with different solutions for Estonian
The plamk code will remain a dark continent to others that Jaak and Heli (?)
It is unclear whether the resulting fst could be integrated in end-user tools

Gradual integration

pro

we don't have to decide to break the ties before we know the consequences better
allow both systems to develop
safest way
in the end we have a system that is GT-style

con

we don't know if it is possible to rewrite the system so that it is fully GT-style (but maybe we don't need to - it is hard to tell now)

Several newinfras:

langs (all languages with at least one application, or with decent coverage)
startup-langs (our incubator)
experiment-langs (alternative setups, for pedagogical or experimental reasons)
closed-langs (as langs, but with a closed license, not visible online)

Conversion plan

Tag wrapper -- needed anyway (will need discussions)
1. src/scripts/ (if conversion of source files)
2. src/tagsets/ (if done on compiled fst's, like conversion to apertium tags)
phonology
1. est-phon.twolc
morphology
1. tags: root.lex
2. stems: Populate nouns.lexc, verbs.lexc, etc. or: nouns, verbs, hybrids, ...
3. affixes: Populate nouns.lexc, verbs.lexc, etc.

Workflow

Try to adjust filenames of plamk and may be the build system so that the conversion would be more natural (more obvious?)

Make a folder: est/src/import/ containing the export snapshot of the plamk source files.

Then move relevant parts of it to their places in the gt tree, so that what is left is more and more empty files. The empty files then serve as (part of) the documentation for what has been integrated and what has not.

Keep up documentation in the est/doc folder, linked from est/doc/EstonianDocumentation.jspwiki.

Illustrations from the gt tree

An alternative: Greenlandic

Stems:

abbreviations.lexc
acronyms.lexc
nouns.lexc
numerals.lexc
particles.lexc
pronouns.lexc
propernouns.lexc
punctuation.lexc
verbs.lexc

Affixes:

derivations-inflections.lexc
numerals.lexc
propernouns.lexc

Estonian as inverted Greenlandic:

Stems:

hybrids.lexc
pronouns.lexc

Affixes

nouns.lexc
verbs.lexc

Milestones in near future

Move Neeme's est (done)
Make new dummy est (done)
Set up Documentation page (Trond)
Look at filenames for plamk (Jaak)
Thereafter export plamk from git to est/src/import/ ( Jaak)
Look at and write in the documentation (all)
Set up Bugzilla with more components (mphon, morph, lex, import) (Trond)

Next meeting

Dec 20th 9: 30 Estonian time

Topics, preparations:

Look at the import folder
Look at tag differences
Try to compile stuff (e.g. another language) in beforehand