131211

Meeting on est fst code

Present: Heiki, Heli, Jaak, Neeme, Sjur, Trond.

Agenda:

  • Presentation
  • Presentation of plamk
  • Presentation of integration alternatives
  • Discussion

Presentation

Of us, done.

Presentation of plamk

Files

Source code:

  • Lexicon from Eesti Keele Instituut
  • Two-level rules to handle mostlly phonology, some orthographic rules.

Separate rules for compounds:

File size:

Here are the 10 or so largest files of the compiled plamk:

  • 2 179 802 saami descriptive analyser (for reference)
  • 177 980 878 eesti.fst
  • 45 534 357 lihtsonad.fst
  • 44 674 879 full-compound.fst
  • 2 954 677 lex_full.txt
  • 2 913 885 lex_tyved.txt
  • 2 107 529 tyvebaas.txt
  • 403 494 lex-av.fst
  • 138 373 lex.fst
  • 76 897 lex_exc.txt
  • 75 864 lex_override_gen.txt
  • 47 355 rules.fst
  • 46 282 form.exc
  • 35 147 COPYING
  • 26 041 lex_main.txt
  • 19 897 eki2lex.pl
  • 16 236 tyvebaas-lisa.txt
  • 15 624 liitsonamask.fst
  • 14 344 lex_extra.txt
  • 11 383 morftrtabel.txt
  • 10 875 rul.txt
  • 7 748 liitsona_full.txt
lex_tyved.txt
 aPla 29; ! (29_V -> ) at: apla, an: abla
 aa+I:aa GI; ! (41_I -> +I) 
 aabe+S:aaPe 06; ! (06_S -> +S) an: aabe, at: aape
 aabits+S:aabits 02_A; ! (02_S -> +S) a0: aabits, b0: aabitsa, b0r: 0

Discussion

Code different from the giellatekno code, for sure.

lexc_main, most lexc files, are a way of expressing regular expressions in order to filter out irregularities.

Differences:

  • stem lexicon is generated from EKI database (?)
  • some lexc and xfst files stored in github
  • regular lexicon and exception lexicon separate
  • if words in both, then the exception lexcion overrides it

Presentation of integration alternatives

  1. Full integration and rewrite, with updates done to the GT code. Cut the link to plamk code
  2. Keep different codes, but build conversion scripts. Update in plamk, convert when needed
  3. A hybrid solution, in itself with many nuances, one of which is to have full integration and rewrite of morphology files, but conversion routines for lexicon files.
  4. Encapsulate plamk in the GT infrastructure, so that "this folder is different".
  5. Gradually adapt Plamk to the GT infrastructure (play it safe, that is)

Nicknames:

  1. One-time integration
  2. Continous integration
  3. Hybrid
  4. No integration
  5. Gradual integration (a safer version of "one-time integration")

Discussion

One-time integration

pro

  • it's "done" -- no dependencies either way
  • it will be maintained within GT fully?
  • Heiki: total and thorough rewrite is both needed and the best solution (it is a dream)
  • It will give us a common language, a common understanding
  • It will give bystanders an alternative to the plamk infrastructure

con

  • development happens in two different places? need to do extra work to synchronize? can be solved using version control systems
  • takes more time and effort in the beginning, probably harder than making the integration step-by-step
  • risky: we don't know the consequences of jumping

Continous integration

pro

  • just one "master copy"

con

  • does not fit into GTs infra that well
  • svn vs git? updating needs to be thought through

Hybrid

This would be a conversion light. The idea is that the core analysis is changed once and for all, and ...

  • morphophonology not changed, because it doesn't need to
  • the lexicon is what needs to be updated on a regular basis

The lexicon format for plamk and gt are similar:

akustik+S:akustik 02_U; ! (02_S -> +S) a0: akustik, b0: akustiku, b0r: 0
akustika+S:akustika 01; ! (01_S -> +S) a0: akustika, a0r: 0
akustiline+A:akustili 12_NE-SE-S; ! (12_A -> +A) a0: akustiline, b0: akustilise, c0: akustilis, b0v: akustilisi

One hybrid sketch:

LEXICON allstems
ihana adjlex ;  ! these as separate 
talo nounlex ;
nuori hybridlex ;

LEXICON hybridlex
adjlex ;
nounlex ;

LEXICON adjlex
+Comp: ... ;
nominalcase ;

LEXICON nounlex
pxlex ;
nominalcase ;

pro

  • less work now, maybe also later

con

  • linguist has to somehow deal with two different systems parallelly

No integration

pro

plamk development will continue we do not risk loosing insights in conversion

con

  • we miss the integration with end-user tools from gt
  • we will always have to come up with different solutions for Estonian
  • The plamk code will remain a dark continent to others that Jaak and Heli (?)
  • It is unclear whether the resulting fst could be integrated in end-user tools

Gradual integration

pro

  • we don't have to decide to break the ties before we know the consequences better
  • allow both systems to develop
  • safest way
  • in the end we have a system that is GT-style

con

  • we don't know if it is possible to rewrite the system so that it is fully GT-style (but maybe we don't need to - it is hard to tell now)

Tags

Many issues are trivial

  • plamk-style
    • +nom +gen +part +ill +adit +in +el +all +ad +abl +tr +term +es +abes +kom
  • gt-style
    • +Nom +Gen +Par +Ill +Adi +Ine +Ela +All +Ade +Abl +Tra +Trm +Ess +Abe +Com

Tag system principles:

  • Verbs:
    • gt: +Sg1
  • others: +1Sg, +1 +Sg
    • plamk: +ps1 +sg

Nouns:

Other issues are substantial

Both lexc and twolc code are in compatible formats:

LEXICON 22_A  !jalg, pikk, sepp
An_SgN;
:a$ TP_22bn;
:a TP_22bt;

Several newinfras:

  1. langs (all languages with at least one application, or with decent coverage)
  2. startup-langs (our incubator)
  3. experiment-langs (alternative setups, for pedagogical or experimental reasons)
  4. closed-langs (as langs, but with a closed license, not visible online)

Conversion plan

  1. Tag wrapper -- needed anyway (will need discussions)
    1. src/scripts/ (if conversion of source files)
    2. src/tagsets/ (if done on compiled fst's, like conversion to apertium tags)
  2. phonology
    1. est-phon.twolc
  3. morphology
    1. tags: root.lex
    2. stems: Populate nouns.lexc, verbs.lexc, etc. or: nouns, verbs, hybrids, ...
    3. affixes: Populate nouns.lexc, verbs.lexc, etc.

Workflow

Try to adjust filenames of plamk and may be the build system so that the conversion would be more natural (more obvious?)

Make a folder: est/src/import/ containing the export snapshot of the plamk source files.

Then move relevant parts of it to their places in the gt tree, so that what is left is more and more empty files. The empty files then serve as (part of) the documentation for what has been integrated and what has not.

Keep up documentation in the est/doc folder, linked from est/doc/EstonianDocumentation.jspwiki.

Illustrations from the gt tree

An alternative: Greenlandic

Stems:

  • abbreviations.lexc
  • acronyms.lexc
  • nouns.lexc
  • numerals.lexc
  • particles.lexc
  • pronouns.lexc
  • propernouns.lexc
  • punctuation.lexc
  • verbs.lexc

Affixes:

  • derivations-inflections.lexc
  • numerals.lexc
  • propernouns.lexc

Estonian as inverted Greenlandic:

Stems:

  • hybrids.lexc
  • pronouns.lexc

Affixes

  • nouns.lexc
  • verbs.lexc

Milestones in near future

  • Move Neeme's est (done)
  • Make new dummy est (done)
  • Set up Documentation page (Trond)
  • Look at filenames for plamk (Jaak)
  • Thereafter export plamk from git to est/src/import/ ( Jaak)
  • Look at and write in the documentation (all)
  • Set up Bugzilla with more components (mphon, morph, lex, import) (Trond)

Next meeting

  • Dec 20th 9: 30 Estonian time

Topics, preparations:

  • Look at the import folder
  • Look at tag differences
  • Try to compile stuff (e.g. another language) in beforehand