131211
Meeting on est fst code
Present: Heiki, Heli, Jaak, Neeme, Sjur, Trond.
Agenda:
- Presentation
- Presentation of plamk
- Presentation of integration alternatives
- Discussion
Presentation
Presentation of plamk
Files
- Lexicon from Eesti Keele Instituut
- Two-level rules to handle mostlly phonology, some orthographic rules.
Separate rules for compounds:
File size:
Here are the 10 or so largest files of the compiled plamk:
- 2 179 802 saami descriptive analyser (for reference)
- 177 980 878 eesti.fst
- 45 534 357 lihtsonad.fst
- 44 674 879 full-compound.fst
- 2 954 677 lex_full.txt
- 2 913 885 lex_tyved.txt
- 2 107 529 tyvebaas.txt
- 403 494 lex-av.fst
- 138 373 lex.fst
- 76 897 lex_exc.txt
- 75 864 lex_override_gen.txt
- 47 355 rules.fst
- 46 282 form.exc
- 35 147 COPYING
- 26 041 lex_main.txt
- 19 897 eki2lex.pl
- 16 236 tyvebaas-lisa.txt
- 15 624 liitsonamask.fst
- 14 344 lex_extra.txt
- 11 383 morftrtabel.txt
- 10 875 rul.txt
- 7 748 liitsona_full.txt
lex_tyved.txt aPla 29; ! (29_V -> ) at: apla, an: abla aa+I:aa GI; ! (41_I -> +I) aabe+S:aaPe 06; ! (06_S -> +S) an: aabe, at: aape aabits+S:aabits 02_A; ! (02_S -> +S) a0: aabits, b0: aabitsa, b0r: 0
Discussion
lexc_main, most lexc files, are a way of expressing regular expressions
Differences:
- stem lexicon is generated from EKI database (?)
- some lexc and xfst files stored in github
- regular lexicon and exception lexicon separate
- if words in both, then the exception lexcion overrides it
Presentation of integration alternatives
- Full integration and rewrite, with updates done to the GT code. Cut the link
- Keep different codes, but build conversion scripts. Update in plamk, convert
- A hybrid solution, in itself with many nuances, one of which is to have full
- Encapsulate plamk in the GT infrastructure, so that "this folder is different".
- Gradually adapt Plamk to the GT infrastructure (play it safe, that is)
Nicknames:
- One-time integration
- Continous integration
- Hybrid
- No integration
- Gradual integration (a safer version of "one-time integration")
Discussion
One-time integration
pro
- it's "done" -- no dependencies either way
- it will be maintained within GT fully?
- Heiki: total and thorough rewrite is both needed and the best solution (it is a dream)
- It will give us a common language, a common understanding
- It will give bystanders an alternative to the plamk infrastructure
con
- development happens in two different places? need to do extra work to synchronize? can be solved using version control systems
- takes more time and effort in the beginning, probably harder than making the integration step-by-step
- risky: we don't know the consequences of jumping
Continous integration
pro
- just one "master copy"
con
- does not fit into GTs infra that well
- svn vs git? updating needs to be thought through
Hybrid
This would be a conversion light. The idea is that
- morphophonology not changed, because it doesn't need to
- the lexicon is what needs to be updated on a regular basis
The lexicon format for plamk and gt are similar:
akustik+S:akustik 02_U; ! (02_S -> +S) a0: akustik, b0: akustiku, b0r: 0 akustika+S:akustika 01; ! (01_S -> +S) a0: akustika, a0r: 0 akustiline+A:akustili 12_NE-SE-S; ! (12_A -> +A) a0: akustiline, b0: akustilise, c0: akustilis, b0v: akustilisi
One hybrid sketch:
LEXICON allstems ihana adjlex ; ! these as separate talo nounlex ; nuori hybridlex ; LEXICON hybridlex adjlex ; nounlex ; LEXICON adjlex +Comp: ... ; nominalcase ; LEXICON nounlex pxlex ; nominalcase ;
pro
- less work now, maybe also later
con
- linguist has to somehow deal with two different systems parallelly
No integration
pro
con
- we miss the integration with end-user tools from gt
- we will always have to come up with different solutions for Estonian
- The plamk code will remain a dark continent to others that Jaak and Heli (?)
- It is unclear whether the resulting fst could be integrated in end-user tools
Gradual integration
pro
- we don't have to decide to break the ties before we know the consequences better
- allow both systems to develop
- safest way
- in the end we have a system that is GT-style
con
- we don't know if it is possible to rewrite the system so that it is fully GT-style (but maybe we don't need to - it is hard to tell now)
Tags
Many issues are trivial
- plamk-style
-
+nom +gen +part +ill +adit +in +el +all +ad +abl +tr +term +es +abes +kom
-
+nom +gen +part +ill +adit +in +el +all +ad +abl +tr +term +es +abes +kom
- gt-style
- +Nom +Gen +Par +Ill +Adi +Ine +Ela +All +Ade +Abl +Tra +Trm +Ess +Abe +Com
Tag system principles:
- Verbs:
- gt: +Sg1
- gt: +Sg1
- others: +1Sg, +1 +Sg
- plamk: +ps1 +sg
Nouns:
Other issues are substantial
Both lexc and twolc code are in compatible formats:
LEXICON 22_A !jalg, pikk, sepp An_SgN; :a$ TP_22bn; :a TP_22bt;
Several newinfras:
- langs (all languages with at least one application, or with decent coverage)
- startup-langs (our incubator)
- experiment-langs (alternative setups, for pedagogical or experimental reasons)
- closed-langs (as langs, but with a closed license, not visible online)
Conversion plan
- Tag wrapper -- needed anyway (will need discussions)
-
src/scripts/ (if conversion of source files)
-
src/tagsets/ (if done on compiled fst's, like conversion to apertium tags)
-
src/scripts/ (if conversion of source files)
- phonology
- est-phon.twolc
- est-phon.twolc
- morphology
- tags: root.lex
- stems: Populate nouns.lexc, verbs.lexc, etc. or: nouns, verbs, hybrids, ...
- affixes: Populate nouns.lexc, verbs.lexc, etc.
- tags: root.lex
Workflow
Try to adjust filenames of plamk and may be the build system so that the conversion would be more natural (more obvious?)
Make a folder:
Then move relevant parts of it to their places in the gt tree,
Keep up documentation in the est/doc folder, linked from
Illustrations from the gt tree
An alternative: Greenlandic
Stems:
- abbreviations.lexc
- acronyms.lexc
- nouns.lexc
- numerals.lexc
- particles.lexc
- pronouns.lexc
- propernouns.lexc
- punctuation.lexc
- verbs.lexc
Affixes:
- derivations-inflections.lexc
- numerals.lexc
- propernouns.lexc
Estonian as inverted Greenlandic:
Stems:
- hybrids.lexc
- pronouns.lexc
Affixes
- nouns.lexc
- verbs.lexc
Milestones in near future
- Move Neeme's est (done)
- Make new dummy est (done)
- Set up Documentation page (Trond)
- Look at filenames for plamk (Jaak)
-
Thereafter export plamk from git to est/src/import/ ( Jaak)
- Look at and write in the documentation (all)
- Set up Bugzilla with more components (mphon, morph, lex, import) (Trond)
Next meeting
- Dec 20th 9: 30 Estonian time
Topics, preparations:
- Look at the import folder
- Look at tag differences
- Try to compile stuff (e.g. another language) in beforehand