140326

SamEst meeting March 26th

Present: Fran, Heiki-Jaan, Heli, Jaak, Sjur, Trond

Agenda

  • Estonian fst
    • Status quo
    • Linguistic issues
    • Compounds
    • Jussive mood
    • Distinguishing homonyms
    • Parallel forms
  • Estonian CG
  • Plan forward
  • Next meeting

Estonian fst

Status quo

Jaak took plamk, split it, and put it into the gtd infra. That was the good part.

The not so good part: Nothing related to derivations are ported from plamk, since they are built in an incompatible way. The same goes for compounds.

Ëven for simple words, the fst is not that good.

For verbs, we have the inverted twol rules that we use. In gtd, this is not possible for the moment.

Some symbols in the lemma should be transformed to ordinary letters.

ex: luGema -> loetud

$ echo "olema" | hfst-proc src/analyser-mt-apertium-desc.und.ohfst 
  ^olema/ole$ma<v><sup><ill>$

nhfst-compose-intersect has an -I / --invert option

perhaps this could be just another filter ?

This WORKS:

$ hfst-compose-intersect -1 analyser-gt-desc.HFST -2 est-phon.HFST -o new.HFST
$ echo olema | hfst-lookup new.HFST 
> olema olema+V+sup+ill 0,000000

Questions: Why are these symbols there?

Jaak's explanation, summing up:

  • simple words other than verbs work
  • We have ideas for how to deal with verbs
    • The simplest way would be to run the twolc rules twice, so that we get upper side taken care of, and we do not need to change things.
Present lexc:
ahK1i 28; ! (28_V -> ) at: ahki, an: ahi                                                                                        

Why not:
ahki:ahK1i 28; ! (28_V -> ) at: ahki, an: ahi                                                                                        

The reason is that during derivation, we build new lemmata, and they are then bu

Linguistic issues

Compounds

Deadline for Filosoft making compound code accessible: May 1st. Generation, analysis, additional marks which mark pronounciation, Grade III, stressed syllable, etc. Since some of them had no users, we did not check them properly, and now they must be debugged.

We should not wait for May 1st, there are things to do.

We should reinvent the compounding as we see fit for the present framework.

Jussive mood

jussive mood (möönev kõneviis?)? Ülle Viks does not have one, EKKR09 has. Is there one or do we really use 3rd ps imperative instead?

Heiki has looked at it. (http: //kodu.ut.ee/~hkaalep/arvutimorf_12/loeng2.htm) There are some forms of Jussive mood that are in the imperative as well.

The vote out there says +Imp+Pl3, and anyone may change this to +Juss ad lib.

Distinguishing homonyms

The form pairs palk/palgi (log) vs palk/palga (salary). At the moment there is no way to distinguish some words in their lexicon form although the nominative (and thus the lexicon form) is almost the only case of homonymy.

It is usual for GT infra to mark different stems with the same lemma with tags like +Hom1, +Hom2 etc. <=== this should be the solution, to stay consistent with the rest of the languages

e.g. (palatalisation on l for palk-palgi)

  palk+Hom1:palk N1 ;
  palk+Hom2:palk N2 ;

Here, we may consider mnemonic tags like +HomA, +HomI, Hom/a, Hom/i or something of the like.

e.g. (no difference in palatalisation either)

  siil-siili
  siil-siilu

But how does this go together with the double twol intersection?

  • It's fine, because they are symbols. (everything are symbols...)
    • Maybe not if you have like Ole+Hom1$ma ... but does this exist?
      • could it be that the tag will interfere with the twol rules? these are symbols that the twol rules have not taken into consideration

This should be kept in mind.

  • pidama - pidin (had to)
  • pidama - pidasin (held)

This output from PLAMK:

> pidasin   pidama+V+indic+impf+ps1+sg+ps+af    0,000000
> pidin     pidama+V+indic+impf+ps1+sg+ps+af    0,000000

This output from GT:

$ echo "pidasin" | hfst-lookup analyser-gt-desc.hfst 
> pidasin piD1ama+V+indic+impf+ps1+sg+ps+af
> pidin pidin+?  pidin+? inf

$ echo "pidasin"

hfst-proc analyser-mt-apertium-desc.und.ohfst

^pidasin/piD1ama<v><indic><impf><ps1><sg><ps><af>$ $ echo "pidin"

hfst-proc analyser-mt-apertium-desc.und.ohfst

^pidin/*pidin$ <-- No result even for the vanilla analyser

Fran would like the +Hom tag to go next to the lemma.

Parallel forms.

There are words with possible parallel forms like short and long illative, multiple forms for plural etc. Francis Tyers argued that it would be handy if generator-fst would generate just one (preferred) surface form. And if there is no single one preferred form then MT solutions (and probably everything else that generates surface forms) still needs some formalized rules to choose between forms.

(do we need a new lexicon format instead of EKI's stem database / morphological database? One possible way would be to use lexc as format but I'm not sure if that is the best idea)

Question of normativity and stylistics.

The gt infra solution:

  • Analyser
    • analyser-gt-desc.xfst
    • analyser-gt-norm.xfst
  • Generator:
    • one generator generating all and only the "correct forms"
    • a set of generators (possibly one), generating only one form for each morphosyntactic combination. We fix this trivially with tags.

If the words end in -line, these-and-these forms may be discarded. Now, look at words ending in -ne, and delete the other parallel forms from these words.

(delete -> mark as 'no generation')

Estonian CG

Postponed to next meeting.

Plan forward

More work on fst

TODO

  • Sjur to set up certain things
  • Jaak, Heli (?) to work on make the fst behave as well as the plamk
  • Make tests
    • Look at the $LANG/test/src/...
    • This test battery may then be used as e.g. regression testing
      • twolc test: true (and false?) test pairs
      • lexc yaml tests:
      • lexc lemma check test

Next meeting

9th april, 13h00 UTC+1

Via the telephone.