140326
SamEst meeting March 26th
Present: Fran, Heiki-Jaan, Heli, Jaak, Sjur, Trond
Agenda
- Estonian fst
- Status quo
- Linguistic issues
- Compounds
- Jussive mood
- Distinguishing homonyms
- Parallel forms
- Status quo
- Estonian CG
- Plan forward
- Next meeting
Estonian fst
Status quo
Jaak took plamk, split it, and put it into the gtd infra.
The not so good part:
Ëven for simple words, the fst is not that good.
For verbs, we have the inverted twol rules that we use.
Some symbols in the lemma should be transformed to ordinary letters.
ex: luGema -> loetud
$ echo "olema" | hfst-proc src/analyser-mt-apertium-desc.und.ohfst ^olema/ole$ma<v><sup><ill>$
nhfst-compose-intersect has an -I / --invert option
perhaps this could be just another filter ?
This WORKS:
$ hfst-compose-intersect -1 analyser-gt-desc.HFST -2 est-phon.HFST -o new.HFST $ echo olema | hfst-lookup new.HFST > olema olema+V+sup+ill 0,000000
Questions: Why are these symbols there?
Jaak's explanation, summing up:
- simple words other than verbs work
- We have ideas for how to deal with verbs
- The simplest way would be to run the twolc rules twice,
- The simplest way would be to run the twolc rules twice,
Present lexc: ahK1i 28; ! (28_V -> ) at: ahki, an: ahi Why not: ahki:ahK1i 28; ! (28_V -> ) at: ahki, an: ahi
The reason is that during derivation, we build new lemmata,
Linguistic issues
Compounds
Deadline for Filosoft making compound code accessible: May 1st.
We should not wait for May 1st, there are things to do.
We should reinvent the compounding as we see fit for the
Jussive mood
jussive mood (möönev kõneviis?)? Ülle Viks does not have one, EKKR09 has. Is there one or do we really use 3rd ps imperative instead?
Heiki has looked at it.
The vote out there says +Imp+Pl3, and anyone may change this to +Juss ad lib.
Distinguishing homonyms
The form pairs palk/palgi (log) vs palk/palga (salary). At the moment there is no way to distinguish some words in their lexicon form
It is usual for GT infra to mark different
e.g. (palatalisation on l for palk-palgi)
palk+Hom1:palk N1 ; palk+Hom2:palk N2 ;
Here, we may consider mnemonic tags like +HomA, +HomI, Hom/a, Hom/i
e.g. (no difference in palatalisation either)
siil-siili siil-siilu
But how does this go together with the double twol intersection?
- It's fine, because they are symbols. (everything are symbols...)
- Maybe not if you have like Ole+Hom1$ma ... but does this exist?
- could it be that the tag will interfere with the twol rules?
- could it be that the tag will interfere with the twol rules?
- Maybe not if you have like Ole+Hom1$ma ... but does this exist?
This should be kept in mind.
- pidama - pidin (had to)
- pidama - pidasin (held)
This output from PLAMK:
> pidasin pidama+V+indic+impf+ps1+sg+ps+af 0,000000 > pidin pidama+V+indic+impf+ps1+sg+ps+af 0,000000
This output from GT:
$ echo "pidasin" | hfst-lookup analyser-gt-desc.hfst > pidasin piD1ama+V+indic+impf+ps1+sg+ps+af > pidin pidin+? pidin+? inf
$ echo "pidasin"
hfst-proc analyser-mt-apertium-desc.und.ohfst |
^pidasin/piD1ama<v><indic><impf><ps1><sg><ps><af>$
hfst-proc analyser-mt-apertium-desc.und.ohfst |
^pidin/*pidin$ <-- No result even for the vanilla analyser
Fran would like the +Hom tag to go next to the lemma.
Parallel forms.
There are words with possible parallel forms like short and long
(do we need a new lexicon format instead of EKI's stem database / morphological database? One possible way would be to use
Question of normativity and stylistics.
The gt infra solution:
- Analyser
- analyser-gt-desc.xfst
- analyser-gt-norm.xfst
- analyser-gt-desc.xfst
- Generator:
- one generator generating all and only the "correct forms"
- a set of generators (possibly one), generating only one form for each
- one generator generating all and only the "correct forms"
If the words end in -line, these-and-these forms may be discarded.
(delete -> mark as 'no generation')
Estonian CG
Postponed to next meeting.
Plan forward
More work on fst
TODO
- Sjur to set up certain things
- Jaak, Heli (?) to work on make the fst behave as well as the plamk
- Make tests
- Look at the $LANG/test/src/...
- This test battery may then be used as e.g. regression testing
- twolc test: true (and false?) test pairs
- lexc yaml tests:
- lexc lemma check test
- twolc test: true (and false?) test pairs
- Look at the $LANG/test/src/...
Next meeting
9th april, 13h00 UTC+1
Via the telephone.