141111
Present: Fran, Heiki, Heli, Jaak, Trond , Kadri, Tiina
Issues/topics:
- FST
- tags
- CG
- MT
- talk for mutiliser
- tags
- bilingual dictionary
- Oahpa (postponed to next meeting)
FST
tags
The tags for the fst are ok now
Yaml tests. missing lemmata:
There is a lot of tests and those that do not work can be fixed either in FST or tests side and we quite clear have "road ahead" .
The answer to that is to test the same things only once.
xfst problem
lookup utility cannot analyse/generate some words (although apply up/down in xfst can). Could we use some wrapper around xfst to emulate lookup?
Probably, yes. You lookup in xfst by the command "up", so a script for that would do. Cf. the Makefile in $GTHOME/gt/Makefile
- xfst: lots of lemmas not generated: afišš, apašš, bjeff, bluff, briošš, böff, depešš, dispašš, dušš, flešš, giljošš, guašš, guljašš, ingušš, kartušš, kašš, kenaff, klošš, lavašš, nišš, pastišš, pekešš, pilaff, proff, retušš, riff, skiff, talõšš, tartüff, tuff, tušš, tšuvašš, šeff, žiraff
- hfst: only one lemma not generated: loe
CG
Status
Tiina: I have eliminated the inline sets from the disambiguation rules except inline sets consisting of lemmas. But tags are not converted to FST ones, so it cannot disambiguate FST output.
Currently I have technical problems running my rules with the last vislcg3 version, I get the "Segmentation fault" error and asked Tino for help.
This should make it easier to convert! hmm.. there are some inline sets in SUBSTITUTE rules. In SUBSTITUTE rules I got an error then tried to change them to lists or sets.
Hmm... also SUBSTITUTE CLBtag ( ) i think this shouldn+t work. It worked in earlier versions. I have to see how to convert them, maybe *)
line 5518: (V "joud") this is an inline set - only those that are with lemmas, because the jõud can be also a noun. It can be fixed later.
line: 1186 REMOVE (K <el>)
Tag naming harmonisation
There are three ways to do tag naming harmonisation:
- change to the fst tags everywhere in the cg file
- Pro: Do it once, it is done
- Con: Do it once is a big job. New tags in CG are new to Tiina
- Pro: Do it once, it is done
- change to the fst tags on the right side of the LIST definition
- Pro: no script. Minimal change. It is simpler for me handling different versions of grammars. Rules remain the same, only definitions change. Backwards compatible.
- Pro: no script. Minimal change. It is simpler for me handling different versions of grammars. Rules remain the same, only definitions change. Backwards compatible.
- a script between fst and cg changing fst tags to cg tags
- this implies another script to change from the CG tags back to FST tags, because other applications (e.g. MT) will use FST tags. And going through this two-stage conversion process should be lossless. e.g. substitute rules will probably mess things up. Easiest for maintaining CG rules.
We go for alternative 2
TODO
A script for this: Fran
Script for getting from fst to cg
The current Giellatekno script for doing this is found at
+ additional script for Estonian
+ maybe add some information directly to fst dictionaries
echo "Mina olen tulnud." | preprocess | uest Mina mi+N+Sg+Gen#na+Adv Mina mi+N+Sg+Ess Mina mi+N+Sg+Nom#na+Adv Mina mina+Pron+Sg+Nom Mina mina+N+Sg+Gen Mina mina+N+Sg+Nom Mina mina+N+Sg+Par olen olema+V+Pers+Prs+Ind+Sg1+Aff tulnud tulnud+A tulnud tulma+V+Der/nu+N+Pl+Nom tulnud tulema+V+Pers+Past+Prc tulnud tulema+V+Pers+Past+Imprt tulnud tulema+V+Pers+Past+Ind+Neg tulnud tulnu+N+Pl+Nom tulnud tulnu+A+Pl+Nom . . +? tf-hsl-m0016:~ ttr000$ echo "Mina olen tulnud." | preprocess | uest | lookup2cg "<Mina>" "mina" Pron Sg Nom "mina" N Sg Par "mi" N Sg Ess "mina" N Sg Gen "mina" N Sg Nom "<olen>" "olema" V Pers Prs Ind Sg1 Aff "<tulnud>" "tulema" V Pers Past Imprt "tulnu" A Pl Nom "tulnud" A "tulema" V Pers Past Ind Neg "tulma" V* Der/nu N Pl Nom "tulema" V Pers Past Prc "tulnu" N Pl Nom "<.>" "." ?
TODO
- Get the Estonian lookup2cg on the table and look at the two.
- (the Perl artists): To be decided.
- This is an issue for the next meeting.
- The one to check in the est lookup2cg scripts: Indicate in the svn what they are
- (the Perl artists): To be decided.
(could it be that the scripts are in http://math.ut.ee/~kaili/grammatika/estmorfcg.tar.gz?)
jjpp: tiina: src/import has "plamk stuff", you could add the scripts under there in another subdirectory, perhaps jjpp: langs/est/src/import, that is
talk for mutiliser
A Finnish translation company has invited Fran to talk about finest mt.
tags (e.g. harmonisation with finnish)
bilingual dictionary
This has gone excactly according to plan. We now have 13000 lemma pairs.
Next step: Look at frequency lists (from both ways) and start translatiing from them.
- For Finnish, there is a top 2500 newspaper lemma list, we cover half.
- For Estonian? http://www.cl.ut.ee/ressursid/sagedused1/
TODO:
- Joonas to continue on that.
Oahpa
Heli has analysed the ca 1500 words in the textbook dictionary with the FST and after some work got POS tags for all the words.
Some problems:
- How to put the pluralia tantum words into the Oahpa lexicon, e.g. käärid, püksid, andmed, jõulud, vastlad, kilekaaned, tangud, vanemad, kirjatarbed, teksad ?
- What is the lemma?
- estmorf: käärid, andmed, püksid, teksad, vastlad, jõulud
- plamk: käär, anne, püks, teksa, vastel, jõul
- estmorf: käärid, andmed, püksid, teksad, vastlad, jõulud
(we could define plural forms as lemmas in plamk as well?)
käärid käär+N+Pl+Nom In Leksa I need the citation form (käärid), but for generating forms for Morfa-S I need the lemma (käär). Solution 1: two records for these words: word POS käärid X - for Leksa käär N - out of this all the (only plural!) forms will be generated Solution 2: one record with an additional (optional) field 'lemma' citation form (used for Leksa), lemma (used for Morfa-S form generation)
For some words we need to give the inflection type (continuation lexicon) in the Oahpa lexicon.
e.g. kurk - kurgi and kurk - kurgu (both are in the textbook's dictionary but in different chapters)
We have +Hom1 and +Hom2 for at least piim+N -- we probably should use a lot more of those but what happens to oahpa's dictionary then?
- Generation of forms of non-lexicalised compounds ?
merisiga meri+S+sg+nom&siga+S+sg+nom välissuhted väline+A+prefix&suhe+S+pl+nom
Oahpa was postponed to the next meeting
Next meeting
Tuesday, 25. nov 0900.