141111

Contents:

FST
CG
Oahpa
Next meeting

Present: Fran, Heiki, Heli, Jaak, Trond , Kadri, Tiina

Issues/topics:

FST
tags
CG
MT
talk for mutiliser
tags
bilingual dictionary
Oahpa (postponed to next meeting)

FST

xfst problem

lookup utility cannot analyse/generate some words (although apply up/down in xfst can). Could we use some wrapper around xfst to emulate lookup?

Probably, yes. You lookup in xfst by the command "up", so a script for that would do. Cf. the Makefile in $GTHOME/gt/Makefile

xfst: lots of lemmas not generated: afišš, apašš, bjeff, bluff, briošš, böff, depešš, dispašš, dušš, flešš, giljošš, guašš, guljašš, ingušš, kartušš, kašš, kenaff, klošš, lavašš, nišš, pastišš, pekešš, pilaff, proff, retušš, riff, skiff, talõšš, tartüff, tuff, tušš, tšuvašš, šeff, žiraff
hfst: only one lemma not generated: loe

CG

Status

Tiina: I have eliminated the inline sets from the disambiguation rules except inline sets consisting of lemmas. But tags are not converted to FST ones, so it cannot disambiguate FST output.

Currently I have technical problems running my rules with the last vislcg3 version, I get the "Segmentation fault" error and asked Tino for help.

This should make it easier to convert! hmm.. there are some inline sets in SUBSTITUTE rules. In SUBSTITUTE rules I got an error then tried to change them to lists or sets.

Hmm... also SUBSTITUTE CLBtag ( ) i think this shouldn+t work. It worked in earlier versions. I have to see how to convert them, maybe *) Do you want to remove the tag ? if so i think it is (*). What version of vislcg3 do you have ? I try to update to the last version but get "Segmentation fault" error. Tino suggested it is because of different versions installed but I couldnt fix it so far.

line 5518: (V "joud") this is an inline set - only those that are with lemmas, because the jõud can be also a noun. It can be fixed later. could it be rewritten ("joud") + V ok

line: 1186 REMOVE (K <el>)

Tag naming harmonisation

There are three ways to do tag naming harmonisation:

change to the fst tags everywhere in the cg file
1. Pro: Do it once, it is done
2. Con: Do it once is a big job. New tags in CG are new to Tiina
change to the fst tags on the right side of the LIST definition
1. Pro: no script. Minimal change. It is simpler for me handling different versions of grammars. Rules remain the same, only definitions change. Backwards compatible.
a script between fst and cg changing fst tags to cg tags
1. this implies another script to change from the CG tags back to FST tags, because other applications (e.g. MT) will use FST tags. And going through this two-stage conversion process should be lossless. e.g. substitute rules will probably mess things up. Easiest for maintaining CG rules.

We go for alternative 2

TODO

A script for this: Fran

Script for getting from fst to cg

The current Giellatekno script for doing this is found at $GTHOME/gt/script/lookup2cg

+ additional script for Estonian

+ maybe add some information directly to fst dictionaries

echo "Mina olen tulnud." | preprocess | uest
Mina        mi+N+Sg+Gen#na+Adv
Mina        mi+N+Sg+Ess
Mina        mi+N+Sg+Nom#na+Adv
Mina        mina+Pron+Sg+Nom
Mina        mina+N+Sg+Gen
Mina        mina+N+Sg+Nom
Mina        mina+N+Sg+Par

olen        olema+V+Pers+Prs+Ind+Sg1+Aff

tulnud        tulnud+A
tulnud        tulma+V+Der/nu+N+Pl+Nom
tulnud        tulema+V+Pers+Past+Prc
tulnud        tulema+V+Pers+Past+Imprt
tulnud        tulema+V+Pers+Past+Ind+Neg
tulnud        tulnu+N+Pl+Nom
tulnud        tulnu+A+Pl+Nom

.        .        +?

tf-hsl-m0016:~ ttr000$ echo "Mina olen tulnud." | preprocess | uest | lookup2cg
"<Mina>"
         "mina" Pron Sg Nom
         "mina" N Sg Par
         "mi" N Sg Ess
         "mina" N Sg Gen
         "mina" N Sg Nom
"<olen>"
         "olema" V Pers Prs Ind Sg1 Aff
"<tulnud>"
         "tulema" V Pers Past Imprt
         "tulnu" A Pl Nom
         "tulnud" A
         "tulema" V Pers Past Ind Neg
         "tulma" V* Der/nu N Pl Nom
         "tulema" V Pers Past Prc
         "tulnu" N Pl Nom
"<.>"
         "." ?

TODO

Get the Estonian lookup2cg on the table and look at the two.
- (the Perl artists): To be decided.
- This is an issue for the next meeting.
- The one to check in the est lookup2cg scripts: Indicate in the svn what they are

(could it be that the scripts are in http://math.ut.ee/~kaili/grammatika/estmorfcg.tar.gz?)

jjpp: tiina: src/import has "plamk stuff", you could add the scripts under there in another subdirectory, perhaps
jjpp: langs/est/src/import, that is

talk for mutiliser

A Finnish translation company has invited Fran to talk about finest mt. At this point of the meeting we have a feeling it is too early.

tags (e.g. harmonisation with finnish)

bilingual dictionary

This has gone excactly according to plan. We now have 13000 lemma pairs.

Next step: Look at frequency lists (from both ways) and start translatiing from them.

For Finnish, there is a top 2500 newspaper lemma list, we cover half.
For Estonian? http://www.cl.ut.ee/ressursid/sagedused1/

TODO:

Joonas to continue on that.

Oahpa

Heli has analysed the ca 1500 words in the textbook dictionary with the FST and after some work got POS tags for all the words.

Some problems:

How to put the pluralia tantum words into the Oahpa lexicon, e.g. käärid, püksid, andmed, jõulud, vastlad, kilekaaned, tangud, vanemad, kirjatarbed, teksad ?
What is the lemma?
1. estmorf: käärid, andmed, püksid, teksad, vastlad, jõulud
2. plamk: käär, anne, püks, teksa, vastel, jõul

(we could define plural forms as lemmas in plamk as well?)

käärid    käär+N+Pl+Nom 
In Leksa I need the citation form (käärid), but for generating forms for Morfa-S I need the lemma (käär).
Solution 1: two records for these words:
word    POS
käärid     X  - for Leksa 
käär        N  - out of this all the (only plural!) forms will be generated
Solution 2: one record with an additional (optional) field 'lemma'
citation form (used for Leksa), lemma (used for Morfa-S form generation)

For some words we need to give the inflection type (continuation lexicon) in the Oahpa lexicon.

e.g. kurk - kurgi and kurk - kurgu (both are in the textbook's dictionary but in different chapters) kaste - kastme, juhe - juhtme, anne - andme, not kaste - kaste, juhe - juhte, anne - ande

We have +Hom1 and +Hom2 for at least piim+N -- we probably should use a lot more of those but what happens to oahpa's dictionary then?

Generation of forms of non-lexicalised compounds ?

merisiga    meri+S+sg+nom&siga+S+sg+nom
välissuhted    väline+A+prefix&suhe+S+pl+nom

Oahpa was postponed to the next meeting

Next meeting

Tuesday, 25. nov 0900.