141209
Samest meeting december 9th 2014
Present: Fran, Heiki, Heli, Jaak, Kadri, Sjur, Trond, (Tiina)
Agenda
- fst
- Tag comparison
- Oahpa
- CG (problem status)
- Next physical meeting
fst
Not much has happened. Q on lexicalisation on compounds (there are some that should be lexicalized for the good of oahpa. They probably will: )
[69/70][FAIL] olema+V+Inf => Missing results: olla [69/70][FAIL] olema+V+Inf => Unexpected results: olema+V+Inf+? [70/70][FAIL] olema+V+Ger => Missing results: olles [70/70][FAIL] olema+V+Ger => Unexpected results: olema+V+Ger+? [55/70][FAIL] olema+V+Impers+Prs+Ind+Aff => Missing results: ollakse [55/70][FAIL] olema+V+Impers+Prs+Ind+Aff => Unexpected results: olema+V+Impers+Prs+Ind+Aff+?
Total passes: 142, Total fails: 6, Total: 148
Next to be done: Jaak will make tests of pronouns to pass.
Tag comparison
est/tools/mt/apertium/tagsets
Tein sen antamalla.AgPrc tavalla - Tegin seda antud moel. Hurmasit minut antamalla.MaInf minulle kirven. - Võlusid mind andes mulle kirve.
So, on our todolist: Do the same for North Saami.
Consider involving Jussi in this.
TODO:
- Lift finsme tag comparison on the agenda here in Tromsø (Trond)
- Comment the estfin ones (Heiki-Jaan, Tommi)
Oahpa
Oahpa lexicon (the lexicon of the textbook "E nagu Eesti") has been provided with POS tags. Now working on the files of tags and paradigms (Oahpa source files).
As soon as the lexicalised compounds will be added to the fst Heli will do the next attempt to generate the word forms for Morfa-S and test Morfa-S.
CG (problem status)
Segmentation fault is solved with the last vislcg version (was reproducible by Tino side).
Preprocessing
Current conversion scripts (Filosoft to vislcg3):
defining sentence boundaries (before morphological analysis) - "." is not very reliable, at least in Estonian
In CG you define e.g. <.> <?> <!> as sentence boundaries
I saw George V. I saw V. Putin. I saw George V. He was ...
In Giellatekno, the abbr.txt is a language-specific file in
The philosophy behind the lexica is documented here:
The abbreviations themselves are written in the abbreviations.lexc
Can it be added somehow to the general philosophy: the rule, that all ordinal numbers (e.g. "7." , if not written with ending like 7-ndal) are ending with fullstop? - Um... you have similar numbers in numbered lists, and there they are really not ordinal numbers...
we have it alredy
I saw George 7. I saw 2. Putin. Yes I saw George 7 . I saw 2 . Putin Giellatekno abbr.txt: I saw George V. I saw V.AMBIG Putin . I saw George V. He was Apertium: George V V. V#. I saw
In Estonian above abbreviations and initials we have also ordinal
George 7 . I saw the 2. tsar
The CG is dependent upon the $GTHOME/gt/script/preprocess file
TODO
- Fix the abbr.txt generation problem (Sjur, Jaak)
conversion from Filosoft to vislcg3
Not needed, will be replaced with FST -> vislcg3, but maybe will need some additions to standard scripts
Transitiveness and object cases
Adding additional information on transitiveness and object cases of verbs, kinds on pronouns and adpositions - will remain in scripts or can be moved to some more general lexicons?
This is added by scripts at the moment.
For sme, we add such sets at the preamble of the CG file. Huge
We have transitivity tags and semantic tags in lexc. In general:
- we onsider lexc tagging to faster
- but CG sets to be more flexible
We thus do both.
LIST LOC-V = "ávkkástallat" "ballat" "beassat" "beroštit" "biehttalit" "bihtit" "ceavzit" "dinet" "dolkat" "eastadallat" "eastadit" "fitnat" "fuolahit" "fuollat" "garvit" "gažadit" "geargat" "heaitit" "hehttet" "ilbmat" "jearrat" "jearralit" "luohpat" "máinnašit" "nohkkot" "oassálastit" "spiehkastit" "váibat" "váruhit" "vástidit" ; LIST ILL-V = "áibbašit" "álgit" "ásaiduvvat" "báitit" "bahkket" "beassat" "čohkkedit" "čujuhit" "čuohcit" "deaivat" "doaškut" "dorvvastit" "došket" "duhtat" "gullat" "guoskat" "gustot" "hárjánit" "heivet" "irgidit" "irggástallat" "jáhkkit" "liikot" "luohttit" "mannat" "máhccat" "mieđihit" "miehtat" "njiedjat" "oahpásmuvvat" "oahpásnuvvat" "ollet" "oskut" "riepmat" "ráhkkanit" "soahpat" "searvat" "suhttat" "váikkuhit" "vástidit" "vuolgit" ; LIST VAHKKU-DUR = "álgojahki" "árrageassi" "beaivi" "jándor" "bodda" "čakča" "čakčageassi" "čakčaseavdnj at" "čuohtejahki" "dálvi" "diibmu" "eahketbodda" "geassi" "giđđa" "idja" "iđitbodda" "jahki" "jahkebeall e" "jahkečuohti" "kaleanddarjahki" "loahppajahki" ("[0-9]*-#lohku"r) "maŋŋe#giđđa" "mánnu" "minuhtta" "m inukta" "njealjádasjahki" "skuvlajahki" "tiibma" "vahkkoloahppa" "vahkku" ; # these are periodes and can be Acc
Transitivity we do in lexc:
berehit:bereh MUITAL_TV ; berostit:berost ALIST_IV ; berostuvvat:berostuvva RAIMMAHALLA_IV ; beroštit:berošt ALIST_IV ; beroštuvvat:beroštuvva RAIMMAHALLA_IV ; berret:berre GILLE_IV ; besset:besse DOHPPE_TV ;
We have also some sets in CG rules, but very general tags are added by scripts so far, as follows?
text
preprocess | fst | postproces | scripts* | cg |
So it will remain in scripts, will not put in lexicon? In this case it is really more flexible. Yes, flexibility is the key word, and needed. But whether
- clumsy doesnt matter so much if they are automatically generated. (fran)
- Trond. this is irrelevatn: we need a list, period.
- Tiina: rule file is not generated automatically
Tiina do you have those scripts checked in?
TODO Tiina: add scripts to src/import.
Nominalization of verbs
Additional information on the nominalizations of verbs like past and present participles, "-mine", "-ja", "-mata"-forms and others - it would be preferable to get this from the morphological analysis as otherwise it is a hack.
Trond: I definitely agree.
dokaaminen dokaaminen dokata+V+Der/minen+Sg+Nom dokata dokata dokata+V+Pss+Ind+Prs+Pe4+ConNeg puhuminen puhuminen puhua+V+Der/minen+Sg+Nom puhuminen puhuminen+N+Sg+Nom <===== missing for dokaaminen
estonian:
jooksmine jooksmine +N+Sg+Nom jooksmine jooks +N+Sg+Nom#mini+N+Pl+Par jooksmine jooks +N+Sg+Nom#minema+V+Pers+Prs+Imprt+Sg2 jooksmine jooks +N+Sg+Nom#minema+V+Pers+Prs+Imprt+Sg2+Aff jooksmine jooks +N+Sg+Nom#minema+V+Pers+Prs+Imprt+Sg2+Neg jooksmine jooksmine +N+Sg+Nom - there is no link to jooksma, #mini and #minema is irrelevant andja andja andja+N+Sg+Gen 0.000000 andja andja+N+Sg+Nom 0.000000 programmeerimine programmeerimine programmeerimine+N+Sg+Nom 0.000000
Generation:
programmeerimine+N+Sg+Gen programmeerimine+N+Sg+Gen programmeerimise
programmeerimine is a backward hack. sme did the same in
So, it should rather be programmeerima+V+Der/mine+... ?
Fixing this would not be hard. Jaak did it for some derivations
Warning: This could deprive us of lexemes.
lexc or xml for lexc lexica?
This will be a nice coffe-table topic with Jack in Tromsö in January.
TODO
Next meeting
Short virtual meeting
Wednesday 7th at 1300 Norw time.
Next physical meeting in Tromsø
Sjur would prefer to not stay over the weekend, and would like to
Long Samest meeting is then 19.-21. More details to be decide