150115
First Tromsø meeting January 15th, 2015
Participants: Fran, Heiki-Jaan, Heli, Jaak, Kadri, Sjur, Tiina, Trond
Agenda
- Status
- Flag diacritics
- Infrastructure issues
- Topics...
- CG work
- Weekend
Status
Christmas holiday sort of hit us.
Last meeting's todo list.
Tag issue:
Illative issue:
E nagu Eesti coverage in Oahpa
Make further CG steps
CG
Estonian CG
pre/post-processing for the CG
cat <file> | preprocess | uest | lookup2cg | <preconverters> | vislcg3 .. | <postconverter>
The files are in est/tools/preprocess/:
tools/preprocess/tagger15.c tools/preprocess/addlex.lx tools/preprocess/preprocess.txt tools/preprocess/pron.pl
Their functions: Lexical information added to
The files:
- tagger15.c + addlex.lx =
- transitiveness of verbs and cases of objects,
- types of adpositions (pre, post) and cases of nouns.
- transitiveness of verbs and cases of objects,
- preprocess.txt = brief instructions
- pron.pl = classification of pronouns
compiling:
gcc -o tagger tagger15.c
Usage:
tagger addlex.lx infile outfile pron.pl < infile > outfile
Let's analyse the sentence "Kass elas taadi juures."
cat <file> | preprocess | uest | \ lookup2cg | \ tools/preprocess/tagger tools/preprocess/addlex.lx stdin stdout | \ perl tools/preprocess/pron.pl | \ vislcg3 -g src/syntax/disambiguation.cg3 | \ <postconverter>
For removing second lemma just add a .pl file as a postconverter with:
A functional example:
echo "Kass elas taadi juures." | preprocess | hfst-lookup -q -p src/analyser-gt-desc.hfst | cut -f1-2 \ | lookup2cg | tools/preprocess/tagger tools/preprocess/addlex.lx stdin stdout | perl tools/preprocess/pron.pl \ | vislcg3 -g src/syntax/disambiguation.cg3 | vislcg3 -g src/syntax/functions.cg3
Expected output:
"<Kass>" "kass" N Sg Nom @SUBJ> "<elas>" "elama" V Pers Prt Ind Sg3 Aff <Intr> <In> <Ad> "ela" "<taadi>" "taadi" N Sg Gen @>N "taat" N Sg Gen @>N "<juures>" "juures" Adv @<ADVL "juur" N Sg Ine "juures" Adp Post <gen> "<.>" "." ? Z Fst
They are placed in tools/preprocess for the time being
Tag conversion
Lemma format
hõbeta 27; ! (27_V -> ) a0: hõbeta hõigata 27; ! (27_V -> ) a0: hõigata hõiK1 30; ! (30_V -> ) at: hõikle, (bt): hõikel, bn: hõigel hõiK1a 29; ! (29_V -> ) at: hõika, an: hõiga hõbetama:hõbeta 27; ! (27_V -> ) a0: hõbeta hõigatama:hõigata 27; ! (27_V -> ) a0: hõigata hõikma:hõiK1 30; ! (30_V -> ) at: hõikle, (bt): hõikel, bn: hõigel hõikama:hõiK1a 29; ! (29_V -> ) at: hõika, an: hõiga
Preprocessor for the fst
License issue
Finnish CG
Tag issues
Gold standard
"<Monta>" "moni" Pron Qu Sg Par @→N "<vuotta>" "vuosi" N Sg Par @Num< "<kerrotaan>" "kertoa" V Pss Ind Prs Pe4 @X "<sinun>" "sinä" Pron Pers Sg Gen @←OBJ "<pitäneen>" "pitää" V Act PrfPrc Sg Gen @NES "<Johannaa>" "Johanna" N Prop Sg Par @←OBJ "<silmällä>" "silmä" N Sg Ade @X "<,>" "," Punct CLB "<mutta>" "mutta" CC CLB @X "<minä>" "minä" Pron Pers Sg Nom @SUBJ→ "<tulin>" "tulla" V Act Ind Prt Sg1 @+FMAINV "<ja>" "ja" CC CLB @CNP "<sieppasin>" "siepata" V Act Ind Prt Sg1 @+FMAINV "<yks>" "yks" Pron Qu Indef Sg Nom @X "<'>" "'" Punct CLB "<kaks>" "kaks" ? @X "<'>" "'" Punct CLB "<tytön>" "tyttö" N Sg Gen @←OBJ "<sinulta>" "sinä" Pron Pers Sg Abl @X "<.>" "." Punct CLB
Flag diacritics
To regulate:
- compound restriction ( *N+V vs N+V+Der/N )
- restricting inflectional patterns for certain lemmas where one paradigm
- handle lowercasing of derived names
- also handle Genitive Attribute (genitive of place names written in lower
- also handle Genitive Attribute (genitive of place names written in lower
The proper nouns as genitive attributes are tagged like this right now (no special +G or +Attr or whatever tags):
echo "Taani" | lookup analyser-gt-norm.xfst Taani taani +N+Sg+Gen
the same for the lowercase:
echo "taani" | lookup analyser-gt-norm.xfst Taani taani +N+Sg+Gen
In the tag conversion table (lang/est/doc/TagList.jspwiki) we have:
+S +N ... +G +Attr
TODO:
- add flags for lowercasing derived names (Jaak)
- check that lowercasing is actually working with the present code
- check that lowercasing is actually working with the present code
- extend the proper noun case handling to add Genitive Attr, and only when lowercased (Jaak)
Infrastructure issues
- preprocess: $LANG/tools/preprocess/ - add pre-cg tag insersion here
- "is there a way to speed up yaml tests?"
Weekend