150115

First Tromsø meeting January 15th, 2015

Participants: Fran, Heiki-Jaan, Heli, Jaak, Kadri, Sjur, Tiina, Trond

Agenda

  • Status
  • Flag diacritics
  • Infrastructure issues
  • Topics...
  • CG work
  • Weekend

Status

Christmas holiday sort of hit us. Earlier fst problems still here. Short illative used to work the way we did exceptions earlier. Now they do not.

Last meeting's todo list.

Tag issue:

Topic here in Tromsø next (?) week

Illative issue:

Outside this meeting

E nagu Eesti coverage in Oahpa

Awaiting the short illative solution

Make further CG steps

Steps are being made, see next point.

CG

Estonian CG

pre/post-processing for the CG

cat <file> | preprocess | uest | lookup2cg | <preconverters> | vislcg3 .. | <postconverter> 

The files are in est/tools/preprocess/:

tools/preprocess/tagger15.c
tools/preprocess/addlex.lx
tools/preprocess/preprocess.txt
tools/preprocess/pron.pl

Their functions: Lexical information added to morphologically analyzed text before CG analysis:

The files:

  • tagger15.c + addlex.lx =
    • transitiveness of verbs and cases of objects,
    • types of adpositions (pre, post) and cases of nouns.
  • preprocess.txt = brief instructions
  • pron.pl = classification of pronouns

compiling:

gcc -o tagger tagger15.c

Usage:

tagger addlex.lx  infile  outfile
pron.pl  < infile  > outfile

Let's analyse the sentence "Kass elas taadi juures."

cat <file> | preprocess | uest | \
lookup2cg | \
tools/preprocess/tagger tools/preprocess/addlex.lx stdin stdout | \
perl tools/preprocess/pron.pl | \
vislcg3 -g src/syntax/disambiguation.cg3 | \
<postconverter>

For removing second lemma just add a .pl file as a postconverter with: s: \"(.*)ma\" V (.*) (\".*\" )\n$: \"$1ma\" V $2\n: g; SELECT ("kiri") + Adit (-1 ("mine") OR ("pane") + V); LIST DaVerb = ("või" V) "saa" "ole" "mine" "tege" "valmista" "tekita" "sünnita" "soovi" "pakku" ("too" V) ("jää" V) "kujune" "muutu" "saa" "tekki" "aita" <InfP>;

A functional example:

echo "Kass elas taadi juures." | preprocess | hfst-lookup -q -p src/analyser-gt-desc.hfst | cut -f1-2 \
| lookup2cg | tools/preprocess/tagger tools/preprocess/addlex.lx stdin stdout | perl tools/preprocess/pron.pl \
| vislcg3 -g src/syntax/disambiguation.cg3 | vislcg3 -g src/syntax/functions.cg3

Expected output:

"<Kass>"
        "kass" N Sg Nom @SUBJ>
"<elas>"
        "elama" V Pers Prt Ind Sg3 Aff <Intr> <In> <Ad> "ela"
"<taadi>"
        "taadi" N Sg Gen @>N
        "taat" N Sg Gen @>N
"<juures>"
        "juures" Adv @<ADVL
        "juur" N Sg Ine
        "juures" Adp Post <gen>
"<.>"
        "." ? Z Fst

They are placed in tools/preprocess for the time being

Tag conversion

Fortcoming, Kadri

Lemma format

 hõbeta 27; ! (27_V -> ) a0: hõbeta
 hõigata 27; ! (27_V -> ) a0: hõigata
 hõiK1 30; ! (30_V -> ) at: hõikle, (bt): hõikel, bn: hõigel
 hõiK1a 29; ! (29_V -> ) at: hõika, an: hõiga
 hõbetama:hõbeta 27; ! (27_V -> ) a0: hõbeta
 hõigatama:hõigata 27; ! (27_V -> ) a0: hõigata
 hõikma:hõiK1 30; ! (30_V -> ) at: hõikle, (bt): hõikel, bn: hõigel
 hõikama:hõiK1a 29; ! (29_V -> ) at: hõika, an: hõiga

Preprocessor for the fst

Free Estonian NL toolkit, python, written in Tartu Uni. They have perhaps something on preprocessing that we might want to reuse. Might be that we should compare with existing and reuse some ideas .. http: //tpetmanson.github.io/estnltk/

License issue

proceeding, plan for next step.

Finnish CG

$GTHOME/kt/fin/src/fin-dis.cg1 https://gtsvn.uit.no/langtech/trunk/kt/fin/src/fin-dis.cg1 $GTHOME/langs/fin/src/syntax/disambiguation.cg3 https://gtsvn.uit.no/langtech/trunk/langs/fin/src/syntax/disambiguation.cg3

Tag issues

There are still unresolved tag issues in the Finnish cg1/cg3 conversion. This is a topic for the Tromsø week.

Gold standard

/lang/sme/j-sme.html

"<Monta>"
        "moni" Pron Qu Sg Par @→N
"<vuotta>"
        "vuosi" N Sg Par @Num<
"<kerrotaan>"
        "kertoa" V Pss Ind Prs Pe4 @X
"<sinun>"
        "sinä" Pron Pers Sg Gen @←OBJ
"<pitäneen>"
        "pitää" V Act PrfPrc Sg Gen @NES
"<Johannaa>"
        "Johanna" N Prop Sg Par @←OBJ
"<silmällä>"
        "silmä" N Sg Ade @X
"<,>"
        "," Punct CLB
"<mutta>"
        "mutta" CC CLB @X
"<minä>"
        "minä" Pron Pers Sg Nom @SUBJ→
"<tulin>"
        "tulla" V Act Ind Prt Sg1 @+FMAINV
"<ja>"
        "ja" CC CLB @CNP
"<sieppasin>"
        "siepata" V Act Ind Prt Sg1 @+FMAINV
"<yks>"
        "yks" Pron Qu Indef Sg Nom @X
"<'>"
        "'" Punct CLB
"<kaks>"
        "kaks" ? @X
"<'>"
        "'" Punct CLB
"<tytön>"
        "tyttö" N Sg Gen @←OBJ
"<sinulta>"
        "sinä" Pron Pers Sg Abl @X
"<.>"
        "." Punct CLB

Flag diacritics

To regulate:

  • compound restriction ( *N+V vs N+V+Der/N )
  • restricting inflectional patterns for certain lemmas where one paradigm belongs to a frequent sense, and another paradigm belongs to a low-frequency sense of the same lemma
  • handle lowercasing of derived names

The proper nouns as genitive attributes are tagged like this right now (no special +G or +Attr or whatever tags):

echo "Taani" | lookup analyser-gt-norm.xfst
Taani    taani    +N+Sg+Gen

the same for the lowercase:

echo "taani" | lookup analyser-gt-norm.xfst
Taani    taani    +N+Sg+Gen

In the tag conversion table (lang/est/doc/TagList.jspwiki) we have:

+S        +N
...
+G        +Attr

TODO:

  • add flags for lowercasing derived names (Jaak)
    • check that lowercasing is actually working with the present code
  • extend the proper noun case handling to add Genitive Attr, and only when lowercased (Jaak)

Infrastructure issues

  • preprocess: $LANG/tools/preprocess/ - add pre-cg tag insersion here
  • "is there a way to speed up yaml tests?" The original python source code for the yaml test bench can be found here https://github.com/bbqsrc/morph-test

Weekend

will be discussed around the dinner table.