How To Analyse WithHFST

Requirements

You need to have installed both HFST  and Apertium  at the moment.

Usage

The command pipeline is as follows:

ccat -l sme -a -r $GTFREE/prestable/converted/sme/ | apertium-destxt | \
    hfst-proc -C -e -q -w sme/bin/sme.hfstol | l

The result of that pipeline is e.g. the following:

"<Sámedikkis>"
        "Sámediggi"     N Prop Org Sg Loc
        "sámediggi"     Org Build N Sg Acc PxSg3
        "sámediggi"     Org Build N Sg Gen PxSg3
        "sámediggi"     Org Build N Sg Loc
"<lea>"
        "leat"  V IV Ind Prs Sg3
"<dasa lassin>"
        "dasa lassin"   Adv
"<váldi>"
        "váldi" Hum Org N Actor Sg Acc
        "váldi" Hum Org N Actor Sg Gen
        "váldi" Hum Org N Actor Sg Nom
        "váldi" Org N Sg Nom
        "váldit"        V TV Imprt Du2
        "váldit"        V TV PrsPrc
        "váldit"        V TV ‡ Actor N Sg Acc
        "váldit"        V TV ‡ Actor N Sg Gen
        "váldit"        V TV ‡ Actor N Sg Nom
"<ja>"
        "ja"    CC
"<váikkuhanfápmu>"
        "váikkuhanfápmu"        N Sg Nom
"<go>"
        "go"    CS
        "go"    Qst Pcle
"<oassálastá>"
        "oassálastit"   V IV Ind Prs Sg3
"<ja>"
        "ja"    CC
"<go>"
        "go"    CS
        "go"    Qst Pcle
"<ovddastuvvo>"
        "ovddastit"     V TV ‡ Der/PassL V Imprt ConNeg
        "ovddastit"     V TV ‡ Der/PassL V Imprt Sg2
        "ovddastit"     V TV ‡ Der/PassL V Ind Prs ConNeg
        "ovddastit"     V TV ‡ Der/PassL V Ind Prs Sg3
        "ovddastit"     V TV ‡ Der/PassL V VGen
"<iešguđet>"
        "iešguhtet"     Pron Indef Acc
        "iešguhtet"     Pron Indef Gen
        "iešguđet"      Pron Indef
"<ge>"
        "ge"    Pcle
"<lávdegottiin>"
        "lávdegoddi"    Org N Pl Loc
        "lávdegoddi"    Org N Sg Com
"<,>"
        ","     CLB
"<stivrrain>"
        "stivra"        Org N Pl Loc
        "stivra"        Org N Sg Com
"<ja>"
        "ja"    CC
"<ráđiin>"
        "ráđđi" Org N Pl Loc
        "ráđđi" Org N Sg Com
"<.>"
        "."     CLB

The commands explained

ccat -l sme -a -r $GTFREE/prestable/converted/sme/

Extracts all North Saami texts from our open corpus repository, returning it as newline-separated paragraphs.

apertium-destxt

Deformats pure text input. This is needed because the following tool (hfst-proc) presently is unable to handle certain characters that are reserved in the apertium system. Most importanly, strings containing forward slash ("/") will not break the analysis as long as we use this apertium tool.

hfst-proc -C -e -q --weight-classes 1 sme/bin/sme.hfstol

Tokenises and analyses the input text, and outputting it in a Constraint Grammar-friendly format. It outputs only the best analyses with the same weight, and will output the simplest compound available (by counting # symbols).

Benefits

Since tokenisation and analysis is done in one step using the same transducer, multiword expressions will be correctly tokenised and analysed, included fully inflected forms. This will generally give a better and linguistically more coherent analysis than using preprocess.pl as we are required to do with the Xerox tools.

Issues

There are still a couple of open issues with the HFST tools and tranducers that we need to deal with:

  • the transducers need to start using weights to avoid superfluous derivation analyses in the output
  • the tokenisation is sometimes not what you would expect because of some apertium-specific treatment of special chars in the lookup tool hfst-proc

See also

  • HowToTokeniseWithHfst - using hfst-tokenise instead of hfst-proc, giving e.g. better multiword handling