How To Analyse WithHFST
Contents:
Requirements
Usage
The command pipeline is as follows:
ccat -l sme -a -r $GTFREE/prestable/converted/sme/ | apertium-destxt | \
hfst-proc -C -e -q -w sme/bin/sme.hfstol | l
The result of that pipeline is e.g. the following:
"<Sámedikkis>"
"Sámediggi" N Prop Org Sg Loc
"sámediggi" Org Build N Sg Acc PxSg3
"sámediggi" Org Build N Sg Gen PxSg3
"sámediggi" Org Build N Sg Loc
"<lea>"
"leat" V IV Ind Prs Sg3
"<dasa lassin>"
"dasa lassin" Adv
"<váldi>"
"váldi" Hum Org N Actor Sg Acc
"váldi" Hum Org N Actor Sg Gen
"váldi" Hum Org N Actor Sg Nom
"váldi" Org N Sg Nom
"váldit" V TV Imprt Du2
"váldit" V TV PrsPrc
"váldit" V TV ‡ Actor N Sg Acc
"váldit" V TV ‡ Actor N Sg Gen
"váldit" V TV ‡ Actor N Sg Nom
"<ja>"
"ja" CC
"<váikkuhanfápmu>"
"váikkuhanfápmu" N Sg Nom
"<go>"
"go" CS
"go" Qst Pcle
"<oassálastá>"
"oassálastit" V IV Ind Prs Sg3
"<ja>"
"ja" CC
"<go>"
"go" CS
"go" Qst Pcle
"<ovddastuvvo>"
"ovddastit" V TV ‡ Der/PassL V Imprt ConNeg
"ovddastit" V TV ‡ Der/PassL V Imprt Sg2
"ovddastit" V TV ‡ Der/PassL V Ind Prs ConNeg
"ovddastit" V TV ‡ Der/PassL V Ind Prs Sg3
"ovddastit" V TV ‡ Der/PassL V VGen
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Gen
"iešguđet" Pron Indef
"<ge>"
"ge" Pcle
"<lávdegottiin>"
"lávdegoddi" Org N Pl Loc
"lávdegoddi" Org N Sg Com
"<,>"
"," CLB
"<stivrrain>"
"stivra" Org N Pl Loc
"stivra" Org N Sg Com
"<ja>"
"ja" CC
"<ráđiin>"
"ráđđi" Org N Pl Loc
"ráđđi" Org N Sg Com
"<.>"
"." CLB
The commands explained
ccat -l sme -a -r $GTFREE/prestable/converted/sme/
Extracts all North Saami texts from our open corpus repository, returning it as newline-separated paragraphs.
apertium-destxt
Deformats pure text input. This is needed because the following tool (hfst-proc) presently is unable to handle certain characters that are reserved in the apertium system. Most importanly, strings containing forward slash ("/") will not break the analysis as long as we use this apertium tool.
hfst-proc -C -e -q --weight-classes 1 sme/bin/sme.hfstol
Tokenises and analyses the input text, and outputting it in a Constraint Grammar-friendly format. It outputs only the best analyses with the same weight, and will output the simplest compound available (by counting # symbols).
Benefits
Since tokenisation and analysis is done in one step using the same transducer, multiword expressions will be correctly tokenised and analysed, included fully inflected forms. This will generally give a better and linguistically more coherent analysis than using preprocess.pl as we are required to do with the Xerox tools.
Issues
There are still a couple of open issues with the HFST tools and tranducers that we need to deal with:
- the transducers need to start using weights to avoid superfluous derivation analyses in the output
- the tokenisation is sometimes not what you would expect because of some apertium-specific treatment of special chars in the lookup tool hfst-proc
See also
- HowToTokeniseWithHfst - using hfst-tokenise instead of hfst-proc, giving e.g. better multiword handling

