How To Analyse WithHFST
Contents:
Requirements
Usage
The command pipeline is as follows:
ccat -l sme -a -r $GTFREE/prestable/converted/sme/ | apertium-destxt | \ hfst-proc -C -e -q -w sme/bin/sme.hfstol | l
The result of that pipeline is e.g. the following:
"<Sámedikkis>" "Sámediggi" N Prop Org Sg Loc "sámediggi" Org Build N Sg Acc PxSg3 "sámediggi" Org Build N Sg Gen PxSg3 "sámediggi" Org Build N Sg Loc "<lea>" "leat" V IV Ind Prs Sg3 "<dasa lassin>" "dasa lassin" Adv "<váldi>" "váldi" Hum Org N Actor Sg Acc "váldi" Hum Org N Actor Sg Gen "váldi" Hum Org N Actor Sg Nom "váldi" Org N Sg Nom "váldit" V TV Imprt Du2 "váldit" V TV PrsPrc "váldit" V TV ‡ Actor N Sg Acc "váldit" V TV ‡ Actor N Sg Gen "váldit" V TV ‡ Actor N Sg Nom "<ja>" "ja" CC "<váikkuhanfápmu>" "váikkuhanfápmu" N Sg Nom "<go>" "go" CS "go" Qst Pcle "<oassálastá>" "oassálastit" V IV Ind Prs Sg3 "<ja>" "ja" CC "<go>" "go" CS "go" Qst Pcle "<ovddastuvvo>" "ovddastit" V TV ‡ Der/PassL V Imprt ConNeg "ovddastit" V TV ‡ Der/PassL V Imprt Sg2 "ovddastit" V TV ‡ Der/PassL V Ind Prs ConNeg "ovddastit" V TV ‡ Der/PassL V Ind Prs Sg3 "ovddastit" V TV ‡ Der/PassL V VGen "<iešguđet>" "iešguhtet" Pron Indef Acc "iešguhtet" Pron Indef Gen "iešguđet" Pron Indef "<ge>" "ge" Pcle "<lávdegottiin>" "lávdegoddi" Org N Pl Loc "lávdegoddi" Org N Sg Com "<,>" "," CLB "<stivrrain>" "stivra" Org N Pl Loc "stivra" Org N Sg Com "<ja>" "ja" CC "<ráđiin>" "ráđđi" Org N Pl Loc "ráđđi" Org N Sg Com "<.>" "." CLB
The commands explained
ccat -l sme -a -r $GTFREE/prestable/converted/sme/
Extracts all North Saami texts from our open corpus repository, returning it as newline-separated paragraphs.
apertium-destxt
Deformats pure text input. This is needed because the following tool (hfst-proc) presently is unable to handle certain characters that are reserved in the apertium system. Most importanly, strings containing forward slash ("/") will not break the analysis as long as we use this apertium tool.
hfst-proc -C -e -q --weight-classes 1 sme/bin/sme.hfstol
Tokenises and analyses the input text, and outputting it in a Constraint Grammar-friendly format. It outputs only the best analyses with the same weight, and will output the simplest compound available (by counting # symbols).
Benefits
Since tokenisation and analysis is done in one step using the same transducer, multiword expressions will be correctly tokenised and analysed, included fully inflected forms. This will generally give a better and linguistically more coherent analysis than using preprocess.pl as we are required to do with the Xerox tools.
Issues
There are still a couple of open issues with the HFST tools and tranducers that we need to deal with:
- the transducers need to start using weights to avoid superfluous derivation analyses in the output
- the tokenisation is sometimes not what you would expect because of some apertium-specific treatment of special chars in the lookup tool hfst-proc
See also
- HowToTokeniseWithHfst - using hfst-tokenise instead of hfst-proc, giving e.g. better multiword handling