How To Tokenise WithHFST
Contents:
How to use the hfst-tokenise pipeline to tokenise-as-you-analyse, using
Prerequisites for Mac
First off, update your HFST+vislcg3 by running
wget http://apertium.projectjj.com/osx/install-nightly.sh sudo bash install-nightly.sh
This should give you the most recent SVN versions (as of last night) of HFST
(Packages exist for pretty much all Unix operating systems; the Prequisites
For now, you'll also need to get the program cg-mwesplit (which will
export CXX=clang++ export CC=clang git clone https://github.com/unhammer/cg-mwesplit ./autogen.sh ./configure make sudo make install
Build sme
Now, svn up in giella-core and langs/sme, and run ./configure in
./configure --enable-tokenisers --enable-syntax
(If you use Apertium, you'd also want --enable-apertium --with-hfst,
Finally, run "make" (currently, this requires >8GB of RAM).
Test
To run just the raw tokenisation+morphological analysis:
echo 'sánit, jna. Leago' \ | hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc
To include disambiguation of ambiguous multiwords:
echo 'sánit, jna. Leago' \ | hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \ | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3
To include splitting disambiguated multiwords into their own cohorts:
echo 'sánit, jna. Leago' \ | hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \ | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \ | cg-mwesplit
To include regular morphological disambiguation:
echo 'sánit, jna. Leago' \ | hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \ | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \ | cg-mwesplit \ | vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3
To include regular syntax tagging:
echo 'sánit, jna. Leago' \ | hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst \ | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 \ | cg-mwesplit \ | vislcg3 -g $GTHOME/langs/sme/src/syntax/disambiguation.cg3 \ | vislcg3 -g $GTHOME/giella-core/giella-shared/smi/src/syntax/functions.cg3
etc.
If you use these steps often, you'll probably want to make an alias. Open
alias hsme='hfst-tokenise --giella-cg $GTHOME/langs/sme/tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst' alias hsmemwe='hsme | vislcg3 -g $GTHOME/langs/sme/tools/tokenisers/mwe-dis.cg3 --trace' alias hsmesplit='hsmemwe | cg-mwesplit'