Text is preprocessed and made into words and sentences. In order to do
Here we look at how to compile and use the preprocessor that deals
Abbreviation handling with hfst
This is the recommended approach. Compile and test with the following
./configure --with-hfst --enable-tokenisers make echo "dr. Watson."|hfst-tokenise $GTHOME/langs/sme/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
The result should treat the first period as part of the abbreviation
Abbreviation handling with xfst
This method is not actively maintained, but documented here in case you have not installed hfst.
Standing in the catalogue $GTHOME/langs/$LANG check whether you have a file abbr.txt in the
echo "dr. Watson."|preprocess --abbr=tools/tokenisers/abbr.txt
The result should be as above.
If you don't have this file, you may compile it as follows:
In the $LANG catalogue (the catalogue of your language), give the
./configure --enable-abbr cd tools/tokenisers make abbr
The result should be a file abbr.txt in tools/tokenisers, and