How To Configure And Optimise Spellers

Contents:

Speller configuration
Fst optimisations
Error model optimisations
Fine tuning the suggestion order
Time-stamping the spellers
Easter egg trigger
Testing spellers

There are a number of different spellers being supported (or on the way to be supported in our infrastructure:

fst-based spellers:
- zhfst files
- extensions for LibreOffice (oxt-files) based on LibreOffice-voikko
- foma spellers
list-based spellers (support under development)
- PLX spellers (Sámi spellers for MS Word using closed-source technology)
- Hunspell files

Speller configuration

The basic configuration for building spellers is:

./configure --with-hfst --enable-spellers

There is one optimisation flag that is turned on by default: --enable-minimised-spellers. For some languages this optimisation is counterproductive, causing the speller to become very slow and unresponsive. If this is the case, disable this optimisation as follows:

./configure --with-hfst --enable-spellers --disable-minimised-spellers

You should also play a bit with the next configuration option, and see which combination of optimisations yeld the best performance.

Fst optimisations

Some languages, notably Greenlandic (kal), compiles into a very large net. Hfst supports something called hyper-minimisation in which paths are replaced with automatically generated flag diacritics, such that otherwise similar paths can be collapsed without changing the semantics of the language model. This type of minimisation has a profound effect on some languages, and a minimal effect on others. In some cases it has even increased the size of the resulting fst. For Greenlandic the effect is stunning: from being a more or less unusable behemoth at 160 Mb and more, the acceptor for the Greenlandic speller (when combined with minimised spellers as described above) is reduced to a mere 6,3 Mb. To turn on this type of fst size optimisation, configure as follows:

./configure --with-hfst --enable-spellers --enable-hyperminimisation

Whether this option helps or not must be tested for each language, and preferably documented. You can see how this and the previous option affects the speller file sizes for three languages (fin, kal, sme) here.

Error model optimisations

The default error model has two important properties:

alphabet size
transition weights

Further details about the error model and its parts and build configuration can be found on a separate page.

Alphabet size

The alphabet size has a huge impact on the size of the final error model fst, and with that, also the speed of creating suggestions. The smaller the alphabet the smaller and speedier the fst. To ensure you have as small an alphabet as possible, add as many characters as possible to the exclusion list in the following file:

tools/spellcheckers/fstbased/hfst/editdist.default.txt

All other characters will be used to create a simple edit distance 1 error model (this model is concatenated with itself to enable corrections of edit distance 2).

Tip: use the terminal output of make in tools/spellcheckers/fstbased/hfst/ (following the text ... and base alphabet size NN) as a starting point. Remove all regular alphabetic symbols, and what is left should be excluded by adding them to the file mentioned above.

Transition weights

The default error model created above is quite rough, as all transitions are equally possible. To improve this, you can specify weights for specific transition pairs (in the same file as above):

ø	ö	0.5

The default weight is 1.0, and the above line says that replacing ø with ö should only have a weighxt of 0.5, and thus be more likely than the default. The columns are TAB separated.

Using this system, it is possible to tune the default error model to improve the order of the suggestions by using general single-letter rules.

To enable the error model to correct longer sequences of letter combinations, one should edit the file tools/spellcheckers/fstbased/hfst/strings.default.txt. It follows a similar but not identical structure as the previous file:

øø:öö	0.2
ää:ææ	0.2

It is also possible to add whole word replacements to the error model by editing the file tools/spellcheckers/fstbased/hfst/words.default.txt. Whole-word replacements are typically given the weight "0.0", to ensure they are on the top of the suggestion list:

jih:jïh	0.0

In the future it will be possible to use a file of collected typos and their corrections as the basis for whole-word corrections.

Fine tuning the suggestion order

In the previous section we looked at how we could fine-tune the suggestions based on the error - what type of changes we had to do to arrive at a correct word. This is good in itself, but it does not differentiate between to suggestions with the same weighting where one is a frequent word and the other is not, or where one word is a compound and the other is not. Neither does it move rare word forms down on the suggestion list. To add such behavior, we need to add weights to the fst that will end up as the acceptor.

Morphology-based weighting

Morphology-based weighting is done by adding weights to the morphological or morphosyntactic tags in the analyser. You do this by modifying the file tools/spellcheckers/fstbased/desktop/weighting/tags.reweight. The file contains TAB separated values, two columns:

the tag itself
the weight that should be given to the tag

Comments can be added as lines starting with #.

Below is an example of how this can be done, taken from sme:

+Cmp	+2
+Der	+1
+Der1	+1
+Der2	+1
+Der3	+1
+Der4	+1
+Der5	+1
+PxSg1	+3
+PxSg2	+3
+PxSg3	+3
+PxPl1	+3
+PxPl2	+3
+PxPl3	+3
+Use/SpellNoSugg	+10000
+Cmp/Hyph	+10000
+Cmp/SplitR	+10000

The weights are added to the other weights given to a word form, and should be chosen to align with the rest of the weights being used. Corpus weights are typically between 6 and 12 (but will vary depending on the size of the corpus), and the default weight for editing distance operations is 10. Very high weights will cause a word form not to be suggested at all, or very rarely.

Corpus-based weighting

You turn on frequency-based weighting by doing two things:

Create a speller corpus
Enable the use of the speller corpus

Creating a speller corpus

This is very simple: just store a large amount of text in the file tools/spellcheckers/fstbased/desktop/weighting/spellercorpus.raw.txt. The content does not have to be sorted, split or clean in anyway - basic cleaning and sorting is done automatically, and all incorrect words will be filtered out automatically.

If you are using texts that are copyrighted, you can use the following Perl one-liner to scramble the words or lines in the text, so that the original text is not reconstructable:

perl -MList::Util=shuffle -e 'print shuffle(<>);' < myfile.txt \
> tools/spellcheckers/fstbased/desktop/weighting/spellercorpus.raw.txt

After this, the text is fine for inclusion in the corpus.

Use a lot of text, so that also the not-so-frequent word forms are covered - that will help a lot in improving the suggestion quality.

Enabling the use of the speller corpus

Having a text corpus (which provides us with frequency data) is not enough, you also need to enable the use of it. This is done by editing tools/spellcheckers/fstbased/desktop/Makefile.am, so that it contains the following line (the line should already be there, but with the value no):

ENABLE_CORPUS_WEIGHTS=yes

You can temporarily disable the use of frequency data, e.g. for evaluation and development purposes, by changing yes to no.

Both

It can also be quite helpful to combine the use of frequency (corpus) weights and tag-based (morphology) weights. You need to experiment and test a bit to arrive at the best configuration for a given language.

Time-stamping the spellers

The spellers do all get an easter egg with build date and version info. But this information does not get automatically updated. To ensure you have a correct timestamp in your easter egg, do:

cd tools/spellcheckers/
make clean
make

The reason you should cd into tools/spellcheckers/ first, is so that you don't have to rebuild everything, just the spellers and the easter egg.

Easter egg trigger

The trigger string is nuvviDspeller. Copy and paste this word into any speller we have made or echo it into a speller on the command line, and the suggestions should contain the version information.

Testing spellers

The speller may be tested on data from test/data/typos.txt. In order to do this, you need Text/Brew.pm (a Perl module, it should be installed if you follow the default setup procedure). To test, stand in the $LANG (langs/sme, etc) directory and write:

sh devtools/test_voikkospell_suggestions.sh 
open -a Safari devtools/speller_result_typos.vk.xml

Application infra

Spellers

Compile spellers

LibreOffice Voikko