Suggestion Weighting
The files for weighting, and thus decide the priority of suggestions, on
LANG/tools/spellcheckers/fstbased/desktop/hfst LANG/tools/spellcheckers/fstbased/desktop/weighting
Levenshtein transitions and adjustments
This is the basis against which the operations we specify will
Each Levenshtein operation is 10 points (this value is system-specific,
Error model A:Adjusting Levenshtein
Levenshtein may be adjusted in two ways. The adjustments are single letters or
At the moment, what weight to put to any given pair is open. As for
Single letters
The file is hfst/editdist.default.txt.
In the beginning of the file, all letters that participate
In the suggestion there is a mapping from each letter to each other letter.
When the transition pair operation is listed in editdist.default.txt,
ç č -9 a á -6 á a -6 a â -6
Strings
The file is hfst/strings.default.txt.
The format is
c:cc -2 cc:c -2 d:dd -2 g:gg -2
This weight also come instead of the basic Levenshtein form.
String pairs are used as follows:
We build a Levenshstein 1 model, i.e. a set of all
Finalstrings
The file is hfst/final_strings.default.txt.
The format is
esnie:esne -5 ese:asse -5 htasse:htse -5
These weights come in addition to the aggregated Error model A,
initial letter
The file is hfst/initial_letters.default.txt
The format is
l:l 0.0 m:m 0.0 n:n 0.0 o:o 0.0
Using this may give a very large error model, and it is thus
Error model B:Swapping words
The file is words.default.txt.
The format is:
jih:jïh 0.0
These full word pairs will get a weight.
Frequency and tag weight
Frequency weighting
The file is spellercorpus.raw.txt. (evt. a .clean. file)
A corpus may be used as a frequency weighting mechanism.
You may even take a specialised speller for learners, tuning
Tag weighting
The file is weighting/tags.reweight.
File format:
+Pot +1 +Cond +1 +Actio -1 +Ess +1 +Par +1 +PxSg1 +3
Interaction between frequency and tag weighting
Logarithmic frequency values and tag weights are added together to get the aggregated grammatical/frequence weight.
Putting it all together
Text frequency and tag weight come on top of the error model.
Testing
At the end of the day, tuning edit distance, letter and string pairs, against word frequency and each other is a linguistic and empirical question.
In order to find the ideal balance, a speller testbench is needed.