Unicode Normalisation
( or:how to fix decomposed Sami letters)
In Unicode, many glyphs (letter symbols) may either be represented
- NFKD = Normalization Form Compatibility Decomposition
- NFKC = Normalization Form Compatibility Composition
The first, NFKD, decomposes the characters (á as two characters),
Our North Sami analysers use the composed representation.
If you get text with decomposed letters (UnicodeChecker will tell you that č is two characters), you must compose them with the following command
cat infile.txt \ | uconv -f utf8 -t utf8 -x Any-NFKC > outfile.txt
See also man uconv
The uconv program should be installed on your machine as part of