Preprocessing the input
Introduction
Tokenizing
Obsolete: The tokenizer file tok.txt
Just as for North Sámi, the Lule Saami preprocessing was earlier done with the Xerox tokenize tool and the language-specific file tok.txt. The code itself is written as a set of regular expressions, and the source file (tok.txt) was compiled by xfst. As explained for the sme preprocessing, this approach was replaced by a preprocessor script, written in perl, gt/script/preprocess.
The current preprocessor
Preprocessing is done by the perl script gt/script/preprocess, which is language-independent. The script is documented here. The language dependent part of the script shall be done via the file smj/bin/abbr.txt
Handling abbreviations
Lule Saami abbreviations are handled as for North Saami.
Spell relaxation of æ/ä, ø/ö
This is a feature common to Lule and South Sami, not to be found in North Sami. The letter æ/ä and ø/ö are used interchangeably in Norway and Sweden. The parser accepts any version of them.
The xfst file to handle this is the language-independent spellrelax.regex. It contains rules like:
ń (->) ñ, ŋ (->) ñ, æ (->) ä, ø (->) ö ;
The line says that æ may optionally be replaced by ä and that ø may optionally be replaced with ö, and the same for the different ways of writing ŋ.
We plan to make parts of the spellrelax file language dependent.
Initial capitalization
There is a language independent inituppercase.regex file. Cf. the documentation for initial capitalization written for North Saami.
Capitalization of whole words
This has not yet been implemented.