Preprocessing the input

Introduction

Tokenizing

Obsolete: The tokenizer file tok.txt

Just as for North Sámi, the Lule Saami preprocessing was earlier done with the Xerox tokenize tool and the language-specific file tok.txt. The code itself is written as a set of regular expressions, and the source file (tok.txt) was compiled by xfst. As explained for the sme preprocessing, this approach was replaced by a preprocessor script, written in perl, gt/script/preprocess.

The current preprocessor

Preprocessing is done by the perl script gt/script/preprocess, which is language-independent. The script is documented here. The language dependent part of the script shall be done via the file smj/bin/abbr.txt

Handling abbreviations

Lule Saami abbreviations are handled as for North Saami.

Spell relaxation of æ/ä, ø/ö

This is a feature common to Lule and South Sami, not to be found in North Sami. The letter æ/ä and ø/ö are used interchangeably in Norway and Sweden. The parser accepts any version of them.

The xfst file to handle this is the language-independent spellrelax.regex. It contains rules like:

ń (->) ñ, ŋ (->) ñ, æ (->) ä, ø (->) ö ;

The line says that æ may optionally be replaced by ä and that ø may optionally be replaced with ö, and the same for the different ways of writing ŋ.

We plan to make parts of the spellrelax file language dependent.

Initial capitalization

There is a language independent inituppercase.regex file. Cf. the documentation for initial capitalization written for North Saami.

Capitalization of whole words

This has not yet been implemented.