Preprocessor Specification
The preprocessor (or tokeniser) described here is meant to have general
There should be two versions of the tokeniser:
- a command-line version for use in a unix pipe when processing text (e.g. for
testing the grammar checker or another tool on the command line) - a library version to be embedded/linked in the run-time version of the
grammar checker
The output from the tokeniser could potentially differ between the two versions,
Expected input
Text. The text is a continous stream of characters. Binary data will not be
Formatting
Some uses of the tokeniser will have to deal with formatted text. The formatting
Classes of input tokens
There are up to three types of tokens in the input text stream, and the task is
The three token types are:
- word-like tokens
- whitespace tokens
- formatting tokens
Formatting tokens will only be found in some type of input data (i.e. it is
Basic elements of the pmatch fst
- pmatch is in principle a collection of regexes for identifying different types
of tokens - one of the regexes should be a descriptive analyser, i.e.
src/analyser-gt-desc.hfst - this regex will identify all tokens known to the language model, i.e. all analysable word-like tokens - there should be additional regexes to handle unknown words (non-whitespace
strings), whitespace and formatting markup.
Unknown tokens and characters
Handling of unknown word-like tokens should be done in two steps:
- by a guesser looking at initial and final characters (e.g. if something looks
like a case ending, give it a tag for that case) - by a last-resort Unicode-aware regex that just lumps together everything that
is not (Unicode) whitespace.
These two steps should preferably be written as regular, weighted regexes,
The details of how to accomplish this must be discussed with the hfst team or
Expected output
There are presently two known user groups of the tokeniser:
- Divvun/Giellatekno
- Apertium
The two groups use differen formats for their processing pipelines, although the
Known issues or things to look out for
Multiple tokenisations
Presently hfst-pmatch only does a left-to-right, longest match (LRLM)
Potentially this is both a problem and a feature: it might be a problem because
Francis will look into how big a problem this could really be, and if it turns
Proper handling of the full Unicode range
We need the tokeniser to be able to handle any text input from the full Unicode

