Language recognition using text_cat
The home page of the package TextCat is found at several locations.
- The original page at University of Groningen, with the source code . The package is lisenced under a GPL license — see the home page for details — and it is developed by Gertjan van Noord
- The source code is also available in the Giellatekno repository
. The Groningen home page also includes links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.
The tool text_cat itself is installed in gt/scripts/, and basic usage is explained by:
Typical usage will be something like:
text_cat -l "What language is this"
In both cases text_cat will return one or more strings with the name of the language(s) the script believes the text to be in.
Adding a new recognizable language
The text_cat reference files are stored in $GTHOME/tools/lang-guesser.
Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:
>$ random_lines < some-text-file > ShortTexts/language-name.txt
This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this:
>$ text_cat -n < ShortTexts/language-name.txt > LM/language-name.lm
After this, the language recognition tool text_cat is ready for use with another language as shown in the previous section.
by Sjur N. Moshagen