Language recognition using pytextcat

To be able to identify sections within a document not in the main language, we need automatic language reqognition. We have installed an open-source package that performs such a task, and this page documents its usage and origin (cf. the Source section at the bottom of the page for background info)..

Usage

The primary use of this tool is via the corpus conversion tool convert2xml. When you use convert2xml in order to turn original corpus documents into text for processing, the language recogniser is automatically put into use. You may also use it as a standalone program. See the help text by writing pytextcat proc -h

Typical usage as a standalone program will be something like:

pytextcat proc $GTHOME/tools/CorpusTools/corpustools/lm < testfile.txt

pytextcat will return the name (the ISO code, to be exact) of the language(s) the script believes the text to be in.

Adding a new recognizable language

The pytextcat reference files are stored in $GTHOME/tools/CorpusTools/corpustools/.

Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:

random_lines < some-text-file > ShortTexts/language-name.txt

This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this (assuming you stand in $GTHOME/tools/CorpusTools/corpustools/):

cat someinput | pytextcat complm > lm/language-iso-code.lm
cat someinput | pytextcat compwm > lm/language-iso-code.wm

After this, the language recognition tool pytextcat is ready for use with another language as shown in the previous section.

Source

The home page of the original perl-based package TextCat is found at several locations.

. The Groningen home page also includes links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.

The python implementation pytextcat we use here was written by Kevin Unhammer.

by Sjur N. Moshagen