Language recognition using pytextcat

To be able to identify sections within a document not in the main language, we need automatic language reqognition. We have installed an open-source package that performs such a task, and this page documents its usage and origin (cf. the Source section at the bottom of the page for background info)..

Contents:

Usage
Adding a new recognizable language
Source

Usage

The primary use of this tool is via the corpus conversion tool convert2xml. When you use convert2xml in order to turn original corpus documents into text for processing, the language recogniser is automatically put into use. You may also use it as a standalone program. See the help text by writing pytextcat proc -h

Typical usage as a standalone program will be something like:

pytextcat proc $GTHOME/tools/CorpusTools/corpustools/lm < testfile.txt

pytextcat will return the name (the ISO code, to be exact) of the language(s) the script believes the text to be in.

Adding a new recognizable language

The pytextcat reference files are stored in $GTHOME/tools/CorpusTools/corpustools/.

Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:

random_lines < some-text-file > ShortTexts/language-name.txt

This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this (assuming you stand in $GTHOME/tools/CorpusTools/corpustools/):

cat someinput | pytextcat complm > lm/language-iso-code.lm

cat someinput | pytextcat compwm > lm/language-iso-code.wm

After this, the language recognition tool pytextcat is ready for use with another language as shown in the previous section.

Source

The home page of the original perl-based package TextCat is found at several locations.

The original page at University of Groningen, with the source code. The package is lisenced under a GPL license — see the home page for details — and it is developed by Gertjan van Noord
The source code is also available in the Giellatekno repository (TODO: Fix)

. The Groningen home page also includes links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.

The python implementation pytextcat we use here was written by Kevin Unhammer.

by Sjur N. Moshagen

Corpus

Overview and important links

Corpus collection/maintenance

Sentence alignment

Meetings

Korp

Ordbilde

Spoken corpora

LIA

ELAN

Language recognition using pytextcat

Usage

Adding a new recognizable language

Source