Sentence alignment

We look at alternatives to our tca2  aligners.

The Europarl aligner

The sentence aligner used to align the Europarl parallel corpus is a perl script based upon Church and Gale algorithm. Here is the README file.

The script may be downloaded from the europarl site. At Giellatekno, it is placed under the $GTHOME/tools/alignment-tools/europarl/.

In order to run it: Add se and no abbreviation files to the nonbreaking_prefixes catalogue (this has been done). Then add files to the directories se and no. The filenames in se and no must be identical. The command then is

./sentence-align-corpus.perl se no

This might not work. For some hints, see:


  • Run a test on the Europarl aligner, and compare to tca2.

Other aligners?

Feel free to add.