Corpus meeting 7.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond


  • Algorithm for dealing with scanning errors
  • Setningsparallellisering
  • Ordparallellisering

Algorithm for dealing with scanning errors

Finding the files

Analyse the main language text morphologically. Then, for each file:

converted/$lang/catalogue/file.xml -- analyse $lang nodes
Count the missing ones -- … | usme | grep '\?'
For each file: register the missing/total ratio, 
List the files according to the ratio, and pick the worst files

Priority: converted/sme/admin/

Tomi has tried this with one file. Command:

linecount=`ccat -l sme $1 | preprocess | wc -l`
errors=`ccat -l sme $1 | preprocess | /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.\
4/bin/lookup -flags mbTT -utf8 ~/langtech/gt/sme/bin/sme.fst | grep '?' | wc -l\`

echo "lines: $linecount / $errors"

lines: 1535 / 87
lines: 1703 / 53
for i in $ANALYSED_DIR/$SMILANG*.ccat.txt
time cat $i | $PREPROCESS 2> /dev/null | lookup -q -flags mbTT $GTHOME/gt/$SMILANG/bin/$SMILANG.fst |

Outcome of this: A list of files


  • File list as described, due this week. Tomi.

Finding a cure for improving the files

  • What has caused the high error rates?
  • How can it be fixed?

Status quo for boundcorpus and freecorpus

  • 1700 out of 52000 files in boundcorpus still cause problems for
  • 84 files out of 9276 in freecorpus still cause problems for Of these, 18 are in $lang/admin.
  • Also, look at errors in nob/admin
  • Look at errors in files with parallel versions.

List of files which are not converted:

freecorpus$ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" "

Command for finding the list:

freecorpus $ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" " | wc -l

The error in the eng files is trivial. Focus now is on the nob-sme pairs under admin.


  • Fix the (very few) remaining nonconverted ones (Børre)

Sentence alignment

TCA2 version update

TCA2 used to work. Some time, during the time we have not touched the code, it stopped working because of Java upgrade to 1.6. Børre tried to fix it this autumn.

  • oct 5th 2006 -- gui-interface does not work for Børre, Trond
  • sep 29th 2006 -- gui-interface works for Børre, Trond


  • Get a newer version (Trond).

TCA2 installing for the rest of us

Postponed to the version question has been clarified.

Anchor list

Trond made an sme-anchor.fst

{biegga} ?* |
{biekka} ?* |
{lássa} ?* |
{lása} ?* |
{viidni} ?* |
{viinni} ?* |
{vuitti} |

Run the corpus through the anchor fst, and spot holes. Fill them.

Split the anchor list in two:

  1. a domain-independent one
  2. a domain-dependent one


  • Trond and Berit Merete to look at this.

Sentence length parameter

We need the ratio.


Dice coefficient

Explanation here Implementations here

Wikipedia: The coefficient may be defined as twice the shared information (intersection) over thecombined set (union)

Hofland and Johansson:

For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. For other languages, the acceptable value for the coefficient can be less. The cognate parameter is also read by the program.

Question: Is there a parameter to be set here?


  • Discuss with Bergen (Trond)
  • Follow-up (all)
    • Find the sme: nob Dice coefficient for cognates
    • Find the lower length for words to be considered


Probably an important candidate.


  • Ciprian to look into it.

Alternatives to TCA2?

  • Maligna
  • Maca
  • Others?

Work ahead

  • All: Read documentation and get a grip of the total picture.
  • Tomi: Find the bad files and look into them + report (this week)
  • Trond: Talk to Bergen (this week) + thereafter we install
  • Berit, Trond: Anchor list
  • Børre: Conversion
  • Cip: various counting thing

Next (short) corpus meeting:

  • Monday afternoon, April 11th.