corpus_ocr_may11

Algorithm for dealing with OCR errors

Finding these errors

  • Problem: There are document-specific conversion errors that result in letters, not in garbage, errors that can be found only by linguistic means.
  • Solution: identify the problematic files via error detection with fst

TODO:

Tne Error Detection Algorithm runs as follows:

  1. For each file:
    1. Analyse the main language text morphologically
    2. Count the missing ones
    3. Register the missing/total ratio, and pick the worst files
  2. Look at the worst files, and figure out how to mend them, or move them, e.g. to an OCR gold standard

Results of finding errors for North Sami

Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:

A list of error analyses can be found from corpus error analysis.

Error typology (summarising the corpus error analysis):

  • Conversion errors
    • ==> Improve conversion
  • Typing errors
    • ==> Add to typos.txt, evt. move to typos gold corpus
  • Linguistic spelling errors
    • ==> Add to typos.txt, evt. move to typos gold corpus
  • Scanning errors
    • ==> Analyse the scanning errors and add search-replace to xsl file
  • Language recognition errors
    • ==> Check whether the xsl file lists the relevant languages
    • ==> Improve language rec module
  • Numbers not recognised
    • ==> Improve fst
  • Unknown words (bad fst)
    • ==> Improve fst
  • Corrupted original
    • ==> Consider removing it

TODO:

  • Improve conversion according to error type, as sketched above

Results of finding errors for South Sami

TODO:

  • Sma improvement of the test results above

Finding catalogue errors:

List all files in langX-catalogue with more non-langX content than langX-content.

TODO:

  • Still not done.

Correcting OCR errors

Develop algorithms for automatic correction of OCR errors. This work must be done separately for each language.