corpus_ocr_may11
Algorithm for dealing with OCR errors
Finding these errors
- Problem: There are document-specific conversion errors that result
- Solution: identify the problematic files via error detection with fst
TODO:
Tne Error Detection Algorithm runs as follows:
- For each file:
- Analyse the main language text morphologically
- Count the missing ones
- Register the missing/total ratio, and pick the worst files
- Analyse the main language text morphologically
- Look at the worst files, and figure out how to mend them, or move them,
Results of finding errors for North Sami
Here is a list of errors per file in each folder in the admin directory. For each file we list error/total ratio - total number of words - words not recognized - filename, and we sort the file list according to error/total ratio:
-
admin/depts/others
-
admin/guovda
-
admin/others
-
admin/regjering
-
admin/sd/others
-
admin/sd/samediggi
-
sma corpus errors
-
sma corpus error analysis
-
sme corpus errors analysis
-
sme corpus errors admin/
-
sme corpus errors analysis
-
sme corpus errors guovda/
-
sme corpus errors regjering/
-
smj corpus errors
- smj corpus error analysis
A list of error analyses can be found from corpus error analysis.
Error typology (summarising the corpus error analysis):
- Conversion errors
- ==> Improve conversion
- ==> Improve conversion
- Typing errors
- ==> Add to typos.txt, evt. move to typos gold corpus
- ==> Add to typos.txt, evt. move to typos gold corpus
- Linguistic spelling errors
- ==> Add to typos.txt, evt. move to typos gold corpus
- ==> Add to typos.txt, evt. move to typos gold corpus
- Scanning errors
- ==> Analyse the scanning errors and add search-replace to xsl file
- ==> Analyse the scanning errors and add search-replace to xsl file
- Language recognition errors
- ==> Check whether the xsl file lists the relevant languages
- ==> Improve language rec module
- ==> Check whether the xsl file lists the relevant languages
- Numbers not recognised
- ==> Improve fst
- ==> Improve fst
- Unknown words (bad fst)
- ==> Improve fst
- ==> Improve fst
- Corrupted original
- ==> Consider removing it
TODO:
- Improve conversion according to error type, as sketched above
Results of finding errors for South Sami
TODO:
- Sma improvement of the test results above
Finding catalogue errors:
List all files in langX-catalogue with more non-langX content than
TODO:
- Still not done.
Correcting OCR errors
Develop algorithms for automatic correction of OCR errors. This