corpus_conversionerrors_may11
Errors addressed so far (may 2011):
- dårlege originalfiler - gjev ugyldig xml
- desse blir fanga opp i dag
- kodefeil - desse gjev gyldig xml, men meiningslause bokstavar
- utf-som-macroman
- utf-som-latin1
- utf-som-html-hex
- utf-som-html-entitet
- skannefeil/ocr-feil - desse gjev meiningsfulle bokstavar, men meiningslaus tekst
- bad sentence-delimitation: one real sentence is one fragment in one language,
3 fragments in the other -> alignment goes bunk
- files freecorpus/converted/sme/admin/others/
-
STM200420050011000SE_PDFS.pdf.xml
STM200420050044000SE_PDFS.pdf.xml
have encoding errors that đ is represented as and the document is full of
's; thus these files should be deleted
- file OTP200620070025000SE_PDFS.pdf.xml
has paragraphs with content '--------' so it should be deleted.
- file STM200320040010000SE_PDFA.pdf.xml
has so many errors, it should be rescanned
-
uito-ohpenplana.txt.xml
the original file is corrupted