Error markup

We want to extend (some of) the corpus files with markup for spelling and other errors, to use them as gold standards for testing our spellers (and in the future other tools as well). The markup is done manually, and needs to follow certain rules.

Language-specific markup

Markup TYPES

We differentiate between different types of errors that people make, depending on the type of analysis needed to detect and correct the error. We also use the annotation for errors in learner texts.

Unclassified errors - wrong§correct

Errors of unknown type. By default such errors will be treated as spelling errors (see below). In the resulting xml, the name of the element will be <error>.

Orthographic errors, non-words - wrong$(error classification|correct)

Traditional misspellings confined to single (error) strings, that is, errors that don't need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorort>. These errors do always lead to non-words in the text, such that a speller should be able to detect them.

Gammel annotering: Attributes:
pos
( noun | verb | adj | adv | num | interj | pp | cc | cs | pers | refl | dem | resip | indef | pcle | prop | typo | mix | x )
errtype
The value of this attribute is language specific. For details, see below.

Orthographic errors, real-words - wrong¢(error classification|correct), same as for non-words, see above

Misspellings confined to single words, but stilll need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorortreal> These errors, although orthographical in nature, lead to other. real words, such that a traditional speller is unable to detect them.

Gammel annotering: Attributes:
pos ( noun | verb | adj | adv | num | interj | pp | cc | cs | pers | refl | dem | resip | indef | pcle | prop | typo | x )
errtype ( a | á | conc | svow | vowc | vow | con | mono | diph | lime | meta | suf | cmp | ascii | typo | cap | min | mix)
errcase ( gen )
corrcase ( nom )
Some explanations:
a = should be a instead of á
á = should be á instead of a
conc = consonant centre
vowc = vowel centre
svow = soggevokála
vow = some other wrong vowel
con = some other wrong consonant
mono = diphthong, but should be monophtong
diph = wrong diphthong
lime = consonant between two unstressed syllables, vuovddášis vs vuovddážis
meta = metathesis
suf = wrong suffixing, e.g. gievkani vs. gievkkanii
cmp = incorrect compounding, e.g. beavdiguorra vs beavdeguorra
ascii = not used sámi letter
typo = typical typo
cap = should be capital letter
min = should be minuscle
mix = more than one errortype in the same word

Morphosyntactic errors - (context wrong)£(pos,gf,cat,orig,errtype|context correct)

Errors that require an analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errormorphsyn>.

Eksempler på ny annotering
  • Brukt feil kasus:
    • Enkel annotering: Mun liikon dien girjji£girjái .
    • Detaljert annotering: Mun (liikon dien girjji)£(n,case,acc-ill|liikon dien girjái) .
  • Analytisk bøyning istedenfor syntetisk bøyning:
    • Enkel annotering: Dat lea (eanemus dábálaš)£dábálaččamus váriin ja duoddariin.
    • Detaljert annotering: Dat lea (eanemus dábálaš)£(adj,superl,analyt-synt|dábálaččamus) váriin ja duoddariin.
Gammel annotering: Attributes:
pos ( noun | verb | adj | adv | num | interj | pp | cc | cs | pcle | prop |pers | refl | dem | resip | indef | x )
gf ( subj | obj | advl | fin | infin | spred | opred | pcle | interj | app | conj | pph | x | attr )
cat ( nomsg | nompl | gensg | genpl | illsg | illpl | locsg | locpl | comsg | compl | ess | sg1prt | sg2prt | sg3prt | du1prt | du2prt | du3prt | pl1prt | pl2prt | pl3prt | sg1prs | sg2prs | sg3prs | du1prs | du2prs | du3prs | pl1prs | pl2prs | pl3prs | attr | pred | word | comp | superl | cmp | imprt | pot | infinite | cond | conneg | ger | vgen | x )
orig ( nomsg | nompl | gensg | genpl | illsg | illpl | locsg | locpl | comsg | compl | ess | sg1prt | sg2prt | sg3prt | du1prt | du2prt | du3prt | pl1prt | pl2prt | pl3prt | sg1prs | sg2prs | sg3prs | du1prs | du2prs | du3prs | pl1prs | pl2prs | pl3prs | attr | pred | word | comp | superl | imprt | pot | infinite | cond | conneg | ger | vgen | x )
errtype ( agr | case | tense | mode | number | mix | x )
Some explanations:
gf = grammatical function
subj = subject
fin = finite verb
infin = infinite verb
obj = object
spred = subjectpredicative
opred = objectpredicative
advl = adverbial, e.g. Mun boađán 'sotnabeaivi' vs. Mun boađán 'sotnabeaivve'
pph = pp phrace, e.g. sullo guovdu vs. guovdu sullo
conj = conjunction/subjunction
pcle = particle
interj = interjection
app = apposition
attr = attribute
x = unknown
nump = numeral phrase
gensg = acc/gen sg
genpl = acc/gen pl

Syntactic errors - redundantword¥0 or word¥(word missingword) or word¥(missingword word)

Also these errors require a partial or full analysis of (parts of) the sentence or surrounding words to be detected and corrected. In the resulting xml, the element is named <errorsyn>.

Eksempler på ny annotering
  • Unødvending ord, enkel annotering:
    • SNF doaibmá dál juo dego¥0 resursaguovddážin.
    • Gaup searvvai mielde¥0 guoimmuhanprográmmii
  • Manglende ord, enkel annotering:
    • Fápmudusskovvi galgá leat stivrras ovdal¥(ovdal go) riikačoahkkin álgá.
    • Dat lei vuosttaš girji sámiid birra máid sápmelaš čállán¥(lea čállán) .
Gammel annotering: Attributes:
pos ( noun | verb | adj | adv | num | interj | pp | cc | cs | pcle | prop | pers | refl | dem | det | resip | indef | punct | x )
errtype ( wo | pph | redun | missing | cmp | x )
Some explanations:
wo = word order
pph = pp-phrase
redun = redundant word
dupl = duplicate
missing = missing word, or punctation when it is crucial for the interpretation
cmp = should be compound
x = unknown

Lexical errors - wrong€(wrongPoS,correctPoS|correct)

Errors where the real error is only in the chosen word used, that is, another word would be better or correct; to be able to detect and correct such errors, we need in addition to syntactic analysis also a dictionary component with sufficiently rich syntactic and semantic markup of the entries, as well as syntactic and semantic disambiguation. The possibility to detect and correct this type of errors is probably not in the nearest future, but the need to mark up texts for these errors is real now. In the resulting xml, the element is named <errorlex>.

Eksempler på ny annotering
  • Brukt adjektiv istedenfor adverb:
    • ovddimus€(adv-adj|ovddimusat)
  • Brukt feil verb:
    • Go su ráđđehus eretgeassádii€(verb|geassádii) , juolludii Stuorradiggi Nygaardsvoldii gudnebálkká.
  • Brukt annet språk istedenfor samisk:
    • og€(foreign|ja)
    • august€(foreign|borgemánnu)
Gammel annotering: Attributes:
pos ( noun | verb | adj | adv | num | interj | pp | cc | cs | pcle | prop | pers | refl | dem | resip | indef | x )
orig ( noun | verb | adj | adv | num | interj | ppan | cc | cs | pcle | prop | pers | refl | dem | resip | indef | x )
errtype ( der | w | foreign | x )
Some explanations:
der = wrong derivation
w = wrong word
foreign = foreign word
x = unknown

Nesting

The three types can be nested, with the spelling error being the innermost one. That is, the following nesting is allowed: syntactic > morphosyntactic > lexical > spelling.

Parentheses are used to identify the range of the error. When nesting error markup, parentheses are required. Parentheses are also required when the error is followed by punctuation that is not part of the error or correction - the parenthesis will make sure the punctuation stays outside the error correction markup.

Markup EXAMPLES

Here are some examples of error/correction markup and how they are converted to xml:

nourra$(a,meta|nuorra)

<errorort pos="n" errtype="meta" corr="nuorra">nourra</errorort> 
(Nieiddat leat nuorra)£(a,spred,nompl,nomsg,agr|Nieiddat leat nuorat).

<errormorphsyn cat="nompl" const="spred" correct="Nieiddat leat nuorat" errtype="agr" orig="nomsg" pos="adj">Nieiddat leat \
      <errorort correct="nuorra" errtype="meta" pos="adj">nourra</errorort></errormorphsyn>.
Mun (riŋgen nieidda lusa)¥(x,pph|riŋgen niidii) ihttin.

Mun <errorsyn pos="x" errtype="pph" corr="riŋgen niidii">riŋgen nieidda lusa</errorsyn> ihttin.
Son lei ovtta¥(num,redun| ) viesus.

Son lei <errorsyn pos="num" errtype="redun" corr="">ovtta</errorsyn> viesus.
Mun barggan nu dábálaš€(adv,adj,der|dábálaččat).

Mun barggan nu <errorlex pos="adv" origpos="adj" errtype="der" corr="dábálaččat">dábálaš</errorlex>.

Nesting:

(Nieiddat leat nourra$(adj,meta|nuorra))£(adj,spred,nompl,nomsg,agr|Nieiddat leat nuorat).

<errormorphsyn pos="adj" const="spred" cat="nompl" orig="nomsg" errtype="agr" corr="Nieiddat leat nuorat">
Nieiddat leat <errorort pos="adj" errtype="meta" corr="nuorra">nourra</errorort></errormorphsyn>.
Mus leat (guokte ganddat§(n,á|gánddat))£(n,nump,gensg,nompl,case|guokte gándda).

Mus leat <errormorphsyn cat="gensg" const="nump" correct="guokte gándda" errtype="case" orig="nompl" pos="n">
guokte <error correct="gánddat">ganddat</error></errormorphsyn>.
Mus (leat (okta máná)£(n,spred,nomsg,gensg,case|okta mánná))£(v,v,sg3prs,pl3prs,agr|lea okta mánná).

Mus <errormorphsyn cat="sg3prs" const="v" correct="lea okta mánná" errtype="agr" orig="pl3prs" pos="v">
leat <errormorphsyn cat="nomsg" const="spred" correct="okta mánná" errtype="case" orig="gensg" pos="n">
okta máná</errormorphsyn></errormorphsyn>.

Markup RULES

The following rules should be followed when marking up texts:

  1. The correction is always done in the original format - never in the xml file! That is, make a copy of the original doc, txt or html file, and name it corr.doc, corr.txt, or corr.html, and add the correction markup in this new file. This will create a "new" original, which is identical to the "real" original, except for the additional correction markup. The "new" original will be converted to xml by the script convert2xml.pl, which is run automatically every night. Corrections done to the converted xml files will be lost upon next conversion.
  2. $ is the spelling correction mark - use it directly after the wrongly spelled word, followed by the correction, as in error$correction. Example: volvo$Volvo. NB! there should be NO space on either side of the correction mark $.
  3. skip foreign text - we assume that text in other languages are properly detected, or manually marked in the xsl file. That is: DON'T add spelling error markup to passages in Norwegian - instead, try to enforce or add xml markup designating the passage as being in Norwegian. Single words used as part of a sami sentence (in situ loans), should NOT be marked, either, since we can't know what the correction should be (and in principle the word isn't a misspelling if it is correctly spelled Norwegian).
  4. enclose multiword corrections in parenthesis - since the conversion to xml needs a way of knowing where the correction ends, we need to tell it if it is not at the end of the first word after the correction symbol. Example: Norggabealde§(Norgga bealde)
  5. separate punctuation that is not part of the correction with a space, or use parentheses around the correction. Example: "buolasta§(buolašta)." or "buolasta§buolašta ." (the example text is the text within the quotes, including the punctuation).
  6. remember the case - the correction should have the same case pattern as the spelling error. Example: Mannjá§Maŋŋá, NOT Mannjá§maŋŋá (note the case of the initial letter). The exception is of course when the error is missing capitalisation, as in names spelled lower case, etc.
  7. always provide a correction! The markup is useless if it isn't complete.
  8. Both the untouched original and the corrected "original" should be stored in $CORPUSHOME/prooftest/orig/$LANG/$GENRE/. The converted xml file(s) will be found in $CORPUSHOME/prooftest/$CONTRACT/$LANG/$GENRE/. It is important that the untouched original is also stored in the prooftest/ hierarchy, otherwise it can easily be included when making new missing lists, which means that the coverage testing will become misleading without us noticing it.

Summary + new error types

(xml element name after conversion to xml is specified after the symbol used for the actual markup)

§ - <error> - unclassified
Unclassified errors, never used, kept for backwards compatibility
$ - <errorort> - orthorgraphic/non-word errors
Traditional typos resulting in non-words, typically targeted by spelling checkers
¢ - <errorortreal> - orthorgraphic/real-word errors
Spelling errors resulting in another, real word, impossible to target by traditional spelling checkers
£ - <errormorphsyn> - morphosyntactic errors
Errors involving the morphosyntax of the language
¥ - <errorsyn> - syntactic errors
Purely syntactic errors
€ - <errorlex> - lexical errors
Errors due to wrong or bad lexeme chosen
∞ - <errorlang> - foreign language
Text written in a foreign or technical language, irrelevant for testing (text marked up as this will be ignored during testing)
‰ - <errorformat> - formatting errors
Errors due to wrong or bad formatting: extra spaces, wrong quote marks, etc.

By following these guidelines the resulting files should be readily useable for (speller) testing, as soon as they are converted to xml.