Compound Tags
Contents:
Compounding tags in the lexicon
Se also seperate pages on morphological, semantic, syntax and dependency tags.
Goal
- sensible tags
- sensible defaults to reduce writing to a minimum
Positional tags
Suggested tags:
+CmpNP/All = all positions, _default_, this tag does not have to be written +CmpNP/First = first-only or alone, in PLX format it means non-last +CmpNP/Pref = only first, never alone +CmpNP/Last = only last or alone +CmpNP/Suff = only last, never alone +CmpNP/Only = stems that only appear in compounds, not in isolation +CmpNP/None = can not make compounds
Questions
- Do we need a tag +CmpNP/Middle?
Claim: there are no such words in
Short words in compounds are problematic because they can lead to spurious
Investigation of the middle part of 3-part compounds
1. analyze the corpus with the non-cirkular analyzer (will a.o. leave
gt$ccat -l sme -r /usr/local/share/corp/bound/sme/news/ | \ preprocess --abbr=sme/bin/abbr.txt --corr=sme/src/typos.txt | \ lookup -flags mbTT -utf8 sme/bin/nonrec-sme.fst > nonrec-corp.txt &
2. grep all non-recognised words, analyze them with the normal, circular,
gt$grep '\?' nonrec-corp.txt | cut -f1 | \ lookup -flags mbTT -utf8 sme/bin/sme.fst | \ egrep '#.{1,3}\+.*#' > 3-part-shortcomp-descr.txt
3. analyze these compounds with the normative (and circular) analyzer:
gt$cut -f1 3-part-shortcomp-descr.txt | sort -u | \ lookup -flags mbTT -utf8 sme/bin/sme-norm.fst | \ egrep '#.{1,3}\+.*#' > 3-part-shortcomp-norm.txt
4. check the result for real spelling errors vs true compounds. This will
The following list of short words was identified:
Saami stems: ađa - ok as last al - SUB, ok sáh - SUB, ok váh - SUB, ok vár - SUB, ok aji Loan words: cup - kan inte se att denna skapar støy? duo - +CmpN/last før denna? popduo, trombonduo, duo-? jf duomuge, duogáša, forsvinn med duo+CmpN/Last kro - kan inte se att denna skapar støy? Kanskje pop - kan inte se att denna skapar støy? rap - kan inte se att denna skapar støy? Names: Alm Eng New Vu -Vu
The names are clearly an error in our normative analyser, they should not be
The following mid-parts are now SUB marked, and won't cause problems for the spellers:
joh sis gas
TODO:
- enforce hyphen on both sides of names when making compounds (at least in
Conclusion
With a few lexical adjustments, and corrections in the analyzer, there are no
The positional tags will be added to lexical entries where needed. They will be
Compound-stem tags
Suggested tags (these are all the same as presently used in the analyzer):
+CmpN/SgCmp (=sealg-) i prinsippet kultur- kultuvra +CmpN/SgNCmp (alternations between full and redused final vowel is coded in the lexicon / twolc) +CmpN/SgGCmp (alternations between full and redused final vowel is free) +CmpN/PlGCmp
The variation between full and reduced stem vowel (such as sápmi vs
In addition, it is useful to have a shortcut tag for all variants applied to one
+CmpN/AllCmp (= all four above)
The tags should be entered as comments after each lexical entry that needs it,
How should we specify deviations from the default? Example:
jávri:jáv'ri GOAHTI ; !+CmpN/SgGCmp +CmpN/SgNCmp jávre- (nom) jávrre- (gen)
There are two alternatives. Either to specify all wanted possibilities, or to
1) all variants: mánná GOAHTI ; !+CmpN/SgGCmp +CmpN/SgNCmp +CmpN/PlGCmp 2) only additions: mánná GOAHTI ; !+CmpN/SgGCmp mánná GOAHTI ; !+CmpN/SgGCmp -SgNCmp (see comment below about negation)
The second alternative implies that we need negation, as we need to be able to
We then end up with the following possible tag combinations:
mánná GOAHTI ; !+CmpN/AllCmp (= +CmpN/SgGCmp +CmpN/SgNCmp +CmpN/PlGCmp +CmpN/SgCmp) mánná GOAHTI ; !+CmpN/SgGCmp +CmpN/SgNCmp +CmpN/PlGCmp mánná GOAHTI ; !+CmpN/SgGCmp +CmpN/SgNCmp mánná GOAHTI ; !+CmpN/SgGCmp +CmpN/PlGCmp mánná GOAHTI ; !+CmpN/PlGCmp +CmpN/SgNCmp mánná GOAHTI ; !+CmpN/PlGCmp +CmpN/SgGCmp mánná GOAHTI ; !+CmpN/SgGCmp mánná GOAHTI ; !+CmpN/PlGCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/SgGCmp +CmpN/SgNCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/SgGCmp +CmpN/PlGCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/PlGCmp +CmpN/SgNCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/PlGCmp +CmpN/SgGCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/SgGCmp mánná GOAHTI ; !+CmpN/SgCmp +CmpN/PlGCmp mánná GOAHTI ; !+CmpN/SgCmp ---- mánná GOAHTI ; !+CmpN/SgNCmp <==== Default
To ease the work with marking up the lexicon, we should make an
Search ; Replace with: EITHER: ENTER = ; G = ; ! +CmpN/SgGCmp P = ; ! +CmpN/PlGCmp B = ; ! +CmpN/SgGCmp +CmpN/PlGCmp C = ; ! +CmpN/SgGCmp +CmpN/PlGCmp +CmpN/SgNCmp A = ; ! +CmpN/AllCmp ... for all combinations OR: One or more of: NGPSA N = ; ! +CmpN/SgNCmp G = ; ! +CmpN/SgGCmp P = ; ! +CmpN/PlGCmp S = ; ! +CmpN/SgCmp A = ; ! +CmpN/AllCmp
Defaults
The following was decided as defaults for compound stems:
-
+CmpN/SgNCmp
-
+CmpN/SgGCmp (using the PLX class Ga)
- +CmpN/PlGCmp (using the PLX class Gp)
The defaults are never written. This is important, to be able to override the
Please note, that the +CmpN/PlGCmp form is always identical to the regular
Conclusion
The compound stem tags are the ones suggested above, and will be added to
Tags for the required form of the left-part
Some nouns require the preceding part of a compound to be in Genitive case,
There are cases where the left part of a compound overrides the specifications
Suggested tags:
+CmpN/SgNomLeft (default, usually not written =0 PLX class N) +CmpN/SgGenLeft (implies +CmpN/SgNomLeft = PLX class Na) +CmpN/PlGenLeft (excludes the other alternatives unless explicitly overridden = PLX class Np)
Thus, by default all compound forms of a word is
Conflicts between specified compound form and required left-part form
There are cases where a word as the left part of a compound uses other
Left-part-tag <=> Right-part-tag when used as | when used as the left part | the right part ----------------+---------------- +CmpN/SgNCmp <=> +CmpN/SgNomLeft +CmpN/SgGCmp <=> +CmpN/SgGenLeft +CmpN/PlGCmp <=> +CmpN/PlGenLeft
Or to put it in other words: the default is to let the last part govern.
One can let the first (i.e. left) part govern by explicitly adding left-part compounding tags to the lexical entry. An example:
nuorra NUORRA; +CmpN/AllCmp [=left-part tag] +CmpN/SgGenLeft [=right-part tag]
This will let nuorra form compounds with other words in all forms (as in
Summary
The tags for the left/first part of a compound can then be split in two groups:
Implicit defaults, never specified: +CmpN/SgNCmp +CmpN/SgGCmp (default as PLX Ga, can only be combined with words requiring GenSg) +CmpN/PlGCmp (default as PLX Gp, can only be combined with words requiring GenPl) Explicit overrides: +CmpN/SgGCmp - can combine with both words requiring GenSg and other words +CmpN/PlGCmp - can combine with both words requiring GenPl and other words
Second part:
+CmpN/SgNomLeft (default, implicit) +CmpN/SgGenLeft (Ga, requires GenSg as first/left part) +CmpN/PlGenLeft (Gp, requires GenPl as first/left part)
Thus, explicit tags for the compound-as-first-part form overrules the default compounding behaviour.
Overriding overrides
In the following example we need open compounding in GenSg, but default
nuorra NUORRA ; +CmpN/SgGCmp +CmpN/DefCmp
We can resolve that by adding a special tag +CmpN/DefCmp that enforces default
The following tags are presently NOT used, but might be put to use if
+CmpN/DefSgGCmp +CmpN/DefPlGCmp
These tags would give default compounding behaviour for the specific cases.
What kind of words get compound-tags?
Adjectives denoting:
- People
Nouns denoting:
- Living creatures, people, animals etc
- Growths
- Organisations (like Gielda, Guovlu, Riika, Goahti, Dállu etc)
- Topografy (like Johka, Mearra, but not Várri, Jávri)
- People-groups (like Sápmi, Duiska etc)
- Weather and state of the ground etc (like Dálki, Siivu, Čáhci, Dulvi, Muohta etc)
- Time (Áigi, Idja, Beaivi etc)
- and nouns on -vuohta (like ráhkisvuohta)
- plural nouns
What kind of words get +CmpN/Left compound-tags?
- Some very few specific words where the meaning of the compound alters with the case of the first part (for example Ahki, Dilli, Heahti, Duohki, Vuolli, Geahči.
- In North Sami: Deverbal nouns like actios and actors stemming from transitive verbs (for example Sálbmalávlun vs. Sálmmalávlun, Sarvabivdi vs. Sarvvabivdi)