Tags and root lexicon for Komi
Analysis symbols
The parts-of-speech tags are:
- +A
- adjective кывберд прилагательное
- +Adp
- adposition (prepositio, postposition)
- +Adv
- adverb урчитан наречие
- +CS
- subordinating conjunction XX подчинительный союз
- +CC
- coordinating conjunction XX сочинительный союз
- conjunction word XX союзное слово (здесь надо узнать который из 2 выш.)
- +Det
- determiner XX XX
- +Interj
- interjection междометтьӧ междометие
- +N
- noun эмакыв - существительное
- +Num
- numeral лыдакыв числительное
- +Pcle
- particle кывтор частица
- +Po
- postposition кывбӧр послелог
- +Pr
- preposition XX предлог
- +Pron
- pronoun нимвежтас местоимение
- +Qnt
- Quantifier ХХ XX
- +V
- verb кадакыв глагол
The parts of speech are further split up into:
+Adv-Ideoph These are ideophonic descriptors used to modify the verb
+AdA Degree
+Manner with reference to type of adverb
+Spat spatial
+Temp temporal
- +Parenthetic parenthetical phrase
+Prop proper
+CollN used with paired nouns collective nouns
- +Relat relational noun: выв, ув
- +Dem
- demonstrative
- +Indef
- indefinite
- +Interr
- interrogative
- +Pers
- personal
- +Recipr
- reciprocal
- +Refl
- reflexive
- +Rel
- relative
- +Poss
- possessive
Quantifiers (numerals)
- +Num
- numeral лыдакыв
- +Appr
- Approximative numeral кавто-колмо, колмошка two or three
- +AssocColl
- -ne- ; avide-
- +Assoc
- +мезть
- +Card
- cardinal + NCard
- +Coll
- collective
- +Distr
- Distributive
- +Iter
- Iterative form expressing number of consecutive times; kpv
: кыкысь - +Mult
- Multiplicative adverbs number of times; kpv
: кык пӧв - +Ord
- ordinal + NOrd
- +Coord
- Coordinates, i.e. 65˚36′8,30″ in numerals.lexc
Nominals are inflected for Number and Case
+Sg singular
- +Pl plural
A category of case in Komi can be identified as:
+Acc accusative ZERO керан
+Acc1 accusative -ӧс керан
+Acc3 accusative -сӧ керан
+Abl ablative case -лысь босьтан
+Apr approximative -лань матыстчан
+AprEgr approximative egressive -ланьсянь матысь ылыстчан
+AprEla approximative elative -ланьысь матысь петан
+AprIll approximative illative -ланьӧ матӧ матыстчан
+AprIne approximative inessive -ланьын матыс ина
+AprPrl approximative prolative -ланьӧд маті вуджан
+AprTer approximative terminative -ланьӧдз матіӧдз воан
+AprTra approximative translative -ланьті маті вуджан
+AprEgr approximative egressive -ланьсянь матысь ылыстчан
+Car cartive -тӧг торйӧдан
+Cns consecultative -ла могман
+Com Comitative -кӧд ӧтвывтан
+Cmpr Comparative case form -ся ӧткодялан
+Cmpl Postposition complement
+Dat dative case -лы сетан
+Egr egressive -сянь ылыстчан
+Ela elative -ысь петан
+Gen genitive case -лӧн асалан
+Ill illative -ӧ пыран
+Ine inessive -ын ина
+Ins instrumental -ӧн керанторъя
+Nom nominative case нимтан
+Prl prolative -ӧд вуджан
+Tra translative -ті вуджан
+Ter Terminative -ӧдз матыстчан
+Voc Vocative ??
- +Abs Absolute = +Sg+Nom
The possession is marked as such:
+PxSg1 +PxSg2 +PxSg3 +PxPl1 +PxPl2 +PxPl3
- +Px1 +Px2 +Px3
The comparative forms are:
+Comp +Superl
+Attr +Card
+Coll Collective
+Distr Distributive
- +Iter Iterative form expressing number of times
Verb moods are:
Other verb forms are
+VAbess тӧм Participle
+VCar тӧг Gerund
- +VTer тӧдз Gerund
- +Symbol = independent symbols in the text stream, like £, €, ©
- +IV
Special multiword units are analysed with:
- +Guess
Question and Focus particles:
+Clt/И This comes at the end of a word -и or after vowels (some authors use -й)
- +Clt/сӧ
Tags distinguishing different versions of the same lemma (before POS)
- +v1
- +v2
- +v3
- +v4
- +v5
- +v6
- +v7
- +v8
- +v9
- +v10
- +v11
- +v12
- +v13
- +v14
- +v15
- +v16
- +v17
- +v18
- +v19
- +v20
- +v21
- +v22
- +v23
- +v24
The Usage extents are marked using following tags:
+Err/Dial e.g. тэг instead of тӧг
+Err/Lex substandard, not in normative fst, no normative lemma помсьыны
- +Use/SpellNoSugg
+Use/PMatch means that the following is only used in the analyser feeding the disambiguator
- +Use/-PMatch Do not include in fst's made for hfst-pmatch
Dialect features
Where do these come from source
+Src/F foreign source apparently 2015-09-08
- +Dim diminutive
- +NonHum look at this and place somewhere
+Sem/Act Activity
+Sem/Amount Amount
+Sem/Ani Animate
+Sem/Aniprod Animal Product
+Sem/Body Bodypart
+Sem/Body-abstr siellu, vuoig?a, jierbmi
+Sem/Build Building
+Sem/Build-part Part of Bulding, like the closet
+Sem/Cat Category
+Sem/Clth Clothes
+Sem/Clth-jewl Jewelery
+Sem/Clth-part part of clothes, boallu, sávdnji...
+Sem/Ctain Container
+Sem/Ctain-abstr Abstract container like bank account
+Sem/Curr Currency like dollár, Not Money
+Sem/Dance Dance
+Sem/Dir Direction like GPS-kursa
+Sem/Domain Domain like politics, reindeerherding (a system of actions)
+Sem/Drink Drink
+Sem/Dummytag Dummytag
+Sem/Edu Educational event
+Sem/Event Event
+Sem/Feat Feature, like Árvu
+Sem/Feat-phys Physiological feature, ivdni, fárda
+Sem/Feat-psych Psychological feauture
+Sem/Feat-measr Psychological feauture
+Sem/Fem Female name
+Sem/Food Food
+Sem/Food-med Medicine
+Sem/Furn Furniture
+Sem/Game Game
+Sem/Geom Geometrical object
+Sem/Group Animal or Human Group
+Sem/Hum Human
+Sem/Hum-abstr Human abstract
+Sem/Ideol Ideology
+Sem/Lang Language
+Sem/Mal Male name
+Sem/Mat Material for producing things
+Sem/Measr Measure
+Sem/Money Has to do with money, like wages, not Curr(ency)
+Sem/Obj Object
+Sem/Obj-clo Cloth
+Sem/Obj-cogn Cloth
+Sem/Obj-el (Electrical) machine or apparatus
+Sem/Obj-ling Object with something written on it
+Sem/Obj-rope flexible ropelike object
+Sem/Obj-surfc Surface object
+Sem/Org Organisation
+Sem/Part Feature, oassi, bealli
+Sem/Perc-cogn Cognative perception
+Sem/Perc-emo Emotional perception
+Sem/Perc-phys Physical perception
+Sem/Perc-psych Physical perception
+Sem/Plant Plant
+Sem/Plant-part Plant part
+Sem/Plc Place
+Sem/Plc-abstr Abstract place
+Sem/Plc-elevate Place
+Sem/Plc-line Place
+Sem/Plc-water Place
+Sem/Pos Position (as in social position job)
+Sem/Process Process
+Sem/Prod Product
+Sem/Prod-audio Audio product
+Sem/Prod-cogn Cognition product
+Sem/Prod-ling Linguistic product
+Sem/Prod-vis Visual product
+Sem/Rel Relation
+Sem/Route Name of a Route
+Sem/Rule Rule or convention
+Sem/Semcon Semantic concept
+Sem/Sign Sign (e.g. numbers, punctuation)
+Sem/Sport Sport
+Sem/State-sick Illness
+Sem/Substnc Substance, like Air and Water
+Sem/Sur Surname
+Sem/Symbol Symbol
+Sem/Time Time
+Sem/Tool Prototypical tool for repairing things
+Sem/Tool-catch Tool used for catching (e.g. fish)
+Sem/Tool-clean Tool used for cleaning
+Sem/Tool-it Tool used in IT
+Sem/Tool-measr Tool used for measuring
+Sem/Tool-music Music instrument
+Sem/Tool-write Writing tool
+Sem/Txt Text (girji, lávlla...)
+Sem/Veh Vehicle
+Sem/Wpn Weapon
+Sem/Wthr The Weather or the state of ground
- +Sem/Year
+Sem/Sur_Fem Surname female
+Sem/Sur_Mal Surname male
+Sem/Ant Anthroponym
+Sem/Ant_Fem Anthroponym female
+Sem/Ant_Mal Anthroponym male
+Sem/Patr Patronym
+Sem/Patr_Fem Patronym female
- +Sem/Patr_Mal Patronym male
+Sem/Event_Plc сёянін
- +Sem/Hum_Prof profession, capacity doctor, tractor driver
Semantics are classified with
Derivations are classified under the morphophonetic form of the suffix, the
+Der In front of every derivation to make it
+Der/Ан Process Participle +AN
+Der/Ана Process Participle +ANA
+Der/Анаа adverb derived from participle (+ANA) +ANAA
- +Der/чӧж +CHOZH
+Der/NomAct +Event
- +Duration
- +Der/иг
+Der/Ан Participle
+Der/Ана Gerund or participle according to context (with...)
- +PastPtc
+Der/кості +KOSTI
+Der/коста +KOSTA
- +Der/кежлӧ +KEZHLO
+Der/мысь +MYS
- +Der/мысьт +MYST
+MAbe abessive modifier -тӧм
+MLoc locative modifier са -
+MHab habeo modifier а -
- +MTmp temporal modifier ся -
+LocMod IneMod Быд во шедӧдӧны бур успеваемость Воркута да Инта каръясса, Прилузскӧй да Княжпогостскӧй районъясса школаяс.
+Der/тӧм used with nouns and followed by +AbeMod
+PrivMod AbeMod джуджыд анализъястӧм да обобщениеястӧм статьяяс.
+ProprietiveMod HabObjMod Весиг киясыс тӧдсаӧсь, найӧ мугов рӧмаӧсь, кузь чорыд чуньясаӧсь.
- +Der/TempMod TempMod Der/ся но и Ф. В. Плесовскийлысь квайтымынӧд вояссяяссӧ * позьӧ аддзыны сӧмын библиотекаясысь.
2012-09-11 Perhaps this is only syntactic
- +Der/N Noun derived with conversion from noun, conversion but not ZERO
- +Der/A Adjective derivated from Noun or Verb
- +Der/Adv Adverb derivated from Adjective
Tags for Ethymological Origin marking. This has initially used used with proper nouns
Sentence markers
- +Cop Copula кадакыв, коді шуӧ: вӧлі либӧ ӧнія кадся Связка
To represent phonologic variations in word forms we use the following
- {aä}
- Vowel alternating symbol
- {oö}
- Vowel alternating symbol
- {uü}
- Vowel alternating symbol
к2 л2 м2 т2 ь2 К2 Л2 М2 Т2 Ь2 И2
- %> suffix border
- %{иі%}
- for soft and hard
- %{ая%}
- for soft and hard
And following triggers to control variation
- {front}
- Vowel change triggers
- {back}
- Vowel change triggers
- %^Close Close syllable, this triggers final consonant drop, seen in
Valency tags, i.e. tags assigned to verbs for denoting their arbuments
- +%<acc%> accusative
- +%<ela%> elative -ысь
- +%<ins%> instrumental -ӧн
- +%<inf_ны%> infinitive in -ны
- +%<po_вылӧ%> postposition вылӧ
- +%<sub_мый%> subordinate clause in мый/that
Symbols that need to be escaped on the lower side (towards twolc):
- »
- «
- > (written with square brackets, see the root.lexc file)
- < (written with square brackets, see the root.lexc file)
Flag diacritics
We have manually optimised the structure of our lexicon using following
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
Two flags copied from sme
@P.Pmatch.Loc@ | Used on multi-token analyses; tell hfst-tokenise/pmatch where in the form/analysis the token should be split. |
@P.Pmatch.Backtrack@ | Used on single-token analyses; tell hfst-tokenise/pmatch to backtrack by reanalysing the substrings before and after this point in the form (to find combinations of shorter analyses that would otherwise be missed) |
For languages that allow compounding, the following flag diacritics are needed
handled automatically if combined with +CmpN/xxx tags. If not used, they will
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
- +Cmp
+Cmp/Serial used with serial verbs |
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj |
Lexicon Root
The word forms in Komi (Zyrian) language start from the lexeme roots of basic
Testing 2015-09-06
пу керка
Lexicon ENDLEX
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ # ;
The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged