root-morphology
Faroese morphological analyser
Definitions for Multichar_Symbols
Tags for POS
- +N +V +A +Adv +Prop +Num : Open POS's
- +CC +CS +Interj +Pr +Pron +IM : Closed POS's
- +Pers +Det +Refl +Recipr +Poss +Dem : Pron types
- +Nom +Acc +Gen +Dat : Case
- +Msc +Fem +Neu : Gender
- +Sg +Pl : Number
- +Def +Indef : Definiteness
- +Cmp +Superl : Comparison
- +Prs +Prt : Tense
- +1Sg : Person-Number
- +2Sg : Person-Number
- +3Sg : Person-Number
- +Inf +PrfPrc +PrsPrc +Sup +Imp +Sbj : Verb forms
- +Cmpnd : Compound
- +Abbr +ACR : Abbreviations, acronyms ,
- +CLB +PUNCT +LEFT +RIGHT : Punctuation, parentheses
- +Symbol : independent symbols in the text stream, like £, €, ©
- +CLBfinal Sentence final abbreviated expression ending in full stop, so that the full stop is ambiguous
- +Sg3 : This is inherited from common files, should be changed to +3Sg.
- +ABBR sub-pos
- +Arab sub-pos
- +Attr sub-pos
- +Coll sub-pos
- +Com samiske kasus, skal bort
- +Dyn samiske kasus, skal bort
- +Ela samiske kasus, skal bort
- +Ess samiske kasus, skal bort
- +Ill samiske kasus, skal bort
- +Ine samiske kasus, skal bort
- +MWE multiword expression
- +Pos sjekk desse XXX
- +Rom sjekk desse XXX
- +Der/heit Derivation with -heit
- +Ind +Pass +Interr +Ord
Semantic tags
- +Sem/Sur
- +Sem/Mal
- +Sem/Fem
- +Sem/Plc
- +Sem/Org
- +Sem/Veh
- +Sem/Fem
- +Sem/Year - year (i.e. 1000 - 2999), used only for numerals
- +Sem/Amount
- +Sem/Build
- +Sem/Build-room
- +Sem/Cat
- +Sem/Curr
- +Sem/Date
- +Sem/Domain
- +Sem/Domain_Hum
- +Sem/Dummytag
- +Sem/Edu_Hum
- +Sem/Event
- +Sem/Food-med
- +Sem/Group_Hum
- +Sem/Hum
- +Sem/ID
- +Sem/Lang
- +Sem/Mat
- +Sem/Measr
- +Sem/Money
- +Sem/Obj
- +Sem/Obj-el
- +Sem/Obj-ling
- +Sem/Org_Prod-audio
- +Sem/Org_Prod-vis
- +Sem/Part
- +Sem/Prod-vis
- +Sem/Route
- +Sem/Rule
- +Sem/Sign
- +Sem/State
- +Sem/State-sick
- +Sem/Substnc
- +Sem/Time
- +Sem/Time-clock
- +Sem/Tool-it
- +Sem/Txt
Non-changing letters
- a2 This is for a special a Umlaut case
- g2 i2 j2 t2 v2
- +v1 +v2 : different paradigms ,
Triggers for Morphophonology
- %^UUML %^IUML %^eIUML %^ØUML : Umlaut types ,
- %^W %^JI : Cns changes ,
- %^EPH %^OEA : Epenthesis, ,
- %^GDEL %^GGDEL %^GVDEL %^VDEL %^JDEL %^RDEL : Cns deletion triggers,
- %^EIO %^OA %^WVV %^EDH %^VSH : TODO ,
- %^AB1 %^AB2 %^AB3 %^AB4 %^AB5 %^AB6 %^AB7 : Ablaut series ,
- %^aAB %^uAB : More Ablaut ,
- %^NGKK : NG to KK ,
- %^PASS : todo ,
- %> : Suffix boundary ,
-
+v1 - Paradigm identifier (e.g. gera+v1 = ger)
- +v2 - Paradigm identifier (e.g. gera+v2 = gerar)
Language tags
- +OLang/ENG
- +OLang/FIN
- +OLang/NNO
- +OLang/NOB
- +OLang/RUS
- +OLang/SMA
- +OLang/SME
- +OLang/SWE
- +OLang/UND
Non-ascii letters, perhaps needed as multichar symbols
- æ ø å
- á é í ó ú ý Á É Í Ó Ý
- ä ö ü Ä Ö Ö
Compounding tags
The tags are of the following form:
-
+CmpNP/xxx - Normative (N), Position (P), ie the tag describes what
-
+CmpN/xxx - Normative (N) form ie the tag describes what
-
+Cmp/xxx - Descriptive compounding tags, ie tags that describes
This entry / word should be in the following position(s):
-
+CmpNP/All - ... in all positions, default, this tag does not have to be written
-
+CmpNP/First - ... only be first part in a compound or alone
-
+CmpNP/Pref - ... only first part in a compound, NEVER alone
-
+CmpNP/Last - ... only be last part in a compound or alone
-
+CmpNP/Suff - ... only last part in a compound, NEVER alone
-
+CmpNP/None - ... does not take part in compounds
-
+CmpNP/Only - ... only be part of a compound, i.e. can never
Usage tags
- +Use/Disamb = Use only in disambiguator/tokeniser analyser
- +Use/Circ = for compound restrictions
- +Use/-PMatch
- +Use/-Spell
- +Use/NG
- +Use/NGA
- +Use/SpellNoSugg
- +Err/Guess : Tag for Name Guesser component
- +Err/Orth : Marking forms that are orthographical errors
Symbols that need to be escaped on the lower side (towards twolc):
- »7 : Literal »
- «7 : Literal «
%[%>%] - Literal > %[%<%] - Literal <
Flag diacritics
We have manually optimised the structure of our lexicon using following
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
Flags for speller suggestions
@D.ErrOrth.ON@ |
@C.ErrOrth@ |
@P.ErrOrth.ON@ |
Flag for case harmony in compounds
Set flag for compounds
@P.Case.MscNom@ | fyrstiflokkur |
@P.Case.MscObl@ | fyrstaflokk |
@P.Case.FemNom@ | lítlasystir |
@P.Case.FemObl@ | lítluusystur |
@P.Case.Neu@ | breiðaskarð |
@P.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
@R.Case.MscNom@ | fyrstiflokkur |
@R.Case.MscObl@ | fyrstaflokk |
@R.Case.FemNom@ | lítlasystir |
@R.Case.FemObl@ | lítluusystur |
@R.Case.Neu@ | breiðaskarð |
@R.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
@U.Case.MscNom@ | fyrstiflokkur |
@U.Case.MscObl@ | fyrstaflokk |
@U.Case.FemNom@ | lítlasystir |
@U.Case.FemObl@ | lítluusystur |
@U.Case.Neu@ | breiðaskarð |
@P.Pmatch.Loc@ | Location in string used or parsed by hfst-pmatch |
@P.Pmatch.Backtrack@ | Also for hfst-pmatch |
Flags for compound restriction
For languages that allow compounding, the following flag diacritics are needed
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
Lexicon Root
- Nouns ;
- Shortnouns ; 1- and 2-letter nouns excluded from compounding
- Propernouns ;
- Adjectives ;
- Verbs ;
- Adverb ;
- Conjunction ;
- Subjunction ;
- Interjection ;
- Numeral ;
- Determiner ;
- Pronoun ;
- Preposition ;
- Punctuation ;
- Symbols ;
- Abbreviation ;
- Acronyms ;
Lexicon Acronyms is split in two:
- Acronym-fao ; for fao acronyms
- Acronym-smi ; for language independent acronums
Lexicon ENDLEX
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged