root-morphology

Faroese morphological analyser

Definitions for Multichar_Symbols

Tags for POS ,

+N +V +A +Adv +Prop +Num
Open POS's ,
+CC +CS +Interj +Pr +Pron +IM
Closed POS's ,
+Pers +Det +Refl +Recipr +Poss +Dem
Pron types ,
+Nom +Acc +Gen +Dat
Case ,
+Msc +Fem +Neu
Gender ,
+Sg +Pl
Number ,
+Def +Indef
Definiteness ,
+Cmp +Superl
Comparison ,
+Prs +Prt
Tense ,
+1Sg +2Sg +3Sg
Person-Number,
+Inf +PrfPrc +PrsPrc +Sup +Imp +Sbj
Verb forms ,
+Cmpnd
Compound ,
+Abbr +ACR
Abbreviations, acronyms ,
+CLB +PUNCT +LEFT +RIGHT
Punctuation, parentheses
+Symbol
independent symbols in the text stream, like £, €, ©
+Err/Guess
Tag for Name Guesser component

Derivation with -heit

Semantic tags

  • +Sem/Year - year (i.e. 1000 - 2999), used only for numerals

Non-changing letters

+v1 +v2
different paradigms ,

Triggers for Morphophonology

%^UUML %^IUML %^eIUML
Umlaut types ,
%^W %^JI
Cns changes ,
%^EPH %^OEA
Epenthesis, ,
%^GDEL %^GGDEL %^GVDEL %^VDEL %^JDEL %^RDEL
Cns deletion triggers,
%^EIO %^OA %^WVV %^EDH %^VSH
TODO ,
%^AB1 %^AB2 %^AB3 %^AB4 %^AB5 %^AB6 %^AB7
Ablaut series ,
%^aAB %^uAB
More Ablaut ,
%^NGKK
NG to KK ,
%^PASS
todo ,
%>
Suffix boundary ,
  • +v1 - Paradigm identifier (e.g. gera+v1 = ger)
  • +v2 - Paradigm identifier (e.g. gera+v2 = gerar)

Non-ascii letters, perhaps needed as multichar symbols

æ ø å á é í ó ú ý Á É Í Ó Ý ä ö ü Ä Ö Ö

Compounding tags

The tags are of the following form:

  • +CmpNP/xxx - Normative (N), Position (P), ie the tag describes what position the tagged word can be in in a compound
  • +CmpN/xxx - Normative (N) form ie the tag describes what form the tagged word should use when making compounds
  • +Cmp/xxx - Descriptive compounding tags, ie tags that describes what form a word actually is using in a compound

This entry / word should be in the following position(s):

  • +CmpNP/All - ... in all positions, default, this tag does not have to be written
  • +CmpNP/First - ... only be first part in a compound or alone
  • +CmpNP/Pref - ... only first part in a compound, NEVER alone
  • +CmpNP/Last - ... only be last part in a compound or alone
  • +CmpNP/Suff - ... only last part in a compound, NEVER alone
  • +CmpNP/None - ... does not take part in compounds
  • +CmpNP/Only - ... only be part of a compound, i.e. can never be used alone, but can appear in any position
  • +Use/Disamb = Use only in disambiguator/tokeniser analyser
  • +Use/Circ = for compound restrictions

Symbols that need to be escaped on the lower side (towards twolc):

»7
Literal »
«7
Literal «
  %[%>%]  - Literal >
  %[%<%]  - Literal <

Flag diacritics

We have manually optimised the structure of our lexicon using following flag diacritics to restrict morhpological combinatorics - only allow compounds with verbs if the verb is further derived into a noun again:

@P.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@D.NeedNoun.ON@ (Dis)allow compounds with verbs unless nominalised
@C.NeedNoun@ (Dis)allow compounds with verbs unless nominalised

Set flag for compounds

@P.Case.MscNom@ fyrstiflokkur
@P.Case.MscObl@ fyrstaflokk
@P.Case.FemNom@ lítlasystir
@P.Case.FemObl@ lítluusystur
@P.Case.Neu@ breiðaskarð
@P.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

@R.Case.MscNom@ fyrstiflokkur
@R.Case.MscObl@ fyrstaflokk
@R.Case.FemNom@ lítlasystir
@R.Case.FemObl@ lítluusystur
@R.Case.Neu@ breiðaskarð
@R.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð

Control flag values for compounds

@U.Case.MscNom@ fyrstiflokkur
@U.Case.MscObl@ fyrstaflokk
@U.Case.FemNom@ lítlasystir
@U.Case.FemObl@ lítluusystur
@U.Case.Neu@ breiðaskarð
@U.Case.Pl@ fyrstuflokkar, lítlusystrar, breiðuskørð
@P.Pmatch.Loc@ Location in string used or parsed by hfst-pmatch

For languages that allow compounding, the following flag diacritics are needed to control position-based compounding restrictions for nominals. Their use is handled automatically if combined with +CmpN/xxx tags. If not used, they will do no harm.

@P.CmpFrst.FALSE@ Require that words tagged as such only appear first
@D.CmpPref.TRUE@ Block such words from entering ENDLEX
@P.CmpPref.FALSE@ Block these words from making further compounds
@D.CmpLast.TRUE@ Block such words from entering R
@D.CmpNone.TRUE@ Combines with the next tag to prohibit compounding
@U.CmpNone.FALSE@ Combines with the prev tag to prohibit compounding
@P.CmpOnly.TRUE@ Sets a flag to indicate that the word has passed R
@D.CmpOnly.FALSE@ Disallow words coming directly from root.

Use the following flag diacritics to control downcasing of derived proper nouns (e.g. Finnish Pariisi -> pariisilainen). See e.g. North Sámi for how to use these flags. There exists a ready-made regex that will do the actual down-casing given the proper use of these flags.

@U.Cap.Obl@ Allowing downcasing of derived names: deatnulasj.
@U.Cap.Opt@ Allowing downcasing of derived names: deatnulasj.

Lexicon Root

This is the beginning of everything. The Root lexicon is reserved in the LexC language, and must be the first lexicon defined.

Lexicon ENDLEX

And this is the ENDLEX of everything:

 @D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;

The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged with +CmpNP/Only to end here. The @D.NeedNoun.ON@ flag diacritic is used to block illegal compounds.