fao
Contents:
Free and Open source Faroese analyser giella-fao
- Authors
- Divvun and Giellatekno teams, community members
- Software version
- 2013
- Documentation license
- GNU GFDL
- SVN Revision
- $Revision
: 68217 $ - SVN Date
- $Date
: 2013-01-16 11: 31: 33 +0200 (Wed, 16 Jan 2013) $
giella-fao
This is free and open source Faroese morphology.
Faroese morphological analyser
Definitions for Multichar_Symbols
Tags for POS
- +N +V +A +Adv +Prop +Num : Open POS's
- +CC +CS +Interj +Pr +Pron +IM : Closed POS's
- +Pers +Det +Refl +Recipr +Poss +Dem : Pron types
- +Nom +Acc +Gen +Dat : Case
- +Msc +Fem +Neu : Gender
- +Sg +Pl : Number
- +Def +Indef : Definiteness
- +Cmp +Superl : Comparison
- +Prs +Prt : Tense
- +1Sg +2Sg +3Sg : Person-Number
- +Inf +PrfPrc +PrsPrc +Sup +Imp +Sbj : Verb forms
- +Cmpnd : Compound
- +Abbr +ACR : Abbreviations, acronyms ,
- +CLB +PUNCT +LEFT +RIGHT : Punctuation, parentheses
- +Symbol : independent symbols in the text stream, like £, €, ©
- +Err/Guess : Tag for Name Guesser component
- +Der/heit Derivation with -heit
- +Ind +Pass +Interr +Ord
Semantic tags
- +Sem/Sur
- +Sem/Mal
- +Sem/Fem
- +Sem/Plc
- +Sem/Org
- +Sem/Veh
- +Sem/Fem
- +Sem/Year - year (i.e. 1000 - 2999), used only for numerals
Non-changing letters
- a2 This is for a special a Umlaut case
- g2 i2 j2 t2 v2
- +v1 +v2 : different paradigms ,
Triggers for Morphophonology
- %^UUML %^IUML %^eIUML : Umlaut types ,
- %^W %^JI : Cns changes ,
- %^EPH %^OEA : Epenthesis, ,
- %^GDEL %^GGDEL %^GVDEL %^VDEL %^JDEL %^RDEL : Cns deletion triggers,
- %^EIO %^OA %^WVV %^EDH %^VSH : TODO ,
- %^AB1 %^AB2 %^AB3 %^AB4 %^AB5 %^AB6 %^AB7 : Ablaut series ,
- %^aAB %^uAB : More Ablaut ,
- %^NGKK : NG to KK ,
- %^PASS : todo ,
- %> : Suffix boundary ,
-
+v1 - Paradigm identifier (e.g. gera+v1 = ger)
- +v2 - Paradigm identifier (e.g. gera+v2 = gerar)
Non-ascii letters, perhaps needed as multichar symbols
- æ ø å
- á é í ó ú ý Á É Í Ó Ý
- ä ö ü Ä Ö Ö
Compounding tags
The tags are of the following form:
-
+CmpNP/xxx - Normative (N), Position (P), ie the tag describes what
-
+CmpN/xxx - Normative (N) form ie the tag describes what
-
+Cmp/xxx - Descriptive compounding tags, ie tags that describes
This entry / word should be in the following position(s):
-
+CmpNP/All - ... in all positions, default, this tag does not have to be written
-
+CmpNP/First - ... only be first part in a compound or alone
-
+CmpNP/Pref - ... only first part in a compound, NEVER alone
-
+CmpNP/Last - ... only be last part in a compound or alone
-
+CmpNP/Suff - ... only last part in a compound, NEVER alone
-
+CmpNP/None - ... does not take part in compounds
-
+CmpNP/Only - ... only be part of a compound, i.e. can never
- +Use/Disamb = Use only in disambiguator/tokeniser analyser
- +Use/Circ = for compound restrictions
Symbols that need to be escaped on the lower side (towards twolc):
- »7 : Literal »
- «7 : Literal «
%[%>%] - Literal > %[%<%] - Literal <
Flag diacritics
We have manually optimised the structure of our lexicon using following
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
Flag for case harmony in compounds
Set flag for compounds
@P.Case.MscNom@ | fyrstiflokkur |
@P.Case.MscObl@ | fyrstaflokk |
@P.Case.FemNom@ | lítlasystir |
@P.Case.FemObl@ | lítluusystur |
@P.Case.Neu@ | breiðaskarð |
@P.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
@R.Case.MscNom@ | fyrstiflokkur |
@R.Case.MscObl@ | fyrstaflokk |
@R.Case.FemNom@ | lítlasystir |
@R.Case.FemObl@ | lítluusystur |
@R.Case.Neu@ | breiðaskarð |
@R.Case.Pl@ | fyrstuflokkar, lítlusystrar, breiðuskørð |
Control flag values for compounds
@U.Case.MscNom@ | fyrstiflokkur |
@U.Case.MscObl@ | fyrstaflokk |
@U.Case.FemNom@ | lítlasystir |
@U.Case.FemObl@ | lítluusystur |
@U.Case.Neu@ | breiðaskarð |
@P.Pmatch.Loc@ | Location in string used or parsed by hfst-pmatch |
@P.Pmatch.Backtrack@ | Also for hfst-pmatch |
Flags for compound restriction
For languages that allow compounding, the following flag diacritics are needed
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
Lexicon Root
- Nouns ;
- Shortnouns ; 1- and 2-letter nouns excluded from compounding
- Propernouns ;
- Adjectives ;
- Verbs ;
- Adverb ;
- Conjunction ;
- Subjunction ;
- Interjection ;
- Numeral ;
- Determiner ;
- Pronoun ;
- Preposition ;
- Punctuation ;
- Symbols ;
- Abbreviation ;
- Acronyms ;
Lexicon Acronyms is split in two:
- Acronym-fao ; for fao acronyms
- Acronym-smi ; for language independent acronums
Lexicon ENDLEX
@D.CmpOnly.FALSE@@D.CmpPref.TRUE@@D.NeedNoun.ON@ ENDLEX2 ;
The @D.CmpOnly.FALSE@ flag diacritic is ued to disallow words tagged
The Faroese morphophonological file
Alphabet
- a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
- á é ó ú í à è ò ù ì ä ë ö ü ï â ê ô û î ã ý þ ñ ð ß ç
- A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Æ Ø Å
- Á É Ó Ú Í À È Ò Ù Ì Ä Ë Ö Ü Ï Â Ê Ô Û Î Ã Ý þ Ñ Ð
- a2: a for da2n -> dønum
- g2: g for invariant g
- i2: i for invariant i
- j2: j for invariant j
- t2: t for invariant, non-deleted t, dráttri pro *drátri
- v2: v for invariant v
- %^UUML: 0 %^IUML: 0 %^eIUML: 0 : Umlaut types ,
- %^W: 0 %^JI: 0 : Cns changes ,
- %^EPH: 0 : Epenthesis, ,
- %^OEA: 0 :
- %^GDEL: 0 %^GGDEL: 0 %^GVDEL: 0 %^VDEL: 0 %^JDEL: 0 %^RDEL: 0 : Cns deletion triggers,
- %^AB1: 0 %^AB2: 0 %^AB3: 0 %^AB4: 0 %^AB5: 0 %^AB6: 0 %^AB7: 0 : Ablaut series ,
- %^aAB: 0 %^uAB: 0 : Ablaut series subcases
- %>: 0 : Suffix border
- « » : hmm, in use?
Sets
- Vow = a e i o u y æ ø å
- Cns = b c d f g h j k l m n p q r s t v w x z ð þ ;
- Nas = m n ;
- NonNas = b c d f g h j k l p q r s t v w x z ð þ ;
- Dummy = %^UUML %^IUML %^eIUML %^W %^EPH %^JI %^OEA
- %^EDH %^VSH %^GDEL %^GGDEL %^GVDEL %^VDEL %^JDEL
- Special = %^UUML %^IUML %^W %^EPH %^JI %^OEA %^GDEL %^GGDEL
Rules
Verschärfung
Deleting g
- Deleting the gv Verschärfung 1
- Deleting gg in ggj Genitive I
- Deleting gg in ggj Genitive II
Deleting v in gv sequences
Verschärfung tests
-
bógv^IUML>i
- bøg000i
-
flúgv^IUML^VDEL
- flýg000
-
flúgv^VSH^VDEL>u
- flug0000u
-
búgv^GVDEL>s
- bú0000s
-
bógv^VDEL>s
- bóg000s
-
skógv^GVDEL>m
- skó0000m
-
skýggj^GGDEL>s
- ský00000s
-
kríggj^GDEL>s
- kríg0000s
-
sjógv^GDEL>ar
- sjó0v00ar
Deleting r in Genitive of ur stems
-
brúður^EPH^RDEL>ar
- brúð00000ar
Deleting m in um%>num
-
ris>um>num
- ris00u000num
-
skógv>m>num
- skó0000000num
Deleting Double Consonant in Front of Consonant
The preceeding rule is fishy - the test cases below don't fit the context
-
all>t
- al00t
-
hjall>s
- hjal00s
-
rygg>s
- ryg00s
-
hjall>ar
- hjall0ar
Verbal Sandhi rules
Geminate Assimilation in Past Tense d
Geminate Assimilation in Past Tense t
-
send>di
- sen00di
-
hirð>di
- hir00di
-
sett>ti
- set00ti
ð Assimilation in Front of Dental Past Suffix -d(i)
-
leið>di
- leid0di
Deleting Double Consonant in Front of Epenthesis mark
-
summar^EPH>i
- sum00r00i
-
himmal^EPH^UUML>um#
- him00l000um0
Deleting stem-final s in s genitive
-
primus>s
- primus00
-
primus>s
- primus00
-
grís>s
- grís00
Double ð Deletion
ð Assimilation in Front of Supine Suffix -t
-
leið>t
- leit0t
Adjusting Dental Past Suffix -d(i)
-
keyp>di
- keyp0ti
Adjectival sandhi rules
Adjective neuter after nlr 1
Adjective neuter after nlr 2
-
mikil^EPH>t
- miki000ð
-
gamal^EPH>t
- gamal00t
t Deletion in Neuter
ng to kk Part 1
ng to kk Part 2
j rules
Deleting j
-
kríggj^GDEL>num
- kríg0000num
-
beiggj^JI>i
- beigg000i
-
verkj^JDEL>ur
- verk000ur
-
heyggj>i
- heygg00i
Realising j in front of vowels
-
hylj2>ar
- hylj0ar
Vowel rules
Realising i2 as i
Epenthetic vowel rules
Epenthetic deletion
-
økur^EPH^UUML>um
- øk0r000um
-
lykil^EPH>an
- lyk0l00an
-
aftan^EPH>
- aftan00
-
vakin^EPH>ir
- vak0n00ir
U-umlaut of Epenthetic vowel
-
gamal^EPH^UUML
- gomul00
-
gamal^EPH^UUML>u
- goml000u
Umlaut rules
U-umlaut in Front of Nasal
-
tank^UUML
- tonk0
-
band^UUML
- bond00
-
hamar^EPH^UUML>um
- hom0r000um
General U-umlaut
-
dag^UUML>um
- døg00um
-
sag^UUML>a
- søg00a
-
all^UUML>
- øll00
U-umlaut for akur
-
akur^EPH^UUML>um
- øk0r000um
I-umlaut
-
dag^IUML>i
- deg00i
-
son^IUML>i
- syn00i
-
bógv^IUML>i
- bøg000i
-
ung^IUMLr>i
- yng0r0i
-
fjørð^IUML>i
- f0irð00i
eI-umlaut for o: e, á: e, i: e
I-umlaut for bróðir
Inverted U-umlaut from ø
-
fløtt^ØAa
- flatt0a
Inverted U-umlaut from o
-
fonn^OA>ar
- fann00ar
o/ei-Umlaut I
o/ei-Umlaut II
-
dreing^EIO>i
- dro0ng00i
Vowel deletion rules
Vowel deletion in front of na
u Deletion in unum after Stressed Vowel
-
bý>unum
- by00num
Verbal vowel alternation rules
Stem vowel change in Weak Verbs
-
flek^WVV>t
- flak00t
-
flek^WVV>t
- flak00t
-
vel^WVV>di
- val00di
Stem Vowel Shortening in Supine and Participle
Past tense singular diphthongs I
Past tense singular diphthongs II
Past tense singular monophthongs
-
gev^AB3
- gav0
Past tense plural monophthongs
Past tense plural monophthongs to a
Supine u
Supine o
Supine i
Present tense ý
Adjectival Sandhi rule
Vowel shortening in Neuter
-
góð>t
- got0t
-
skjót>t
- skjót0t
Other rules
Morphological passive rules
u in ur Deletion in front of Pass
r Deletion in front of Pass
ð Deletion in front of Pass
Faroese Noun morphology
Basic noun lexica
Taken from the dictionary
The nominal morphology is added in three layers.
Lexicons still to be allocated
Irregular nouns
k0 for januar etc.
kv0 for ommudidd
irregular_nouns just gives the tags for the indeclinables
Lexica for words belonging to two paradigms.
The ordinary lexica
Weak masculines.
k1, risi, is the basic Msc lexicon, split in sg and pl
k1e for sg
k_flt1 for pl
- @NO CODE@ felagi
k2 beiggi
k3 for hagi
k3e for sg
k_flt3 for pl
k4 for tanki, just pointing to k3 (identical)
k5 for bóndi
Strong masculines
k6_null for antikrist
k6e_null for sg
k6 for úlvur
k6e for sg
k_flt6 for pl
k7 for sandur
k7e for sg
k_flt7 for pl
k_flt8 for pl, pointing to k_flt7
k8e for sg, pointing to k7e
k8 for garður, pointing to k7, but has a different u-umlaut
k_flt9 for pl
k9e for sg, pointing to k6e
k9 with double consonant deletion in front of s, but pointing to k6
k10 splitting in sg/pl
k10e for sg
k_flt10 for pl
k11 for ísur
k11e for sg
k_flt11 for pl
k12 for vinur
k12_boe
k12e for sg
k_flt12 for pl
k13e for sg, giving extra NULL dative then pointing to k12e
k13 for vegur
k14 for staður
k14e for sg
k_flt14 for pl
k15 for gestur
k15e for sg
k_flt15 for pl
k16 having double Cns but pointing to k15
k_flt17 giving UUML PLDAT and pointing to k_flt15
k17 giving UUML Dat and pointing to k15
k18 for dansur
k18e for sg
k_flt18 for pl
k19 for meldur
k19e for sg
k_flt19 for pl
k20 for akur
k20e for sg
k_flt20 for pl
k_flt21 pointing to k_flt19
k21 for stuðul
k22 for himmal
k23 for róður
k23e for sg
k_flt23 for pl
k24 for fløttur
k25 for vøllur
k25e for sg
k_flt25 for pl
k26 for táttur
k26e for sg
k_flt26 for pl
k27 for vøkstur
k28 for dráttur
k28e for sg
k_flt28 for pl
k29 for tráður
k30 for fótur
k30e for sg
k_flt30 for pl
k31 for veggur
k31e for sg
k_flt31 for pl
k32 for ryggur, using k31e
k33 for hylur
k34 for drongur
k34e for sg
k_flt34 for pl
k36 for heyggjur
k37 for skógvur
k37e for sg
k_flt37 for pl
k38 for bógvur
k38e for sg
k_flt38 for pl
k39 for sjógvur
k39e for sg
k_flt39 for pl
k40 for hógvur
k40e for sg
k_flt40 for pl
k41 for maður
k41e for sg
k41_obl for oblique, hmm, needed?
k_flt41 for pl
k42 for dagur
k42e for sg
k_flt42 for pl
k43 for faðir
k43e for sg
k_flt43 for pl
k44 for bróðir, stem is ZERO
k_flt44 for pl
k45 for spónur
k45e for sg
k_flt45 for pl
k46 for fjørðu
k46e for sg
k_flt46 for pl
k47 for sonur
k47e for sg
k_flt47 for pl
k48 for hamar
k48e for sg
k_flt48 for pl
k49 for verkur
k49e for sg
k_flt49 for pl
k50 for skjøldur (non_poetic)
k51 for luður
k52 for primus
k52e for sg
k_flt52 for pl
k53 for aðal
Feminines
Case inflection
This is the second layer. Here we do indefinite
Masculine forms
Weak case suffixes.
Singular
W_M_SGNOM for weak masculines, pointing to definites
W_M_SGACC etc for risan
W_M_SGDAT for
W_M_SGDAT_mixed for felagnum
W_M_SGGEN for
Plural
W_M_PLNOM for -ar-
W_M_PLNOM_UR for -ur-
W_M_PLACC for -ar-
W_M_PLACC_UR for -ur-
W_M_PLDAT for -u-
W_M_PLGEN for -a-
Strong case suffixes
Nominative Sg
Accusative Sg
Dative Sg
Plural forms
Nominative
Definite inflection
This is the third layer. Here we do the indefinite and definite forms.
Masculine forms
Masc def sg
DF_N_SGm for
DF_N_SGm_indef for
DF_N_SGm_def for
DF_A_SGm for
DF_A_SGm_indef for
DF_A_SGm_def for
DF_D_SGm for
DF_G_SGm for
Masc def pl
DF_N_PLm for
DF_N_PLm_indef for
DF_N_PLm_def for
DF_A_PLm for
DF_A_PLm_indef for
DF_A_PLm_def for
Feminine forms
Fem Sg
DF_N_SGf_S for
DF_A_SGf_W for
DF_A_SGf_S for
DF_D_SGf_W for
DF_D_SGf_S for
DF_G_SGf_W for
DF_G_SGf_S for
Feminine plural forms
DF_NA_PLf for
Neuter forms
Neuter sg
This concludes the nominal morphology.
The rest of the file contains flags, that govern
Compound flags
MscNom_Flag for
MscObl_Flag for
FemNom_Flag for
FemObl_Flag for
Neu_Flag for
Pl_Flag for
Faroese noun stem file
The lexicon names are taken from
Note that in some cases, the lexicon names and stems here
Short lexica
Shortnouns for 1- and 2-letter nouns excluded from compounding
These are now always excluded from lastpart compound
The main list of nouns
Her kjem alle substantiva. Dei er baklengssortert.
Nouns
Fila inneheld i underkant av 50000 lemma.