som
Free and Open source Somali analyser giella-som
- Authors
- Divvun and Giellatekno teams, community members
- Software version
- 2012
- Documentation license
- GNU GFDL
- SVN Revision
- $Revision
: 68217 $ - SVN Date
- $Date
: 2013-01-16 11: 31: 33 +0200 (Wed, 16 Jan 2013) $
giella-som
This is free and open source Somali morphology.
Somali morphological analyser
INTRODUCTION TO THE MORPHOLOGICAL ANALYSER OF SOMALI.
Multichar_Symbols definitions
Analysis symbols
The morphological analyses of Somali wordforms are presented
The parts-of-speech are:
- +N
- +V
- +A
- +Adp
- +Q
- +Pr
- +Adv
- +CC
- +CS
- +Interj
- +Pron
- +Num
Fusional adpositions
- +Adp/u Fusional ú
- +Adp/ka Fusional ká
- +Adp/la Fusional lá
- +Adp/ku Fusional kú
Object pronouns in adpositional+pronoun
- +1SgObj/i E.x.: la + i + lá -> laylá
- +2SgObj/ku E.x.: la + ku + lá -> lagulá
Focus
- +Foc/L Focus markers baa and ayaa
- +Foc/R Focus marker waxaa
The parts of speech are further split up into:
- +Interr
- +Interr/ma
- +Attr
- +Short
- +Cmp
- +Der/sho
- +Pfx
- +PP
- +Sep
- +Com
- +Appos
- +Impers
- +Inch
- +Recit
- +Restr
- +Pers
- +Dem
- +Coll
- +Mass
- +Acr
- +Abstr
- +Abbr
- +Prop
Verb and noun declensions for the analysers that want to know about that
- +Decl/1
- +Decl/2
- +Decl/2A
- +Decl/2B
- +Decl/3 +Decl/3A +Decl/3B
- +Decl/4
- +Decl/5
- +Decl/6
- +Decl/7
The Usage extents are marked using the following tags:
- +Err/Orth
- +Use/-Spell
- +Use/-Spell
- +Use/Circ
- +Use/CircN
- +Err/Lex
- +Use/Marg
- +Use/NG
- +Use/Ped
- +Use/SpellNoSugg+Prog
- +Err/Orth
The nominals are inflected in the following case, number
- +Sg
- +Pl
- +Nom
- +Abs
- +Gen
- +Indef
- +Def
Nominals also are inflected for gender
- +Masc
- +Fem
Nominal marked for gender undergo gender polarity changes in plural.
- +M→M
- +M→F
- +F→M
- +F→F
Nominals also have affixed demonstratives
- +Prox -0
- +Dist -ii
- +Near -aas / -aasi
- +Far -eer / -eeri
- +Farther -oo / -ooyi
- +Close -an / -anu / -ani
Are these in use?
- +Adc
- +Apr
- +Prl
- +Apr
- +Cns
- +Ord
The possession is marked as such:
- +PxSg1
- +PxSg2
- +PxSg3F
- +PxSg3M
- +PxPl1
- +PxPl1Incl
- +PxPl1Excl
- +PxPl2
- +PxPl3
The comparative forms are:
- +Comp
- +Superl
Numerals are classified under:
- +Attr
- +Card
- +Ord
Verb moods are:
- +Ind
- +Opt
- +Imprt
- +Neg
- +Imper
Verb tenses
- +Past
- +Pres
Verb aspects are:
- +Prog
Verb personal forms are (NB: no inclusive/exclusive):
- +1Sg
- +2Sg
- +3Sg
- +3SgM
- +3SgF
- +1Pl
- +2Pl
- +3Pl
Verbs also mark some non-agreement syntactic information
- +Red occurs often with subjects that are focused
- +Rel the verb is within a relative clause, and is also case marked.
Other verb forms are
- +Inf
- +Ger
- +ConNeg
- +ConNegII
- +Neg
- +ImprtII
- +PrsPrc
- +PrfPrc
- +Sup
- +VGen
- +VAbess
Abbreviated words are classified with:
- +ABBR
- +Symbol = independent symbols in the text stream, like £, €, ©
- +ACR
Special symbols are classified with:
- +CLB
- +PUNCT
- +LEFT
- +RIGHT
The verbs are syntactically split according to transitivity:
- +TV
- +IV
- +DV
Special multiword units are analysed with:
- +Multi
Non-dictionary words can be recognised with:
- +Guess
- ^GUESSNOUNROOT
Question and Focus particles:
- +Qst
- +Foc
Semantics are classified with
- +Sem/Plc
Derivations are classified under the morphophonetic form of the suffix, the
- +V→N
- +V→V
- +V→A
- +Der/xxx
- +Incl
- +Excl
Syntaxy stuff, don't want to use +Acc, because this isn't relevant in nouns
- +Subj
- +Sem/Obj
Nominal MSP
- +Rel
Derivation
- +Der/A
- +Der/V
- +Der/N
Clitics
- +Clit/ba
- +Clit/se
- +Clit/na
- +Clit/oo
- +Clit/CS
- +Clit/Without
Style
- +Use/NG
- +Sty
- +Sty/TODO
- +Sty/i
- +Sty/D
- +Sty/R
- +TODO
Morphophonology
To represent phonologic variations in word forms we use the following
- {N} For tagging certain twolc rules as nominal-only
Going to try to replace these with flag diacritics if possible.
And following triggers to control variation
- {#} # -
TODO: no need for , but needs to be removed in all files
- {m} in nouns: for marking m~n alternations
- {mm} in nouns: rare instance of mm ~ n
- {C2} in nouns: consonant reduplication in noun declension 4. (yaab ~ yaabab)
- {X} in nouns: insertion of some kind in noun definiteness. TODO: twolc rule no longer exists?
- {ae} in verbs: umlaut of a~i in some verb stems (seems restricted to specific lexemes, not productive)
- {e} in nouns: -e- variation in declension 7 (waraabe ~ waraabaha), not 100% predictable
- {-e} in nouns: delete final -e, often used in conjunction with {a}, possible room for cleaning up.
- {a} in verbs: Mostly V3B: has alternation between o ~ a. (sigo ~ sigaday)
- {-V} in verbs: deletion of specific vowel, used only in affixes, to make stems prettier? room for cleaning
- {-I} in verbs: -i- deletions in V3A and -san adjectives
- {-a} used specifically in -sho derivations. TODO: change to rule with » ?
- {E} part of cliticized ee (CS+Appos)
- {y} in verbs: -y- deletion in certain parts of V2
Tone
- ´´
Symbols that need to be escaped on the lower side (towards twolc):
- »7
- Literal »
- «7
- Literal «
%[%>%] - Literal > %[%<%] - Literal <
Flag diacritics
@P.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@D.NeedNoun.ON@ | (Dis)allow compounds with verbs unless nominalised |
@C.NeedNoun@ | (Dis)allow compounds with verbs unless nominalised |
For languages that allow compounding, the following flag diacritics are needed
@P.CmpFrst.FALSE@ | Require that words tagged as such only appear first |
@D.CmpPref.TRUE@ | Block such words from entering ENDLEX |
@P.CmpPref.FALSE@ | Block these words from making further compounds |
@D.CmpLast.TRUE@ | Block such words from entering R |
@D.CmpNone.TRUE@ | Combines with the next tag to prohibit compounding |
@U.CmpNone.FALSE@ | Combines with the prev tag to prohibit compounding |
@P.CmpOnly.TRUE@ | Sets a flag to indicate that the word has passed R |
@D.CmpOnly.FALSE@ | Disallow words coming directly from root. |
Use the following flag diacritics to control downcasing of derived proper
@U.Cap.Obl@ | Allowing downcasing of derived names: deatnulasj. |
@U.Cap.Opt@ | Allowing downcasing of derived names: deatnulasj. |
- @P.VCLASS.V1@ @R.VCLASS.V1@
- @P.VCLASS.V1ow@ @R.VCLASS.V1ow@
- @P.VCLASS.V1ow2@ @R.VCLASS.V1ow2@
- @P.VCLASS.V2A@ @R.VCLASS.V2A@
- @P.VCLASS.V2B@ @R.VCLASS.V2B@
- @P.VCLASS.V3A@ @R.VCLASS.V3A@
- @P.VCLASS.V3B@ @R.VCLASS.V3B@
- @P.VCLASS.V3B_ADel@ @R.VCLASS.V3B_ADel@
- @P.VCLASS.V3B_ADelPart@ @R.VCLASS.V3B_ADelPart@
- @P.VCLASS.PREFIXING@
- @U.VCLASS.V1@
- @U.VCLASS.V2A@
- @U.VCLASS.V2B@
- @U.VCLASS.V3A@
- @U.VCLASS.V3B@
- @U.VCLASS.UREFIXING@
- @P.ATR.True@
- @R.ATR.True@
Person flags
- @U.Pers.1Sg@
- @U.Pers.2Sg@
- @U.Pers.3SgM@
- @U.Pers.3SgF@
- @U.Pers.1Pl@
- @U.Pers.2Pl@
- @U.Pers.3Pl@
- @P.Pers.1Sg@
- @P.Pers.2Sg@
- @P.Pers.3SgM@
- @P.Pers.3SgF@
- @P.Pers.1Pl@
- @P.Pers.2Pl@
- @P.Pers.3Pl@
- @R.Pers.1Sg@
- @R.Pers.2Sg@
- @R.Pers.3SgM@
- @R.Pers.3SgF@
- @R.Pers.1Pl@
- @R.Pers.2Pl@
- @R.Pers.3Pl@
- @R.Gender.Masc@
- @P.Gender.Masc@
- @R.Gender.Fem@
- @P.Gender.Fem@
The continuation lexica
The word forms in Somali start from the lexeme roots of basic
- LEXICON Root
- Abbreviations ;
- Nouns ;
- ProperNouns ;
- Numerals ;
- Pronouns ;
- Verbs ;
- IrregularVerbs ;
- VerbalPrefixes ; Certain VP elements often get combined with the verbs in writing.
- Adjectives ; Some have verb morphology, and some view them to just be a 4th declension of verbs.
- Adverbs ;
- Conjunctions ;
- Subjunctions ;
- Adpositions ;
- Determiners ;
- Interjections ;
- Punctuation ;
- Symbols ;
The following are coming from som-lex.txt
- IrregularAdjective ;
- Prefixes ;
- LEXICON FINAL_NG just adds the +Use/NG tag to lower ##
- LEXICON FINAL just adds lower ##
These lexica are dummy lexical to make the source compile, they contain only #.
- LEXICON Proper
- LEXICON Unknown_Declensions
- LEXICON Obj_Pron
- LEXICON SemiReducedPerson
Nouns
Nouns in Somali have separate paradigms depending on
Note that items containing ATR should mark the ATR using <¨> before the vowel
doog:¨doog NOUN1_M/SgOnly ;
Irregular nouns
- LEXICON IrregNouns
il # Irregular tests examples:
-
il: il+N+Fem+Sg+Indef+Abs
-
isha: il+N+Fem+Sg+Def+Abs+Prox
-
indho: il+N+Masc+Pl+Indef+Abs
- indhaha: il+N+Masc+Pl+Def+Abs+Prox
il # Irregular tests examples:
-
si: si+N+Fem+Sg+Indef+Abs
-
sida: si+N+Fem+Sg+Def+Abs+Prox
-
siyaabo: si+N+Masc+Pl+Indef+Abs
- siyaabaha: si+N+Masc+Pl+Def+Abs+Prox
Declension 1: F→M
TODO: write quick overview of morphosyntax, morphophon
Good amount of nouns with -ad, Fem derivational suffix.
aalad # aalad sample paradigm. examples:
-
aalad: aalad+N+Decl/1+Fem+Sg+Indef+Abs
-
aaladda: aalad+N+Decl/1+Fem+Sg+Def+Abs+Prox
-
aalado: aalad+N+Decl/1+Masc+Pl+Indef+Abs
- aaladaha: aalad+N+Decl/1+Masc+Pl+Def+Abs+Prox
Declension 1: M, sg. only
Declension 1: M→M, M→F
Declension 1: Masc. Pl. Only
Declension 1: Fem. Sg. Only
A fair amount of abstract things, and some collective things that probably need
Declension 2
Declension 2: Collective
Groups of things, -ley is a common suffix. Taged with +Coll, but available
Declension 2: M→F
Declension 2: M→F
Some consonant doubling in plurals with -o, some with -yo, no doubling.
Declension 2: M→M - Arabic words with Somali plurals.
Declension 2: F→F
Declension 2: M→F - collectives
TODO: these are collectives, but not marked as such and perhaps should be.
Declension 2: M→F - Mass
TODO: these are mass nouns, but not marked as such and perhaps should be.
Declension 3: M→M
Declension 3: F→M
gabadh # dh + d -> dh; vowel deletion examples:
-
gabadh: gabadh+N+Fem+Sg+Indef+Abs
-
gabadha: gabadh+N+Fem+Sg+Def+Abs+Prox
-
gabdho: gabadh+N+Masc+Pl+Indef+Abs
- gabdhaha: gabadh+N+Masc+Pl+Def+Abs+Prox
Declension 3: M→M
xadhig # g ~ k; vowel deletion examples:
-
xadhig: xadhig+N+Masc+Sg+Indef+Abs
-
xadhigga: xadhig+N+Masc+Sg+Def+Abs+Prox
-
xadhko: xadhig+N+Masc+Pl+Indef+Abs
- xadhkaha: xadhig+N+Masc+Pl+Def+Abs+Prox
Arabic loan plural forms
Ex.)
- amar -> awaamiir
- axmaq -> axmaqiin
- banki -> bunuug
- LEXICON ArabicLoans
guri # Odd-syllable test examples:
- guri: guri+N+Masc+Sg+Indef+Nom
The Somali morphophonological/twolc rules file
Phonological Processes in Somali
Spreading processes
Lenition
ilig ‘tooth' ~ iligga ‘tooth (Def.)' ~ ilko ‘teeth (Indef.)’ arag ‘see' ~ aragtaa ‘2Sg/3SgF sees' ~ arkaa ‘1Sg/3SgM sees’
Voicing assimilation
Stops assimilate for voicing (or lenis/fortis), particularly across morpheme boundaries,
aragtaa ‘2Sg/3SgF sees’ wararka ‘the news’ buugga ‘the book’ naagta ‘the woman’ jaamacadda ‘the university’
This also follows for the retroflex segment <dh>, however the sequence <dhdh> is shortened
gabadh ‘(a) girl’ gabadha ‘the girl’
Vowel ablaut
waxaan ‘foc+1Sg' magac rah wuxuu ‘foc+3SgM' magucu / magacu ruhu magicii / magicii
Full ablaut appears to be optional in some words.
tag ‘go' tegi ‘to go' tegeen ‘they went’ bax ‘leave' bexi ‘to leave' bexeen ‘they left’
<k> deletion
<k> is deleted when it follows a back consonant (which is not <k> itself):
wax + ka => waxa magac + ka => magaca rah + ka => raha
<l> + <t> -> <sh>
Vowel harmony
Tonal Processes
Reduplication
Somali has two kinds of reduplication: partial and full. Reduplication is typically
af ‘mouth, language' => afaf ‘... +Pl’ qoys ‘family' => qoysas ‘... +Pl’
Note: There may be some partial reduplication that also copies the vowel.
qof ‘person' => qofqof cad ‘part' => cadcad
Adjectives:
quruxsán ‘beautiful' => qurquruxsán ‘ ... +Pl’ fudúd ‘easy, light' => fudfudúd adág ‘difficult' => ad'adág yár ‘a little' => yaryára gaabán ‘short' => gaaggaabán ‘ ... +Pl’ laabán ‘folded' => laallaabán jabán ‘broken' => ‘jajabán’
Irregular adjectives:
dhéer ‘long, tall (sg)' => dhaadhéer wéyn ‘big' => waawéyn
Note: Other possibilities might be found greping: with ‘^\(.\)aa\1e', ‘^\(.\)aa\1\1aa',
Person morphemes
The realization of personal suffixes on verbs is a little complex and depends mostly on
Realization of these suffixes is currently all handled by twolc:
sg pl 1 {đ} {ñ} ==> 2 {ŧ} 0 3 Masc {đ} {đ} 3 Fem {ŧ} Declension 1 sg pl 1 keenaa keennaa 1 dhacay dhacnay 2 keentaa keenaan 2 dhacday dheceen 3 Masc keenaa keentaan 3 Masc dhacay dhecdeen 3 Fem keentaa 3 Fem dhacday Declension 2 sg pl sg 1 sameeyay sameynay 1 kariyay karinay 2 sameysay sameyseen 2 karisay kariseen 3 Masc sameeyay sameeyeen 3 Masc kariyay kariyeen 3 Fem sameysay 3 Fem karisay Declension 3 sg pl 1 furay furannay 1 watay wadannay 2 furtay furateen 2 wadatay wadateen 3 Masc furay fureen 3 Masc watay wateen 3 Fem furtay 3 Fem wadatay
A short summary:
%{đ%} - 1Sg, 3SgM and 3Pl; realized as 0/y/d in classes %{ŧ%} - 2Sg, 3SgF, 2Pl; realized as t/s in classes %{ñ%} - 1Pl; realized as n in all classes ‘special' here just in case.
This process could easily be removed from twolc and done in just flag diacritics, why not?
More processes to describe:
- wataa / wadatay dissimilation
- a/e alternation in -e nouns (noun type is a bit more complex too, as corpus reveals)
- vowel deletion ilig, ilko
- e/y alternation in samee- verbs
- w/b alternations
- vowel lengthening magaalo / magaalooyin
Notes:
- If forms disappear but the rule should work, make sure
Alphabet
For ease of reduplication and treating sh as separate from l + t => sh
TODO: write tests for gacmahii / gacmihii, magacii / magicii, magucu / magacu
TODO: tests from these
ilig+N+Masc+Pl+Def+PxSg3M+Abs+Prox il^ik´%>%{N%}a%>híisa# -- > ilkihiisa walaal+N+Masc+Pl+Def+PxSg3M+Abs+Prox walaal´´%>%{N%}a%>híisa# -- > walaalihiisa magac+N+Masc+Sg+Def+Abs+Dist magac´´%>%{N%}kii# -- > magicii magac+N+Masc+Sg+Def+Nom+Prox magac´´%>%{N%}ku# -- > magucu magac+N+Masc+Sg+Def+PxSg3F+Abs+Prox magac´´%>%{N%}kíisa# -- > magiciisa waxa+CS+Foc/R+Sg3M wax%>uu# -- > wuxuu wax+N+Masc+Sg+Def+PxSg3M+Nom+Prox wax´´%>%{N%}kíisu# -- > wixiisu / waxiisu
List of rules:
Ablaut around back fricatives at morpheme boundaries
Ablaut in declension 7 with final vowel
LT -> SH part 1
LT -> SH part 2
LT -> SH part 2 for V3B
LT -> SH in reduplicative plurals, pt 1
V2 {y} deletion
Definite suffix -k to -g next to preceding voiced
t:0 before d h
t/d alternation - definite suffix, V1
TODO: dukaammo
Declension 2:-C doubling in plural
TODO: This rule is currently conflicting with the next one.
Declension 2:-y insertion after back sounds and fricatives; and after -i
Declension 4:Plural Partial Reduplication
Declension 4:Plural Partial Reduplication %{m}
Declension 4:C2->n in m contexts
-y- insertion in plurals
Disallow yy
Doesn't work with wuxu, where k has been deleted.
Declension 6:final vowel
Declension V1:magacow / magacaabay / magacawday, shorten a
Delete preceding a
AllBoundaries \?: WordChars
Declension 7:{-e} delete
Declension 7:final vowel
Declension 7:final vowel in final position
Verbal Rules
V2A -> V3B derivations, delete -i
V2A -> V3B derivations, depalatization
V1:a/e umlaut before i/e
V1:a/e umlaut before a
V1, D3:Vowel deletion
V1:Simplification of dhd to dh (t)
V1:n/m when deleted vowel results in C0n
V2A
tbw.
V2B
V2B:y-insertion
V2B:ey before consonants
V2B:ee before y
V2B:Remove e
V3A
V3B
Following rules are for the funky things that happen with 1Sg, 3SgM, 3Pl
wada>d>aa# wada>nn>aa#
wataa wadannaa
fura>d>aa# fura>nn>aa#
furtaa furannaa
Also another type: TODO: merge
sigo+V+IV+1Sg+Ind+Past sig{a}>d>ay#
sigtay
But: obd / cd / hd / qd / xd / yd
V3B:%{o%}:o finally
V3B:%{o%}:0 in d _ d
V3B:d dissimilation
V3B:d dissimilation w/ cor
V3B:d dissimilation 2
V3B:root {a} deletion before suffixes with -d
V3B:otherwise
General Rules
Vowel reduction VVV -> VV
m->n in codas
m->m in codas
Optional m doubling
Velar deletion after Back, but avoiding digraphs
Final lenition.
-ee changes to -ye after other vowels
-ee changes to -e after y
-ee clitic deletes preceding -o
progressive nasal assimilation across morphological boundaries
-
primus%>s
- primus00
examples:
examples:
examples:
examples: