Documenting the North Saami lexicon file
File structure
The file format is documented in the Xerox manuals, especially in Karttunen 1993 Finite-State Lexicon Compiler, but see also the Beesley and Karttunen book. The file itself consists of a section defining Multichar_symbols, and of a large number of lexica, 183 lexica according to the present count (19.10.01). The file sme-lex.txt contains a.o. the continuation lexica for nouns, verbs and adjectives, whereas the bulk of the lexicon is divided into different files, as indicated below.
In the sme-lex.txt file, the Multichar_Symbols section contains all grammatical tags, and all multicharacter members of the alphabet (the latter set is taken from the grammar file).
The Root lexicon points to the lexica of the different parts of speech: (for each sublexicon there is a pointer to the relevant file containing the sublexicon)
LEXICON Root NounRoot ; ! -> noun-sme-lex.txt ProperNoun ; ! -> the file sme-lex.txt itself AdjectiveRoot ; ! -> adj-sme-lex.txt VerbRoot ; ! -> verb-sme-lex.txt Pronoun ; ! -> closed-sme.lex.txt Adverb ; ! -> adv-sme-lex.txt Particles ; ! -> closed-sme.lex.txt Subjunction ; ! -> closed-sme.lex.txt Conjunction ; ! -> closed-sme.lex.txt Adposition ; ! -> pp-sme.lex.txt Postposition ; ! -> pp-sme.lex.txt Preposition ; ! -> pp-sme.lex.txt Interjection ; ! -> closed-sme.lex.txt
The different part of speech lexica are documented here, in the order just given.
The NounRoot lexicon
The structure of the noun-sme.txt file
The file contains the following sections:
- LEXICON GuessNoun
- LEXICON NounRoot
- Stray forms
- Compounds
- Multiword nouns
- The noun stems themselves, sorted alphabetically (the alphabetical sorting is according to the computer algorithm, and not according to the Saami one)
The GuessNoun lexiconbelongs to the guess-sme binary file. The idea behind this is to have a binary file that guesses the grammatical form of nouns on the basis of their suffix, a hypothetical form like plineraiguin could for example be hypothetised to be the Comitative Plural of plinera. The guesser has not been used, and it is unclear whether it works at all. TODO: Check this.
The lexicon NounRoot first has some temporary sections. There is a set "compounds, awaiting a solution on the shorting question", such like sámegiel, oar, beal, etc., reduced first- and middlepart compounds. The long-term solution is to build a syllable shortener for these compounds, for the time being the lexicalised ones are listed here. This section also contain some loanwords that typically have a shorter, non-sámi form as compounds, and a longer, Saami form when used as independent word. An illustrative example is the word sosial. In isolation, we find sosiála, but as compound, we find sosialdepartemeantta, etc. Thus, these words are listed here, pointing directly to the compound lexicon R.
Finally, the noun stem section itself is the large one, it contains about 26500 nouns, they are divided into different noun stem classes. They are documented in the next section.
The noun stem classes
The sublexica, ordered by inflectional type
Bisyllabic nouns:
GOAHTI !Bisyll. V-Nouns. Short nom-compound-forms
goahte-,long/short gen
nom-compound-forms,long/short gen
long nom-compound-forms,long/short gen
ALBMI !Bisyll. V-Nouns.
Short nom-compound-forms, long gen.
ALBMILONG !Bisyll. V-Nouns.
Long nom-compound-forms, long gen.
V-Nouns. Long/SHORT nom-compound-forms, long gen.
AIGI !Bisyll.
V-Nouns. Short nom-compound-forms, short gen
ACTOR !Long compound-forms
ACTORLONGSHORT !Sometimes long
ACTORTV !deverbal nouns from transitive
ACTORLONGSHORTTV !deverbal nouns from transitive
STAHTA !Bisyll. Non-Gradating a-Nouns; i-Illative
!Bisyll. Non-Gradating a-Nouns; a-Illative
LUONDU !this word (+vuohta) because of behavior in
RUOKTU !only this word because of its behavior in
MANNI !Bisyll. V-Nouns. Long/SHORT
nom-compound-forms, long gen. ILL:mánnii/mánnái
!Bisyll. V-Nouns. Short nom-compound-forms, short gen., long
EADNI !Bisyll. V-Nouns. Short
nom-compound-forms, long gen.short caritive
RAFI !Bisyll.
V-Nouns. Long nom-compound-forms, long gen. short
DUOHKI !only this word so far for disamb.
BOAHTALADDAN ! No compound-forms
coming from transitive verbs, No compound-forms
No compound-forms, plural
EAPMITV !words
stemming from transitives
MUITTASJEAPMITV !words stemming from
Without loc and ill sg:
OLGU !Bisyll. V-Nouns. Short nom-compound-forms, short
MIEHTI !Bisyll. V-Nouns. Short nom-compound-forms, long
LULLI !Bisyll. V-Nouns. Long/SHORT nom-compound-forms,
long gen
With comparatives:
GADDI !Bisyll. V-Nouns with Comparative Forms. Short
nom-compound-forms, long gen.
GADDILONG !long compound
GADDISHORT !Bisyll. V-Nouns with Comparative Forms. Short
nom-compound-forms, SHORT gen.
With comparatives, without loc and ill sg:
GADDILONGSHORT !NB! No SgIll and SgLoc because davvi is the only
word this far Bisyll. V-Nouns with Comp. Forms, long-short nomcmp,
long gencmpOARJI !Bisyll. V-Nouns with Comparative Forms. Short
nom-compound-forms, long gen.
LULLILONG !long compound
BORALMASAT !like JOHTOLAT but plural only
Trisyllabic nouns:
Ending with consonant, gradating:
MATTAR !Tris. Anim. Gradating C-Nouns
MALIS !Tris. Inanim.
Gradating C-Nouns, Short compound-forms
Inanim. Gradating C-Nouns. Long compound-forms
!Tris. Inanim. Gradating C-Nouns. Long and short
OVCCIS ! Collective numerals
DAIVVAS !Tris. Gradating
C-Nouns, The Troms declension: dáivvaš:dáivaha, bearaš:bearraha,
Ending with vowel, gradating:
BEANA !Trisyll. Anim. Gradating 0-Nouns
SEAMU !Short
compound-forms. Trisyll. Inanim. Gradating 0-Nouns
!Long compound-forms. Trisyll. Inanim. Gradating 0-Nouns
Ending with consonant, non-gradating:
GAHPIR !Short compound-forms. Trisyll. Non-Gradating
GAHPIRLONGSHORT !Long and short compound-forms.
Trisyll. Non-Gradating C-Nouns
GAHPIRLONG !Long compound-forms.
Trisyll. Non-Gradating C-Nouns
LEXDIMINC !Diminutives
BOAHTINLONGSHORTTV !!Words coming from
transitive verbs Deverbal nouns
BOAHTINTV !!Words coming from
transitive verbs Long compound-forms
coming from transitive verbs Short compound-forms
!Actio plurals
BADJOSAT !Pl. bajus:badjosat
ALIMAT !Pl. alin:alimat
seamu but plural only
Pl. vuoiŋŋaš:vuoigŋahat
Contracted nouns:
BOAZU !Anim. Contracted 0-Nouns
SUOLU !Short compound-forms.
Inanim. Contracted 0-Nouns
SUOLULONG !Long compound-forms.
Inanim. Contracted 0-Nouns
GISTTA !The Noun gistta, gist -
FALIS !Contracted Anim. C-Nouns
LASIS !Contracted Inanim.
GUOVTTIS ! Collective numerals
GUOVTTU !Here because other
lexicons don't fit for it
DURVAT !like LASIS but pl. only
pl, only
Miscellaneous noun types:
GENTLEMAN ! cns-final bisyllabic loanwords (stem mana-
!Another peculiar word that deserves its own lex
MASAI !heavy
fin syll !gen = -a ill = -ii !parallel to NYSTØ
special word
GARGIA ! light fin syll, bisyll. on -o- that
doesn't have change o:u in front of j (i): Kino
BUFFALO !heavy
fin syll !gen = -a ill = -ii !parallel to NYSTØ
KULTUR !Recent
loanwords on -vra with short cmp-form
!Recent loanwords on -iidna with short cmp-form
loanwords on -vdna with short cmp-form
SOSIAL !Recent loanwords
on -ála with both short and long cmp-form
!Exceptional vuohta-Nouns
Nominal sublexica
tbw. Here we document the nominal section of the sme-lex.txt file.
The ProperNoun lexicon
The proper nouns are stored in gt/sme/src/propernoun-sme-lex.txt.
The file structure
Propernoun is one file, with the inflectional lexica first, followed by the over 36000 items long proper noun lexicon. This lexicon points to several sublexica. They are shown below, ordered according to the phonological and semantical properties of the stem. After the table comes a list of lexica not yet integrated in the table.
Type StemCoda CG IllChange Loc Lexicon -------------------------------------------------------- Monosyllabic stems HeavyVow no no -as NYSTØ HeavyCns no no -as BERN Bisyllabic stems LightVow no yes -s ACCRA LightVow yes yes -s MARJA LightVow yes yes -s SUOPMA Light e no no -s SIJTE LightCns no HEANDARAT LightVow no no -s NIKOSIIJA LightVow yes yes -s HEIM LightVow yes yes -s SUND Trisyllabic stems LightCns no no -is LONDON LightCns no no -is ANAR LightCns yes yes -is DUORTNUS LightVow yes GUOLBBA LightCns no no -is PLACE-DIM LightCns yes yes -is RANES LightCns yes yes -is CAVKKUS LightCns no no -is BALAK LightCns no no -is SARAK Contracted stems DAVVISUOLU !Inanim. Contracted 0-Names - from SUOLU GEAVNNIS Mixed stems -nen no no -as/nenis no C-FI-NEN HAWAII SKOHTERMADII Plural stems ALEUHTAT is for bisyllabic, would-have-been vowel final stems VARGGAT is as aleuhtat, but with Sg Gen and Loc substandard forms EATNAMAT is for trisyllabic plural stems ADJAGAT is for trisyllabic plural stems SULLOT is for contracted plural stems LASSAT is for contracted plural stems SKANIK is for plurals on -k --------------------------------------------------------- Notes: Comp means "has comparative forms"
Looking at the distribution, and ignoring the semantic subtags, we have a (slightly outdated) distribution of names over lexicon types as follows:
Saami geographical names
Saami names have been added from Norway, in 2002. Since that, Statens Kartverk has translated most (all?) Saami names from the old Bergsland/Ruong to the new 1979 orthography. Now, the remaining Sámi names should be added to the list, in the following way:
- All North Saami names should be extracted from the map base and run against the transducer
- The names that do not get a +Prop tag should be extracted, and then added to the base.
Saami names from Finland and Sweden have not been added, but they are underway.
Saami names from outside Sápmi, as listed in Sámi Atlas are added. Other Saami names from outside Sápmi should also be added, but we don't know of good sources for such names (language councils are one possibility)
Names in other Saami languages are increasingly being used in North Saami texts, e.g. when referring to South Saami or Lule Saami institutions. These ones should be added and organised in lexica.
The file adj-sme-lex.txt contains 4270 entries.
The adjective sublexica
In the lexicon file adj-sme-lex.txt, the sublexica are distributed in the following way (30.06.05) (ordered after frequency, thereafter after declension type):
Making linguistic sense of the system (Sammallahti's codes aaa etc.): 899 DABALAS aad 477 JEAGOHEAPMI bae 358 BOAKKAS c 329 At attributes 280 BEAKKAN aab 218 GAPPUS aab 205 JOHTIL babaa 191 NUORRA aaa 165 AKTIIVA 157 JIEDNAI bad 150 GARAS bbb 139 MEAHTTUS aab 114 LAIKI baa 106 DEARVVASLAS aad 71 ISSORAS aad 59 VIELG babab 38 CAHKK babaa babab 31 BUOREMUS aae 30 JUHKKIS aaa 26 GUOHCA aab 26 EATTAS babba 19 LODJI baa bac 14 GEARGGUS aab 13 SEARRA baa 13 DILDDAS babba 8 VUDDJII bad 8 CIENAL babbb 7 LINIS bbe 7 DEAHTIS bbe 7 BIEKKUS babba 7 ASEHAS baf 6 OVDDIT aae 6 NUOLUS aab 5 VATTIS aab 5 NJUORAS bbb 5 HEAHKAS babba 4 VIISSIS aab aac 4 LIEKKUS aab 4 FINJU- 3 UHCC bba 3 SUVRRIS bbe 3 JALGAT bbc babab 3 BUORRE ab 3 BU/MUS 3 ATTR 3 ALLAT bbc 2 NJALGGAT bbe 2 NAMAT a.................No morphologically distinct attr. form aa..............attr. is not inflected aaa...........Bisyll. NUORRA JUHKKIS Non-gradating aab...........Trisyll. BEAKKAN Gradating GAPPUS Gradating MEAHTTUS Compounded non-gradating GUOHCA Gradating GEARGGUS Gradating NUOLUS Gradating VATTIS Gradating VIISSIS Gradating LIEKKUS Gradating IPMAHA Gradating aac...........Contracted VIISSIS aad...........Qadrisyll. DABALAS DEARVVASLAS ISSORAS aae...........Comparative forms BUOREMUS OVDDIT ab..............Attr. in partial congruence with the noun BUORRE b.................Morphologically distinct attr. form ba..............Attr. form ends with -s baa...........Bisyll. LAIKI Attr. form -es in WeG LODJI Attr. form -es and -is in WeG SEARRA Attr. form -s in WeG bab...........Trisyll. baba........Non-gradating babaa.....No contraction in attr. form JOHTIL is-Attr. CAHKK is-Attr. babab.....Contraction (and ensuing strengthening) in attr. fom VIELG es-Attr. CAHKK es-Attr. JALGAT es-Attr. babb........Grade alternation in stem babba.....s-stems with WeG attr-form EATTAS is-Attr. DILDDAS is-Attr. BIEKKUS is-Attr HEAHKAS is-Attr babbb.....Attr. form in StG CIENAL is-Attr. bac...........Vocalic contracetd attr. form LODJI -es and -is Attr. in WeG bad...........Contracted stems JIEDNAI Non-gradating VUDDJII Non-gradating bae...........Caritives with attr- -his. JEAGOHEAPMI baf...........Qadrisyll. Adj., with Attr.-his ASEHAS bb..............Attr. form ends with -a bba...........Bisyll. ending in -i (or -a sometimes) UHCC StG attr. bbb...........Trisyll. ending -as GARAS Gradating with StG attr. NJUORAS Gradating with StG attr. bbc...........Trisyll. ending -at JALGAT Non-gradating with StG attr. ALLAT Non-gradating with StG attr. bbe...........Trisyll. ending -is LINIS Gradating with StG Attr. DEAHTIS Contracted with StG Attr. SUVRRIS Gradating with WeG Attr. NJALGGAT Gradating with StG Attr. c.................No attr. BOAKKAS Trisyll.
The VerbRoot lexicon
The lexicon is stored in the verb-sme.txt file.
VerbRoot contains 49 sublexica divided into three stem types:
- lexicon for impersonal verbs
- lexicon for verbs with personal passives, Transitives
- lexicon for verbs without personal passives, Intransitives
- lexicon without Personal Passive but with Acc obj
- lexicon for inherent passives
Bisyllabic verbs:
ARVI arvit sataa !Bisyllabic Impersonal
ARVALADDAT arváladdat sataa !Already derived bisyllabic
Impersonal Verbs
DIEHTI diehtit tietaa !Bisyllabic i-Verbs with
Personal Passive
BORRA !Bisyllabic a- and u-verbs with Personal
DEAKCU !as BORRA for u-verbs with dim -astit, and a-verbs
with dim -istit that are hardcoded
DIEHTISHORT !Short actio
compound-form DIEHTILONGSHORT !Long and short actio compound-form
DIEHTALADDAT diehtáladdat tietaa !Already derived bisyllabic Verbs
with Personal Passive
HAHTTIT !Four-syll kausatives on
DAHTU dáhtut ! As diehti, but -ut verbs, thus without
short passive
BOLTU ! As dáhtu but with dim -astit that are
ALLU ! -ut verbs, thus without short passive
BOAHTI boahtit tulla !Bisyllabic i-Verbs without Personal
BOAHTILONGSHORT !!Long and short actio compound-form
DIEVVA ! Bisyllabic a- and u-verbs without Personal Passive but with
Actor BOAZZU !as DIEVVA for u-verbs with dim -astit, that are
hardcoded BINDU !as DIEVVA (but without short passive) for u-verbs
with dim -astit, that are hardcoded BOAHTALADDA boahtáladda tulla
!Already derived bisyllabic Verbs without Personal
RAIMMAHALLA !passives on -hallat and INCHOATIVES on
Personal Passive but with Acc obj:
MAHTI máhttiit ! Bisyllabic
Verbs without Personal Passive but with Acc obj.
máhtáladdat ! Already derived bisyllabic Verbs without Personal
Passive but with Acc obj.
Inherent passives:
RAIMMAHALLA !passives on -hallat and
INCHOATIVES on -stuvvat UVVA !passives -uvvat
Contracted verbs:
BORGE borget tehda pyry !Contracted Impersonal
DOHPPE dohppet tarttua !Contracted Verbs with
Personal Passive
MUITA ! Inchoatives and translatives on -á, -o,
-e with Personal Passive
GILLE gillet viitsia !Contracted Verbs without
Personal Passive
CIRRO ! Inchoatives and translatives on -á, -o,
-e without Personal Passive
Personal Passive but with Acc obj:
MAHTA máhttát !Contracted
Verbs without Personal Passive but with Acc obj.
Inherent passives:
BASSO bassot ! Bisyllabic, inherently
passive -ot verbs
Trisyllabic verbs:
COASKKIT čoaskkidiit !Trisyllabic impersonal
ARVVASJ arvvašit !Trisyllabic impersonal verbs ending -šit,
-skit, smit, -idit, -ldit, -git and 5-syllables
ARVIL arvilit
!Trisyllabic impersonal verbs ending -lit
MUITAL muitalit !Trisyllabic Verbs with Personal
MUITTASJ !Words ending -šit, -skit, -ldit - !directed
here as well: !Reciprocals on -dit !Momentatives on -dit, -ádit,
-ihit, -e7hit !Frequentatives on -(u)hit !Continuatives on -nit
!Inchoatives in -nit
HALIID !Words ending -smit, -idit, -git - BONJAT !!Cont/Freq on
-dit, Continuatives on -(u)hit, Reciprocals, momentatives and
frequentatives ending -alit
VUORDIL !Trisyllabic Verbs ending
-lit, -rit with Personal Passive
ALIST alistit !Trisyllabic Verbs without
Personal Passive
BEAGASJ !Words ending -šit, -skit
-ldit, transitives on -hit
!directed here as well:
!Reciprocals on -dit !Momentatives on -dit, -ádit, -ihit, -e7hit
!Frequentatives on -(u)hit !Continuatives on -nit !Inchoatives in
JORGGIID !Words ending -smit, -idit, -git
BALAT !!Cont/Freq
on -dit, Continuatives on -(u)hit, Reciprocals, momentatives and
frequentatives ending -alit
SUOTNJAL suotnjalit !Trisyllabic
Verbs without Personal Passive ending -lit
BOTNJAS botnjasit
!Trisyllabic Verbs without Personal Passive ending -nit and
LASSAN !Trisyllabic Verbs ending -nit without Personal
Personal Passive but with Acc obj:
GEAGAT ! Trisyllabic Verbs
without Personal Passive but with Acc obj. BUOVVAL buovvalit !
Trisyllabic Verbs without Personal Passive but with Acc obj ending
The stems are distributed numerically as follows (the -it class includes both even-syllable and odd-syllable verbs):
-at 3722 even-syll -it 1035 -ut 794 total 5551 3-syllabic -it 5376 -át 297 -et 2310 -ot 111 total 2718
Verbal sublexica
Verbal derivation
Here documenting the main even-syll ones, the other ones are similar. DIEHTI is transitive, BOAHTI is intransitive.
DIEHTI -> +V: DIEHTIStem ; +V: DeverbalVerbsDIEHTI ; BOAHTI -> +V: BOAHTIStem ; +V: DeverbalVerbsBOAHTI ; DIEHTIStem -> :Y7j PASSIVE ; BOAHTIINCH ; BOAHTIStem -> SG3PASSV ; BOAHTIINCH ; BOAHTIINCH -> DeverbalNounsV ; +goah0ti:goah'ti BOAHTICnj ; BOAHTICnj ; BOAHTICnj -> +Ind+Prs: PrsV ; +Ind+Prt: PrtV ; +Pot+Prs:Q7z1 PrsC ; +Cond: CondV1 ; +Imprt: ImprtVA ; NominalFormsV ; NominalFormsV -> :X1 NominalFormsV1 ; :X4 NominalFormsV2 ; :Q6 NominalFormsV3 ; :X2 NominalFormsV4 ; :Q3 NominalFormsV5 ; :Y1 NominalFormsV6 ; PASSIVE -> +Pass:uvvo DOHPPEINCH ; +Pass+meahttun+A:uvvomeahttum MEAHTTUN ; +Pass+PrfPrc:un K ; +Pass+eaddji+N+Actor:uvvojeaddji¤ DEVNVCASE ; +Pass+upmi+N:upmi DEVNVCASE ; DeverbalVerbsDIEHTI -> +st:X8st MUITALStem ; +st+alla:X6stalla DIEHTIStem ; +st+adda:X6stadda DIEHTIStem ; +l:l MUITALStem ; +l+adda:X2ladda DIEHTIStem ; +l+ahtti:lahtti DIEHTIStem ; +l+asti:las'ti DIEHTIStem ; +h:X4h MUITALStem ; +h+alla:X6halla DIEHTIStem ; +h+adda:X6hadda DIEHTIStem ; +h+asti:X4has'ti DIEHTIStem ; +stuvva:X8stuvva SG3PASSV ; +d:Q8d MUITALStem ; DeverbalVerbsBOAHTI -> +st:X8st ALISTStem ; +st+alla:X6stalla BOAHTIStem ; +st+adda:X6stadda BOAHTIStem ; +l:l ALISTStem ; +l+adda:X2ladda BOAHTIStem ; +l+ahtti:lahtti BOAHTIStem ; +l+asti:las'ti BOAHTIStem ; +h:X4h MUITALStem ; +h+alla:X6halla DIEHTIStem ; +h+adda:X6hadda DIEHTIStem ; +h+asti:X4has'ti DIEHTIStem ; +stuvva:X8stuvva SG3PASSV ; +d:Q8d ALISTStem ;
Comments to the verb sublexica
Within each of the main groups, there are FIVE types, impersonal verbs, verbs with and without personal passives, verbs without Personal Passive but with Acc obj (+ two more lexicas, see above The VerbRoot lexicon), and inherent passives. The difference between i/a/u and e/á/o verbs is handeled in the rules file, and not in the lexicon file.
The with / without Personal Passive distinction shows up in one sublexicon. DOHPPE has PASSIVE, where GILLE has SG3PASS. So, this is (probably) a transivity difference, cf. also diehtit vs. boahtit. It seems thus that the difference is one of transitivity: 0, 1 and 2 valence.
At present, the file verb-sme-lex.txt comtains all the verbs. In the beginning of the file, all sublexica are exemplified. Then follows the bulk of the verbs, twosyllabic even, manysyllabic even, odd and contracted verbs.
The tag system follows the outline in Nickel.
All Pronouns have the initial lexicon path Root -> Pronoun -> ...
Personal pronouns
Lexicon path:
Personal firstperspron firstperspronsg -> wordforms -> K firstpersprondu -> wordforms -> K perspronpl -> wordforms -> K nonfirstperspron nonfirsperspronsg -> wordforms -> K nonfirstpersrondu -> wordforms -> K perspronpl -> wordforms -> K
Note that 3rd person is identical for all three persons. Not all forms were different for the sg and du forms, but the lexica were split for consistency.
Interrogative pronouns
Mii, Gii, Guhte, Guhtemuš, Makkár, MAn Láhkái. The sublexicon Interrogative contains one entry for Sg Nom, and points the rest to the case paradigms.
Interrogative +Sg+Nom -> K (one entry for gii and one for mii) oblintercas (one entry for gii and one for mii) demcas
Demonstrative pronouns
The lexicon path:
Demonstrative demcas (one entry for each stem) demcassg nomdemcassg -> wordforms -> K obldemcassg -> wordforms -> K demcaspl nomdemcaspl -> wordforms -> K obldemcaspl -> wordforms -> K
Reflexive pronouns
The Nominative forms are just listed. The oblique ones are directed to the sublexicon reflobl, and there directed via different case stems to appropriate Px sublexica. These sublexica are the same as the ones for nouns, they are found in the sme-lex.txt file. The only exception are some sublexica that are used only for plural forms, these are duplicated here from the sme-lex file, in order not to revise the main lexicon.
Reciprocal pronouns
The section on reciprocal pronouns consists of three parts. The first 6 entries handle the first element of the recipr. The next 12 handle the 2nd part of the non-Px recipr. Finally, the members of the third section point to special Px sublexica, designed for the reciprocal pronouns, and found in the same section.
Relative pronouns
Formally, the relative and interrogative pronouns are identical. In this parser, we skip the separate chapter on relative pronouns, and instead we use the interrogative pronouns.
Indefinite pronouns
We divide the indefinite pronouns into three groups, with a fourth group of leftovers waiting for a better destiny:
- Declineable indefinite pronouns with case + clitic (mihkkege, giige, guhtemušge)
- Declineable indefinites with normal case paradigms (eará, eanas, muhtin)
- Indeclineable indefinites (buot, eatnat, guhtet)
- TODO: A set of ideosyncratic cases
Declineable indefinite pronouns with case + clitic
These pronouns have two stems, one nominative, and one oblique, and the clitic -ge attached to the case ending. The initial lexicon splits them in two, one hard-coded nominative (e.g. giige), and one oblique stem (e.g. gea-). Then, the case + clitic sequence is treated as a single suffix (e.g. locative -sge, etc.). Since the clitic slot has already been filled, they are directed to # rather than to K.
Declineable indefinites with normal case paradigms
This section hosts a seemingly complicated system of taylored sublexica. It contains three sections: First a section where the pronouns themselved are split into different continuation lexica, then a section with intoermadiate lexica, and finally a section with the case suffixes themselves. The lexica are partly modelled upon nominal lexica.
Naming convention for the sublexica:
- -c, -v
- Consonant stem, vowel stem
- -n, -ne
- nominative, nominative and essive
Indeclineable indefinites
There is first a list of multiword indefinites. These are picked out by the preprocessor and copied onto a file abbr.txt and put in the bin/ catalogue. In thie closed-sme-lex.txt they behave like the other indeclineable indefinites.
The ideosyncratic rest category
Indefinite pronouns are complex, and the grammars are not always explicit enough, so this section hosts a set of pronouns, partly with a hard-coded tag, partly just commented out. They are awaiting a principled linguistic solution, but in order to do that, we need more info than we can get from the reference grammars.
TODO: Have a linguistic/native speaker-look at this section.
Overview of the lexicon structure
The numeral lexica are formed as a generator, generating all possible numerals. The basic lexicon is Numeral, and it looks like this:
LEXICON Numeral MILJON ; ! a noun of its own UNDERDUHAT ; ! for generator under 1000 JUSTDUHAT ; ! going via 1000 OVERDUHAT ; ! for generator over 1000 OLD ; ! for "thirteen hundred, etc. !num-basic ; ! replaced by the 5 lexica above !num-derived ; ! still unimplemented num-imprecise ;! still almost unimplemented ARABIC ; ! for the arabic numerals ROMAN ; ! for the roman numerals
MILJON is a noun. OLD is the old way of counting. num-ordinal act like adjectives, they are not finished yet. ARABIC and ROMAN contain number generators.
So, what is the reason for the three different lexica around 1000?
The reason is that the numeric system turns at the thousand mark. Numbers above it and numbers below it behave in the same way, thus we have both twentyfour and twentyfourthousand, etc.
The path is OVERDUHAT -> JUSTDUHAT -> UNDERDUHAT. OVERDUHAT generates the part of the numeral that is over 1000, and all these lexica then point to JUSTDUHAT. That lexicon has an optional "(one) thousand" before it leads either to DUHAT and via the relevant case paradigm to K, or to UNDERDUHAT. UNDERDUHAT contains the numerals 1-999. UNDERDUHAT starts with the lexicon for one, and gives each group of numerals its own lexicon.
Cardinals and ordinals
The cardinal and ordinal numbers are split at the final lexica, the OKTAF and 2TO9F lexica. This generates both numbers as second and fiftysecond.
Indeclinable words
All the lexica for indeclinable words are made the same way:
LEXICON Root Adverb ; LEXICON Adverb áđamusat adv ; LEXICON adv +Adv:0 K ;
The Root lexicon points to the POS lexica (Adverb etc.). Each of the POS lexica lists the entries, with a pointer to an arbitrarily named sublexicon (here "adv"). This sublexicon contains the grammatical tag for the POS in question (the tag has no surface form, hence ":0"), and eventually a pointer towards the cliticon lexicon K. Adverbs can have clitics added, hence K, whereas subjunctions do not, hence no K.
[XXX At the moment particles are not directed to K, perhaps they should be. TODO: Check with corpus and native speakers.]
They are explained in the intro to the section "Indeclinable words" above.
These are in the closed-sme-lex.txt file. Their tag is +Pcle and th|qeir lexicon path is:
Root -> Particle -> pcle -> #
Subjunctions are ahte, juos, etc. These are in the closed-sme-lex.txt file. Their lexicon path is:
Root -> Subjunction -> -> #
Conjunctions are ja, dahjege, etc. These are in the closed-sme-lex.txt file. Their tag is +CC and their lexicon path is:
Root -> Conjunction -> Cc -> #
There are three different classes here: Postpositions, occuring after their complement, prepositions, occuring before, and adpositions, occuring both before and after. This could have been done the Lingsoft way as well: Having +Adp as a common tag for both, and eventually +Prep and +Postp as subtags, no subtag would indicate both ways (or both subtags could be used). At the moment, they are left as 3 distinct groups. The classification is based upon Nickel, p-positions found only in Sammallahti's dictionary and not in Nickel were put in the Adposition group. Empirical studies will probably lead to rearrangement of the present division, this should be looked into in connection with the morphological disambiguator (cg grammar).
Adpositions are are bajil, birra, gaskal, etc. These are in the pp-sme-lex.txt file. Their tag is +Adp and their lexicon path is:
Root -> Adposition -> Pp -> #
Postpositions are are bokte, lusa, etc. These are in the pp-sme-lex.txt file. Their tag is +Po and their lexicon path is:
Root -> Postposition -> Postp -> #
Prepositions are are aisttan, earet, etc. These are in the pp-sme-lex.txt file. Their tag is +Pr and their lexicon path is:
Root -> Preposition -> Prep -> #
Interjections are are hoi, huh, kyš-kyš, etc. These are in the closed-sme-lex.txt file. Their tag is +Interj and their lexicon path is:
Root -> Interjection -> Ij -> #
There is a file called abbr-sme-lex.txt.
Last modified: $Date$, by $Author$