Aspell Documentation
This is a document for the technical documentation of our spellers based on open source, by example from Debian this presently includes Aspell, iSpell and MySpell.
It is hastily written, and will be split up in sensible subdocuments as soon as we have time to do so. Just now it is here as a survival kit for aspell newcomers.
File Overview
The files are in the gt/sme/aspell directory. Cf. the list with explanation:
- Makefile.pre
- This file is used when generating Makefile in distribution package
- configure
- Used for generating the Makefile in distribution package
- info
- The info file is the main file which contains most of the information.
- sme.cmap
- File contains the mappings of the characters into aspell internal format used in our wordlist.
- sme.cset
- File contains the characterset of the characters used in aspell internal format.
- sme.cwl
- This is compressed form of the wordlist.
- sme.dat
- Data file of the language.
- sme.multi
- Listing of the dictionary files
- sme.wl
- Wordlist
- sme_affix.dat
- Contains the affixes of the language.
- sme_phonet.dat ???
- It is possible to define some substitution sequences as more likely than others.
The format of the files
The way it works
The fullform wordlist sme.wl is created from transducer with Xerox finite-state tools.
The Makefile
Here we need some words on what it does and especially what it doesn't.
We postpone the makefile documentation (how it does it).
sme_affix
This file is a manually created affix file for North Sami.
The format of the affix file
Affix is either a prefix or a suffix that can be attached to root words to make other words. Here is an example of how to define a suffix.
SFX I Y 5 SFX I a ii [^ij]a SFX I 0 i [ij]a SFX I e ii e SFX I 0 i [iá] SFX I i ái i
The first line has 4 fields.
- SFX indicates this is a suffix
- I is the name of the character which represents this suffix
- Y indicates it can be combined with other affix entries
- 5 indicates it has 5 entries stored in this prefix
The remaining lines describe the information for the 5 entries.
- SFX indicates this is a suffix
- I is the name of the character which represents this suffix
- a the string of chars to strip off before adding affix
- ii the prefix string to add (a 0 indicates the NULL string)
- [^ij]a the conditions which must be met before prefix can be added
Suffix lexica
The suffix lexica are
- PFX A
- North-, South-, prefixes to proper nouns
- PFX B
- ii-, the negative prefix
- SFX E
- the suffixes for the Prop lexica ACCRA (V-final) and LONDON (odd-syll C-final)
- SFX F
- suffixes for the Prop lexica NYSTØ (heavy vow), C-FI-NEN and CNAME (C-final)
- SFX G
- suffixes for the Prop lexicon BERN (even-syll C-final)
- SFX H
- suffixes for GOAHTI etc, even-syll nouns
- SFX I
- suffixes for STAHTA, IIJA, even-syll nouns without illative vow alt
- SFX J
- suffixes for SEAMU
- SFX K
- suffixes for MALIS, BU/MUS
- SFX L
- suffixes for DIMINC
- SFX N
- suffixes for verbs.
At present, the status is
PFX A Y 9 PFX B Y 1 SFX E Y 31 SFX F Y 69 SFX G Y 18 SFX H Y 156 SFX I Y 121 SFX J Y 214 SFX K Y 70 SFX N Y 149
Status quo:
- Implemented
- Proper nouns, nouns, even-syllabic verbs, contracted verbs (??)
- Missing
- Adjectives are missing completely. Three-syllabic nouns and three-syllabic verbs are missing.
Methods for adding them (how the suffixes have been added so far).
A nice strategy would be to generate suffixes automatically, like the way it was done for the Finnish ispell file. Let us look into that possibility. While waiting, let us consider adding the following chunk to sme_affix.dat.
SFX M 0 SFX M it a it SFX M it an it SFX M it at it SFX M 0 e t SFX M 0 eaba t SFX M 0 eaddji t SFX M 0 eadnot t SFX M 0 eadnu t SFX M 0 eahkket t SFX M 0 eahkki t SFX M 0 eahkkot t SFX M 0 eahkku t SFX M 0 eahppi t SFX M 0 eačča t SFX M 0 eaččan t SFX M 0 eaččat t SFX M 0 eaš t SFX M 0 eažžaba t SFX M 0 eažžabeahtti t SFX M 0 eažžabehtet t SFX M 0 eažžat t SFX M 0 eažžá t SFX M 0 edje t SFX M 0 edne t SFX M 0 ehket t SFX M 0 ehkon t SFX M 0 ehkos t SFX M 0 ehkoset t SFX M 0 ehkoska t SFX M 0 ehkot t SFX M 0 ehpet t SFX M 0 eidde t SFX M 0 eiddet t SFX M 0 eigga t SFX M 0 eimme t SFX M 0 eimmet t SFX M 0 etne t SFX M 0 etnot t SFX M 0 ettiin t SFX M 0 eš t SFX M 0 ežže t SFX M 0 ežžet t SFX M 0 haga t SFX M 0 ii t SFX M 0 in t SFX M 0 it t SFX M 0 ivčče t SFX M 0 ivččen t SFX M 0 ivččet t SFX M 0 ivččii t SFX M 0 ivččiide t SFX M 0 ivččiidet t SFX M 0 ivččiiga t SFX M 0 ivččiime t SFX M 0 ivččiimet t SFX M 0 keahttá t
Initially, we inserted flags in the fsm's source code. Now, the tags A, B, C, etc. are generated via aspell's munch command, intended-to-be made part of the Makefile routine.
The wordlist sme.wl
The list sme.wl contains 1000 lines (why?). For the words A-vitamiidna and some other vitamins, it has full paradigms, whereas for place names it has only stems. Why?
Because sme.wl is only a dummy file?
Because the proper nouns have been enrolled in the sme_suffix.dat file, and therefore are represented only by their stems, whereas the vitamins are unenrolled, and therefore represented by fullforms?
Adding flags to wordlist
Flags are inserted to the wordlist with munch-list aspell command. Munch-list takes the wordlist sme.wl and affix information file sme_affix.dat and generates a munched wordlist with wordroots that have appropriate affix tags.
The munching must be done in small bits, because it consumes RAM very much. The plan is to use 1 million words at a time, and catenate the output into one uniform wordlist.
After the wordlist has been munched, it can be compressed and the output is a compressed wordlist sme.cwl.
The file sme_phonet.dat
In aspell common misspellings are given more weight than other substitutions. This can be a part of sma_affix.dat file (it will then be MySpell compatible), but in aspell it is done as a separate file sme_phonet.dat
Format: The list is a tab-separated file, wshere each line has the form sourceletter(s) whitespace(s) targetletter(s). We would like to add the following type of substitutions:
- Long á vs. short á
- The Sámi letters vs. the nonmodified Latin letters
- t versus its non-reduced consonants
- consonant gradation pairs
- Probably something around loan words:
a á á a c č tt t tt dd ...
by Sjur N. Moshagen, Tomi Pieski, Trond Trosterud