This document describes the Neahttadigisánit configuration
Individual dictionary services are configured via a
- Main application settings: localisations, application subdomain/hostname, etc.
- FST path and format definitions (compounding, tag structure, etc.)
- Languages available
- XML dictionary paths
- Reader configuration
Further configuration beyond this, formatting of lexicon entries, is covered
For examples of configuration files, refer to the configs/ directory.
Main application settings (ApplicationSettings)
The ApplicationSettings key contains the following configuration settings:
app_name - defines the name displayed in the menu bar, Neahttadigisánit, Nettidigisanat, etc.
short_name - a short name for the project, usually corresponding to the subdomain. This must be unique.
default_locale - the default locale to display of those available, when any other locale cannot be detected from the browser.*
default_pair - the default dictionary language pair to display
mobile_default_pair - the default dictionary to display when a mobile browser is detected
locales_available - internationalisations available*. It is important that these are defined as strings with quotes (YAML allows for strings to be defined without quotes as well), otherwise problems occur for Norwegian ("no"). Without the quotes, YAML will process this as a boolean value, but with quotes it is unambiguously a string.
meta_description, meta_keywords - these values will be inserted into the HTML <meta />tags in the header of all pages, and are important for search engines.
admins_to_email - A list of email addresses to send server errors to.
app_meta_title, meta_description, meta_keywrods - Fields for
grouped_nav - For projects with many dictionary pairs, this allows
Development features in ApplicationSettings:
These features may not be entirely finished, so use with care.
new_style_templates - This enables the template
new_mobile_nav - To be used with grouped_nav: this enables a new
Note on locales
When defining locales for localization, it is important to use the
When defining language codes for dictionaries and morphological tools, use the
ApplicationSettings: app_name: "Nettidigisanat" short_name: "sanat" default_locale: "ru" default_pair: ["olo", "fin"] locales_available: - "fi" - "lv" - "ru" - "no" meta_description: > Free, mobile-friendly dictionaries for lots of languages. meta_keywords: > list, of, keywords app_meta_title: > Balto-finnic dictionaries admins_to_email: - "firstname.lastname@example.org" - "email@example.com"
FST path and format definitions (Morphology)
The Morphology key contains a list of languages by ISO 639-2 code, but
tool - path to the morphological tool
file - path to the morphological analysis file
inverse_file - path to the morphological generation file
format - format name ('xfst' currently only supported, but this value would probably also cover hfst)
- options - defined below
The options setting may contain the following keys:
compoundBoundary - the part of a morphological analysis tag that marks the compound boundary, i.e.,: lemma+Tag+Tag+CompoundTag+lemma2+Tag+Tag. This will be used to split a compound word into multiple lemmas.
derivationMarker - the part of a morphological analysis tag that marks a derivation. This is used in sme particularly, to only display non-derived analyses when one exists.
tagsep - the character that separates tags and lemmas
- inverse_tagsep - the same, but for generation
Morphology: liv: tool: '/usr/bin/lookup' file: '/opt/smi/liv/bin/analyser-dict-gt-desc-mobile.xfst' inverse_file: '/opt/smi/liv/bin/generator-dict-gt-norm.xfst' format: 'xfst' options: compoundBoundary: "+Use/Circ#" derivationMarker: "+Der" tagsep: '+' inverse_tagsep: '+'
Make note of the name you use for this key (i.e. liv), because this will be
If you look at existing configuration files, you'll see YAML references used
Tools: xfst_lookup: &LOOKUP '/usr/bin/lookup' opt: &OPT '/opt/smi/' Morphology: olo: tool: *LOOKUP file: [*OPT, '/olo/bin/analyser-dict-gt-desc-mobile.xfst'] inverse_file: [*OPT, '/olo/bin/generator-dict-gt-norm.xfst']
Note how string concatenation is handled in YAML.
Languages covered by the system (Languages)
A list of language ISO codes covered by the system. This may be going away at some point, as its original purpose was language name translations, but for that it turned out better to use Python-Babel and gettext.
Languages: - iso: olo - iso: fin - iso: liv - iso: fkv - iso: izh - iso: nob - iso: est - iso: lav
Optional arguments for each item:
minority_lang:true: this helps sort by minority and majority languages,
XML dictionary paths (Dictionaries)
The dictionaries in the system. For now there are two different types of
Dictionaries is a list of dictionaries, each dictionary defining the following keys:
source - source language ISO (or other short code, i.e., spellrelax variant)
target - target language ISO
Following are some minimal examples:
Dictionaries: - source: olo target: fin path: 'dicts/olo-fin.xml' - source: liv target: fin path: 'dicts/liv.all.xml' - source: liv target: est path: 'dicts/liv.all.xml'
Additional lexicon settings
Each dictionary may specify that paradigms are to be generated asynchronously.
- source: lang_iso target: lang_iso asynchronous_paradigms: true path: 'dicts/dictionary.file.xml'
This causes the page to load, and the paradigm to be requested via a separate
Some languages have optional spell-relax FSTs, either for converting from
These may be marked as 'mobile' too, so that they appear by default when a
The "special" types are thus: mobile and standard. Anything that is
- source: sme target: fin path: 'dicts/sme-fin.all.xml' input_variants: - type: "standard" description: "Standard (<em>áčđŋšŧž</em>) short_name: "sme" - type: "mobile" description: "Social media (with <em>acdnstz</em>)" short_name: "SoMe"
Here, each item in the input_variants key has a type, description
The description string may be marked for translation with the !gettext
- type: "mobile" description: !gettext "Social media (with <em>acdnstz</em>)" short_name: "SoMe"
On-screen keyboard/key palette
The project maintainer may define an on-screen key palette to allow users to
This is configured on a variant-to-variant basis, to reflect that each
- source: sms target: fin input_variants: &spell_relax - type: "standard" description: !gettext "Standard" example: "(ǩ)" onscreen_keyboard: &SMS_KEYS - "â" - "č" - "ʒ" - "ǯ" # etc ...
Note: Skolt Saami has lots of characters in the keyboard, so this example
Each item in the dictionary list may specify keys to include korp search links.
What is essential is that the URL patterns included specify variables that
show_korp_search: True # use http://meyerweb.com/eric/tools/dencoder/ if things are # unreadable or do not work # # Here, whatever the user input is will be replaced into the # following string, marked by USER_INPUT wordform_search_url: "http://gtweb.uit.no/korp/#search=word%7CUSER_INPUT&page=0" # # Here, whatever the input lemma is will be replaced into the # following string, marked by INPUT_LEMMA # # cqp|[lemma = "INPUT_LEMMA"] lemma_search_url: "http://gtweb.uit.no/korp/#page=0&search-tab=2&search=cqp%7C%5Blemma%20%3D%20%22INPUT_LEMMA%22%5D" # Specify a word delimiter for when there are many. # "] [word = " lemma_multiword_delimiter: &korp_lemma_delim "%22%5D%20%5Bword%20%3D%20%22"
Wordform generation and analysis details (Paradigms)
The Paradigms section contains a part of speech (in all caps) and tag forms (minus
If forms will be displayed, but pregenerated by some other rule, there must
Paradigms: olo: PRON: - "Pregenerate" N: - "Sg+Par" - "Sg+Apr" - "Sg+Gen" - "Pl+Par" V: - "Ind+Prs+ScSg1" - "Ind+Prs+ScSg3" - "Ind+Prs+ScPl3" - "Ind+Prt+ScSg1" liv: PRON: - "Pregenerate" N: - "Sg+Nom" - "Sg+Gen" - "Sg+Dat" - ... etc.
In the above example: "Pregenerate" is completely arbitrary and serves no programmatic function, however "PRON" being set is important.
Tag definitions (TagSets, TagTransforms)
Unfortunately it is not yet easy to use the babel and gettext translation
TagTransforms is a dictionary of language pairs, each of which contains
Each language pair is defined as the source language of the dictionary or
TagTransforms: (olo, rus): "V": "v." "N": "s." "A": "adj." (liv, rus): "V": "v." "N": "s." "A": "adj."
NOTE: parentheses, comma, and space are important in the language pair
TagSets aren't particularly relevant within the configuration file, but are
TagSets: sme: pos: ["N", "V", "A", "Pr", "Po", "Num"] type: ["NomAg", "G3", "aktor"]
Reader Settings (ReaderConfig)
This is another top-level configuration. Within this is one key for each
multiword_range (string) - this setting specifies how many words before
word_regex_opts (string) - any regular expression options. Most likely
An example of the word regular expression, which contains most characters
ReaderConfig myv: multiword_lookups: false word_regex: | [\u00C0-\u1FFF\u2C00-\uD7FF\w\-']+ word_regex_opts: "g"
The reader may be configured to allow multiword environments, so, each click