NDS Linguistic Settings
Neahttadigisánit linguistic settings
The documentation here concerns the directory and subdirectories in
- tagsets
- user_friendly_tags
- paradigms
- Contexts
- paradigm layouts
If you update these files, be sure to run the test procedure and restart the
Tagsets
Tagsets are necessary for constructing certain types of rules for manipulating
The pos tagset is also particularly important, because it helps match up
Tagsets are file based because this makes it easier to duplicate them for
Symlinks in this directory are also permitted, if two language variants (i.e.
Tagset files
Each language has its own set of tagsets, and these are defined in a file in:
configs/language_specific_rules/tagsets/
The filename must be ISO.tagset, where ISO is a variable for the 3-character
The file format is YAML, and all that is permitted here is key-value settings,
Example
Here's an example of some tagsets from sme:
pos: - "N" - "V" - "A" - "Pr" - "Po" - "Num" - "CS" - "CC" - "pron." - "subst." - "verb" - "adj." - "konj." type: - "NomAg" - "G3" - "aktor" - "res." - "Prop" - "prop." number: ["Sg", "Pl"]
Note that YAML allows you to define lists in multiple ways, and strings may be
The above example also shows the two alternate list formats, one with brackets,
Note that comments are also allowed (marked with #), and it may be useful
See YAML documentation for more info.
User-friendly tags
configs/language_specific_rules/user_friendly_tags/*.relabel
Each file is named with a suffix .relabel, but the name may be
Consider that you may have to repeat some tagsets, so maybe using YAML
File structure
The file structure is quite simple, and at most it must contain a list
-
source_morphology - The morphology name, usually an ISO, but
-
target_ui_language - The language the user is browsing in-- must
- tags - A dictionary of tags.
Example
Relabel: - source_morphology: 'kpv' target_ui_language: 'eng' tags: &some_alias_name V: "v." N: "n." A: "adj." - source_morphology: 'kpv' target_ui_language: 'fin' tags: &another_alias V: "v." N: "s." A: "adj." DO_NOT_SHOW: "" - source_morphology: "zzz" target_ui_language: "www" tags: <<: *some_alias_name <<: *another_alias
The last item in the list shows an example of inheriting from two
V: "v." N: "s." A: "adj." DO_NOT_SHOW: ""
You can even set tags in another location, outside of the Relabel
Aliases: tag_set_one: &some_alias_name V: "v." N: "n." A: "adj." Relabel: - source_morphology: 'kpv' target_ui_language: 'eng' tags: <<: *some_alias_name
Paradigm generation and paradigm design
The dictionary paradigms are managed by a file and directory structure based
The paradigm folder structure
paradigms/sme/common_nouns.paradigm paradigms/sme/proper_nouns.paradigm paradigms/sme/paradigm_group/foo.paradigm paradigms/sme/paradigm_group/bar.paradigm
Paradigm files can be ordered in any way you like within the language
Currently, there is no explicit setting for ordering the generation rules, and
Symlinks in this directory are also tolerated, so if multiple language variants
For some more advanced examples, see the rules for sme (particularly,
Paradigm file format (.paradigm)
.paradigm files concern only which forms will be generated. If you wish to
Paradigm files are structured in the following way: one part is YAML, and the
The rules may be very simple, but here is one that combines morphology and
name: "Proper noun paradigm" description: | Generate the proper noun if the entry contains sem_type="Prop" or "prop" morphology: pos: "N" lexicon: XPATH: sem_type: ".//l/@type" sem_type: - "Prop" - "prop" -- {{ lemma }}+N+Prop+Sem/Plc+Sg+Gen {{ lemma }}+N+Prop+Sem/Plc+Sg+Ill {{ lemma }}+N+Prop+Sem/Plc+Sg+Loc
YAML settings:
-
name - A short name to display when the service is loading (required)
-
description (optional) - More words for other developers
-
morphology, lexicon - one of these must be present, but both may be
Conditions together
Operating together, what the conditions essentially say is that for any
Morphology conditions
Conditions that are possible to match on are set up in a variety of ways.
In the following example, the condition applies if the PoS is V, and if
morphology: pos: "V" infinitive: true
Above we see that either a string value "V"may be specified, or boolean
morphology: pos: "V" infinitive: - "Inf1" - "Inf2"
The morphology condition also supports matching of whole tags, using the
morphology: tag: - "V+Inf1" - "V+Inf2"
One additional keyword is lemma, available to both morphology and
morphology: lemma: "diehtit"
NB: if there are problems matching a tag set, make sure that it is defined in
Lexicon conditions
The lexicon is also usable for providing conditions for a particular
For example, assuming we have some place-name lexicon entries like the
<e> <lg> <l sem_type="Plc">Minneapolis</l> </lg> ... etc ... </e>
A rule for the above might look like the following:
lexicon: XPATH: sem_type: ".//l/@sem_type" sem_type: "Plc"
Note that you may also specify lists, as with the above:
lexicon: XPATH: sem_type: ".//l/@sem_type" sem_type: - "Plc" - "Something"
Paradigm definition
Paradigm definition is mostly plaintext, but since it is a template, it
lemma +N+Sg+Nom
Certain variables are available by default:
- lemma
Additional variables are available as they are defined by the conditions, and
lexicon: XPATH: some_attribute: ".//l/@some_ttribute" some_attribute: - "Foo" - "Bar" --
lemma +Adj+ some_attribute
It is also possible to specify additional variables that are not used in the
lexicon: XPATH: some_attribute: ".//l/@some_ttribute" another: ".//l/@another_attribute" some_attribute: - "Foo" - "Bar" -- {% if another %} {{ lemma }}+Adj+{{ some_attribute }}+{{ another }} {% else %} {{ lemma }}+Adj+{{ some_attribute }} {% endif %}
Paradigm layouts and presentation
Paradigm layouts are defined in a similar way as paradigm generation: the file
--
. As in the YAML section, spacing
First we will look at an example, and then following sections will describe all
An example, and overview:
TODO: actual working example definition from itwêwina, as well as screenshots
Most of the following example should look familiar based on the above
name: "basic" layout: type: "basic" morphology: pos: V animacy: - AI - TI -- | "#" | "Sg" | "Pl" | | "1p" | Prs+1Sg | Prs+1Pl | | "2p" | Prs+2Sg | Prs+12Pl | | | | Prs+2pl | | "3p" | Prs+3Sg | Prs+3Pl | | "4" | Prs+4Sg | |
In the example above, the first half shows that the paradigm is applied when
Some additional information about the layout is also defined, the name, and
Next is the actual layout:
- spacing is important, - columns must match up
- columns are marked with the pipe character |.
- leave one space between the pipe character and any content
- each row must begin with and end with |
- the first row should not include any cells spanning multiple columns
Associating the layout with a generated paradigm
There are two ways to target the layout to a paradigm, the first is the exact
The second, is to associate the .layout file with a .paradigm file in
name: "verb paradigm" paradigm: "some-paradigm-file.paradigm" layout: type: "basic"
Layout options (YAML)
Name is mostly used to render the startup log message as settings
description may optionally be set. This will be displayed to users
name: "transitive" description: "This is the transitive conjugation."
The following shows multiple languages, note that if one translation does not
name: "transitive" description: eng: "This is the transitive conjugation." fra: "C'est ne pas une pipe."
YAML has several conventions for specifying strings: YAML strings.
Optional settings within ''layout''
The following settings do not need to be defined at all, but help determine the
- type - (string) specify the type of the layout and thus its title in the tab menu if multiple layouts are matched.
- no_form - (string) If no form results from paradigm generation, by default, whatever is in the cell will pass through. Otherwise, set what will be shown: ex.) a space " "for nothing, "-"a dash, etc.
- value_separator - default is a line break in html, <br />), other ideas: comma, etc.
Layout specification, features, options
Consider the table in the following example .paradigm file and .layout
verbs.paradigm contains:
name: "basic" morphology: pos: V -- {{ lemma }}+V+Prs+1Sg {{ lemma }}+V+Prs+2Sg {{ lemma }}+V+Prs+1Pl {{ lemma }}+V+Prs+2Pl {{ lemma }}+V+Prs+3P {{ lemma }}+V+Prt+1Sg {{ lemma }}+V+Prt+2Sg {{ lemma }}+V+Prt+1Pl {{ lemma }}+V+Prt+2Pl {{ lemma }}+V+Prt+3P
verbs.layout contains:
name: "basic" morphology: pos: V -- | | "Sg" | "Pl" | | "1p" | Prs+1Sg | Prs+1Pl | | "2p" | Prs+2Sg | Prs+2Pl | | "3p" | Prs+3P |
After the .paradigm file is sent off to generation, two things occur here
- Some values (quoted) are treated as strings, and rendered directly
- Tags are matched against the generated paradigm, and inserted into the layout. Multiple forms will be inserted if multiple forms match.
Matching wordforms
The default behavior is to match the value in the cell against all tags, as a
| "1p" | V+Prs+1Sg | | "2p" | V+Prs+2Sg |
Two features borrowed from regular expression land are available: ^match
| "1p" | Prs+1Sg$ | | "2p" | Prs+2Sg$ |
TODO: examples from myv
Heading values, and heading internationalization
"quoted"values will be passed through as headings. You can also access
Cell spanning
Cell spanning is accomplished by leaving out the pipe character.
| "Label" | "Label" | "Label" | | "Label" | +Some+Tag | +Some+Other+Tag | | "Label" | +Some+Tag | | "Label" | +Some+Tag | +Some+Other+Tag |
This also depends on a clearly-defined column layout. Cell-spanning is not
As long as the pipe is missing, the value may be anywhere within.
Cell text alignment
Aligning text or values within the cell is Value alignment is a matter
|: "Label" | +Tag :| +Some+Other+Tag | |: "Label" |: +Tag :| |: "Label" | +Tag | +Some+Other+Tag |
In most cases you will not need these, because the default style should
Contexts
Contexts are for applying additional helpful information to a generated
Contexts are defined within .context files in the corresponding language
File structure
Context files are simply a YAML list, and each item is a dictionary
-
entry_context - (string) matches the @context attribute on each <l />
-
tag_context - (string) matches the tag used in generation. Must be
- template - jinja-format string, which accepts certain variables:
Template variables allowed:
-
word_form - inserts the wordform
- context - inserts the context (usually not necessary)
Some examples:
- entry_context: "sii" tag_context: "V+Ind+Prs+Pl3" template: "(odne sii) {{ word_form }}"
The above would thus generate:
(odne sii) deaivvadit
Example without entry_context:
- entry_context: None tag_context: "V+Ind+Prs+Sg1" template: "(daan biejjien manne) {{ word_form }}"
Note the lack of quotes around "None".
Otherwise, see the checked in files for more examples.