Starting New Language Pairs
This document explains how to start new Neahttadigisánit projects.
Starting a new project
Commands here may assume that you have already configured the virtual
Also this assumes you have determined a project short name already. These can
Create the configuration file
- In the terminal, move to $GTHOME/apps/dicts/nds/neahtta/
- Copy configs/sample.config.yaml.in to configs/PROJNAME.yaml.in
- Add the new file to SVN.
- Open the file in a text editor, and read through the settings. There are
numerous comments to guide you. - When you are done, check in the changes.
Adding language names
- Open the file configs/language_names.yaml
- For each language in the project, check the following (there are plenty of comments to guide you):
- the NAMES variable contains the ISO codes, and then a string marked for
localization with the language names in English - the LOCALISATION_NAMES_BY_LANGUAGE contains ISO codes, and the language's
own name / endonym - the ISO_TRANSFORMS contains any potential pairs of two-character and
three-character ISO codes
- the NAMES variable contains the ISO codes, and then a string marked for
Create additional directories and files
TODO:
Fabfile
- Search the file for instances of sample and follow the instructions there.
- DO NOT check this in yet.
Makefile
- copy sample to a new location, uncomment it, and follow the instructions there.
- Be sure to replace instances of sample in your new section with the PROJNAME.
TODO: this is a slightly more complex part, which I wish to do away with by
Test the configuration
- In the terminal, move to $GTHOME/apps/dicts/nds/neahtta/
- Activate the virtualenv
- Run fab PROJNAME compile, and wait until the process completes.
- Run fab PROJNAME test_configuration, and wait until the process
completes. Check FST path names and ensure that the build process moved all the files to the proper location. - Run fab PROJNAME runserver. If this completes, navigate to the
address that you see at the end of the output in your browser. - Does everything seem to work as intended? If so...
Check in the configurations
Check in the following config files
- fabfile.py
- dicts/Makefile
- config/language_names.py
- config/PROJNAME.config.yaml.in
Create additional files
TODO: confirm that there isn't anything required for the base configuration to
Server-side configuration
Adding opt directories for FST deployment
If, while editing the Makefile, you are creating new languages in the opt
- create /opt/smi/LANG/bin
- check permissions on directories /opt/smi/LANG/bin and /opt/smi/LANG, if it is owned by the group neahtta, and writeable by that group
Configuring nginx
TODO:
Installing an init.d script
TODO:
Added polish
Now that we have a running instance, it's time for some extra configuration.
Flags
For languages that have a translation available in the interface, a flag is
- Find the flag .svg page in wikipedia, e.g. by browsing to the language
page or region page, and click on the flag: http://en.wikipedia.org/wiki/File:Flag_of_Nenets_Autonomous_District.svg - Look for the link just below: "This image rendered as PNG in other sizes:
200px, 500px, 1000px, 2000px." - Click any size, preferrably the smallest, and alter the url path, to change
the width of the flag to 20px:
http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/200px-Flag_of_Nenets_Autonomous_District.svg.png
->
http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/20px-Flag_of_Nenets_Autonomous_District.svg.png
Save the file to static/img/flags/, and match the path name so that it is
Linguistic configuration (paradigms, etc.)
Configuring a new pair in an existing instance
So far the process is a little complex, but there are things that can be
This following process assumes that there is already a service existing
1.) Establish a build process for the FSTs and lexicon.
Intended: Programmers
FST build process in dicts/Makefile
This is mainly meant as a convenience for easy developing.
Assuming that the language uses the langs/ infrastructure, adding
.PHONY: baakoeh-install
baakoeh-install: GT_COMPILE_LANGS := sma nob
baakoeh-install: install_langs_fsts
.PHONY: baakoeh
baakoeh: GT_COMPILE_LANGS := sma nob
baakoeh: baakoeh-lexica compile_langs_fsts
[... snip ...]
These targets will build analysers as usual, but the *-install targets
In any case, the targets that these will write to are
/opt/smi/LANG/bin/dict-LANG.fst
/opt/smi/LANG/bin/dict-iLANG-norm.fst
/opt/smi/LANG/bin/some-LANG.fst
Troubleshooting
If you do not succeed in getting these make targets to work with a new
Lexicon in the Makefile
Editing the Makefile is a little tricky. You will need to add a target
Lexica are compiled using a Saxon process, and the Makefile contains
ZZZ-all.xml: $(GTHOME)/langs/ZZZ/src/morphology/stems/*.xml
@echo "**************************"
@echo "** Building ZZZ lexicon **"
@echo "**************************"
@echo "** Backup made (.bak) **"
@echo "**************************"
@echo ""
-@cp $@ $@.$(shell date +%s).bak
mkdir ZZZ
cp $^ ZZZ/
$(SAXON) inDir=$(pwd)/ZZZ/ > ZZZ-all.xml
rm -rf ZZZ/
The above makes a copy of the XML files, and then uses the Saxon process
This process will be the same if the lexica are in main/words/dicts/,
Make note of the filename that you intend to output this to, and add it
2.) Edit the .yaml file for new FSTs and Dictionaries
Intended: Programmers, linguists
Realistically anyone can do this as long as the build process is
Once you're done, save the file and attempt to restart the service.
If everything seems to be working, do not check in the config file
''Morphology'' section
This needs to have the paths to the new analysers, for each language
In any case, the morphology section should contain a new entry like the
YYY:
tool: *LOOKUP
file: [*OPT, '/YYY/bin/dict-YYY.fst']
inverse_file: [*OPT, '/YYY/bin/dict-iYYY-norm.fst']
format: 'xfst'
options:
compoundBoundary: "+Use/Circ#"
derivationMarker: "+Der"
tagsep: '+'
inverse_tagsep: '+'
Where YYY is the language ISO path. Note the weird way that forming
''Languages'' section
Add a new entry for the language iso to this list.
''Dictionaries'' section
Here, add a new item to the list of dictionaries, relative to the
Dictionaries:
# [... snip ...]
- source: udm
target: hun
path: 'dicts/udm-all.xml'
For more information on all the settings for this chunk, see the page on YAML
3.) Define language names and translation strings
Intended: Linguists
Open the file configs/language_names.py. Here you will need to add the
NAMES
Here we define the name in English, so that it will be available for
('sme', _(u"North Sámi")),
The most easy way is to copy one existing line, and replace the contents
The first value should be the language ISO, **or** the language variant
LOCALISATION_NAMES_BY_LANGUAGE
Here we have the ISO and the language's name in the language.
('sme', u"Davvisámegiella"),
Again, copy and paste a line, and only edit the strings.
ISO_TRANSFORMS
If the language has a two-character ISO as well as a three-character
('se', 'sme'),
('no', 'nob'),
('fi', 'fin'),
('en', 'eng'),
4.) Define tagsets, and paradigms, user-friendly tag relabels
Intended: Linguists
If you wish to have paradigms visible in the language, you will need two
-
Tagsets files: NDS Linguistic Settings
-
.paradigm files: NDS Linguistic Settings
-
.context files: NDS Linguistic Settings
- .relabel files: NDS Linguistic Settings
The easiest means of course is to look at existing languages and copy
When done with these steps, be sure to add the new files and directories
Server config
Things to consider:
- nginx configuration file
- init.d script: make sure to pick an unused port, change the config file, and
also the path to the pid file, otherwise bad things happen
Linguistic requirements
Intended: Linguists
- The xml format of the lexicon should match sme-nob or sma-nob format as
closely as possible (words/dicts/smanob/src) - morphological analysers (FSTs, described below)
- lists of tag pairs, what is in FST to convert to what users will see
- lists of paradigms for parts of speech
- they can be either as detailed as one paradigm per part of speech, or several
paradigms for parts of speech and varying sub-types. These will have to be marked in the lexicon in some way, for exampl, plural-only proper noun paradigms for North Saami
- they can be either as detailed as one paradigm per part of speech, or several
- description of attributes in the XML that need to be displayed to the user
Collecting the materials...
FSTs
Lookup FST
Lookup FST tags may be in any format, as these can be relabeled for users at a
vuovdi
vuovdi vuovdit+V+TV+PrsPrc
vuovdi vuovdit+V+TV+Imprt+Du2
vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Nom
vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Gen
vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Acc
vuovdi vuovdi+Sem/Plc+N+Sg+Nom
vuovdi vuovdi+A+Sg+Nom
vuovdi vuovdi+A+Sg+Gen
vuovdi vuovdi+A+Sg+Acc
vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Nom
vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Gen
vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Acc
It is best to use the analysers compiled with
Spell relax FST
Spell relaxing FSTs should follow the exact same format as the lookup FST, but
- Normalizing non-standard spellings
- Compensating for keyboards without certain characters
- Switching orthographies, e.g., accepting latin for cyrillic, or vice versa.
Whatever the use case, the analyzed lemma must match up with the lexicon
Spellrelax behaviour is governed in $GTHOME/langs/$LANG/src/orhtography.
- analyser-dict-gt-desc.xfst
- analyser-dict-gt-desc-mobile.xfst
Put variation within the norm, and variation invisible to the user,
Compound marking
Must have a defined (and consistent) method for marking compounds:
vuovdedoaibma
vuovdedoaibma vuovdi subst. + #doaibma subst. ent. nom.
vuovdedoaibma vuovdi adj. + #doaibma subst. ent. nom.
vuovdedoaibma vuovdedoaibma subst. ent. nom.
Compound marker:
" + #"
This setting can be configured in the YAML settings file.
Derivation marking
Must also be able to specify how all derivational suffixes are marked, because
oahpásmahttit
oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb inf.
oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb imp. 2.p.flt.
oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb ind. pres. 1.p.flt.
oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb nom.Ag. subst. fl. nom.
Derivation marker:
"suff."
This setting can be configured in the YAML settings file.
Generation FST
For generation, currently sme and sma use the typical tag format for GT, but,
Lexicon
It may be that the analysis FST does not match absolutely with the lexicon, and
For instance, the above examples show "noun" and "verb" as

