Starting New Language Pairs
This document explains how to start new Neahttadigisánit projects.
Starting a new project
Commands here may assume that you have already configured the virtual
Also this assumes you have determined a project short name already. These can
Create the configuration file
- In the terminal, move to $GTHOME/apps/dicts/nds/neahtta/
- Copy configs/sample.config.yaml.in to configs/PROJNAME.yaml.in
- Add the new file to SVN.
- Open the file in a text editor, and read through the settings. There are
- When you are done, check in the changes.
Adding language names
- Open the file configs/language_names.yaml
- For each language in the project, check the following (there are plenty of comments to guide you):
- the NAMES variable contains the ISO codes, and then a string marked for
- the LOCALISATION_NAMES_BY_LANGUAGE contains ISO codes, and the language's
- the ISO_TRANSFORMS contains any potential pairs of two-character and
- the NAMES variable contains the ISO codes, and then a string marked for
Create additional directories and files
TODO:
Fabfile
- Search the file for instances of sample and follow the instructions there.
- DO NOT check this in yet.
Makefile
- copy sample to a new location, uncomment it, and follow the instructions there.
- Be sure to replace instances of sample in your new section with the PROJNAME.
TODO: this is a slightly more complex part, which I wish to do away with by
Test the configuration
- In the terminal, move to $GTHOME/apps/dicts/nds/neahtta/
- Activate the virtualenv
- Run fab PROJNAME compile, and wait until the process completes.
- Run fab PROJNAME test_configuration, and wait until the process
- Run fab PROJNAME runserver. If this completes, navigate to the
- Does everything seem to work as intended? If so...
Check in the configurations
Check in the following config files
- fabfile.py
- dicts/Makefile
- config/language_names.py
- config/PROJNAME.config.yaml.in
Create additional files
TODO: confirm that there isn't anything required for the base configuration to
Server-side configuration
Adding opt directories for FST deployment
If, while editing the Makefile, you are creating new languages in the opt
- create /opt/smi/LANG/bin
- check permissions on directories /opt/smi/LANG/bin and /opt/smi/LANG, if it is owned by the group neahtta, and writeable by that group
Configuring nginx
TODO:
Installing an init.d script
TODO:
Added polish
Now that we have a running instance, it's time for some extra configuration.
Flags
For languages that have a translation available in the interface, a flag is
- Find the flag .svg page in wikipedia, e.g. by browsing to the language
- Look for the link just below: "This image rendered as PNG in other sizes:
- Click any size, preferrably the smallest, and alter the url path, to change
http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/200px-Flag_of_Nenets_Autonomous_District.svg.png -> http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/20px-Flag_of_Nenets_Autonomous_District.svg.png
Save the file to static/img/flags/, and match the path name so that it is
Linguistic configuration (paradigms, etc.)
Configuring a new pair in an existing instance
So far the process is a little complex, but there are things that can be
This following process assumes that there is already a service existing
1.) Establish a build process for the FSTs and lexicon.
Intended: Programmers
FST build process in dicts/Makefile
This is mainly meant as a convenience for easy developing.
Assuming that the language uses the langs/ infrastructure, adding
.PHONY: baakoeh-install baakoeh-install: GT_COMPILE_LANGS := sma nob baakoeh-install: install_langs_fsts .PHONY: baakoeh baakoeh: GT_COMPILE_LANGS := sma nob baakoeh: baakoeh-lexica compile_langs_fsts [... snip ...]
These targets will build analysers as usual, but the *-install targets
In any case, the targets that these will write to are
/opt/smi/LANG/bin/dict-LANG.fst /opt/smi/LANG/bin/dict-iLANG-norm.fst /opt/smi/LANG/bin/some-LANG.fst
Troubleshooting
If you do not succeed in getting these make targets to work with a new
Lexicon in the Makefile
Editing the Makefile is a little tricky. You will need to add a target
Lexica are compiled using a Saxon process, and the Makefile contains
ZZZ-all.xml: $(GTHOME)/langs/ZZZ/src/morphology/stems/*.xml @echo "**************************" @echo "** Building ZZZ lexicon **" @echo "**************************" @echo "** Backup made (.bak) **" @echo "**************************" @echo "" -@cp $@ $@.$(shell date +%s).bak mkdir ZZZ cp $^ ZZZ/ $(SAXON) inDir=$(pwd)/ZZZ/ > ZZZ-all.xml rm -rf ZZZ/
The above makes a copy of the XML files, and then uses the Saxon process
This process will be the same if the lexica are in main/words/dicts/,
Make note of the filename that you intend to output this to, and add it
2.) Edit the .yaml file for new FSTs and Dictionaries
Intended: Programmers, linguists
Realistically anyone can do this as long as the build process is
Once you're done, save the file and attempt to restart the service.
If everything seems to be working, do not check in the config file
''Morphology'' section
This needs to have the paths to the new analysers, for each language
In any case, the morphology section should contain a new entry like the
YYY: tool: *LOOKUP file: [*OPT, '/YYY/bin/dict-YYY.fst'] inverse_file: [*OPT, '/YYY/bin/dict-iYYY-norm.fst'] format: 'xfst' options: compoundBoundary: "+Use/Circ#" derivationMarker: "+Der" tagsep: '+' inverse_tagsep: '+'
Where YYY is the language ISO path. Note the weird way that forming
''Languages'' section
Add a new entry for the language iso to this list.
''Dictionaries'' section
Here, add a new item to the list of dictionaries, relative to the
Dictionaries: # [... snip ...] - source: udm target: hun path: 'dicts/udm-all.xml'
For more information on all the settings for this chunk, see the page on YAML
3.) Define language names and translation strings
Intended: Linguists
Open the file configs/language_names.py. Here you will need to add the
NAMES
Here we define the name in English, so that it will be available for
('sme', _(u"North Sámi")),
The most easy way is to copy one existing line, and replace the contents
The first value should be the language ISO, **or** the language variant
LOCALISATION_NAMES_BY_LANGUAGE
Here we have the ISO and the language's name in the language.
('sme', u"Davvisámegiella"),
Again, copy and paste a line, and only edit the strings.
ISO_TRANSFORMS
If the language has a two-character ISO as well as a three-character
('se', 'sme'), ('no', 'nob'), ('fi', 'fin'), ('en', 'eng'),
4.) Define tagsets, and paradigms, user-friendly tag relabels
Intended: Linguists
If you wish to have paradigms visible in the language, you will need two
-
Tagsets files: NDS Linguistic Settings
-
.paradigm files: NDS Linguistic Settings
-
.context files: NDS Linguistic Settings
- .relabel files: NDS Linguistic Settings
The easiest means of course is to look at existing languages and copy
When done with these steps, be sure to add the new files and directories
Server config
Things to consider:
- nginx configuration file
- init.d script: make sure to pick an unused port, change the config file, and
Linguistic requirements
Intended: Linguists
- The xml format of the lexicon should match sme-nob or sma-nob format as
- morphological analysers (FSTs, described below)
- lists of tag pairs, what is in FST to convert to what users will see
- lists of paradigms for parts of speech
- they can be either as detailed as one paradigm per part of speech, or several
- they can be either as detailed as one paradigm per part of speech, or several
- description of attributes in the XML that need to be displayed to the user
Collecting the materials...
FSTs
Lookup FST
Lookup FST tags may be in any format, as these can be relabeled for users at a
vuovdi vuovdi vuovdit+V+TV+PrsPrc vuovdi vuovdit+V+TV+Imprt+Du2 vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Nom vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Gen vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Acc vuovdi vuovdi+Sem/Plc+N+Sg+Nom vuovdi vuovdi+A+Sg+Nom vuovdi vuovdi+A+Sg+Gen vuovdi vuovdi+A+Sg+Acc vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Nom vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Gen vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Acc
It is best to use the analysers compiled with
Spell relax FST
Spell relaxing FSTs should follow the exact same format as the lookup FST, but
- Normalizing non-standard spellings
- Compensating for keyboards without certain characters
- Switching orthographies, e.g., accepting latin for cyrillic, or vice versa.
Whatever the use case, the analyzed lemma must match up with the lexicon
Spellrelax behaviour is governed in $GTHOME/langs/$LANG/src/orhtography.
- analyser-dict-gt-desc.xfst
- analyser-dict-gt-desc-mobile.xfst
Put variation within the norm, and variation invisible to the user,
Compound marking
Must have a defined (and consistent) method for marking compounds:
vuovdedoaibma vuovdedoaibma vuovdi subst. + #doaibma subst. ent. nom. vuovdedoaibma vuovdi adj. + #doaibma subst. ent. nom. vuovdedoaibma vuovdedoaibma subst. ent. nom.
Compound marker:
" + #"
This setting can be configured in the YAML settings file.
Derivation marking
Must also be able to specify how all derivational suffixes are marked, because
oahpásmahttit oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb inf. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb imp. 2.p.flt. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb ind. pres. 1.p.flt. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb nom.Ag. subst. fl. nom.
Derivation marker:
"suff."
This setting can be configured in the YAML settings file.
Generation FST
For generation, currently sme and sma use the typical tag format for GT, but,
Lexicon
It may be that the analysis FST does not match absolutely with the lexicon, and
For instance, the above examples show "noun" and "verb" as