This document explains how to start new Neahttadigisánit projects. !!! Starting a new project Commands here may assume that you have already configured the virtual environment. If you are not sure, you probably have not done so. See [Developing in NDS and virtualenv|NDSDeveloping.html]. Also this assumes you have determined a project short name already. These can be changed at a later time, but with some amount of find/replace work, and moving directories. Below, PROJNAME will stand in for this. Replace it with your project name. !! Create the configuration file # In the terminal, move to {{$GTHOME/apps/dicts/nds/neahtta/}} # Copy {{configs/sample.config.yaml.in}} to {{configs/PROJNAME.yaml.in}} # Add the new file to SVN. # Open the file in a text editor, and read through the settings. There are numerous comments to guide you. # When you are done, check in the changes. !! Adding language names # Open the file {{configs/language_names.yaml}} # For each language in the project, check the following (there are plenty of comments to guide you): ## the {{NAMES}} variable contains the ISO codes, and then a string marked for localization with the language names in English ## the {{LOCALISATION_NAMES_BY_LANGUAGE}} contains ISO codes, and the language's own name / endonym ## the {{ISO_TRANSFORMS}} contains any potential pairs of two-character and three-character ISO codes !! Create additional directories and files TODO: !! Fabfile # Search the file for instances of ''sample'' and follow the instructions there. # DO NOT check this in yet. !! Makefile # copy sample to a new location, uncomment it, and follow the instructions there. # Be sure to replace instances of ''sample'' in your new section with the PROJNAME. __TODO:__ this is a slightly more complex part, which I wish to do away with by generalizing the makefile settings into the .yaml.in config, interpreted by 'fabric'. Make will still be used, but everything will be configured by environment variables instead. This way we can ensure that configuration is an easier process, and build information is more visible. !! Test the configuration # In the terminal, move to {{$GTHOME/apps/dicts/nds/neahtta/}} # Activate the virtualenv # Run {{fab PROJNAME compile}}, and wait until the process completes. # Run {{fab PROJNAME test_configuration}}, and wait until the process completes. Check FST path names and ensure that the build process moved all the files to the proper location. # Run {{fab PROJNAME runserver}}. If this completes, navigate to the address that you see at the end of the output in your browser. # Does everything seem to work as intended? If so... !! Check in the configurations Check in the following config files # fabfile.py # dicts/Makefile # config/language_names.py # config/PROJNAME.config.yaml.in !! Create additional files TODO: confirm that there isn't anything required for the base configuration to work (maybe user friendly tag file?) !!! Server-side configuration !! Adding opt directories for FST deployment If, while editing the Makefile, you are creating new languages in the ''opt'' directory for deployment, there are three things to do: # create ''/opt/smi/LANG/bin'' # check permissions on directories ''/opt/smi/LANG/bin'' and ''/opt/smi/LANG'', if it is owned by the group ''neahtta'', and writeable by that group !! Configuring nginx TODO: !! Installing an init.d script TODO: !!! Added polish Now that we have a running instance, it's time for some extra configuration. !! Flags For languages that have a translation available in the interface, a flag is necessary for the menu. Wikipedia provides pretty much all flags in SVG format, and automatically converts to PNG. To get a roughly 20 x 15 px flag, use the following steps: * Find the flag {{.svg}} page in wikipedia, e.g. by browsing to the language page or region page, and click on the flag: [http://en.wikipedia.org/wiki/File:Flag_of_Nenets_Autonomous_District.svg] * Look for the link just below: "This image rendered as PNG in other sizes: 200px, 500px, 1000px, 2000px." * Click any size, preferrably the smallest, and alter the url path, to change the width of the flag to 20px: {{{ http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/200px-Flag_of_Nenets_Autonomous_District.svg.png -> http://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Flag_of_Nenets_Autonomous_District.svg/20px-Flag_of_Nenets_Autonomous_District.svg.png }}} Save the file to static/img/flags/, and match the path name so that it is {{LOCALE_20x15.png}}. !! Linguistic configuration (paradigms, etc.) See [NDS Linguistic Settings|NDSLinguisticSettings.html]. !!! Configuring a new pair in an existing instance So far the process is a little complex, but there are things that can be done mostly by linguists once the basic structure is in place. In each following section, I'll mark who the role is best suited for, thus it's clearer where work can be shared. This following process assumes that there is already a service existing to which a new language pair is being added. !! 1.) Establish a build process for the FSTs and lexicon. __Intended__: Programmers ! FST build process in dicts/Makefile This is mainly meant as a convenience for easy developing. Assuming that the language uses the ''langs/'' infrastructure, adding another to a dictionary set's build process is easy. Find the targets for the dictionary set, for example, ''kyv'' and ''kyv-install'', and add the language ISO to the variable ''GT_COMPILE_LANGS'' for these targets. {{{ .PHONY: baakoeh-install baakoeh-install: GT_COMPILE_LANGS := sma nob baakoeh-install: install_langs_fsts .PHONY: baakoeh baakoeh: GT_COMPILE_LANGS := sma nob baakoeh: baakoeh-lexica compile_langs_fsts [... snip ...] }}} These targets will build analysers as usual, but the ''*-install'' targets are there as a convenience for when overwriting the analysers in ''/opt/smi/'' is allowed. __Be careful__ with this though, because with language sets like ''sánit'' and ''baakoeh'' which are very much in production mode now, there may be some unintended consequences. In any case, the targets that these will write to are dictionary-specific, and will not overwrite analysers for other projects. {{{ /opt/smi/LANG/bin/dict-LANG.fst /opt/smi/LANG/bin/dict-iLANG-norm.fst /opt/smi/LANG/bin/some-LANG.fst }}} ! Troubleshooting If you do not succeed in getting these make targets to work with a new language, run the process manually. It might be that ''make distclean'' needs to be run once within the language directory, and then things will work. ! Lexicon in the Makefile Editing the Makefile is a little tricky. You will need to add a target for the lexicon file or files. Lexica are compiled using a ''Saxon'' process, and the Makefile contains some variables that can be used as shortcuts. For languages using ''langs/'' infrastructure for the lexicon, the best option is the following: {{{ ZZZ-all.xml: $(GTHOME)/langs/ZZZ/src/morphology/stems/*.xml @echo "**************************" @echo "** Building ZZZ lexicon **" @echo "**************************" @echo "** Backup made (.bak) **" @echo "**************************" @echo "" -@cp $@ $@.$(shell date +%s).bak mkdir ZZZ cp $^ ZZZ/ $(SAXON) inDir=$(pwd)/ZZZ/ > ZZZ-all.xml rm -rf ZZZ/ }}} The above makes a copy of the XML files, and then uses the Saxon process to compile them all into one file, with no additional processing. This process will be the same if the lexica are in {{main/words/dicts/}}, however some languages there have multiple subdirectories that need to be copied before the Saxon process is run. Make note of the filename that you intend to output this to, and add it to the language installation’s lexicon target, for example, __kyv-lexica__, __muter-lexica__, and also the remove target (such as __rm-kyv-lexica__ etc.). !! 2.) Edit the .yaml file for new FSTs and Dictionaries __Intended__: Programmers, linguists Realistically anyone can do this as long as the build process is working, since most of this should be a cut-and-paste job. Once you're done, save the file and attempt to restart the service. If everything seems to be working, do not check in the config file itself, but copy the values to ''INSTANCE.config.yaml.in'', and check that in. This is simply so that no incoming updates to config files will destroy existing production configs. ! ''Morphology''  section This needs to have the paths to the new analysers, for each language ISO. Follow one of the existing languages and adjust the values as necessary. If any language variants (mobile spellrelax) need to be included, a good idea is to use the language ISO as the key, but with one letter appended, i.e., ''udm'' for mobile would be ''udmM''. In any case, the morphology section should contain a new entry like the following: {{{ YYY: tool: *LOOKUP file: [*OPT, '/YYY/bin/dict-YYY.fst'] inverse_file: [*OPT, '/YYY/bin/dict-iYYY-norm.fst'] format: 'xfst' options: compoundBoundary: "+Use/Circ#" derivationMarker: "+Der" tagsep: '+' inverse_tagsep: '+' }}} Where {{YYY}} is the language ISO path. Note the weird way that forming paths with aliases is handled here in YAML, they may be strings or lists, and if they are lists, they will be automatically concatinated into strings. This must be done because YAML does not allow string concatenation with aliases/variables. ! ''Languages''  section Add a new entry for the language iso to this list. ! ''Dictionaries''  section Here, add a new item to the list of dictionaries, relative to the {{neahtta}} path, i.e., {{dicts/file-name.xml}}. {{{ Dictionaries: # [... snip ...] - source: udm target: hun path: 'dicts/udm-all.xml' }}} For more information on all the settings for this chunk, see the page on YAML configuration: [The Neahttadigisánit Configuration|FilesForConfiguratingNDS.html] !! 3.) Define language names and translation strings __Intended__: Linguists Open the file ''configs/language_names.py''. Here you will need to add the language ISO to several variables. Save when done, and be sure to check in in SVN. ! NAMES Here we define the name in English, so that it will be available for translation to any interface languages. {{{ ('sme', _(u"North Sámi")), }}} The most easy way is to copy one existing line, and replace the contents of the strings. If you're unfamiliar with Python, be careful not to remove any underscores around the strings, and only edit the contents. The first value should be the language ISO, **or** the language variant (''SoMe'', ''udmM'', ''kpvS'', etc.) ! LOCALISATION_NAMES_BY_LANGUAGE Here we have the ISO and the language's name in the language. {{{ ('sme', u"Davvisámegiella"), }}} Again, copy and paste a line, and only edit the strings. ! ISO_TRANSFORMS If the language has a two-character ISO as well as a three-character ISO, we must have these defined here. {{{ ('se', 'sme'), ('no', 'nob'), ('fi', 'fin'), ('en', 'eng'), }}} !! 4.) Define tagsets, and paradigms, user-friendly tag relabels __Intended__: Linguists If you wish to have paradigms visible in the language, you will need two things: * ''Tagsets'' files: [NDS Linguistic Settings|NDSLinguisticSettings.html] * ''.paradigm'' files: [NDS Linguistic Settings|NDSLinguisticSettings.html] * ''.context'' files: [NDS Linguistic Settings|NDSLinguisticSettings.html] * ''.relabel'' files: [NDS Linguistic Settings|NDSLinguisticSettings.html] The easiest means of course is to look at existing languages and copy what they do. When done with these steps, be sure to add the new files and directories to svn and check them in. !!! Server config Things to consider: * nginx configuration file * init.d script: make sure to pick an unused port, change the config file, and also the path to the pid file, otherwise bad things happen !!!Linguistic requirements __Intended__: Linguists * The xml format of the lexicon should match sme-nob or sma-nob format as closely as possible ({{words/dicts/smanob/src}}) * morphological analysers (FSTs, described below) * lists of tag pairs, what is in FST to convert to what users will see * lists of paradigms for parts of speech ** they can be either as detailed as one paradigm per part of speech, or several paradigms for parts of speech and varying sub-types. These will have to be marked in the lexicon in some way, for exampl, plural-only proper noun paradigms for North Saami * description of attributes in the XML that need to be displayed to the user !!! Collecting the materials... !! FSTs ! Lookup FST Lookup FST tags may be in any format, as these can be relabeled for users at a later stage. {{{ vuovdi vuovdi vuovdit+V+TV+PrsPrc vuovdi vuovdit+V+TV+Imprt+Du2 vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Nom vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Gen vuovdi vuovdit+V+TV+Der/NomAg+N+Sg+Acc vuovdi vuovdi+Sem/Plc+N+Sg+Nom vuovdi vuovdi+A+Sg+Nom vuovdi vuovdi+A+Sg+Gen vuovdi vuovdi+A+Sg+Acc vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Nom vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Gen vuovdi vuovdi+Sem/Hum+N+NomAg+Sg+Acc }}} It is best to use the analysers compiled with ''--enable-dicts'', because this will strip extraneous tags. !Spell relax FST Spell relaxing FSTs should follow the exact same format as the lookup FST, but naturally point to the normative lemmas, or whatever lemmas will be used in lookups in the lexicon. These could be of three types: # Normalizing non-standard spellings # Compensating for keyboards without certain characters # Switching orthographies, e.g., accepting latin for cyrillic, or vice versa. Whatever the use case, the analyzed lemma must match up with the lexicon entry. Spellrelax behaviour is governed in {{$GTHOME/langs/$LANG/src/orhtography}}. Here, there are two scripts: ''spellrelax.regex'' and ''spellrelax-mobile-keyboard.regex''. These will result in two analysers, the first containing the spellrelax.regex rules, and the latter will contain the rules for both regex files. The latter is intended for the SoMe option (see e.g. [sanit.oahpa.no|http://sanit.oahpa.no]). * analyser-dict-gt-desc.xfst * analyser-dict-gt-desc-mobile.xfst Put variation __within__ the norm, and variation invisible to the user, in the former file, and ad hoc / dirty hack variation in the latter. ! Compound marking Must have a defined (and consistent) method for marking compounds: {{{ vuovdedoaibma vuovdedoaibma vuovdi subst. + #doaibma subst. ent. nom. vuovdedoaibma vuovdi adj. + #doaibma subst. ent. nom. vuovdedoaibma vuovdedoaibma subst. ent. nom. }}} __Compound marker__: {{{ " + #" }}} This setting can be configured in the YAML settings file. ! Derivation marking Must also be able to specify how all derivational suffixes are marked, because this affects how lexicalized words are displayed when a derivation has a lexicalized form. {{{ oahpásmahttit oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb inf. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb imp. 2.p.flt. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb ind. pres. 1.p.flt. oahpásmahttit oahpásmuvvat verb avl.suff.-ahtti verb nom.Ag. subst. fl. nom. }}} __Derivation marker__: {{{ "suff." }}} This setting can be configured in the YAML settings file. ! Generation FST For generation, currently sme and sma use the typical tag format for GT, but, we have a special generator FST that removes non-standard forms. Also, sma has some special entries that are marked with a hid attribute (göövledh+1, göövledh+2) were generation depends on meaning. These must be included in the generation FST. !! Lexicon It may be that the analysis FST does not match absolutely with the lexicon, and this is okay. But, if this is so, it is important to know where the differences are. For instance, the above examples show "noun" and "verb" as the part of speech in the analyzer, but the lexicon only knows "N" and "V". Thus, in order to line up these tools, the programmers will need to have a list of all of these things to formulate rules.