Infra Remake
Outline of major goals
- divide and conquer - split dirs and source files to contain one and only one type of content
- e.g. $GTLANG/src/ now contains both lexc, twolc, cg3 and other source files
- some regex filters contains filtering both for normative and generation
- etc.
- e.g. $GTLANG/src/ now contains both lexc, twolc, cg3 and other source files
- use a commonly found build system - autotools or CMake
- an important part of this is to ensure that we do proper checking of dependencies, both within our source files, as well as against external tools and software packages
- an important part of this is to ensure that we do proper checking of dependencies, both within our source files, as well as against external tools and software packages
- merge all languages in gt/, kt/ and st/ into one dir named gtlangs/
- move script/ out of any language dirs
- try to build a Apertium-style system where one only needs to check out one core component, and then only the languages one wants to work with
- another inspiration is Omorfi which has a dir structure and build system along the lines of what we want
More detailed plans and progress
There are some details further down, but the meat of the plan is found on a separate page. The same goes for the progress.
Other goals
- VPATH builds (out-of-source builds) - would be nice to plan for this from the very beginning
- parallel builds: both Xerox and HFST should be supported; autoconf
- Sjur: I'm not sure what the best behaviour should be if both are found, probably to give a build/makefile option to specify which one to use.
- Tommi: I think this is a good approach. For defaults we can go anyway, perhaps even set it language by language.
- Good idea. One global default (xerox for now), but possibly other defaults for specific languages.
- Good idea. One global default (xerox for now), but possibly other defaults for specific languages.
- Sjur: Also, we probably need to maintain separate build commands for each tool set, as they are too different to be able to use the same build commands (perhaps with an exception for foma vs xfst).
- Tommi: Yeah, unfortunately. One of the aims would be to have exactly same commands as xerox tool set in HFST but it still requires a lot of work.
- Yes. The targets should be the same (ie 'make' will make both hfst and xerox), but for some targets there is no parallel in the other (e.g. hfst can make a speller, but there is no xerox counterpart for a speller transducer).
- Yes. The targets should be the same (ie 'make' will make both hfst and xerox), but for some targets there is no parallel in the other (e.g. hfst can make a speller, but there is no xerox counterpart for a speller transducer).
- Tommi: Anyways, since the builds will be separate it would even be possible to build both variants if both tool sets are found.
- Sjur: I'm not sure what the best behaviour should be if both are found, probably to give a build/makefile option to specify which one to use.
- Another design goal is that once you are within $GTHOME/gtlangs/, you
- Another question is how to handle the growing size of $GTMAIN. We are
- The problem with the above solution is that it will break the
- I think autotools can do a great deal to help with the script
- The problem with the above solution is that it will break the
- Another thing I would like to have further along the road is to add a
- Almost always this is implemented by copying some templates and filling
- The template files should be the same as the ones used for checking if anything in the build environment has been updated. But I will try to elaborate on this a bit later on.
- Almost always this is implemented by copying some templates and filling
- Finally, we should support unit testing and inline documentation (á
- That is one thing I (and I think Anssi Yli-Jyrä as well) have been
- further discussion on inline documentation is found on
- That is one thing I (and I think Anssi Yli-Jyrä as well) have been
Some details
Dir structure
The basic dir structure could be something like this:
$GTHOME/ gtcore/ scripts/ # the old gt/script/ dir mk-files/ # shared core mk-files templates/ # src file templates and dir structure shared/ # old common/ - shared linguistic src files gtlangs/ sme/ smj/ sma/ fao/ kom/ langgroups/ smi/
Comments: the dirs in $GTHOME/gtcore/ are intended to be used as follows:
-
mk-files/ - the bulk of the build system is found here. In each language dir der is only a very simple (auto)make file that imports or includes the core makefile at the same relative position, as well as a local (auto)make file for local overrides. This will make it easy to improve or change the build for all languages, and at the same time provide enough flexibility to do language-specific experiments.
- templates/ - contains a complete directory structure and source file templates for all tools and purposes for our languages. This dir tree is used for two purposes: to populate a new language dir tree, and to add new files to existing languages when new source file templates are added. That is, when a new functionality is added, with its corresponding source files in this directory, then all languages will automatically be updated to get source file templates for this new functionality.
Longer term, one can consider the following additions:
$GTHOME/ gtcore/ # as above gtlangs/ # as above gtlangpairs/ # language pairs, typically dictionaries and MT gtlanggroups/ # multilingual resources, typically terminology # collections and shared name resources
The idea is to gather resources that are specific to the given language pairs within these directories. They should also serve as the starting point for CS's Dream (Cip's and Sjur's Dream), where all monolingual information is stored in gtlangs/, and all multilingual information is stored in one of the two dirs indicated above. Language pair names are directional, indicating the source and target languages.
In this scenario, resources for an MT application would then probably be divided among three dir trees: gtlangs/ for the monolingual resources, gtlangpairs/ for the transfer dictionaries, and gtlanggroups/ for terminology resources.
- Cip: As a matter of fact, Cip's dream is about how to model humans' language knowledge, i.e. humans don't have a list of Saami words,
Filenames and extension
Filenames need to be standardised, as well as the use of filename extensions. The extension should reflect the content type. A possible list of extensions could be:
-
.lexc - LexC source files
-
.xfscript - xfst script file
-
.regex - xfst regex file
-
.twolc - twolc source file
-
.xfst - compiled transducer, Xerox type
-
.hfst - compiled transducer, HFST type
-
.hfstol - compiled transducer, optimised HFST type
-
.cg - generic constraint grammar (CG) file
-
.cg2 - CG2 source file (probably not used anymore, but listed to ensure we can differentiate between major versions of the CG formalism)
-
.cg3 - VISL CG3 constraint grammar source file
- .cg3bin - binary/compiled VISLCG3 constraint grammar file
There are probably other file types we need to handle, add mmore extensions here as needed.
Language codes
So far we have used ISO639-2 codes for all languages, and applied that to both dir names and as part of file names. We should probably move to (the relevant subparts of) proper locale codes, following the standards used by the rest of the world. This means changing all sme strings to se, nob to nb, etc.
Execution plan
- start small - only one language. First language is fao, which is reasonably big but still not too complex. Create the basic dir tree, and use svn copy to copy over the fao sources, so that the old fao dir remains intact and usable all the time (only when everything is working ok, the old dir will be removed).
- get all the basic infrastructure and build functionality to work for the needs of fao
- add another language, probably a small one, to test the multilingual behaviour as well as templating system (the small language will most likely not have all features that fao has)
- add a third language - a big one this time, e.g. kal (which is using xfst instead of twolc and thus provides a slightly new use case). Make sure all build targets are working as they should, and extend the build system, template files, etc as needed. kal has probably more requirements than fao.
- then add one language at a time, all the time ensuring that everything is working for all langauges, and that the small languages automatically pick up new functionality from the big ones as the template dir is expanded to follow the big languages being added
- gradually remove the old language dirs as the new location and build infrastructure becomes stable, also forcing the whole group to start using the new infrastructure. This is important to get feedback and correct bugs.
Testing the remake
We need to ensure that nothing changes in terms of the output of the transducers as part of the remake - unless there are some intended changes (e.g. unifying tags across languages). It is probably best to first do the infra remake, and then later do such tag unifying in the output. So what we need for each language is:
- a reference corpus
- analysed using the different transducers built with the old build system
- analysed using the same transducers from the new build system
- analysed using the different transducers built with the old build system
The testing then amounts to ensuring that the output is the same from both the old and new transducers. This should guarantee stability in the output, and thus reliability from a linguistic point of view.
There might be problems with this testing scenario in cases where we want to change tags as part of the infrastructure remake. One example could be that we want to standardise some of the compounding tags, to ensure that a compound filter works the same for all languages. Or that some tags that are visible now will be removed in the output of the new transducers, since they really should not be part of the output even in the old transducers (e.g. the +Der1 tags).
Open questions
Applications - top-level or bottom-level?
Should we build end-user applications in a separate dir tree, one tree for each application, or should the applications be included in the regular language dirs? As long as the application basically only involves one technology and a few files, it would probably seem easiest to build the application as part of the other builds for that language. One such example is spell checkers, which basically are an application of normative transducers.
But as soon as the application builds on multiple technologies, requires several installation packages for different plattforms and a multitude of files as part of the application, it might get more complicated. In this case it might be easier to maintain separate directory trees for each application. Take Oahpa as an example, which uses both several transducers, disambiguators, SQL data, and user interface files.
There is no easy answer to this, we probably have to try both, and see how things develop.