Development tools

Development tools

The project manipulates text in many ways, organized in lexicons.

Editors

To edit our source file we need a text editor, which has to support UTF-8, and can save the edited result as pure text. You may use emacs and it's modes On a Mac you may e.g. use SubEthaEdit, for which we also have made modes for the relevant programming tools..

Documentation tools

We publish our documentation with forrest

Morphological analysis

The project uses a set of morphological compilers which exists in two versions, the xerox and the hfst tools. The xerox tools are the original ones, they are robust and well documented, they are freely available for research, but they are not open source. The hfst tools are open source with no restrictions. Both compilers compile the same source files, and at Giellatekno and Divvun we use both compilers interchangeably. Files for practical programs we compile in hfst, sevaral extensions are available in hfst only, but on a daily basis the xerox tools have a somewhat faster compilation speed.

A third compiler is also able to compile source files written for xfst and lexc, the foma compiler.

The xerox compilers

The Xerox tools are: twolc (for morphophonology), lexc (for morphology), xfst (for compiling the final transducer) , and lookup (for analysis and generation). Hfst has the same tools (called hfst-twolc, hfst-xfst, etc.) as well as a long list of other tools.

The xerox tools can be found at fsmbook.com. They are documented in the book referred to on that page (Beesley and Karttunen), we strongly recommend anyone working on morphological transducers, both with xerox and hfst, to buy the book.

Note
There is a bug in the latest xfst, causing forms like oslolaččat (derived from Oslo) not to work. If this is important to you, download xfst 2.13, change the name to xfst and put it in e.g. $HOME/bin.
  1. twolc, for phonological and morphophonological rules (cf. a shorter and a longer documentation).
  2. lexc, for representing the Saami stems and the affix lexica
  3. xfst the finite-state transducer tool, for integrating the different parts of the program, and for compiling the preprocessor.
  4. tokenize, for tokenization and processing (cf. documentation), note that we do not use tokenize for preprocessing at the moment, but perl.
  5. lookup, an interface to the morphological analyser. (documentation, cf. also our lookup notes

The programs are activated by printing e.g. lexc and then pressing the enter key. The tools are documented in Karttunen / Beesley Finite-State Morphology: Xerox Tools and Techniques. The tools may also be installed on your own machine, be it on Mac OSX, Linux or Windows. One version of the software is found on the CD accompanying the book, for the latest version, ask Trond for reference.

The hfst compilers

The hfst tools can be found at the hfst download page. Documentation is found at the hfst wiki. For installation, see also our hfst3 installation page. Note that the documentation is mainly technical, for a pedagogical introduction, we still recommend the Beesley and Karttunen book.

The foma compiler

Måns Huldén's oma may be downloadet at bitbucket.org/mhulden/foma. See our Foma documentation .

Disambiguation tools

  1. Morphological disambiguation
  2. lookup2cg, a script to transform Xerox output to CG input

Analysis and testing

The easiest and the most effective way to do this (although a little scary at first) is to use commandline tools. We have made a short introduction in English and a longer document in Norwegian on this topic. The introduction on how to use our parser is also an excellent introduction on how to combine the individual tools.

Our home-made tools, and adjustments of public tools

  1. The cgi-bin setup for making the parsers accessible on the web
  2. How the generated paradigms should be presented at web
  3. The web interface to our web demo
  4. Conversion scripts
  5. Testing tools
  6. Emacs for lexicon expansion
  7. Special emacs modes
  8. Autshumato CAT platform

Other tools

  1. tca2, the corpus alignment program.
  2. Evaluating other sentence alignment programs.
  3. Obsolete documentationon UTF8 for older operatie systems: setup