graphic with four colored squares
Cover page image

The Giellatekno & Divvun infrastructure

The Giellatekno & Divvun infrastructure

Presentation at the BAULT seminar, HU 5.6.2014

Sjur Nørstebø Moshagen, Divvun, UiT Norgga árktalaš universitehta

The Giellatekno & Divvun infrastructure

Content

Content

The Giellatekno & Divvun infrastructure

Overview

Overview

Main points for the design of the infrastructure:

Developed by Tommi Pirinen and Sjur Moshagen.

A schematic overview of the main components of the infrastructure:

@alt

The Giellatekno & Divvun infrastructure

The Core

The Core

Available at:

svn co https://gtsvn.uit.no/langtech/trunk/giella-core

The Giellatekno & Divvun infrastructure

The Core

Templates For Data And Build Files

Templates For Data And Build Files

The template contains template data for the actual linguistic content. The idea is that when initialising a new language, you should get a working setup, with a working set of tools (analysers, generators, spell checkers, etc) for a toy language, and from there on you add your own, real content. The reality isn't quite there, but not too far away either.

The template also contains configuration and build instructions, both language independent parts, and support for language specific build steps. The language independent parts are kept in a separate directory, to avoid accidental changes and easy reuse of general build instructions.

The Giellatekno & Divvun infrastructure

The Core

Shared Data

Shared Data

The shared data comes in two flavours:

Shared linguistic data typically is shared only for a subgroup of languages, like smi and urj-Cyrl.

The regular expressions are made to remove tags or tagged strings of classes typically found in all languages:

There are presently 52 such regexes.

The Giellatekno & Divvun infrastructure

The Core

Other shared items

Other shared items

The Giellatekno & Divvun infrastructure

Templates And Merging

Templates And Merging

The Giellatekno & Divvun infrastructure

Templates And Merging

Template Content

Template Content

Briefly described earlier, this is what you will find:

... or in a picture:

The Giellatekno & Divvun infrastructure

Templates And Merging

Template content layout

Template content layout

@alt @alt

The Giellatekno & Divvun infrastructure

Templates And Merging

Template content layout as ASCII art

Template content layout as ASCII art
.
├── am-shared
├── doc
├── src
│   ├── filters
│   ├── hyphenation
│   ├── morphology
│   │   ├── affixes
│   │   └── stems
│   ├── orthography
│   ├── phonetics
│   ├── phonology
│   ├── syntax
│   ├── tagsets
│   └── transcriptions
├── test
│   ├── data
│   ├── src
│   │   ├── dict-gt-yamls
│   │   ├── gt-desc-yamls
│   │   ├── gt-norm-yamls
│   │   ├── morphology
│   │   ├── phonology
│   │   └── syntax
│   └── tools
│       ├── mt
│       │   └── apertium
│       └── spellcheckers
└── tools
    ├── grammarcheckers
    ├── mt
    │   └── apertium
    │       ├── filters
    │       └── tagsets
    ├── preprocess
    ├── shellscripts
    └── spellcheckers
        ├── fstbased
        │   ├── foma
        │   └── hfst
        └── listbased
            └── hunspell

The Giellatekno & Divvun infrastructure

Templates And Merging

Template Content (II)

Template Content (II)

This also shows the use of the Autotools basic structure, with such items as:

The Giellatekno & Divvun infrastructure

Templates And Merging

Merging The Template

Merging The Template

One of the main features of the infra is the relative ease with which one can update all or any language with new features or bug fixes in the build instructions. This is done by merging recent changes in the template with the corresponding files in each language.

The Giellatekno & Divvun infrastructure

Languages

Languages

We have split the languages in four groups, depending on the type of work done on them, and their license:

langs
These are the languages being actively developed - 43 languages
startup-langs
These are languages that someone has an interest in, but are not actually being developed, and where the linguistic content is thin - 11 languages
experiment-langs
The name says it all - this is the playground, and they are a.o. used for teaching - 3 languages
closed-langs
These are languages with a closed license, only ISL and DAN at the moment

Available at:

svn co https://gtsvn.uit.no/langtech/trunk/langs/ISO639-3-CODE/

(replace ISO639-3-CODE with the actual ISO code)

The Giellatekno & Divvun infrastructure

Languages

And more languages

And more languages

We still have a number of languages located in an older infrastructure system - these will be moved to the new infra as time permits.

The Giellatekno & Divvun infrastructure

Linguistic data

Linguistic data

Formats:

Standardised tag sets:

The Giellatekno & Divvun infrastructure

Build Structure

Build Structure

Support for:

The Giellatekno & Divvun infrastructure

Build Structure

Build tools

Build tools

The Giellatekno & Divvun infrastructure

Build Structure

Build tools

Language Specific Adaptions In The Build Process

Language Specific Adaptions In The Build Process

This is done by first building a *.tmp file, and using a fall-back target that just copies the *.tmp file to the final target. By overriding the copy step, one can do whatever one needs to do for a specific target after the default, language-independent processing is done.

The language-specific build steps are specified in a (mostly) clearly marked section in the Makefile.am files.

The Giellatekno & Divvun infrastructure

Testing

Testing

Testing is done with the command make check, as in all Autotool-based build systems. There is built-in support for two types of tests:

In addition, there is the general support for testing in Autotools (or more specifically in automake), meaning that it is possible to add test scripts for whatever you like.

The Giellatekno & Divvun infrastructure

Testing

Test directory

Test directory

Most test scripts are located within the test/ directory, within which there is a mirror copy of the language's directory tree, to keep the tests for different parts of the grammar separate from each other:

@alt

The Giellatekno & Divvun infrastructure

Testing

Yaml tests

Yaml tests
  Adjective - gielak: # AGAdj
    gielak+A+Attr: gielak 
    gielak+A+Sg+Nom: gielak 
    gielak+A+Sg+Acc: gielagav 
    gielak+A+Comp+Sg+Nom: []
    gielak+A+Superl+Sg+Nom: []

Tests both generation (absolute match) and analysis (ignoring homonymy) for the specified fst (specified in the filename of the yaml file). The test runner will loop over all matching yaml files, run all tests in each file, and if one file fails, it will print out a command to copy & paste to repeat the test with all details visible.

Example (line wraps added for readability):

YAML test 37: analyser-gt-norm.xfst + gt-norm-yamls/N-aambaz_gt-norm.yaml - 30/0/30 PASS
YAML test 38: analyser-gt-norm.xfst + gt-norm-yamls/N-aandam_gt-norm.yaml - 26/2/28 FAIL

To rerun with more details, please triple-click, copy and paste the following:

pushd $GTHOME/langs/liv/build/xerox/test/src; \
/opt/local/bin/python3.2 \
$GTHOME/giella-core/scripts/morph-test.py -c -i -v -S xerox \
--app /Users/smo036/bin/lookup \
--morph ././../../src/analyser-gt-norm.xfst \
--gen ././../../src/generator-gt-norm.xfst \
../../../../test/src/gt-norm-yamls/N-aandam_gt-norm.yaml; popd

The Giellatekno & Divvun infrastructure

Testing

Lexc tests

Lexc tests

The in-source lexc tests are actually a variant of the yaml tests, although the format is slightly different:

!!€gt-norm: linja # Even-syllable test
!!€ linnja                linnja+N+Sg+Nom
!!€ linjajn               linnja+N+Sg+Com
!!€ linjav                linnja+N+Sg+Acc

The first line specifies the fst to run the test against and the name of the test, the remaining lines are the actual test data. Positive test lines start with "!!€ ", whereas negative test lines start with "!!$ ".

All lexc files are looped over, and if test cases are found, they are extracted and run against the specified fst. The feedback to the developer is the same as for the yaml tests, including the command to repeat in case of fails.

The Giellatekno & Divvun infrastructure

Testing

Twolc tests

Twolc tests

If one wants to test specific two-level rules one can add test pairs to the twolc files. Support for this type of testing has only recently been added, and works only for the Xerox tools (because the pair-testing facilities in Hfst use a different format, and converting from one to the other is non-trivial without knowledge of the alphabet).

The test data looks like the following:

!!€ roavggoX4j
!!€ roavggu0j
!!$ roavggoX4j
!!$ r0åvggu0j

The yaml and lexc tests will also de facto test the correctness of the two-level rules.

The Giellatekno & Divvun infrastructure

Documentation

Documentation

The infrastructure supports extraction of in-source documentation written as comments in a specific format. The exact format is specified on a separate page, and will in the end produce html pages.

The basic idea is that documentation that is part of the actual source code is more likely kept up-to-date than external documentation.

The format supports the use of a couple of variables to extract such things as lexicon names, a line of code, etc. The extracted documentation must follow the jspwiki syntax.

The Giellatekno & Divvun infrastructure

Documentation

Documentation example

Documentation example
! ======================
!! !!!Sublexica for Noun
! ======================

!! !!Even-syllable stems
! -------------------

!! !2syll stems
! - - - - - -

LEXICON MUORRA !!= @CODE@ Standard even stems here. 2syll stem with cg (note Q1)

This will produce the jspwiki code:

!!!Sublexica for Noun

!!Even-syllable stems

!2syll stems

 LEXICON MUORRA  Standard even stems here. 2syll stem with cg (note Q1)

which can be seen rendered online as html here: /lang/smj/nouns-affixes.html.

The Giellatekno & Divvun infrastructure

The Targets, Tools And Packages Produced By The Infrastructure

The Targets, Tools And Packages Produced By The Infrastructure

The list is constantly growing and contains roughly the following at present:

The Giellatekno & Divvun infrastructure

The Targets, Tools And Packages Produced By The Infrastructure

Coming and future tools and packages

Coming and future tools and packages

Projects being worked on right now that will lead to new tools for all languages in the future:

On a slightly longer scale there are plans for:

The Giellatekno & Divvun infrastructure

Concluding remarks

Concluding remarks

More info at https://giellalt.uit.no/infra/GettingStarted.html
and https://giellalt.uit.no/infra/infraremake/index.html

The Giellatekno & Divvun infrastructure

Thank you! Kiitos!

Thank you! Kiitos!

Thanks for your attention!