Edmonton Infrastructure Presentation2015

Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

Overview of the presentation

  • background and goals
  • bird's eye view
  • closer view of selected parts:
    • documentation
    • testing
    • from source to final tool

Background and goals

  • Background
  • Goals

Background

  • need for simpler maintenance
  • scalability, both for languages, tools and linguists and other developers
  • developing NLP resources is a lot of work, and languages are complex - we need a tool and an infrastructure to handle the complexity in a managable way
  • keep technical details out of the way
  • make the daily work as simple as possible
  • division of labour
  • Recognition: know the basic setup of one language - know the setup of them all

Goals

  • easy support for many languages
  • easy support for many tools
  • keep language independent and language specific code apart
  • easily upgradable
  • the resources in our infrastructure should live on for decades or more

General principles

  1. Be explicit (use non-cryptic catalogue and file names)
  2. Be clear (files should be found in non-surprising locations)
  3. Be consistent (identical conventions in all languages as far as possible)
  4. Be modular
  5. Divide language-dependent and language-independent code
  6. Reuse resources
  7. Build all tools for all languages
  8. ... but only as much as you want (parametrised build process)

Bird's Eye View and Down

  • the house
  • organisation - directory structure
  • technologies (xerox, hfst, foma + cg)
  • templated build structure and source files
  • configuration of builds

The House

../images/hus_eng_2015.png

The House and the Infra

../images/hus_eng_2015_with_infra.png

  • *Machine translation: fst's built by the infra, the rest handled by Apertium
  • *Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though
  • Supported: fst's and syntactic parsers used are built by the infrastructure

$GTHOME - directory structure

Some less relevant dirs removed for clarity:

$GTHOME/                     # root directory, can be named whatever
├── experiment-langs         # language dirs used for experimentation
├── giella-core              # $GTCORE - core utilities
├── giella-shared            # shared linguistic resources
├── giella-templates         # templates for maintaining the infrastructure
├── keyboards                # keyboard apps organised roughly as the language dirs
├── langs                    # The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  # Plains Cree
│   ├── est                  # Estonian
│   ├── evn                  # Evenki
│   ├── fao                  # Faroese
│   ├── fin                  # Finnish
│   ├── fkv                  # Kven
│   ├── hdn                  # Northern Haida
│   └─[...]                  #
├── ped                      # Oahpa etc.
├── prooftools               # Libraries and installers for spellers and the like
├── startup-langs            # Directory for languages in their start-up phase
├── techdoc                  # technical documentation
├── words                    # dictionary sources
└── xtdoc                    # external (user) documentation & web pages

Organisation - Dir Structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here

Technologies

  • All technologies are rule-based as opposed to statistical and similar technologies.
  • This allows us to write grammars that are precise descriptions of the languages - reference grammars in a way
  • Goal: The documentation for your grammar - with suitable examples etc - could be the next published grammar for your language (we'll return to that shortly)

Technology for morphological analysis

We presently use three different technologies:

  • Xerox - closed source, not properly maintained, fast, no weights
  • Hfst - open source, actively maintained, used in our proofing tools
  • Foma - Open source, actively maintained, fast (newly added, not available for all fst's yet)

Technology for syntactic parsing

  • Cg (VISLCG3, from University of Southern Denmark)
  • used for syntactic parsing
  • also for grammar checking
  • Basic idea: remove unwanted readings or select wanted ones based on the morphosyntactic context (= output of the morphological analysis)
  • Example:
# We like finite verbs:
SELECT:Vfin VFIN ;

Templated Build Structure And Source Files

  • Common resources in $GTHOME/core/
  • Template for new languages, including build instructions
  • The template is merged (using svn merge) with each language when updated

../images/newinfra.png

Configurable builds

We support a lot of different tools and targets, but in most cases one only wants a handful of them. When running ./configure, you get a summary of the things that are turned on and off at the end:

$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:

  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no

  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes

  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no

  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no

For more ./configure options, run ./configure --help

The build - schematic

../images/new_infra_build_overview.png

Closer View Of Selected Parts:

  • Documentation
  • Testing
  • From Source To Final Tool:
    • Relation Between Lexicon, Build And Speller

Closer View: Documentation

  • Background
  • Implementation

Background

  • Documentation is always out-of-date
  • It tends to be much more out-of-date when heavily separated from the thing to be documented, and vice versa
  • How to improve: make it possible to write documentation within the source code
  • This is similar to JavaDoc, Doxygen and many other such system
  • Ultimate goal:
    • Document the source code so that it can be published as the next reference grammar!

Implementation

  • The infrastructure will automatically extract comments of a certain type, and convert them into html
  • One can cite portions of the source code, as well as test data.
  • The syntax of the comments must follow the jspwiki syntax

Example cases:

Documentation:

Closer View: Testing

  • testing framework
  • yaml tests
  • in-source tests
  • other tests

Testing Framework

All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.

All tests are run with a single command:

make check

Autotools gives a PASS or FAIL on each test as it finishes:

../images/make_check_output.png

Yaml Tests

These are the most used tests, and are named after the syntax of the test files. The core syntax is:

  • a header
  • test sets:
    • test name
    • test data
  • syntax requirements: indents using spaces, multiple choices as lists within brackets, colons after everything except the word forms
Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup

Tests:
  Noun - mihkw - ok : # -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw

Yaml test output

../images/make_check_output.png

  • each yaml test file has its own line of output with PASS / FAIL /TOTAL
  • at the end of each yaml test run (= all yaml files for the same fst) there is a summary of the total results for that yaml test run
  • ... followed by the Automake PASS / FAIL message

In-Source Tests

  • LexC tests
  • Twolc tests

LexC tests

As an alternative to the yaml tests, one can specify similar test data within the source files:

LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon. 
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;

!!€gt-norm: kárta # Even-syllable test
!!€ kártta:         kártta+N+Sg+Nom
!!€ kártajn:        kártta+N+Sg+Com

Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.

The syntax is slightly different from the yaml files:

  • word form first
  • multiple alternative word forms on separate lines

Twolc tests

The twolc tests look like the following:

!!€ iemed9#
!!€ iemet#

!!€ gål'leX7tj#
!!€ gål0lå0sj#

The point is to ensure that the rules behave as they should.

Other Tests

You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.

Closer View: From Source To Final Tool:

  • Relation Between Lexicon, Build And Speller
  • Fst's And Dictionaries

Relation Between Lexicon, Build And Speller

  • tag conventions
  • automatically generated filters
  • spellers and different writing system / alternative orthographies

Tag Conventions

We use certain tag conventions in the infrastructure:

  • +Err/... ( +Err/Orth, +Err/Cmp)
  • +Sem/...
  • and more...

Automatically Generated Filters

  • Many of these clusters of tags are used for specific purposes, and are removed from other fst's.
  • tag using a common prefix (like +Err/ or +Sem/ gets filters for different purposes automatically
  • there are filers for:
    • removing the tags themselves
    • remvoing strings / words containing the tags
  • by adhering to these conventions, you get a lot of functionality for free
  • this system is used when...

Dealing with descriptive vs normative grammars

  • the normative is a subset of the descriptive
  • tag the non-normative forms using +Err/... tags
  • write your grammar as descriptive
  • remove the +Err/... strings
  • => normative fst!

Summary

  • scalability
  • division of labour
  • language independence
  • ... but still flexible wrt the needs of each language

Giitu

  • Thank you!