graphic with four colored squares
Cover page image

Edmonton presentation

Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

Edmonton presentation

Overview of the presentation

Overview of the presentation

Edmonton presentation

Background and goals

Background and goals

Edmonton presentation

Background and goals

Background

Background

Edmonton presentation

Background and goals

Goals

Goals

Edmonton presentation

Background and goals

General principles

General principles

Edmonton presentation

Bird's Eye View and Down

Bird's Eye View and Down

Edmonton presentation

Bird's Eye View and Down

The House

The House

@alt

Edmonton presentation

Bird's Eye View and Down

The House and the Infra

The House and the Infra

@alt

Edmonton presentation

Bird's Eye View and Down

$GTHOME - directory structure

$GTHOME - directory structure

Some less relevant dirs removed for clarity:

$GTHOME/                     # root directory, can be named whatever
├── experiment-langs         # language dirs used for experimentation
├── giella-core              # $GTCORE - core utilities
├── giella-shared            # shared linguistic resources
├── giella-templates         # templates for maintaining the infrastructure
├── keyboards                # keyboard apps organised roughly as the language dirs
├── langs                    # The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  # Plains Cree
│   ├── est                  # Estonian
│   ├── evn                  # Evenki
│   ├── fao                  # Faroese
│   ├── fin                  # Finnish
│   ├── fkv                  # Kven
│   ├── hdn                  # Northern Haida
│   └─[...]                  #
├── ped                      # Oahpa etc.
├── prooftools               # Libraries and installers for spellers and the like
├── startup-langs            # Directory for languages in their start-up phase
├── techdoc                  # technical documentation
├── words                    # dictionary sources
└── xtdoc                    # external (user) documentation & web pages

Edmonton presentation

Bird's Eye View and Down

Organisation - Dir Structure

Organisation - Dir Structure
.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here

Edmonton presentation

Bird's Eye View and Down

Technologies

Technologies

Edmonton presentation

Bird's Eye View and Down

Technologies

Technology for morphological analysis

Technology for morphological analysis

We presently use three different technologies:

Edmonton presentation

Bird's Eye View and Down

Technologies

Technology for syntactic parsing

Technology for syntactic parsing
# We like finite verbs:
SELECT:Vfin VFIN ;

Edmonton presentation

Bird's Eye View and Down

Templated Build Structure And Source Files

Templated Build Structure And Source Files

@alt

Edmonton presentation

Bird's Eye View and Down

Configurable builds

Configurable builds

We support a lot of different tools and targets, but in most cases one only wants a handful of them. When running ./configure, you get a summary of the things that are turned on and off at the end:

$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:

  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no

  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes

  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no

  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no

For more ./configure options, run ./configure --help

Edmonton presentation

Bird's Eye View and Down

The build - schematic

The build - schematic

@alt

Edmonton presentation

Closer View Of Selected Parts:

Closer View Of Selected Parts:

Edmonton presentation

Closer View: Documentation

Closer View: Documentation

Edmonton presentation

Closer View: Documentation

Background

Background

Edmonton presentation

Closer View: Documentation

Implementation

Implementation

Example cases:

Documentation:

Edmonton presentation

Closer View: Testing

Closer View: Testing

Edmonton presentation

Closer View: Testing

Testing Framework

Testing Framework

All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.

All tests are run with a single command:

make check

Autotools gives a PASS or FAIL on each test as it finishes:

@alt

Edmonton presentation

Closer View: Testing

Yaml Tests

Yaml Tests

These are the most used tests, and are named after the syntax of the test files. The core syntax is:

Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup

Tests:
  Noun - mihkw - ok : # -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw

Edmonton presentation

Closer View: Testing

Yaml Tests

Yaml test output

Yaml test output

@alt

Edmonton presentation

Closer View: Testing

In-Source Tests

In-Source Tests

Edmonton presentation

Closer View: Testing

In-Source Tests

LexC tests

LexC tests

As an alternative to the yaml tests, one can specify similar test data within the source files:

LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon. 
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;

!!€gt-norm: kárta # Even-syllable test
!!€ kártta:         kártta+N+Sg+Nom
!!€ kártajn:        kártta+N+Sg+Com

Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.

The syntax is slightly different from the yaml files:

Edmonton presentation

Closer View: Testing

In-Source Tests

Twolc tests

Twolc tests

The twolc tests look like the following:

!!€ iemed9#
!!€ iemet#

!!€ gål'leX7tj#
!!€ gål0lå0sj#

The point is to ensure that the rules behave as they should.

Edmonton presentation

Closer View: Testing

Other Tests

Other Tests

You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.

Edmonton presentation

Closer View: From Source To Final Tool:

Closer View: From Source To Final Tool:

Edmonton presentation

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Relation Between Lexicon, Build And Speller

Edmonton presentation

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Tag Conventions

Tag Conventions

We use certain tag conventions in the infrastructure:

Edmonton presentation

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Automatically Generated Filters

Automatically Generated Filters

Edmonton presentation

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Dealing with descriptive vs normative grammars

Dealing with descriptive vs normative grammars

Edmonton presentation

Summary

Summary

Edmonton presentation

Giitu

Giitu