graphic with four colored squares
Cover page image

Presentation of the Divvun and Giellatekno infrastructure

Presentation of the Divvun and Giellatekno infrastructure

University of Alberta, Edmonton, June 19th

Sjur Moshagen & Trond Trosterud, UiT The Arctic University of Norway

Presentation of the Divvun and Giellatekno infrastructure

Content

Content

Presentation of the Divvun and Giellatekno infrastructure

Background

Background

Presentation of the Divvun and Giellatekno infrastructure

Background

The problem

The problem

Presentation of the Divvun and Giellatekno infrastructure

Background

The plan

The plan

To create an infrastructure that:

Presentation of the Divvun and Giellatekno infrastructure

Background

The solution

The solution

@alt

Details in the rest of the presentation.

Presentation of the Divvun and Giellatekno infrastructure

Introduction

Introduction

Developed by Tommi Pirinen and Sjur Moshagen.

A schematic overview of the main components of the infrastructure:

@alt

Presentation of the Divvun and Giellatekno infrastructure

Introduction

General principles

General principles

Presentation of the Divvun and Giellatekno infrastructure

Introduction

What is the infrastructure?

What is the infrastructure?

For this to work for many languages in parallel and at the same time, we need:

Presentation of the Divvun and Giellatekno infrastructure

Introduction

Conventions

Conventions

We need conventions for:

E.g., your source files are located in src/:

Presentation of the Divvun and Giellatekno infrastructure

Introduction

Directory structure

Directory structure

In detail:

.
├── am-shared
├── doc
├── misc
├── src
│   ├── filters
│   ├── hyphenation
│   ├── morphology
│   │   ├── affixes
│   │   └── stems
│   ├── orthography
│   ├── phonetics
│   ├── phonology
│   ├── syntax
│   ├── tagsets
│   └── transcriptions
├── test
│   ├── data
│   ├── src
│   └── tools
└── tools
    ├── grammarcheckers
    ├── mt
    │   └── apertium
    ├── preprocess
    ├── shellscripts
    └── spellcheckers

Presentation of the Divvun and Giellatekno infrastructure

Introduction

Explaining the directory structure

Explaining the directory structure
.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        =
│   ├── orthography      = latin <-> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   ├── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  =
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    ├── shellscripts     = shell scripts to use the modules we create
    └── spellcheckers    = spell checkers are built here

Presentation of the Divvun and Giellatekno infrastructure

The core

The core

The core is a separate folder outside the language-specific ones. It contains:

Presentation of the Divvun and Giellatekno infrastructure

The core

Shared resources

Shared resources

The shared resources come in two flavours:

Shared linguistic data typically is shared only for a subgroup of languages, like smi and urj-Cyrl, potentially also alg and ath.

The fst manipulations remove tags or tagged strings of classes typically found in all languages:

Presentation of the Divvun and Giellatekno infrastructure

Languages

Languages

We have split the languages in four groups, depending on the type of work done on them, and their license:

langs
These are the languages being actively developed - 43 languages
startup-langs
These are languages that someone has an interest in, but are not actually being developed, and where the linguistic content is thin - 11 languages
experiment-langs
The name says it all - this is the playground, and these languages are a.o. used for teaching - 3 languages
closed-langs
These are languages with a closed license, only ISL and DAN at the moment

Available at:

svn co https://gtsvn.uit.no/langtech/trunk/langs/ISO639-3-CODE/

(replace ISO639-3-CODE with the actual ISO code)

Presentation of the Divvun and Giellatekno infrastructure

Build Structure

Build Structure

Support for:

Presentation of the Divvun and Giellatekno infrastructure

Testing

Testing

Testing is done with the command make check. There is built-in support for two types of tests:

In addition, there is the general support for testing in Autotools (or more specifically in automake), meaning that it is possible to add test scripts for whatever you like.

Presentation of the Divvun and Giellatekno infrastructure

Documentation

Documentation

The infrastructure supports extraction of in-source documentation written as comments in a specific format, and will in the end produce html pages.

Documentation written in the actual source code is more likely to be kept up-to-date than external documentation.

The format supports the use of a couple of variables to extract such things as lexicon names, a line of code, etc.

Presentation of the Divvun and Giellatekno infrastructure

The tools

The tools

Presentation of the Divvun and Giellatekno infrastructure

The tools

The pipeline for analysis

The pipeline for analysis

Presentation of the Divvun and Giellatekno infrastructure

The tools

The pipeline for grammar checking

The pipeline for grammar checking

Presentation of the Divvun and Giellatekno infrastructure

The tools

Two startup scenarios

Two startup scenarios

In the latter case it could be possible and even preferable to script the conversion from the original format to the lexc format, to make it possible to reimport or update the data.

Presentation of the Divvun and Giellatekno infrastructure

Summary

Summary