Technical doc

What is the GiellaLT infra?

Getting Started

Getting Started Details:

Getting started, Mac

Getting started, Linux

Getting started, Server

Getting started, Windows

Source code (Subversion)

More source code (Github)

Keyboard source code (Github)

Github code build status

Administration

Infrastructure

Work environment

SVN repositories

Howto fill out contracts

Setting up Voikko with HFST

Building MacVoikko

Building the LibreOffice Voikko extension

Preparations and final steps

Building hfst-ospell for inclusion in Voikko

Building Voikko as a Static Universal Binary

Build a release version of Voikko+hfst oxt

LO-Voikko libraries

Installation and setup

New user overview

Installing XCode

Localisation issues

How to write a jspwiki document

Howto upgrade bugzilla

Howto upgrade to Leopard

Upgrade notes for Snow Leopard

Upgrade notes for OSX 10.9/Maverick

Howto install HFST3

Setting up Cygwin

Configuring a server

Compilation Speed Tests

Sys. admin

Packaging software for victorio

Setup of svnserve on victorio

Setting up Access To Private Git

Add and remove users to the servers

Adding users to svn and svnmail

Manually send svn email

Group administration

Subversion conversion

Instructions on how to restart the divvun web-service

Service Checks After System Restarts

Automatic login by using ssh

How to set up signing certificates

Serveroppgradering

Serveroppgr.møte

Connecting to the computer

The new X Serve and the G5

MacPorts installation

Infrastructure Make-over

Basics

Getting started with the new infra

Kurs i ny infra

New Infra BAULT Presentation

Edmonton Infrastructure Presentation

How to add a new language

How to write documentation as comments.

New infra directory structure

XML as lexical source files

Related to the old infra

Fst names in the old and new infra

Moving a language from old to new infra

Testing

How to add morphological test data.

How to add new test scripts

Build Your Own Tests

Howto debug propernouns

Debugging

Debugging Forrest

Intermediate

How To Enable Tags In Natural Languages

How To Configure And Optimise Spellers

Example Of File Sizes With Different Optimisations

How To Control Compounding In Spellers

Advanced topics

Build and use weighted fst's as spell checkers

Up and Down Clarifications

Multiple configurations

List Of Defined Automake Conditionals

For Infrastrucure developers

New infra maintenance

How to merge template updates

Specification for comments as documentation

Transducer names

Makefile names and organisation

Web file names from the old infra

Moving plx and Hunspell to the new infra

Maintaining SVN Ignores

Project details

New infra overview

Inline documentation

Technical Overview

eXist

Install eXist testapp

Dictionaries

Engine overview

Languages

Saami languages

North Sámi

Word analysis

File documentation

Morphophonology

Discussions on twolc and lexc

Sentence analysis

The disambiguation file

Some disambiguation issues

Unresolved syntactic topics

Testing

Old bug reports, obsolete.

Normativity issues

Lule Sámi

Word analysis

Sentence analysis

The disambiguation file

Testing

Normativity issues

South Sámi

Morphophonology

Testing

Normativity issues

Inari Sámi

Source file documentation

Kildin Sámi

Index - Индекс

In English

Word analysis

Morphophonology

По-руссий

Анализ слов

грамматических теги

Морфофонологиийа

Pite Sámi

File documentation

Skolt Sámi

File documentation

Finnic languages

Other Uralic lgs

Eastern Mari

File documentation

Erzya

File documentation

Khanty

File documentation

Komi

Moksha

File documentation

Nganasan

File documentation

Northern Mansi

File documentation

Selkup

File documentation

Tundra Nenets

File documentation

Udmurt

File documentation

Western Mari

File documentation

North American lgs

Central Alaskan Yupik

File documentation

Central Siberian Yupik

File documentation

Cherokee

File documentation

Dogrib

File documentation

Greenlandic

File documentation

Iñupiaq

File documentation

Kiowa

File documentation

Northern Haida

File documentation

Ojibwa

File documentation

Ojibwe

File documentation

Plains Cree

File documentation

Southern Puget Sound Salish

File documentation

Tsuut’ina

File documentation

Upper Necaxa Totonac

File documentation

Upper Tanana

File documentation

Other languages

Bashkir

File documentation

Buryaad

File documentation

Chukchi

File documentation

Cornish

File documentation

Evenki

File documentation

Faroese

File documentation

Irish

File documentation

Khalkha Mongolian

File documentation

Khakhas

File documentation

Latvian

File documentation

Norwegian Bokmål

Romanian

File documentation

Aromanian

File documentation

Russian

File documentation

Somali

File documentation

Klingon

File documentation

Tuvan

File documentation

Kalmyk

File documentation

Todo Oirat

File documentation

Common resources

Dependency tags

Semantic Double Tagging of Names

Compounding tags for the spellers

Preprocess and lookup2cg

Normative fst-ar og stavekontrollar

Leksikalisering

Preprocess, lookup2cg, Apertium

Flag diacritics

Linguistics

Morfeme border markup

Tag standardisation

Preprocessing

Regular expressions

Morphological analysis

Derivational tags

Language Independent Tags

Disambiguation

Writing disambiguation files

Testing

Testing lexical coverage

Testing the disambiguator

Corpus

Machine translation

Apertium

Installing Apertium

Updating gtweb MT

OmegaT

OmegaT Dev Info

Meeting 7.6.2017

Language pairs

North Saami - Norwegian

North Saami - South Saami

North Saami - Inari Saami

North Saami - Lule Saami

Finnish - North Saami

Tools

Forrest documentation publishing

Basic tools

Bug database

Grammar tools

How to Use Voikko+HFST

linguistic commands

Commands for grammar checker developers

Conversion tools

Windows tools

Dictionaries

ICALL

Konteaksta

Gïelese

Client development

Server development

Workshops

ICALL workshop 2013

Course inlogging 2016

Oahpa2.0 Workshop 2016

Keyboards

Getting Started

Android keyboards

Linux keyboards

Icon resource design info

Tips for desktop keyboards

Customising Keyboard Packages

Build And Install

Plan for more keyboards

Presentations

Giellagáldu-møte 2017

Minority Language LT

Proofing documentation

Proofing Overview

Testing of proofing tools

Release procedures

Admin

Release testing, Divvun 2.2

Presentations

Status for hfst-stavekontrollane (presentation)

Status for hfst-stavekontrollane (web page)

Spelling

X-spell

Hfst

How to build the error model

MS Office spellers

Documentation for speller testing

How to mark up spelling errors

Automatic testing using MS Word

Early Beta results

PLX conversion testing

PLX debugging table

DVChart2 specifications

Hyphenation

Meetings

Hyph meeting 05.11.2007

MS Office Hyphenation

Hyphenation in OpenOffice

How To Build Tex Hyphenators

Grammar checker

Nordplus-prosjektet

Prosjektoversikt

Presentasjon på torsdagsseminar

Prosjektadmin.

TTS documentation

Subversion

Admin svn users

Copyright © 2004-2019 UiT Norgga árktalaš universitehta

feedback@divvun.no

Using The Ipa Generating Pipeline

Contents:

Requirements
Command pipeline
Further work

Requirements

HFST (at least svn r2160)
VISLCG3 (a recent svn version)
Apertium (one tool only)

The pipeline is not yet fully functional. This document is both a guide to help us get where we want, and documentation for the present status and planned functionality.

Command pipeline

Here is a test command illustrating the whole processing pipeline from plain text in until IPA out (not all components are in place yet, and those components are substituted with alternatives to get something running):

$ echo "Iđđes dii. 9 mun doapmalan čoaggit alitnásttiid álbmotmeahcis." | \
apertium-destxt | \
hfst-proc -C -w -e -q -r sme/bin/sme.hfstol | \
vislcg3 -g sme/src/sme-dis.rle | \
grep -v '^"' | cut -d '"' -f3 | cut -d ' ' -f2 | \
hfst-optimized-lookup -q sme/bin/isme.hfstol | \
cut -f2 | grep -v '^$'

The output produced with the above pipeline is:

Iđđes+Adv+?
dii.
9
+?
doapmalan
čoaggit
alitnásttiid
álbmotmeahcin
álbmotmeahcis
..
+?

The target is to produce IPA, one output token for each input token.

The text output option illustrated above can be used to ensure 1: 1 roundtrip correctnes for the disambiguation and generation - we should be able to produce the same output as we put into the pipeline.

Commented commands

Below is each command commented:

echo "Iđđes dii. 9 mun doapmalan čoaggit alitnásttiid álbmotmeahcis." - the input data piped to the next step
apertium-destxt - the tool hfst-proc requires that certain characters are escaped, and this tool does the job
hfst-proc -C -w -e -q -r sme/bin/sme.hfstol - tokenise and analyse the text, removing superfluous compound analyses ( -e) and producing VISLCG3-formatted output (-C) adding the raw analysis string as a subreading (-r); the lemma is returned in dictionary case (-w) which is needed if generation is going to work
vislcg3 -g sme/src/sme-dis.rle - disambiguate
grep + cut + cut - temporary manual postprocessing to get only the disambiguated raw analysis string (hopefully to be replaced with some VISLCG3 output option), which is then given to:
hfst-optimized-lookup -q sme/bin/isme.hfstol - generate IPA strings from the input (the exemplified transducer generates regular orthographic forms for now)
cut + grep - do some simple postprocessing to only get the generated wordforms (this should be added as an output option to hfst-lookup)

Further work

replace the analysing transducer with a tailored speech synthesis transducer; the most important diff against the regular transducer is that most (all?) tags are included, to ensure round-trip stability
tune the disambiguation
replace the generating transducer with a real IPA transducer