Languages

Saami languages

North Sámi

Lule Sámi

South Sámi

Inari Sámi

Kildin Sámi

Pite Sámi

Skolt Sámi

Finnic languages

Estonian 1

Source documentation

Estonian 2

Source documentation

Finnish

Ingrian

Source documentation

adjectives-affixes

adpositions-stems

adverbs-affixes

clitics-affixes

particles-affixes

pronouns-affixes

propernouns-affixes

quantifiers-affixes

Kven

Livonian

Source documentation

adjectives-affixes

Meänkieli

Source documentation

adjectives-affixes

Olonetsian

Source documentation

adjectives-affixes

Veps

Source documentation

adjectives-affixes

clitics-affixes

pronouns-affixes

propernouns-affixes

Võro

File documentation

Source documentation

Other Uralic lgs

Eastern Mari

Source documentation

adjectives-affixes

clitics-affixes

numbers-affixes

pronouns-affixes

propernouns-affixes

Erzya

Source documentation

adjectives-affixes

propernouns-affixes

Khanty

Source documentation

Komi

Moksha

Source documentation

adjectives-affixes

pronouns-affixes

propernouns-affixes

Nganasan

Source documentation

Northern Mansi

Source documentation

adjectives-affixes

adjectives-stems

conjunctions-stems

Selkup

Source documentation

Tundra Nenets

Source documentation

Udmurt

Source documentation

adverbs-affixes

propernouns-affixes

Western Mari

Source documentation

adjectives-affixes

clitics-affixes

propernouns-affixes

quantifiers-affixes

pronouns-affixes

North American lgs

Central Alaskan Yupik

File documentation

Central Siberian Yupik

File documentation

Cherokee

File documentation

Dogrib

File documentation

Greenlandic

Source documentation

derivations-inflections

numerals-affixes

propernouns-affixes

Iñupiaq

File documentation

Kiowa

File documentation

Northern Haida

Source documentation

Ojibwa

File documentation

Ojibwe

Source documentation

Plains Cree

Source documentation

particles-stems

punctuation-stems

Southern Puget Sound Salish

File documentation

Tsuut’ina

Source documentation

Upper Necaxa Totonac

File documentation

Upper Tanana

File documentation

Other languages

Bashkir

File documentation

Buryaad

Chukchi

File documentation

Cornish

File documentation

Evenki

File documentation

Faroese

Source documentation

Irish

File documentation

Kalderash Romani

File documentation

Khalkha Mongolian

File documentation

Khakhas

File documentation

Latvian

File documentation

Norwegian Bokmål

Romanian

File documentation

Aromanian

File documentation

Russian

File documentation

Somali

Source documentation

Klingon

File documentation

Tuvan

File documentation

Kalmyk

File documentation

Todo Oirat

File documentation

Copyright © 2004-2019 UiT Norgga árktalaš universitehta

giellalt@uit.no

Preprocessing the input

Introduction

Tokenizing

Obsolete: The tokenizer file tok.txt

Just as for North Sámi, the Lule Saami preprocessing was earlier done with the Xerox tokenize tool and the language-specific file tok.txt. The code itself is written as a set of regular expressions, and the source file (tok.txt) was compiled by xfst. As explained for the sme preprocessing, this approach was replaced by a preprocessor script, written in perl, gt/script/preprocess.

The current preprocessor

Preprocessing is done by the perl script gt/script/preprocess, which is language-independent. The script is documented here. The language dependent part of the script shall be done via the file smj/bin/abbr.txt

Handling abbreviations

Lule Saami abbreviations are handled as for North Saami.

Spell relaxation of æ/ä, ø/ö

This is a feature common to Lule and South Sami, not to be found in North Sami. The letter æ/ä and ø/ö are used interchangeably in Norway and Sweden. The parser accepts any version of them.

The xfst file to handle this is the language-independent spellrelax.regex. It contains rules like:

ń (->) ñ, ŋ (->) ñ, æ (->) ä, ø (->) ö ;

The line says that æ may optionally be replaced by ä and that ø may optionally be replaced with ö, and the same for the different ways of writing ŋ.

We plan to make parts of the spellrelax file language dependent.

Initial capitalization

There is a language independent inituppercase.regex file. Cf. the documentation for initial capitalization written for North Saami.

Capitalization of whole words

This has not yet been implemented.