barentspresentation.eng

Digital infrastructure for indigenous languages in North West Russia

The goal of this project is to provide the three indigenous languages of Northwestern Russia, Komi, Nenets and Kildin, a digital infrastructure that makes it possible to continue using these languages in administration, schools and society at large in the modern Russian Federation.

With digital infrastructure, we mean:

  • Letters and keyboard available on modern computers
  • Grammatical and lexical analysis programs
  • Spellcheckers for the most important computer programs
  • Multi-language dictionaries in electronic form

All these languages have traditionally been used in schools and partly also in cultural life. Without a digital infrastructure in place, they will not make the transition to the ICT community, and thus fall out of use in public life.

Komi is the official language of the Republic of Komi, Nenets is the official language of the Nenets autonomous areas and the Yamalo-Nenets area, and it is also spoken in the western part of Taimyr. Kildin is primarily spoken in Lujávr on the Kola Peninsula.

These three languages are different, and the specific goals of each language subproject is to some extent different from the others. The languages are still facing the same challenges, however. For all three languages it is the case that they have letters that are not part of the regular Cyrillic alphabet, so they need their own keyboard layouts, and their own procedures for internet publication.

None of the languages have access to language technology resources such as spellcheckers, grammar checkers, parser programs, or multi-lingual resources. The difference between the respective language communities lies partly in what linguistic data resources are available, and partly to what extent the various language societies have resolved internal normative questions.

The present project will, to the extent that the task has not been solved already, create keyboard layouts and spelling and grammatical analysis program for all three languages. For each language, there will be appointed a team of one linguist, one or more philologists, and one (part-time) programmer. In addition, there will be a common team for the language-independent infrastructure.

Seen in a typological light, all circumpolar languages have a lot in common. Infrastructure and grammatical work carried out within this project will thus be relevant for other circumpolar languages. The project will follow the principles of open source, and the results may be reused to the benefit of other language communities.

Background:Digital Infrastructure for Indigenous languages in North-West Russia

Sjur N. Moshagen (Divvun) and Trond Trosterud (UiT).

The philosophy behind the project

The present project is based upon our earlier projects at the University of Tromsø and at the Sami parliament, creating spellcheckers for North and Lule Sami and for Greenland. We have also carried out pilot projects for the languages in Russia.

The project will include language technology knowledge transfer to research institutions for Kildin, Komi and Nenets in Russia, the project will also increase the knowledge of these languages. We are therefore convinced that we will be able to undertake this project in a way that will make clear and lasting effect upon the relevant language societies.

Digital Infrastructure

The term digital infrastructure of a language is defined above. In short, it is the technology that makes it possible to use a language in today's computerized society, eg. in the administrative context, or in modern publishing activities.

Without this infrastructure in place, all declarations as to support and appreciation of a minority language are only empty declarations. As long as we are not able to use computers to write letters in the language, or to correct text or find the correct terminology, the language can not be put into use in the administration of modern societies..

One can divide the level of the available infrastructure in three different generations, or phases:

  1. First generation: keyboard layout, fonts, date formats, sorting
  2. Second generations: Spellcheckers, automatic hyphenation, electronic dictionaries, automatic word analysis
  3. Third generation: wordnets, thesauri and thesaurus dictionaries, automatic sentence analysis, machine translation, speech technology

For this project the goal is to create second-generation infrastructure. That is roughly what is now available for North Sami, Lule Sami and Norwegian.

Permanent infrastructure

With permanent or sustained infrastructure we mean in part that the project is conducted as open source (there is no third party that owns part of the source and thus be able to block similar efforts in the future), and partly that we in the project document everything we do (we make our own knowledge explicit), and partly that we build up knowledge and academic departments within each language community.

A permanent infrastructure as defined here is important to prevent that the work we are doing now is wasted in the long term. It should be possible with reasonable effort, to pick up the thread after this project and work on based upon the work done within this project.

An important instrument to achieve this is to work and structuring the project most modular.

The three languages are briefly presented in the basic document for the project.

Kildin Sami

For Kildin an active revitalization effort has started. There have been held language camps in Lovozero two summers in a row. A turning point for the Kildin Sami community was the centralisation policy in the 1960's, when most of the Sami were moved from around the Kola peninsula and into Lovozero. The Kildin literary culture was, unlike other minority languages in the Soviet Union, not reintroduced as the language of instruction after the second world war, but only in the 1970's. It has led to a situation where the generation who speaks the language best is the one that has not learned to read and write it at school. In practice, only a small group of Samis write in Sami, the other Samis deliver text to this small group of writers for proofreading, or even for translating from Russian.

The aim of the Kildin proofing program is to allow more than the small group of "writing experts" to write, and publish their own texts. Kildin digital dictionaries will make this possible.

The work will be done in cooperation with the Sami language center in Lovozero, a project where both the University of Tromsø and the Sami College is involved, in addition to the language workers in Lovozero.

Komi

Komi is the largest of the three languages. In this project it will probably be represented by The Department of Philology at Komi State University or by the Department of Languages, Literature and History at Komi scientific centre (IJaLI at KomNTs).

At IJaLI they have focused on lexicography (a.o. published a Komi-Russian dictionary of 31,000 words, which could be an important resource for the Komi analyzer). A key resource is also the Department of Komi Philology at the Komi State University. Unlike Kildin, Komi has had an unbroken tradition of language teaching and use throughout the Soviet period, it has a national literature going back to the 1800s (it even boasts the oldest Uralic literary tradition), it has a publishing sector with about 20 annual literary titles, and a couple of magazines and newspapers. Streets and official institutions are to some extent named in two languages.

Nenets

Nenets has approximately one and a half time as many speakers as North Sámi. The language has few dialect differences and a relatively well-defined written standard, but it is spoken over a large area and has official status in three different autonomous regions (Nenets, Yamalo-Nenets and Taymyr). Of the three areas most of the language work has been done in the Yamalo-Nenets, where there also is an educational institute.

Work on Nenets will be based on a machine-readable version of N.M. Tereshchenko's standard Nenets - Russian dictionary , and on a morphological dictionary of Nenets (T. Salminen). Nenets differs from both Sami and Komi, but the solid grammatical foundation work just quoted gives a good starting point for the work.

Language programs we will develop

For each of the languages we develop these programs:

  • Automatic word analysis
  • Spelling
  • Automatic hyphenation
  • Electronic dictionaries

It is essentially the same as we have developed for North and Lule, and we are about to develop for South Sami as well.

These programs are what can be regarded as the most basic language programs for any language. The technology and resources behind the programs is also building blocks for next-generation language technology.

The underlying technology

The Sami projects use proprietary compilers from Xerox and Polderland. We will preferably use transducer technology developed at Helsinki University, an open and free alternative to Xerox compilers, and use it for both word analysis and proofing programs. In case this does not give as good results as we expect, we can fall back on Xerox and Polderland.

For next generation tools (sentence analysis, grammar, etc.), we use technology from Odense, Constraint Grammar CG3. This technology is already open source.

Future use of this technology.

The developed technology will provide a good foundation for many different tools and programs in the future. For Northern Sami, we have developed different kinds of grammar teaching programs, both for native writers and pupils, and for second language students. On the basis of such educational tools are also not far from a grammar checker.

Other possible tools that build on resources from the project are:

  • speech synthesis
  • sentence analysis
  • tagged corpus (important resource for both research and lexicology and terminology work)
  • indexing and searching
  • text summarization
  • machine translation

Structure

The project will be a collaboration between the existing environment for language technology at the University of Tromsø and the Sami Parliament, and relevant scientific institutions in Russia and Finland. The language technology centre at the university of Tromsø will be the coordinator.

Potential problem areas in the project

Although we have carried out similar projects for several languages, it is clear that each new language and every new language community brings with it new challenges. To the extent that we have not worked in Russia earlier, we would have to learn cooperating. Our good contacts at the relevant language communities will be an important aid there.

There are also various levels of the standard installation and standardization activities for each of the written languages, and it can easily become difficult to handle in order to complete a spell checker. Experience shows that spelling project will force more explicit language standardization, which is very positive for the writing culture.