graphic with four colored squares
Cover page image

Language technology for low-resource languages

Language technology for low-resource languages

Jack Rueter and Sjur Moshagen

Language technology for low-resource languages

Rule-based language technology

Rule-based language technology

Our rule-based approach:

Both technologies developed here at HU : -)

Actual implementations:

Open source is important, to avoid being locked in or spend much more money to redo things.

Language technology for low-resource languages

Rule-based language technology

Word-level technology

Word-level technology

Both formalisms should be easily recognisable by linguists.

Language technology for low-resource languages

Rule-based language technology

Word-level technology

Example — twolc

Example — twolc
Alphabet m p ;

Rules

"N:m rule"
N:m <=>  _ p: ;

"p:m rule"
p:m <=> :m _  ;

Target changes:

kaNpat
kammat

Language technology for low-resource languages

Rule-based language technology

Word-level technology

Example — rewrite rules

Example — rewrite rules

The same changes written using rewrite rules:

[ N -> m || _ p ]
.o.
[ p -> m || m _ ];

Rewrite rules are ordered, twolc rules are not.

Language technology for low-resource languages

Rule-based language technology

Word-level technology

Example — lexc

Example — lexc

A very simplistic English lexicon:

Multichar_Symbols
+V +Pres +3Sg +PresPtc +Past

! Lexicon containing lexical stems:
LEXICON Root
 walk V ;
 talk V ;
 pack V ;

! Lexicon containing POS tag only:
LEXICON V
 +V: V-suff ;

! Lexicon containing inflectional suffixes and corresponding tags:
LEXICON V-suff
 +Pres+3Sg:s   # ;
     +Past:ed  # ;
  +PresPtc:ing # ;
     +Pres:    # ;

Language technology for low-resource languages

Rule-based language technology

Word-level technology

Summary

Summary

End result: computer model of morphology and morphophonology. This model can analyse and generate word forms.

Language technology for low-resource languages

Rule-based language technology

Sentence level technology

Sentence level technology

VISLCG3, an extended version of the original Constraint Grammar syntax developed by Karlsson et al. VISLCG3 is maintained and developed in Odense, Denmark. It is open source.

Constraint grammars work "backwards": instead of imposing a structure on a sentence, it selects or removes invalid readings given a certain contexts. By eliminating readings over and over again, one should be left with only the correct readings. From the CG-3 tutorial:

(a) REMOVE VFIN IF (-1 ART) ;
(b) REMOVE (N) IF (-1 (PERS NOM)) ;

(a) will remove finite verb readings (the target) from a cohort, if the one immediately to the left (-1) contains an article tag, while (b) will remove noun readings in the presence of an immediately preceding personal pronoun in the nominative, thus disambiguating nominal-verbal ambiguities like hit in ”the hit/they hit”.

Note that the target VFIN is a defined set (e.g. consisting of tense or mode tags), while the target (N) is a simple tag, declared as a set on-the-fly by using parentheses.

Constraint grammars require a lot of manual work, but will generally achieve substantially higher scores than similar tools using other techniques (given enough work).

Language technology for low-resource languages

Tools for linguistic research

Tools for linguistic research

Language technology for low-resource languages

Tools for linguistic research

Explicit Grammars

Explicit Grammars

Language technology for low-resource languages

Tools for linguistic research

Analysers

Analysers

Language technology for low-resource languages

Tools for linguistic research

Generators

Generators

Language technology for low-resource languages

Tools for linguistic research

Use Model To Process (Analyse) Text

Use Model To Process (Analyse) Text

=> Korp (searchable corpus of analysed texts)

@alt

Language technology for low-resource languages

Tools for speakers of minority languages

Tools for speakers of minority languages

The listed tools are all supported by the Giella infrastructure.

Language technology for low-resource languages

Tools for speakers of minority languages

Keyboards

Keyboards

Without keyboards, writing texts in any language can become almost impossible. The first step of building language technology for a language is thus to make a keyboard.

Using the keyboard part of the Giella infrastructure, the amount of work needed to produce installable keyboards is minimal. But effort should be put into the desing of the keyboard such that it will work optimally for the language community. Actually building the installation packages is a matter of minutes.

Language technology for low-resource languages

Tools for speakers of minority languages

Keyboards

Layout definition

Layout definition

The layout definition is done using yaml syntax, and is just the following:

modes:
  mobile-default: |
    á š e r t y u i o p â
    a s d f g h j k l ä đ
      ž z č c v b n m ŋ
  mobile-shift: |
    Á Š E R T Y U I O P Â
    A S D F G H J K L Ä Đ
      Ž Z Č C V B N M Ŋ

longpress:
  Á: Q
  A: Å   Á À Â Ã Ạ
  á: q
  a: å   á à â ã ạ

(Demo: Skolt Sámi macOS keyboard)

Language technology for low-resource languages

Tools for speakers of minority languages

Spell Checkers

Spell Checkers

(Demo: gaelic)

Language technology for low-resource languages

Tools for speakers of minority languages

Morphologically Aware Hyphenators

Morphologically Aware Hyphenators

Language technology for low-resource languages

Tools for speakers of minority languages

Morphologically Aware Dictionaries

Morphologically Aware Dictionaries

Language technology for low-resource languages

Tools for speakers of minority languages

Grammar Checkers

Grammar Checkers

To handle wrongly split compounds, we convert word boundaries in compounds into spaces, and try to analyse the result. If we get an analysis, it is given as an error candidate, subject to further disambiguation and error detection rules.

(Demo: grammar checker)

Language technology for low-resource languages

Tools for speakers of minority languages

Machine translation

Machine translation

(Demo using Ávvir)

Language technology for low-resource languages

Tools for speakers of minority languages

Icall

Icall

Language technology for low-resource languages

Tools for speakers of minority languages

Text-To-Speech

Text-To-Speech

(Demo using Ávvir)

Language technology for low-resource languages

Conclusions

Conclusions

Language technology for low-resource languages

Conclusions

Language coverage

Language coverage

@alt

Language technology for low-resource languages

Hands-on on Thursday

Hands-on on Thursday