Technical documentation

Documentation Overview

What is the GiellaLT infra?

Getting Started

Getting Started, intro

Getting started, Mac

Getting started, Linux

Getting started, Server

Getting started, Windows

Source code for download

Linguistic source code

Divvun tools source code

Keyboard source code

Giellatekno tools source code

Github code build status

Don't Panic! (when servers are down)

Administration

Transducer infrastructure

Technical maintenance

Installation and setup

Migrating to Git

New user overview

Installing XCode

MacPorts installation

How to write a jspwiki document

Howto upgrade bugzilla

Howto upgrade to Leopard

Upgrade notes for Snow Leopard

Upgrade notes for OSX 10.9/Maverick

Howto install HFST3

Setting up Cygwin

Configuring a server

Install eXist testapp

Web page infra

The cgi-bin environemt

Debugging Forrest

Converting Docs To Markdown

Maintenance

Infra maintenance

How to merge template updates

Specification for comments as documentation

Transducer names

Makefile names and organisation

Web file names from the old infra

Maintaining SVN Ignores

How to add a new language

Maintenance using GitHub

GUT documentation

Application infra

Spellers

Compile spellers

Building Spelling Checkers

How To Configure And Optimise Spellers

Build and use weighted fst's as spell checkers

How To Control Compounding In Spellers

Setting up Voikko with HFST

Building MacVoikko

LibreOffice Voikko

Preparations and final steps

Building hfst-ospell for inclusion in Voikko

Building Voikko as a Static Universal Binary

Build a release version of Voikko+hfst oxt

LO-Voikko libraries

Servers, users, access

Presentations

Infra Presentation (BAULT)

The BAULT slides

Infrastructure Presentation (Edmonton)

Edmonton slides

Giellagáldu-møte 2017

Minority Language LT

Old documentation

Related to the old infra

Fst names in the old and new infra

Moving a language from old to new infra

Move to newinfra

Moving plx and Hunspell to the new infra

Project details

New infra overview

Inline documentation

Technical Overview

Languages

Saami languages

North Sámi

Lule Sámi

South Sámi

Inari Sámi

Kildin Sámi

Pite Sámi

Skolt Sámi

Finnic languages

Estonian 1

Source documentation

Estonian 2

Source documentation

Finnish

Ingrian

Source documentation

adjectives-affixes

adpositions-stems

adverbs-affixes

clitics-affixes

particles-affixes

pronouns-affixes

propernouns-affixes

quantifiers-affixes

Kven

Livonian

Source documentation

adjectives-affixes

Meänkieli

Source documentation

adjectives-affixes

Olonetsian

Source documentation

adjectives-affixes

Veps

Source documentation

adjectives-affixes

clitics-affixes

pronouns-affixes

propernouns-affixes

Võro

File documentation

Source documentation

Other Uralic lgs

Eastern Mari

Source documentation

adjectives-affixes

clitics-affixes

numbers-affixes

pronouns-affixes

propernouns-affixes

Erzya

Source documentation

adjectives-affixes

propernouns-affixes

Khanty

Source documentation

Komi

Moksha

Source documentation

adjectives-affixes

pronouns-affixes

propernouns-affixes

Nganasan

Source documentation

Northern Mansi

Source documentation

adjectives-affixes

adjectives-stems

conjunctions-stems

Selkup

Source documentation

Tundra Nenets

Source documentation

Udmurt

Source documentation

adverbs-affixes

propernouns-affixes

Western Mari

Source documentation

adjectives-affixes

clitics-affixes

propernouns-affixes

quantifiers-affixes

pronouns-affixes

North American lgs

Central Alaskan Yupik

File documentation

Central Siberian Yupik

File documentation

Cherokee

File documentation

Dogrib

File documentation

Greenlandic

Source documentation

derivations-inflections

numerals-affixes

propernouns-affixes

Iñupiaq

File documentation

Kiowa

File documentation

Northern Haida

Source documentation

Ojibwa

File documentation

Ojibwe

Source documentation

Plains Cree

Source documentation

particles-stems

punctuation-stems

Southern Puget Sound Salish

File documentation

Tsuut’ina

Source documentation

Upper Necaxa Totonac

File documentation

Upper Tanana

File documentation

Other languages

Bashkir

File documentation

Buryaad

Chukchi

File documentation

Cornish

File documentation

Evenki

File documentation

Faroese

Source documentation

Irish

File documentation

Kalderash Romani

File documentation

Khalkha Mongolian

File documentation

Khakhas

File documentation

Latvian

File documentation

Norwegian Bokmål

Romanian

File documentation

Aromanian

File documentation

Russian

File documentation

Somali

Source documentation

Klingon

File documentation

Tuvan

File documentation

Kalmyk

File documentation

Todo Oirat

File documentation

Common resources

Dependency tags

Semantic Double Tagging of Names

Compounding tags for the spellers

Preprocess and lookup2cg

Normative fst-ar og stavekontrollar

Leksikalisering

Preprocess, lookup2cg, Apertium

Flag diacritics

Linguistics

Morfeme border markup

Tag standardisation

Preprocessing

Regular expressions

Morphological analysis

Derivational tags

Language Independent Tags

Disambiguation

Writing disambiguation files

Testing

Testing lexical coverage

Testing the disambiguator

Corpus

Overview and important links

Corpus collection/maintenance

Korpussamlerens 1-2-3

Corpus collector's manual

Corpus analysis

Corpus conversion

Language recognition

Unicode normalisation

Wikipedia as corpus

Sentence alignment

Korp

Ordbilde

Plan for content

Spoken corpora

LIA

ELAN

ELAN documentation

Machine translation

Apertium

Installing Apertium

Updating gtweb MT

OmegaT

OmegaT Dev Info

Meeting 7.6.2017

Language pairs

North Saami - Norwegian

North Saami - South Saami

North Saami - Inari Saami

North Saami - Lule Saami

Finnish - North Saami

Ttranslation memory

Linguistic analysis

Machine learning

Localisation

Language Support And BCP47

Tools

Forrest documentation publishing

Basic tools

Bug database

Grammar tools

How to Use Voikko+HFST

linguistic commands

Commands for grammar checker developers

Conversion tools

Windows tools

Dictionaries

ICALL

Keyboards

Designing keyboards

Language specific doc

Tips for keybord devlp

Compiling keyboards

Getting Started

Android keyboards

Linux keyboards

Icon design resources

Customising packeges

Build/install on phone

Plan for more keyboards

Proofing documentation

Proofing Overview

Testing of proofing tools

Release procedures

Admin

Release testing, Divvun 2.2

Presentations

Status for hfst-stavekontrollane (presentation)

Status for hfst-stavekontrollane (web page)

Spelling

Hyphenation

Meetings

Hyph meeting 05.11.2007

MS Office Hyphenation

Hyphenation in OpenOffice

How To Build Tex Hyphenators

Grammar checker

Nordplus-prosjektet

Prosjektoversikt

Presentasjon på torsdagsseminar

Prosjektadmin.

TTS documentation

Old Acapela project docs

Requirements And Specifications

Meetings

08.02.2012 - finalisation kickoff

10.09.2009 Project Kickoff

Subversion

Admin svn users

Old documentation

Related to the old infra

Fst names in the old and new infra

Moving a language from old to new infra

Move to newinfra

Moving plx and Hunspell to the new infra

Project details

New infra overview

Inline documentation

Technical Overview

Copyright © 2004-2019 UiT Norgga árktalaš universitehta

giellalt@uit.no

130814

Contents:

presentasjonen i Enare
arbeidet
Neste møte

FAD-møte 14.8.2013

Til stades:

BM, Cip, Marja, Trond.

Saksliste

presentasjonen i Enare
arbeidet
neste møte

presentasjonen i Enare

Punkt frå abstractet:

We report on the onging work
evaluate by native speakers
copmparing to dict
We have a dictionary
We get a domain-specific list
What does it give us?

poeng for oss:

kva har vi gjort
eksisterer det ein samisk fagterminologi
er vi i stand til å finne den
er dette nyttig

arbeidet

disambiguere

src_gt-fad_merged>grep 'src="fad"' _out_/* | cut -d ':' -f1 | sort | uniq -c | sort -nr 
1974 _out_/N_nobsme.xml
 682 _out_/V_nobsme.xml
 319 _out_/A_nobsme.xml
 ==> omkring 3000 rene fad-t-elementer

status

src_fad-only>grep '<e' * | grep 'mg_c' | sort | uniq -c | sort -nr  
 151 N_nobsme.xml:   <e src="fad" mg_c="2">
 120 N_nobsme.xml:   <e src="fad" mg_c="3">
 103 N_nobsme.xml:   <e src="fad" mg_c="4">
  37 N_nobsme.xml:   <e src="fad" mg_c="5">
  17 N_nobsme.xml:   <e src="fad" mg_c="6">
   9 N_nobsme.xml:   <e src="fad" mg_c="7">
   4 N_nobsme.xml:   <e src="fad" mg_c="8">
   1 N_nobsme.xml:   <e src="fad" mg_c="9">
   1 N_nobsme.xml:   <e src="fad" mg_c="10">
   
   src_fad-only>grep '<e' * | grep 'mg_c' | wc -l 
     443

For lemma og translation:

abs frekv for ordet i heile domenet =
rel frekv for ordet i heile domenet = gfL, gfT
abs frekv for ordet i fagdomenet
rel frekv for ordet i fagdomenet = ffL, ffT

Kva kan vi gjere med desse tala?

Scenarier:

vanleg i fagdomene / sjelden i heile domene
Kva er terskelen for å finne fagord?
Finn vi fagord i det heile?
Finn vi domener?

Filene:

   <e>
      <lg>
         <l pos="N" gf="0.0000000623088" ff="0">topptekst</l>
      </lg>
      <mg>
         <tg xml:lang="sme">
            <t pos="N" usage="vd" gf="0" ff="0">badjeteaksta</t>
         </tg>
      </mg>
   </e>

   <e>
      <lg>
         <l pos="N" gf="0.0000001142327" ff="0">bunntekst</l>
      </lg>
      <mg>
         <tg xml:lang="sme">
            <t pos="N" usage="vd" gf="0.0000001120293" ff="0">vuolleteaksta</t>
         </tg>
      </mg>
   </e>

for <l> og <t>:

gf = global relativ frekvens (nowac)
ff = fagfrekvens (fad)

Vi må vurdere kva slike tal betyr (t = belegg, 0 = ingen belegg):

gfL, ffL, gfT, ffT
tttt ... relative skilnader her
tt00
t0t0
ttt0
0000

er L vanlegare i fad enn i generell

ffL - gfL = positiv ==> fagord (meir vanleg i domene)
ffL - gfL = 0 ==> generelt ord (like vanleg) der 0 er det same som ± 0.05
ffL - gfL = negativ ==> ikkje fagord (mindre vanleg i domene)

output av differanse:

list ordpar ordna etter d(ffL,gfL) (øverst det ordparet som er "mest fagord"
Sjå på lista og trekk ei grense

Eitt svar: med grense X får vi Y% fagord i det som ligg over grensa

kan vi finne:

viss vi for lemma L finn at:
d(ffL,gfL) ≠ d(ffT1,gfT1) er positiv
d(ffL,gfL) ≠ d(ffT2,gfT2) er er null eller negativ

så har vi L => T1 = fagordomsetjing, L => T2 = generell omsetjing.

Arbeid framover:

grunndata for fad ferdig (unifisering) (bm, trond, marja)
frekvensar for ordpar frå fad-merge (cip)
differansar som ovafor (cip)
nytt møte, evaluering, presentasjon (alle) <--

Neste møte

Tysdag 20.8. kl. 10.00