Greenland2017

Contents:

Presentation Of All Tools
Introduction To The Infrastructure
Unix Crash Course
Dependencies On External Tools
In-Source Documentation
Tag conventions in the Giella infrastructure
Morpheme boundaries
Clean code
Testing
Debugging
Developer Tools
Refactoring the code
Excercises and practical work

This document contains an overview of the topics and training given in Nuuk in June, 2017. It contains links to the relevant parts of the existing documentation. The idea is that this page can be used as an overview document also in other cases.

Basic organisation: presentation of topic, followed by excercises. Roughly one topic + excercise before lunch and one topic after.

Topics (with a roughly schedule with time left at the end - to be adjusted as needed):

Day 1:
- presentation of all tools
- introduction to the infrastructure
- unix crash course
Day 2:
- dependencies on external tools
- in-source documentation
- tag conventions in the Giella infrastructure
- morpheme boundaries
Day 3:
- testing
- debugging
- developer tools
Day 4:
- refactoring the code

Presentation Of All Tools

analysers & generators
Korp
- infra component: Korpus processing pipeline (converson & analysis)
dictionaries:
- NDS
- satni.org
- dictionary content:
  - $GTHOME/words/dicts/
  - Termwikien
proofing tools
- spellers here and here and here
- hyphenators
- grammar checker
MT
keyboards:
- mobile phone keyboards
- computer system keyboards
- the keyboards are built automatically from very simple text files
speech synthesis
language learning:

Introduction To The Infrastructure

Overall goals

language independent, while still adaptable to the needs of each language
separation of concerns - build structure vs linguistic work
scalable:
- add as many languages as you want
- add as many tools as you want
- with a new language, you get whatever all the other languages have
- with a new tool, all languages get a basic version of the tool
predictable - same thing is called the same in different languages
understandable - names should be understandable as is
modular
technology neutral (but rule-based by default - the only thing that works with the languages we work with)
- it supports both Xerox, Hfst and Foma
- it supports Apertium, but could also support other (rule-based) MT systems
should give strong support to reuse of resources
- shared lexical data
- one fst starting point for every tool

Means for achieving the goals

shared directory structure

Unix Crash Course

See Unix for Linguists

Dependencies On External Tools

Language technology tools

Hfst or Foma or Xerox
VislCG3

Optional:

Apertium (MT)

Infrastructure support tools

Autotools
Forrest (documentation verification and publishing)
Python
Subversion
a number of other tools listed on a separate page

External dependencies for final products

This varies from product to product. In most cases (like spellers, keyboards and more) all these dependencies are precompiled by programmers and added as library resources to be included automatically so that the linguists do not have to have everything installed and correctly configured for building the end user products (s)he wants.

The exact details for each product is listed separately in each case, on individual pages.

In-Source Documentation

See a separate document.

Tips for in-surce documentation:

open two terminal widows
run forrest in one
run make in the other

To run forrest:

forrest run -Dforrest.jvmargs="-Dfile.encoding=utf-8 -Djava.awt.headless=true"

To debug, edit the generated jspwiki file till the error is found, then correct it in lexc.

Tag conventions in the Giella infrastructure

Language Independent Tags In The Giella Infra

Morpheme boundaries

Morfeme border markup

Clean code

Guidelines for clean code:

all forms should have an analysis
all analyses should be consistent - same tags for the same feaure, in the same order
consistent use of whitespace for increased readability

Testing

See this document

Debugging

Issues in KAL

Developer Tools

Short intro to the tools in the devtools/ directory:

check_analysis_regressions.sh
generate-*-wordforms.sh
test_ospell-office_suggestions.sh

Refactoring the code

When the Yaml files are covering the relevant parts of lexc, one can rewrite the lexc code and at the same time ensure that the analysis doesn't change. One can also use one of the devtools/ tools to help in that process.

This can also be seen the other way: when you know what area of the grammar you want to change, you write yaml tests to specify the intended output, and work on the grammar until you get there.

Excercises and practical work

correct bugs identified
search for new bugs
change tags
run tests

Presentations