Greenland2017
Contents:
This document contains an overview of the topics and training given in Nuuk in
Basic organisation: presentation of topic, followed by excercises. Roughly one
Topics (with a roughly schedule with time left at the end - to be adjusted as
- Day 1:
- presentation of all tools
- introduction to the infrastructure
- unix crash course
- presentation of all tools
- Day 2:
- dependencies on external tools
- in-source documentation
- tag conventions in the Giella infrastructure
- morpheme boundaries
- dependencies on external tools
- Day 3:
- testing
- debugging
- developer tools
- testing
- Day 4:
- refactoring the code
Presentation Of All Tools
- analysers & generators
-
Korp
- infra component: Korpus processing pipeline (converson & analysis)
- infra component: Korpus processing pipeline (converson & analysis)
- dictionaries:
- proofing tools
- spellers here and
- hyphenators
-
grammar checker
- spellers here and
-
MT
- keyboards:
-
mobile phone keyboards
-
computer system keyboards
- the keyboards are built automatically from very simple text
files
-
mobile phone keyboards
-
speech synthesis
- language learning:
Introduction To The Infrastructure
Overall goals
- language independent, while still adaptable to the needs of each language
- separation of concerns - build structure vs linguistic work
- scalable:
- add as many languages as you want
- add as many tools as you want
- with a new language, you get whatever all the other languages have
- with a new tool, all languages get a basic version of the tool
- add as many languages as you want
- predictable - same thing is called the same in different languages
- understandable - names should be understandable as is
- modular
- technology neutral (but rule-based by default - the only thing that works with
- it supports both Xerox, Hfst and Foma
- it supports Apertium, but could also support other (rule-based) MT systems
- it supports both Xerox, Hfst and Foma
- should give strong support to reuse of resources
- shared lexical data
- one fst starting point for every tool
- shared lexical data
Means for achieving the goals
Unix Crash Course
Dependencies On External Tools
Language technology tools
- Hfst or Foma or Xerox
- VislCG3
Optional:
- Apertium (MT)
Infrastructure support tools
- Autotools
- Forrest (documentation verification and publishing)
- Python
- Subversion
- a number of other tools
External dependencies for final products
This varies from product to product. In most cases (like spellers, keyboards and
The exact details for each product is listed separately in each case, on
In-Source Documentation
See a separate document.
Tips for in-surce documentation:
- open two terminal widows
- run forrest in one
- run make in the other
To run forrest:
forrest run -Dforrest.jvmargs="-Dfile.encoding=utf-8 -Djava.awt.headless=true"
To debug, edit the generated jspwiki file till the error is found, then correct
Tag conventions in the Giella infrastructure
Morpheme boundaries
Clean code
Guidelines for clean code:
- all forms should have an analysis
- all analyses should be consistent - same tags for the same feaure, in the same order
- consistent use of whitespace for increased readability
Testing
See this document
Debugging
Developer Tools
Short intro to the tools in the devtools/ directory:
check_analysis_regressions.sh generate-*-wordforms.sh test_ospell-office_suggestions.sh
Refactoring the code
When the Yaml files are covering the relevant parts of lexc, one can rewrite the
This can also be seen the other way: when you know what area of the grammar you
Excercises and practical work
- correct bugs identified
- search for new bugs
- change tags
- run tests