August 2006 gathering

  • Place: Tromsø
  • Dates: 22-24?


Børre, Maaren, Sjur, Thomas, Tomi, Lene, Trond, Saara


  • Presentation, status quo and plans for divvun and disamb
  • Linguistic issues
    • Normativity
    • Lexical coverage
    • 3-part compounds
  • Lang-tech issues:
    • finalise Xerox hyphenation (Trond + Sjur 2-3h)
    • G3 issue for sme
  • Technical issues
    • The munching deadlock
    • M4 macros for hyph/spell/dis versions of TWOL
    • our own wiki? for (technical) CG documentation,
  • Speller plans
  • Polderland work
    • Online meeting with them? Yes. Memo
  • Proper noun status and action plan
    • action plan:
    • technical evaluation: how do we make proper noun editing work in Emacs?
    • how much more

Commented topics

  • linguistic issues
    • General linguistic discussion (morphology (and syntax?)):
      • What is on top of the priority list?
      • What are our most serious problems?
      • What can we learn from each other?
      • How should we work together across the projects in the remaining months?
    • Lexical coverage
    • 3-part compounds
  • lang-tech issues:
    • finalise Xerox hyphenation (Trond + Sjur 2-3 timar)
    • G3 issue for sme
  • technical issues
    • The munching deadlock
    • M4 macros for hyph/spell/dis versions of TWOL
    • our own wiki? for (technical) CG documentation, (open source) fst technology and Saami language technology (basically the parts of our current documentation that isn't project-specific. Goal: to invite and engage a larger community (CG, fst/transducer users, i.e., the grammatically-based bottum-up parsing community) to improve and create documenation for the above topics. Potential partners are Helsinki, Oslo, Odense, Stuttgart.
  • speller plans - infra to generate word form lists for Polderland/Aspell/HunSpell type spellers. We need to setle on plans for this infrastructure now.
  • proper noun status and action plan
    • action plan: we need to systematically go through the most typical/useful tasks (problem now: Sjur and Tomi do not edit lexicons enough to really know what is most important)
    • technical evaluation: how do we make proper noun editing work in Emacs, and who? Are the coding (as in XML tags/structure) solutions good? (speed is not an issue, at least not on my new Mac: -), and shouldn't be on the server either)

Worst-case scenario: The name project failed, took too long time, etc., and we will have to build separate name lexica in traditional lexc format for each language. In the best case we will get it up and running in a reasonable amount of time.

  • corpus collection:
    • how much more (in terms of effort and invested time)? Børre needs to start looking at testing pretty soon - we should have tools for people to test this fall, and we need a working feedback infra to make sure we get the respons we need
  • divvun plans ahead

Time table

Time Tuesday Wednesday Thursday
8: 30-10: 00 A Presentation (9: 00->) T lexc2Xspell Machine update + planning
    L Consequences of eval Aligner (Trond, Saara)
10: 00-10: 30 A Reports, plans Coffee A Coffee
10: 30-12: 00 a Polderland T lexc2Xspell -
  a Polderland L Consequences of eval -
12: 00-13: 00 A Lunch A Lunch A Lunch
13: 00-14: 00 A G3/Howto/m4/Wiki A Name lexicon 1 (exit Saara, Maaren)
  13: 45: Coffee (what shall we store) -
14: 00-14: 30 T Video with PL (1h) A Coffee A Coffee
  L Evaluate   -
14: 30-16: 00 A *G3/Howto/m4/Wiki T Name lexicon 2 (how) -
- L (3part) -
... | preprocess --abbr=bin/abbr.txt | | ..
... | preprocess --abbr=bin/abbr.txt --corr=src/typos.txt | lookup ...
A = all
a = all - Lene
T = Saara, Sjur, Trond, Tomi, Børre
L = Lene, Maaren, Thomas

Tuesday afternoon 1

A Presentation with Polderland
T Video with PL (Saara, Sjur, Trond, Tomi, Børre, Maaren, Thomas)
L Evaluate our linguistic analysers (Lene, Maaren, Thomas)
  How good are the tools? (M, T explaining L what input she can expect from M, T)
  What does it take to make them better?
  Do we need tools for measurement
  Or: Office as usual, working

Tuesday afternoon 2

  • G3: Trond, Sjur, Thomas
  • Howto: Maaren explains practical linguistic work to Lene <= M&L find out how to cooperate...
  • m4: Saara, Tomi
  • Wiki: Børre (todo: write a spec)

Thursday morning

  • Plenary machine update (all(?)) <==
  • Aligner (Trond, Saara)

Machine updates:

  • forrest issues / installations / updates
  • readline / bash

Thursday evening

  • G3 (Sjur, Trond, Thomas) <==
  • Hyphenation (Sjur, Trond) <==
  • Bible format discussion
  • Corpus health care
  • Saami chars in pdf
    • Solved! Almost: -) - path needs to be generalised, then checked in to CVS.
  • i18n finalisation: language selection menu (using dispatcher)
rule for id field in common.xml
default: you are your own id
overriding: your id is different from yourself, and points to another lemma

 Kautokeino .. nob, nno, eng ... id=Guovdageaidnu
 Guovdageaidnu ... sme, sma, smj ... id=Guovdageaidnu

sme.xml (id info is inherited, perhaps)
sme.lexc (pure lexc file, generated)
 Kautokeino contlex-i ;
 Guovdageaidnu contlex-j ;

M4 flags:


  • exclude or include the hyphenation marks when making a hyphenator transducer ("#", "-" and "^")


  • Permitting truncated two- and three-part compounds or not


  • Exclude/include twol ruleset to allow diphthong simplification.
norm: oahpaheddjiid 
also: oahpaheaddjiid and oahpaheddjiid
==> include/exclude the G3-sensitivity of the diphthong simplification rules

non-m4-variation (variation in the lexicon):

  • Circular transducer or not? Today:
grep -v '^[CN]^'
  • Including SUB entries of not?
  • EAST, WEST, ALLDIALECTS Differenting between eastern and western dialectal forms, tolerating both
  • SOUTH Tolerating cleary non-standard (like Locative Sg -n)