Meeting setup

  • Date: 02.05.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: Phone, iChat, SubEthaEdit


  1. Opening, agenda review
  2. Reviewing the task list from a week ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Term db
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10.10. Agenda accepted as is.

Present: Maaren, Sjur, Thomas, Tomi, Trond, Børre

Main secretary: Trond (Sjur)

2. Reviewing the task list from the last meeting

  • All: report vacation plans
  • Trond, Børre, Sjur and Tomi
    • A session on the corpus format.
      • We had several sessions on the corpus format, and actually came up with a working model, at /usr/local/share/corp/. We made a dtd proposal, a directory structure, in essence a home for our corpora.
  • Børre
    • Contact Kimmo Koskenniemi and Ruth Vatvedt Fjeld about contracts
      • Sent an e-mail, no respons so far
      • Trond: will call around
    • Fix the link problems in the documentation.
      • Only a handful of link problems left
    • Further discussion with web server administrator about the site.
      • Monday/Tuesday next week he will do the nexcessary work
    • Some minor things with termdb.
      • Done some of it, some left
    • Review site suggestion with Trond.
      • Did some, found problems with utf-8 .ihtml files.
  • Tomi
    • continue corpus discussion
      • Did so, and updated the conversion script from docbook to our internal format
      • corpus location and corpus processing
  • Maaren: tries to work with the missing list. Else?
    • Maaren has had courses and seminars at the Sámi parliament the whole week.
  • Thomas: work with verb transitivity
    • Has been working as usual, with verb transitivity (which is much work).
  • Sjur:
    • ... has been working with corpus format and processing.
    • The editing tool of the terminology db is almost completed. The technical problems are solved as well.
    • No work on the text license contract or on, awaiting further progress here.

3. Documentation -

The humfak server will be set up for this week, so that the pentalingual welcome pages should be ready by next week.

Todo: Translate the current welcome page into 5 lgs, and add some text as well.

We will add more information on the plans and the progress of the project. The setup incl. sync with CVS

Official opening by Språkstyret medio June. Todo: Ask Anne Britt to put it on the agenda (21.-22.6. in Tysfjord).

North Sámi as default language. There should be a note explaining that English is the internal language for documentation so that people won't complain about the techdoc not existing in other lgs.

Giellatekno pages will be changed from parallel bilingual text on the same pages into two parallel sets of pages.

4. Corpus gathering

There is a problem that we have got no answer from Oslo and Helsinki. We need the documents soon, Sjur and Trond to talk to relevant people.

Northern Sámi texts: Further issues for corpus gathering: Talk with the Bible translators, the writers association, We have written letters to the Departments in Oslo, the Sámi parliament, the municipalities, etc. Maja Hætta at the Guovdageaidnu suohkan is going to collect text for us. Sámi Allaskuvla has not responded, Sámi Instituhta are awaiting a license suggestion for their taped texts, but we will get the newspaper texts right away.

Lule Sámi texts: We should talk to the Tysfjord branch of the Parliament on gathering of Lule Sámi text. There is some text translated to Lule Sámi within the Sámi parliament system, the challenge is just to find them. There should be Lule Sámi text in Sweden as well, especially the NT, here, Susanna Kuoljok-Angeus should be contacted.

We need a web crawler for Sámi text. TODO: Contact Knut Hofland in Bergen on this issue. The Skolelinux project has set up a web-crawler for bokmål and nynorsk, we could probably also use the same one. The two crawlers should be compared.

5. Corpus infrastructure

We have arrived, or rather are arriving, at the following:

   /orig/<donor>/<year>/file.doc         <- dump, __must__ be write protected
   /int/as-in-orig/file.db.xml           <- docbook format       
   /int/as-in-orig/file.xsl              <- file-specific scripts, under version control
   /gt/publisher OR author/year/file.xml <- xmlpreprocess
    ... | lookup ...

We should put some effort into the donor directory. Donor could be ordered according to person, or according to institution.

We need: The texts We need: A script for going from file.xml to .. -> preprocess eventually modifying preprocess as well (but still making preprocess able to take raw text as input, or we may make an xmlpreprocess along with the (txt)preprocess).

6. Linguistics

Gone through 5200+1500 verbs, more than 13 000 all in all.

Maren tries to work with the missing list (again)


  • General issue: How should we prioritise between work on names, n-v-a cluster, closed classes
  • Detailed view:
    • Lexical coverage
    • Sámi place names (later, when Hønefoss have been transposing more names)
    • Names:
      • The LONDON/BERN (-is/-as) issue.
      • Distinguishing different name types? (Person, place, ...).
      • Group parallel names (Guovdageaidnu / Kautokeino(s) / Koutokeino = Koutokeinossa, Maaren / Maren, Máret / Marit)
    • 3-part compounds: sáme giel oahpahus. Needed: Linguistic description + lexc programming skills.
    • east-west differences
    • indefinite pronouns (closed-sme-lex.txt file)
    • Compounding

A priority policy is to first look at the closed classes. For the other classes So far, we have done: inflectional morphology of the n-v-a cluster. Derivational morphology and compounding not been systematically checked.

Linguistic priority list:

  1. finish the verb transitivity
  2. closed POS
  3. compounds
  4. derivation
  5. completing the lexicon
  6. names

7. Term db

Issues left:

  • i18n
  • l10n
  • styling
  • graphics
  • translations
  • more...

New deadline for internal opening: May 13th New server: June 1st Official opening: June 17th

8. Other issues

University project:

Will need 1-2 new linguists, for at least one and a half year.

Friday May 6:

  • Sjur: taking day off
  • Maren: me too

9. Summary, task list


  • Thomas, Maaren, Børre: Translate divvun front page
  • Trond, Tomi, Børre, Sjur: Continue with corpus infrastructure
  • Trond:
    • Will find out who can help us with corpus agreement contracts in Oslo when Ruth Vadtvedt Fjeld is away.
    • Work with Børre on the giellatekno
  • Sjur:
    • Call Kimmo about license text & e-mail from Børre
    • More termdb work
    • Contact Anne Britt about and the Språkstyremøte.
  • Børre:
    • Ask Kåre Tjikkom, Harriet Aira, Karin Tuolja, Susanna Kuoljok-Angeus and Samuel Gælok about Lule Sámi text.
    • Contact Leif Åge adding Trond to divvun e-mail alias.
    • Prepare the war file for Tomcat deployment next week
    • Work with Trond on the giellatekno pages
    • Finish the work on the Termdb
    • Contact Skolelinux about their webcrawler, also Knut Hofland (ask Trond/Sjur)
  • Tomi:
    • Decide the directory structure for corpus originals
    • Script for antiword and UTF-8 fix -processing
    • Template for xsl manual conversion script
    • Script for processing the corpus for preprocessor
  • Thomas: continue with verbs and ask Teemu Leskinen for finnish parallel names.
  • Maren:
    • Have a look at the closed classes, especially the indefinite pronouns.
    • Work more on lexical issues.

10. Next meeting, closing

09.05.2005 10.00

Closed at 12.23.