Meeting_2005-09-05

Meeting setup

  • Date: 05.09.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Summing up last week (Meeting + conference)
  4. Documentation - divvun.no
  5. Corpus gathering
  6. Corpus infrastructure
  7. Linguistics
  8. Speller infrastructure
  9. Other issues
  10. Summary, task lists
  11. Closing

1. Opening, agenda review, participants

Opened at 10: 12.

Present: Børre, Saara (half an hour), Sjur, Thomas, Tomi, Trond

Absent: Maaren

Main secretary: Børre

Agenda accepted with new point 3.

2. Reviewing the task list from the last meeting

Børre

  • Add crontab specification for the cvs update/export script Tomi made
    • Adjust it so it will work with the new server
      • Working on it, almost done
  • reopen the jspwiki + UTF-8 issue
    • Not done
  • Add issue to forrest issue tracker about utf-8 ihtml documents.
    • Not done
  • Contact Svenska bibelsällskapet
    • Not done
  • discuss with Anders Kintel about possible cooperation
    • Not done, been in Helsinki
  • Continue writing a contract draft, together with Trond and Sjur. Coordinate with Kimmo Koskenniemi
    • Got new documents
  • Follow up on CVS mailing:
    • check that Thomas and Maaren receive forwarded mail from cochice
    • set up Thomas and Maaren
      • Thomas is set up, not sure about Maaren
  • Investigate compatibility between MySpell and Aspell source files
    • Debian has common source files for these.
  • Finish giellatekno forrest transition.
    • Done, deployed.
  • Other tasks:
    • Has looked at sfst, begun reading "THE book" (about the Xerox tools)

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology
    • Done?
  • Go through the missing list from risten.no
    • Partially done
  • remember headphones, 2*money and batteries for Helsinki
    • Done

Sjur

  • risten.no bugs and fixes
    • Nothing during the last two weeks
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working to Sámediggi
      • This won't work without a new or upgraded Firewall
    • Maaren has problems with SubEthaEdit (can't connect)
      • Maaren's computer updated and fixed, the problem might be solved, but I haven't had a chance to test it.
  • To the board:
    • write proposal for permanent maintenance organisation
    • write draft specification for the outsourced tasks
    • write half-yearly project report with progress and bugdet status
    • write agenda
    • Deadline for the board tasks: 3 weeks ahead of meeting (meeting is October 4th)
  • add Divvun to the UNESCO Hall-of-Fame list
    • Not done
  • Plan the Helsinki meeting
    • Done
  • meeting about corpus contract
    • Done
  • project planning with Trond
    • Not done

Thomas

  • work on Lule Sami compounding and derivation, check closed POSes
    • checked closed POSes, started with derivation

Tomi

  • Aspell: Continue working on the affix file
    • Work has been done there
  • investigate Aspell/MySpell/OOo issues
    • MySpell uses (almost) the same kind of wordlist & affix files as aspell
    • OOo is turning over to hunspell, which has a much more powerfull and linguistically adequate formalism,
  • three-part compounding
    • Not done
  • decapitalisation of proper nouns when (if?) compounded, and when derived (the oslolaš issue)
    • This is working, but should be added to makefile and CVS
  • corpus infrastructure: dtd location (both public and internal)
    • Discussed in Helsinki
  • corpus infrastructure: file and dir organisation
    • Discussed in Helsinki

Trond

  • Work on the bug list (Lule Sámi).
    • No work on Lule Sámi last week)
  • Work on compounds (three-part + oslolaš, with Tomi)
    • The oslolaš issue was solved in the Helsinki conference
    • Three-part compounds still open
  • Work on the corpus interface (with Lars)
    • Not done.
  • Corpus infrastructure: dtd location
    • Made a principled decision in Helsinki, not implemented.
  • Get the new version of the New Testament
    • Got an html version in Helsinki from a collegue(!), still haven't got hold of Helge Almås in Bibelselskapet.
  • Check Hans-Ragnars names.
    • Not done.
  • Look at the Sámi names issue
    • Not done. But the CD is on my desk, so we will look into it here in Tromsø.
  • Plan the Helsinki meeting
    • Done
  • New coworker
    • Formalities start now.
  • meeting about corpus contract
    • We had a meeting, the work continues.
  • check the new giellatekno site
    • Some checking done.
  • project planning with Sjur
    • Not done.

3. Summing up last week (Meeting + conference)

  • Meetings: Our meetings went well.
  • Demo: We should have had a prepared text with collected authenthical spelling errors, and learned how to use KWord... Otherwise our demo performance was as good as we had capacity to.
  • The conference was too advanced from time to time (especially the mathematical parts), but the twol day (and some of the lectures) was ok.
  • We met the community and discussed, that was very positive. E.g. discussion with the person behind sfst, the Xerox persons, etc.

4. Documentation

The giellatekno.uit.no site is deployed. Børre has read and fixed some of the technical documents.

The standing invitation remains: Document what you do, when you do it.

Especially document Aspell. The Makefile, how to build, what changes has been added to the lexicon, and so on.

5. Corpus gathering

Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good as it is. There are some points still to be discussed. We'll continue this week.

Paths forward: We have a contract suggestion. Sjur and Børre should start the negotiation with the authors and/or publishers to get text. We should make a priority list for authors, and we should invite ourselves to their meeting to discuss. We should carry on the discussions both with Iđut and with Davvi Girji.

How to proceed:

  1. Get the contract suggestion ready
    1. Translated part 1 ok, part 2 and 3 missing. Done this week
    2. Get part 4 from Kimmo, and translate it
    3. Contact our lawyers, at SD and UIT (today, tomorrow).
    4. When the Norwegian version of the contracts are ready, make sure they're corresponding to the Finnish ones in their legal interpretation (as far as possible); then publish both versions for others to reuse, with background documentation (in cooperation with Kimmo Koskenniemi)
  2. Approach the text owners (see ordered list below)

Independent of the contract work

  1. Bible: The new testament (Trond)
  2. Bureaucratic text:
    1. Sámi Parliament (Børre)
    2. Sámi Oahpahusráđđi (Børre)
    3. KRD (Børre, check whether we miss texts (discuss with Trond))
    4. the Sámi municipalities (Børre)
  3. Textbooks
    1. To the extent that text can be got directly from SO.

After the contracts are ready

Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the most important persons and institutions. Børre as the responsible for collecting, Sjur as responsible for the project, and representative of SD

The tour should be planned, not in this meeting, but before the contracts are ready, i.e., it should be planned next week.

  1. Commercially published texts
    1. Author organisations' meetings
    2. Key authors one by one
      1. (list of author names) Kerttu Vuolab, Kirsi Paltto,
    3. Iđut and key authors there (Børre)
    4. Davvi Girji and key authors there
  2. Newspaper text:
    1. Sámi Instituhtta's (for the old archive of Min Áigi and Áššu)
    2. Áššu has been making a CD since the end of may, there should be a pile there. Børre suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format.
    3. Min Áigi

List of texts with lower priority (to be gathered when the above list is more or less fixed)

  • the Sámi municipalities,
  • Authors with smaller production
  • Textbooks

6. Corpus infrastructure

Do documentation.

Naming conventions and directory structure

We have a decision from Helsinki:

  • have the same directory structure in all three levels, and we also decided upon the overall structure.
  • Path forward: Tomi and Trond to implement the directory structure

7. Linguistics

North Sámi

  • New place names received, should be added to our lexicons
  • three-part compounds issue still open
  • Johnny Andersen has written a letter to us on the treatment of Sámi place names, we need a contract with "Norge digitalt", via UFD. We will have to get such an agreement. This is a task for Thomas and Trond.

Lule Sámi

We do not know when we get the lexicon from Anders Kintel. We need a meeting with him and with Árran in order to coordinate the work.

Status quo on our parser:

  • All the major POSes have been covered
  • closed POSes have been checked
  • compounding and derivation
    • Thomas has begun with the deverbals

8. Speller infrastructure

aSpell

Write documentation here as well.

Munch-list is working, and the affix file is improving. See previous meeting memo.

Issues:

  1. The phonetic file should be systematically looked into.
    1. Check that it works
    2. Add more correspondences on an impressionistic basis
  2. Start work on collecting systematic spelling errors:
    1. Our in-house file typos.txt
    2. The soon-to-arrive error texts from newspapers
  3. The holes in the affix list should be mended
  4. We should, at some point, evaluate whether this is The Correct Approach to aspell-type speller building
  5. Affix file UTF-8 problem should be checked and reported.
  6. Then there is the UTF-8 root (or whatever) problem: The work-around using Latin 4 is no solution because of the following latin 4 bug: It treats đ as ð, and thus gives error on "ođđa" (wants "oðða", this is since in Latin 4 đ and ð are unified (Latin 4). Latin 6 (= ISO/IEC 8859-10) or ISO-IR 197 (the old Dihtorlávdegoddi code page) could be a fix to this.
  7. The clitics issue: Today we have a manually created affix file in order to meet the muncher. Strategy: Do the munching without the clitics, but then enrich the manually-created affix file via a genfisuffix -program in order to get an automatically created suffix + clitic file to add the compiled lexicon to. We will have genfisuff taking affix + clitic and makes it into affix'. Then we use affix to munch but affix' to spellcheck.
  8. We must create subcomponents under the Speller

MS Office spellers

Nothing new, see previous meeting memo.

OpenOffice.org

From last meeting:

The conversion from aspell to myspell will work trivially as soon as the myspell list becomes smaller.

Issue left open.

Hunspell

Hunspell is presently already working with OOo, and is a much better speller engine, linguistically speaking (can handle compounds much better than Aspell, as well as complex inflection and derivation). For pointers, see the previous meeting memo.

Issue left open.

Other engines

Børre and Sjur had a long discussion with the author of the SFST library/tool set. Next on his priority list is a feature for handling spelling errors in running text analysis. This is principally the same as we want, thus there should be a good opportunity for making SFST into the spelling engine. Sjur has a suggestion on how to implement this feature that may be mailed to the author of SFST.

9. Other

Technical issues

  • Fixing the machine for the new coworker
  • The mac os / perl bug (at least Trond and Sjur has it):
    • utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
    • Sjur has a non-solved Backspace + UTF-8 issue
  • 29 open bugs - too much! Have a look at what you can fix.

10. Summary, task list

Børre

  • Finish crontab specification for the cvs update/export script Tomi made
  • reopen the jspwiki + UTF-8 issue
  • Add issue to forrest issue tracker about utf-8 ihtml documents.
  • Contact Svenska bibelsällskapet
  • discuss with Anders Kintel about possible cooperation
  • Follow up on CVS mailing:
    • set up Maaren
  • Meet up with Trond about directory structure
  • Contact oahpahusossodat and the rest of the SD about texts
  • Fixing the machine for the new coworker

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology
  • Go through the missing list from risten.no
  • Start working on grammatical issues with Thomas and Trond.

Saara

  • Get aquainted with the project status quo
  • Look at the corpus infrastructure issue
  • Look at the corpus interface issue with Lars

Sjur

  • risten.no bugs and fixes
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working to Sámediggi
    • Maaren has problems with SubEthaEdit (can't connect)
  • To the board:
    • write proposal for permanent maintenance organisation
    • write draft specification for the outsourced tasks
    • write half-yearly project report with progress and bugdet status
    • write agenda
    • Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is October 4th, deadline for the documents is September 13th)
  • project planning with Trond

Thomas

  • work on Lule Sami compounding and derivation
  • Look at Linguistic bugs with Trond.
  • Work on the name agreement with "Norge digitalt" with Trond

Tomi

  • Aspell: Continue working on the affix file
  • three-part compounding
  • Add downcasing to makefile and CVS
  • corpus infrastructure: dtd location (both public and internal)
  • corpus infrastructure: file and dir organisation

Trond

  • Work on the bug list (Lule Sámi).
  • Work on compounds (three-part, with Tomi)
  • Work on the corpus interface (with Lars)
  • Corpus infrastructure: dtd location
  • Work on the name agreement with "Norge digitalt" with Thomas
  • Look at the linguistic aspects of the speller clitics, with
  • Get the new version of the New Testament
  • Check Hans-Ragnars names.
  • New coworker
  • translate contract
  • check the new giellatekno site
  • project planning with Sjur

10. Next meeting, closing

12.09.2005 10: 00

Closed at 12: 56