Meeting_2005-04-18

Meeting setup

  • Date: 18.04.2005
  • Time: 11.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: Phone, iChat, SubEthaEdit

Agenda

  1. Opening, agenda review, participants
  2. Reviewing the task list from a week ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Term db
  8. Other issues
    1. Mouse problems
    2. Job positions
    3. Upgrade to 10.4
    4. Physical meeting?
    5. Vacation
  9. Summary, task lists
  10. Closing

Task list since last meeting:

  • Tomi: Move the corpus discussion from newsgroup to forrest
    • has been doing XSLT for docbook to our xml format (docbook2sme.xsl)
    • awaiting final XML format
  • Thomas: continue work with verb transitivity. Contact Olavi about his dictionary.
  • Maaren: Will work with this project, starting tomorrow.
    • I`ll take contact with Trond and Thomas (telephone meeting Tuesday 12.4.)
  • All: discuss our own corpus format in the newsgroup - to be continued.
    • deadlline: 25.4.
  • Tomi and Børre: identify the input encodings antiword can handle
    • Tomi makes a perl-script, Børre helps out on that one.
  • Børre: divvun.no:
    • to set up the documentation and integration with CVS This must be integrated with the systems at the UIT, and Børre will contact the IT persons at UIT
    • Coordinate with Sjur as well
    • Still missing a project e-mail address. Børre contacts Leif Åge.
  • Sjur, Børre: terminology database
    • Still working with it...
  • Børre: should make the How-To tab appear as intended:
    • still to be done
  • Børre: wiki - will follow up on the UTF-8 problem
    • temporary workaround: use 8-bit encoding for the memos (=MacRoman works correctly)
    • convert all memos to MacRoman
  • Trond, Børre, all: fix the links
    • Børre will check if there are still more link problems
    • Lot's of links to .txt files are broken. Trond and Børre takes a look at that issue.
  • Børre: contact Min Áigi, discussing possible practical arrangements for cooperation.

1. Opening, agenda review, participants

Opened at 11.15 (more than 1 hour late due to a miss by Sjur). Agenda accepted as is.

Present: Maaren, Sjur, Thomas, Tomi, Trond, Børre

Main secretary: Trond

2. Reviewing the task list from the last meeting

  • Tomi: Move the corpus discussion from newsgroup to forrest
    • awaiting final XML format
    • cont. disc in newsgroup: Tomi and Sjur has discussed in the thread "Corpus format", but the discussion is still inconclusive, awaiting further feedback.
  • Thomas: continue work with verb transitivity. Contact Olavi about his dictionary.
    • has worked on the verb transitivity all week, and contacted Olavi about Lule Sámi. Thomas explained that the Lule dictionary does not need to have complete translations in order to be useful for us. Olavi informed that the Lule-Swedish web dictionary will be done in the autumn.
  • Maaren: Will work with this project, starting tomorrow. Managed to work every day(!), on the missing list.
    • I`ll take contact with Trond and Thomas (telephone meeting Tuesday 12.4.) The meeting was held.
  • All: discuss our own corpus format in the newsgroup - to be continued.
    • deadlline: 25.4. So far only Tomi and Sjur have discussed, this must get high priority this week. Marit should also have her say.
  • Tomi and Børre: identify the input encodings antiword can handle
    • Tomi makes a perl-script, Børre helps out on that one.
    • Done
  • Børre: divvun.no:
    • to set up the documentation and integration with CVS This must be integrated with the systems at the UIT, and Børre will contact the IT persons at UIT
    • Contacted Thor Øyvind Johansen. Told about forrest, cochise. Discussed the possibilities (cvs, forrest, ssh). He will answer later this week.
    • Coordinate with Sjur as well
    • Still missing a project e-mail address. Børre contacts Leif Åge. Fixed. Sent to us all. Add Trond to the list.
  • Sjur, Børre: terminology database
    • Still working with it... The base has been further developed, problems are better identified than before, so there is progress.
  • Børre: should make the How-To tab appear as intended:
    • still to be done.
    • It seems to be a fundamental problem for forrest, we have to return to this.
  • Børre: wiki - will follow up on the UTF-8 problem
    • temporary workaround: use 8-bit encoding for the memos (=MacRoman works correctly)
    • convert all memos to MacRoman.
    • Sjur did this.
  • Trond, Børre, all: fix the links
    • Børre will check if there are still more link problems
    • Lot's of links to .txt files are broken. Trond and Børre takes a look at that issue.
    • Nothing has been done on this issue.
  • Børre: contact Min Áigi, discussing possible practical arrangements for cooperation.
    • We have received a very positive letter from Min Áigi. But, to receive texts we need a contract. I contacted Anne-Britt today, who will look for a "template" we can use.
    • Metsähallitus (Finland) is waiting for a contract as well.
    • Thomas has forwarded a contract from Lantmäteriverket (Maanmittauslaitos) in Finland.
    • Trond has sent Thomas the contract between Statens Kartverk and UIT:
    • Thomas has forwarded the contract to Maanmittauslaitos and they have written a licence for us. When we have signed it we will receive the sami names in Suomi. Trond have to sign it.

3. Documentation - divvun.no

We discussed to set up forrest as a standalone on the web server, or to make forrest convert the xml. As long as there is not much load on divvun.no, it is perhaps easiest to just put of forrest. Another possibility is to type "forrest war", which gives .. that can be saved as ... If we install a war-file, we can get away with only updating testfiles. If you, on the other hand, install forrest, you must redo the whole installation on the server when you update. Using war is an easier solution, since we can isolate different parts of the whole site. We should run the default Tomcat, and a forrest file manualy. Textual updates should be done automatically, as a cronjob. The issue should be discussed more in detail.

The task is to set up the server and to integrate it with the cvs repository.

Deadline for when we want to have the site up and running is the end of April.

4. Corpus gathering

Contract prototype! We have two models: Textlaboratoriet in Oslo, and the Helsinki model, presented by Kimmo Koskenniemi, and run by CSC in Finland, as a 3-part solution: The author, the licenser (publisher) and the maintainer (in that case: CSC). The Oslo model is simpler in the outset, but perhaps harder to maintain. The Helsinki model is more complicated in the outset, but perhaps(?) easier to maintain, since it is clear in a more principled way. We should have the UIT and Sámediggi lawyers look at this as well.

Børre: Contact Kimmo K &% Ruth V F

Deadline: The middle of May(?).

We should contact the Writers' organisations, and suggest for them to recommend to their members that they in turn use the contract we have suggested to them for use of their texts.

5. Corpus infrastructure

Location of charset conv perl script

Tomi has made a perl script to convert problematic charsets to utf8. Where do we store it? Presently, we have cvs files in gt/, and we plan corp files outside the cvs. The question is then where to have cvs-included corpus files, in gt/ with the other cvs files, or in the corpus catalogue with the other corpus files. The discussion will go on in the newsgroups.

6. Linguistics

Thomas: Nothing more than already reported. Work continues on verb transitivity.

Maaren: working missing list. There are very many misspellings in the missing list. Spelling errors are very valuable for us, and in general, they are hard to get at. The spelling errors in the missing list should thus be gathered, e.g. in this format:

/misspelledform/correctspelledform/frequency_of_misspelledform?/

There is a thread for this on the newsgroup, "Misspellings in the corpus material ". Two strategies: 1. Add the common misspelled words to the parser, with a !SUB tag. 2. make separate files with /misspelledform/correctspelledform/, as suggested above. The problem with (2) is that it does not contain any morphology, and the problem with (1) is that it does not offer any correction. The solution would be to link departementa to departemeantta, in some way. Central in the discussion is the linguistic aspect: What do we want to link to what.

One frequent error type is the a - á errors. We could perhaps have a special prechecking a>á before the usual "editing distance 1".

Compilation error

see BUG #69. The parser did not compile on the 17th of April. Many of the errors have general interest, and the cvs comment will be posted to the news group.

7. Term db

Deadline is end of April for the internal beta. Work is in progress, but not done.

8. Other issues

Job positions.

  • Sámi language technology position vacant in Tromsø (Nordlys today)
  • Marit's position soon available for others to apply for
  • There will in practice be money for one more position for large parts of the disamb project lifespan. Please suggest candidates if you know someone.

Mouse problems

Mainly Sjur has problems, Tomi sometimes

Upgrade to 10.4

Leif Åge has ordered Tiger, and he will distribute the installation packages from Guovdageaidnu. Tiger will be released on the 29th of April.

Things that we have modified may get broken as we move to 10.4. We should be aware of those issues. The information from Apple is incomplete.

Physical meeting?

Week 21 (23.5.->), place to be decided opn nest meeting

Vacation

To be discussed and decided in the next meeting.

9. Summary, task lists

TODO:

  • All:
    • Follow up the corpus format discussion.
    • Decide upon vacation time
    • Views on physical meeting place (May meeting)
  • Tomi:
    • Waiting for the corpus xml-format verification, to complete xml-conversion script(s) after that
  • Børre:
    • Contact Kimmo Koskenniemi and Ruth Vatvedt Fjeld about contract
    • Follow up on the divvun.no issue, suggest tomcat and .war files
    • Work on the termdb
    • Find a solution for wiki and utf-8
    • Link problems in forrest
    • Contact Min Áigi, discuss technical details
  • Maaren: tries to work with the missing list. Else?
  • Thomas: work with verb transitivity
  • Sjur:
    • terminology db
    • text license contract
    • divvun.no

10. Next meeting, closing

25.04.2005 09.30

Closed at 13.00