Meeting_2005-09-26

Meeting setup

  • Date: 26.09.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10: 10.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • Finish crontab specification for the cvs update/export script Tomi made
    • Finished the divvun2web script and set up the cronjob
    • It seems to work, but needs a better reporting features.
  • Contact Svenska bibelsällskapet / Olavi Korhonen
    • Kjell Hognesius had the files (only pdf). They were converted to .rtf format and returned so they can do their own finetuning on them.
  • discuss with Anders Kintel about possible cooperation
    • Not done
  • Follow up on CVS mailing:
    • Have a look at why Maaren and Thomas get two copies of every samicvs mail.
      • Maaren had setup both a pop and exchange account to the samediggi.
  • Contact oahpahusossodat and the rest of the SD about texts
    • Sent an internal note to the main departments inside the SD.
  • Document the corpus infrastructure
    • Not done
  • Read through the Helsinki contracts (new translations)
    • Done, sent some comments to Sjur and Trond.
  • Reorganise the directory structure
    • Some work done, sitting on my machine
  • Continue converting text from input format to our xml
    • Not done
  • Other things done
    • Learned how to use WebSak.
    • Talked to Ellen Näkkäläjärvi about getting texts from the Finnish sámi parliament. She agreed on finding and sending texts to us.
    • Gathered texts from the Guovdageaidnu municipality

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology. do it next week
    • shall do it next week
  • shall get mainly through the missing list from risten.no this week
    • working with risten.no this week also
  • Start working on grammatical issues with Thomas and Trond
    • shall do it this week or next week?
  • Work on the name project with Trond and Sjur
    • okei okei
  • Start looking at normativity issues
    • shall do it this week
  • Work on the numerals project with Trond
    • shall contact Trond

Saara

  • Look at the corpus infrastructure issue
    • Done
  • Look at the corpus interface issue with Lars
    • Done, the development of the interface is going on well. The interface to the grammatical data (tags, etc.) is controlled by us. The aim of the metadata interface is to be suitable to a wide range of corpora, and it seems that it suits us too. I was not able to get any specification of the metadata interface, but the metadata specification Lars proposed corresponds to our dtd.
  • Convert texts from .doc to .xml, to get a grasp of our corpus format
    • I did not convert any documents, but I am now very familiar with the format.
  • Have a look at the pdf-to-xml issue
    • The metadata is extracted when it is available in the pdf-document. The structural information that can be extracted is headers and paragraphs but only afterwards by perl. Working with a Perl module to handle the character coding issues.

Sjur

  • risten.no bugs and fixes
    • nothing
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working to Sámediggi
      • Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
  • To the board:
    • write draft specification for the outsourced tasks
      • Redefined as:
      • qualification criterias - done
      • selection criterias - done
      • deliverables - partly done, required for the board?
      • requirements specification - not done, and not required for the board
    • write half-yearly project report with progress and bugdet status
      • done
    • Added: write project sketch/proposal for South Sámi spellers, with budget
      • started
  • project planning with Trond
    • not done
  • Work on the name project with Trond and Maaren
    • not done
  • Prepare for a Lule Sámi meeting with Árran
    • not done
  • Follow up on place names from Norge Digitalt
    • waiting for the material
  • Read through the Helsinki contracts (new translations)
    • done - several remarks and questions discussed with Trond
  • Talk to Bitte about the Lule Sámi lexicon
    • done. She will read the contract
  • Evaluate SFST as speller (and analyzer) lexicon
    • still postponed.
  • prepare for the Guovdageaidnu meeting:
    • name lexicon
    • three-part compounds
    • not done
  • e-mail Kimmo Koskenniemi about contract issues
    • done
  • other tasks:
    • discussed twol testing with Trond
    • started to work on a South Sámi speller project description, to be presented to the board

Thomas

  • work on Lule Sami compounding and derivation
    • worked and still working
  • Look at Linguistic bugs with Trond.
    • not worked with bugs last week
  • Prepare for a Lule Sámi meeting with Árran
    • not done

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Trying to solve the UTF-8
    • Got e-mail from Roy Dragseth, that he had to terminate the aspell processes I had left running on cochise after work, because they consumed all the memory.
    • Contact aspell author (UTF-8 thing)
      • There's also a aspell-dev list, perhaps that's the right forum. (Comment from Børre)
      • I have subscribed to aspell-user list
      • Not done
  • three-part compounding
    • Not done
  • corpus infrastructure: dtd location (both public and internal)
    • Not done
  • corpus infrastructure: file and dir organisation
    • Not done
  • Document aspell and corpus infrastructure
    • This has been done to some extent
  • Add html-to-xml conversion to corpus infra
    • Waiting the pdf-to-xml results
      • Sjur: There's no reason to wait except priorities, is there?
  • Other issues:
    • Cleaned up the aspell structure
    • Implemented a new testing routine for twol-files

Trond

  • Work on the bug list.
    • The last bug fix -- from any of us -- is from Sept 14th : -(
  • Work on compounds (three-part, with Tomi)
    • Not done.
  • Work on the corpus interface (with Lars and Saara)
    • Have discussed with Saara, but not with Lars.
  • Work on the name agreement with "Norge digitalt" with Thomas
    • Not done.
      • This one is now on Sjur's table
  • Look at the linguistic aspects of the speller clitics
    • Done. They are trivial.
  • Get the new version of the New Testament
    • Still open.
  • Introduce the new coworker to the work routines
    • The new coworker withdrew his application, and this issue has thus taken most of the week. It seems we have been able to get a stand-in for him.
  • project planning with Sjur
    • Not done.
  • Work on the name project with Maaren and Sjur
    • Not done.
  • Prepare for a Lule Sámi meeting with Árran
    • Not done.
  • Work on the numerals project with Maaren
  • Prepare for three-part compounds meeting in Guovdageaidnu
    • Not done.
  • Other issues:
    • Made a new testing routine for our twol-files, made a test suite for sme, and debugged it. Tomi has made an interface for it.
    • Discussed legal issues.
    • Worked on the lexicon.

3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general ("To be done by the ones making the corpora": Børre, Tomi, Trond, Saara).
  2. Now we have 4 documents:
    1. Correct corpus (disamb usage)
    2. Corpus plan (for the disamb corpus cwb)
    3. Corpus conversion
    4. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

  1. For the users/linguists:
    1. What corpus are found, how do I use them (this info is now scattered)
  2. For the collectors:
    1. How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
  3. For the programmer
    1. What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

  1. add/update Aspell documentation (Tomi)
  2. finish divvun2web script (Børre)
    1. the cronjob is up an working. It needs a better error reporting mechanism, though.
  3. as always: document what you're doing: -) (all)

Crontab:

Do we need validation upon cvs check-in? What about forrestbot? An xml validation check (acc to our dtd) is a good thing. We would like that to be implemented.

We need better error reporting, and errors should preferably be caught before cvs commit.

Børre:

  • add cvs commit xml validation
  • look into how divvun2web can provide (better) error messages, or look at forrestbot

4. Corpus gathering

See notes from the 12.9. meeting for details about the steps forward.

Contracts

Tasks:

  • read through Trond's translations (Børre, Sjur)
    • Done, and commented
  • e-mail Kimmo Koskenniemi about the missing fourth contract, and about unclear paragraphs in the versions received (Sjur).
    • Done
  • contact lawyers (find suitable lawyers, Trond will start with the university)
  • send the license text to lawyers
    • These should be sent to the university lawyers, as the sámediggi ones are overloaded with work.
  • add a background document explaining the model

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

We will send the contracts as is to the lawyers, in parallel with waiting for comments from Kimmo. The SD lawyers will be glad to not get this task, as they're overloaded as is.

Lule Sámi New Testament

Børre has converted the translation (which was only available in pdf) to rtf, and sent the files back to Bibelsällskapet for corrections.

5. Corpus infrastructure

Naming conventions and directory structure

See notes from the 12.9. meeting for details about the decision and implementation, as well as a list of tasks.

Børre has done some work, but it is only locally on his machine. Some more discussions with Trond must be done before these are copied to cochise.

Corpus conversion

Pdf to XML

Extraction priority list

  1. retain correct Sámi characters: ok
  2. retain word and sentence order: ok
  3. retain paragraph order: ok
  4. retain structure
    1. paragraphs: ok, by perl
    2. titles, headers: ok, by perl
    3. metadata (author, year, etc.): ok, when it is present in the document
    4. lists: no
    5. tables: no

A Perl module for character conversions

  • Saara proposes a Perl module that handles the character coding issues. The pdf conversion faces the same problems with e.g. Latin-6 character codes are present in the doc to xml conversion as well, so a module would be a good idea. (I got the impression from the word2xml.pl script, that some problems with the characters still remain.) I try to make the module so general that it suits to html character conversions later on. What do you think?

Problems found so far using open-source tools:

  • paragraphs are correctly ordered, but not separated (i.e. one long paragraph): solved
  • no structure: parsed afterwards

Tasks:

  • Børre and Saara to discuss what perhaps is two different approaches to this,
  • and evaluate differences and strengths.

HTML to XML

  • we already have some tools according to Saara
  • this is anyway easy, as HTML provides us with the structure we need
  • what is needed is a transformation to our XML, + adding the metadata as usual
  • it can wait at least a week or two (after pdf conversion is mostly done)

6. Linguistics

Name lexicon

See notes from the 12.9. meeting

Place name summary

Sjur needs a short resumé of the present status wrt the parallel place names. The resumé will be given to the project board for information. This is the situation:

  • Finland: we have received all names, in North Sámi and Finnish. What about Inari and Skolt Sámi place names? Børre will find out what we received, and whether it contained Inari and Skolt Sámi. Today: -)
  • Sweden: the transcription (from old to new orthogr.) is finished, now the names are identified as Sámi ones. Have we made it clear that we need parallel names (i.e. the Swedish (Finnish/Meänkieli) parallel as well? No. Can they deliver parallel names? We don't know (yet). Thomas will ask.
  • Norway: we have received all Sámi names, but the CD is unreadable (we have earlier received a partial list of all names in all Sámi languages, reflecting the status quo of the orthography conversion 3 years ago).
  • We can not get any parallel listing directly from Statens kartverk, but Øystein Johannessen has promised to help us through Norge Digitalt. As of today, nothing new has been heard.

Conclusion: it has been much easier to get place names from Finland and Sweden than anticipated, and so far without any costs on the project. So far, the place names have been the single most important contribution from Finland and Sweden for this project.

North Sámi

  • three-part compounds issue still open, as are the name and number projects.
    • Trond, Maaren, Sjur will look into this in Guovdageaidnu
  • Johnny Andersen has written a letter to us on the treatment of Sámi place names, we need a contract with "Norge digitalt", via UFD. We will have to get such an agreement. This is a task for Thomas and Trond.
    • Sjur has written an e-mail to the UFD contact person, Øystein Johannessen, who will look into it soon.
  • normativity issues:
    • the Giellalávdegoddi meeting is in October sometime

Lule Sámi

  • we need a lexicon
  • compounding and derivation
    • Thomas has finished with deverbals, now working with denominals
    • most likely the same three-part compound problem in Lule Sámi as well (the middle stem shortens to a stem only found in such compounds)
    • it is possible that even the first stem shortens the same way
  • Suffix boundary symbol has not been added, we are not sure whether we should do that.

Numerals

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

7. Speller infrastructure

Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See 15.8. meeting memo for more.

Got an e-mail from Roy Dragseth, that he had to terminate the aspell processes Tomi had left running on cochise after work, because they consumed all the memory.

See 12.9. meeting memo for a list of open issues.

8. Other

Technical issues

  • The mac os / perl bug (at least Trond and Sjur has it):
    • utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
      • Another example:
      • : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33.
  • CVS mailing to Maaren:
    • it is working, but she receives two copies of every message. I don't know why. It is a minor annoyance, but needs to be looked into. Thomas has the same problem. Børre will look into it (get rid of the double copies).
      • is Thomas also set up with a double set of e-mail accounts?
      • Thomas has two mail accounts. Not sure if that's the problem.

Bug fixing

  • 13 open bugs (and 24 risten.no bugs) - it seems Sjur can need some help with bug fixing in risten.no: -) (note: some of them are feature requests more than real bugs)
  • some bugs are not real, but just not closed yet even though they are fixed. Please have a look at all the bugs you are assigned to. Do we need to set up an automatic reminder system? Something like sending an e-mail if the bug has been open and untouched for more than X weeks?
    • Børre has not checked his bugs last week: -)
    • Thomas never has bugs, they go to Trond.
    • Tomi has - good!
    • Sjur has not - bad: -(
  • Børre to ask Thor Øyvind to configure Bugzilla to send e-mail reminders for open bugs not touched in a month.

Buying

  • new external screens for all Divvun workers
  • rugsacks for all? Yes.

9. Summary, task list

Børre

  • discuss with Anders Kintel about possible cooperation
  • Follow up on CVS mailing:
    • Have a look at why Maaren and Thomas get two copies of every samicvs mail.
  • Contact oahpahusossodat and the rest of the SD about texts
    • Contact these to clarify details on how to gather the texts
  • Document the corpus infrastructure
  • Reorganise the directory structure
  • Continue converting text from input format to our xml
  • Contact Saara about pdf conversions.
  • Have a look at the placenames files.
  • Børre to ask Thor Øyvind to configure Bugzilla to send e-mail reminders for open bugs not touched in a month.
  • add cvs commit xml validation
  • look into how divvun2web can provide (better) error messages, or look at forrestbot
  • Børre and Saara to discuss what perhaps is two different approaches to corpus conversion, and evaluate differences and strengths.

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology.
    • shall do it next week
  • shall get mainly through the missing list from risten.no this week
    • working with risten.no this week also
  • Start working on grammatical issues with Thomas and Trond
    • shall do it this week or next week?
  • Work on the name project with Trond and Sjur
    • okei okei
  • Start looking at normativity issues
    • shall do it this week
  • Work on the numerals project with Trond
    • shall contact Trond

Saara

  • Look at the corpus infrastructure issue
  • Look at the corpus interface issue with Lars
  • Convert texts from .doc to .xml, to get a grasp of our corpus format
  • Have a look at the pdf-to-xml issue
    • use the priority list earlier in the memo for a guidance
  • Børre and Saara to discuss what perhaps is two different approaches to corpus conversion, and evaluate differences and strengths.

Sjur

  • risten.no bugs and fixes
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working to Sámediggi
      • Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
  • To the board:
    • place name status
    • south sami project draft
    • deliverables
    • check the memo from the last meeting
  • project planning with Trond
  • Work on the name project with Trond and Maaren
  • Prepare for a Lule Sámi meeting with Árran
  • Follow up on place names from Norge Digitalt
  • Evaluate SFST as speller (and analyzer) lexicon
  • prepare for the Guovdageaidnu meeting:
    • name lexicon
    • three-part compounds

Thomas

  • work on Lule Sami compounding and derivation
  • Look at Linguistic bugs with Trond.
  • Prepare for a Lule Sámi meeting with Árran

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
  • three-part compounding
  • corpus infrastructure: dtd location (both public and internal)
  • corpus infrastructure: file and dir organisation
  • Document aspell and corpus infrastructure
  • Add html-to-xml conversion to corpus infra

Trond

  • Work on the bug list (11 open).
  • Work on compounds (three-part, with Tomi)
  • Work on the corpus interface (with Lars and Saara)
  • Work on the name agreement with "Norge digitalt" with Thomas
  • Get the new version of the New Testament
  • Introduce the new coworker to the work routines
  • project planning with Sjur
  • Work on the name project with Maaren and Sjur
  • Prepare for a Lule Sámi meeting with Árran
  • Work on the numerals project with Maaren
  • Prepare for three-part compounds meeting in Guovdageaidnu
  • Contact the University lawyers for comments on the contract
  • Set up testing procedures for the disambiguator.

10. Next meeting, closing

03.10.2005 10: 00

Closed at 11: 53