Meeting_2005-10-03

Meeting setup

Date: 03.10.2005
Time: 10.00 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
Speller infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10: 35.

Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond

Absent: Saara

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

discuss with Anders Kintel about possible cooperation
- Would like to hear what the Sámediggi finds out about the contract that Anders Kintel has with them. Berit Karin Paulsen is working on this.
Follow up on CVS mailing:
- Have a look at why Maaren and Thomas get two copies of every samicvs mail.
  - This one is ok.
Contact oahpahusossodat and the rest of the SD about texts
- Contact these to clarify details on how to gather the texts
  - Not done
Document the corpus infrastructure
- Has documented the word2xml.pl and corpus2dir.pl scripts.
Reorganise the directory structure
- Done it on my own machine, discussed with Trond and Tomi on how to arrange the directory structure.
Continue converting text from input format to our xml
- Done
Contact Saara about pdf conversions.
- Not done
Have a look at the placenames files.
- Not done
Børre to ask Thor Øyvind to configure Bugzilla to send e-mail reminders for open bugs not touched in a month.
- Sent an e-mail, no answer.
add cvs commit xml validation
- Not done
look into how divvun2web can provide (better) error messages, or look at forrestbot
- divvun2web seems to work satisfactory
Børre and Saara to discuss what perhaps is two different approaches to corpus conversion, and evaluate differences and strengths.
- Belongs to the other point

Maaren

The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology.
- not done
shall get mainly through the missing list from risten.no this week
- have worked with risten.no only on thursday
Start working on grammatical issues with Thomas and Trond
- not done
Work on the name project with Trond and Sjur
- shall start to work with this issue this week
Start looking at normativity issues
- shall start looking at this issue this week
Work on the numerals project with Trond
- waiting for Trond

Saara

Look at the corpus infrastructure issue
- Has looked at it.
Look at the corpus interface issue with Lars
- We will have to wait until Lars has a working demo ready, and then evaluate it.
Convert texts from .doc to .xml, to get a grasp of our corpus format
- Got a grasp without converting.
Have a look at the pdf-to-xml issue
- Has looked at it, is now working on the conversion issue.
- use the priority list earlier in the memo for a guidance
Børre and Saara to discuss what perhaps is two different approaches to corpus conversion, and evaluate differences and strengths.

Sjur

risten.no bugs and fixes
- brief discussions with Pia, Risten about moving forward with the main issue (concurrent editing)
complete the action summary after our half-year evaluation
- finally done!
follow up on:
- voice group-chat not working to Sámediggi
  - Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
To the board:
- place name status
  - done
- south sami project draft
  - done
- deliverables
  - done
- check the memo from the last meeting
  - not yet done
project planning with Trond
- postponed till Kautokeino this week
- looked more into software, tools for this, preferably coupled with requirements specification and testing
Work on the name project with Trond and Maaren
- coming up in Kautokeino
Prepare for a Lule Sámi meeting with Árran
- nothing done
Follow up on place names from Norge Digitalt
- waiting for an answer
Evaluate SFST as speller (and analyzer) lexicon
- still on the postponed list
prepare for the Guovdageaidnu meeting:
- name lexicon
  - not done
- three-part compounds
  - not done
others:
- tried looking at the utf-8 error reported by perl when running scripts/preprocess - under the assumption that the input file contains bad chars - but nothing found, and nothing new revealed (other than MS Words files often contains spurious FormFeed chars (x000c, octal 14, decimal 12).

Thomas

work on Lule Sami compounding and derivation
- worked and still working
Look at Linguistic bugs with Trond.
- solved some
Prepare for a Lule Sámi meeting with Árran
- not done

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
  - Not done yet
three-part compounding
- Not done
corpus infrastructure: dtd location (both public and internal)
- Not done
corpus infrastructure: file and dir organisation
- Worked on this one with Børre
Document aspell and corpus infrastructure
Add html-to-xml conversion to corpus infra
- This is on the way...
- The trick is to convert the html to well formed xhtml with Tidy and then convert it with our xslt-file
Other tasks:
- Did cgi-script for uploading corpus files to cochise

Trond

Work on the bug list (11 open).
- Worked a little.
Work on compounds (three-part, with Tomi)
- Not done.
Work on the corpus interface (with Lars and Saara)
- Discussed plans with them.
Work on the name agreement with "Norge digitalt" with Thomas
- Not done.
Get the new version of the New Testament
- Talked to Bibelselskapt, now awaiting their decision.
Introduce the new coworker to the work routines
- This has taken most of my time.
project planning with Sjur
- Not done
Work on the name project with Maaren and Sjur
- Have prepaired for the Guovdageaidnu meeting, by discussing with Kari Pitkänen, a Finnish collegue. Will bring his comments to G.
Prepare for a Lule Sámi meeting with Árran
- Not done
Work on the numerals project with Maaren
- Not done
Prepare for three-part compounds meeting in Guovdageaidnu
- Still to be done, not in G. yet.
Contact the University lawyers for comments on the contract
- Done, but the lawyer I talked to is on leave this week and the beginning of next, we thus had bad luck there.

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general ("To be done by the ones making the corpora": Børre, Tomi, Trond, Saara).
Now we have 4 documents:
1. Correct corpus (disamb usage)
2. Corpus plan (for the disamb corpus cwb)
3. Corpus conversion, two versions, in infra and in ling. Tomi and Børre have done parallell work ;-(
4. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

For the users/linguists:
1. What corpus are found, how do I use them (this info is now scattered)
For the collectors:
1. How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
For the programmer
1. What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

add/update Aspell documentation (Tomi)
finish divvun2web script (Børre)
1. the cronjob is up an working. It needs a better error reporting mechanism, though.
as always: document what you're doing: -) (all)

Crontab:

Do we need validation upon cvs check-in? What about forrestbot? An xml validation check (acc to our dtd) is a good thing. We would like that to be implemented.

We need better error reporting, and errors should preferably be caught before cvs commit.

Børre:

add cvs commit xml validation
look into how divvun2web can provide (better) error messages, or look at forrestbot

4. Corpus gathering

See notes from the 12.9. meeting for details about the steps forward.

Contracts

Tasks:

read through Trond's translations (Børre, Sjur)
- Done, and commented
e-mail Kimmo Koskenniemi about the missing fourth contract, and about unclear paragraphs in the versions received (Sjur).
- Done
contact lawyers (find suitable lawyers, Trond will start with the university)
send the license text to lawyers
- These should be sent to the university lawyers, as the sámediggi ones are overloaded with work.
add a background document explaining the model

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

We will send the contracts as is to the lawyers, in parallel with waiting for comments from Kimmo. The SD lawyers will be glad to not get this task, as they're overloaded as is. Trond will follow up on the issue.

North Sámi New Testament

Trond has been in contact with Bibelselskapet, and sent the version he received from a collegue, gathered from the web. Now we await a reaction from Bibelselskapet.

Lule Sámi New Testament

Børre has converted the translation (which was only available in pdf) to rtf, and sent the files back to Bibelsällskapet for corrections.

Update (week 39): Olavi Korhonen had some problems with fonts in the document, Børre helped him with that. We seem to have the final version now.

5. Corpus infrastructure

Naming conventions and directory structure

See notes from the 12.9. meeting for details about the decision and implementation, as well as a list of tasks.

Børre has done some work, but it is only locally on his machine. Some more discussions with Trond must be done before these are copied to cochise.

Børre:
- The three directory structure is too cumbersome, the only thing we need to have control of is the original file, the meta-info.xsl file and the version of the script making the final file.
- We will have to have some build system (make, scons) to do the grunt work.
Sjur
- If everything but the orig directory is automatically created and rene wed, then everything should be clean and fine, and the directories will only be implementation details.
- This also requires some sort of build system, which Sjur had assumed

Corpus conversion

Pdf to XML

Extraction priority list

retain correct Sámi characters: ok
retain word and sentence order: ok
retain paragraph order: ok
retain structure
1. paragraphs: ok, by perl
2. titles, headers: ok, by perl
3. metadata (author, year, etc.): ok, when it is present in the document
4. lists: no
5. tables: no

A Perl module for character conversions

Saara proposes a Perl module that handles the character coding issues. The pdf conversion faces the same problems with e.g. Latin-6 character codes are present in the doc to xml conversion as well, so a module would be a good idea. (I got the impression from the word2xml.pl script, that some problems with the characters still remain.) I try to make the module so general that it suits to html character conversions later on. What do you think?

Problems found so far using open-source tools:

paragraphs are correctly ordered, but not separated (i.e. one long paragraph): solved
no structure: parsed afterwards

Tasks:

Børre and Saara to discuss what perhaps is two different approaches to this,
and evaluate differences and strengths.

HTML to XML

we already have some tools according to Saara
this is anyway easy, as HTML provides us with the structure we need
what is needed is a transformation to our XML, + adding the metadata as usual
it can wait at least a week or two (after pdf conversion is mostly done)

6. Linguistics

Name lexicon

See notes from the 12.9. meeting

Place name summary

Sjur needs a short resumé of the present status wrt the parallel place names. The resumé will be given to the project board for information. This is the situation:

Finland: we have received all names, in North Sámi and Finnish. What about Inari and Skolt Sámi place names? Børre will find out what we received, and whether it contained Inari and Skolt Sámi. Today: -)
Sweden: the transcription (from old to new orthogr.) is finished, now the names are identified as Sámi ones. Have we made it clear that we need parallel names (i.e. the Swedish (Finnish/Meänkieli) parallel as well? No. Can they deliver parallel names? We don't know (yet). Thomas will ask.
- Lantmäteriet are now manually linking the parallel names: Due to lack of funds it takes time. Anette Torensjö doesn´t believe it´s finished until the beginning of next year.
Norway: we have received all Sámi names, but the CD is unreadable (we have earlier received a partial list of all names in all Sámi languages, reflecting the status quo of the orthography conversion 3 years ago).
We can not get any parallel listing directly from Statens kartverk, but Øystein Johannessen has promised to help us through Norge Digitalt. As of today, nothing new has been heard.

Conclusion: it has been much easier to get place names from Finland and Sweden than anticipated, and so far without any costs on the project. So far, the place names have been the single most important contribution from Finland and Sweden for this project.

Twol SETS definition issue

Trond and Thomas tried to define Lule Sámi G1, G2, G3 sequences in the SETS section of the twol file. It did not succeed, it turned out we had done it in the xfst spirit. We would like to have input on this issue.

North Sámi

three-part compounds issue still open, as are the name and number projects.
- Trond, Maaren, Sjur will look into this in Guovdageaidnu
Johnny Andersen has written a letter to us on the treatment of Sámi place names, we need a contract with "Norge digitalt", via UFD. We will have to get such an agreement. This is a task for Thomas and Trond.
- Sjur has written an e-mail to the UFD contact person, Øystein Johannessen, who will look into it soon.
normativity issues:
- the Giellalávdegoddi meeting is in October sometime

Lule Sámi

we need a lexicon
compounding and derivation
- Thomas has finished with deverbals, now working with denominals
- most likely the same three-part compound problem in Lule Sámi as well (the middle stem shortens to a stem only found in such compounds)
- it is possible that even the first stem shortens the same way
Suffix boundary symbol has not been added, we are not sure whether we should do that.

Numerals

An empirical overview
1. Numeral generation
2. Numeral inflection
3. Numerals as parts of compounds
A clear concept of how we want to treat them
1. Tagging
A treatment

7. Speller infrastructure

Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See 15.8. meeting memo for more.

Got an e-mail from Roy Dragseth, that he had to terminate the aspell processes Tomi had left running on cochise after work, because they consumed all the memory.

See 12.9. meeting memo for a list of open issues.

8. Other

Technical issues

The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
  - Another example:
  - : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33.

Bug fixing

13 open bugs (and 24 risten.no bugs) - it seems Sjur can need some help with bug fixing in risten.no: -) (note: some of them are feature requests more than real bugs)

Buying

new external screens for all Divvun workers
rugsacks for all? Yes.

9. Summary, task list