Meeting_2005-10-03
Meeting setup
- Date: 03.10.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 35.
Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond
Absent: Saara
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- discuss with Anders Kintel about possible cooperation
- Would like to hear what the Sámediggi finds out about the contract that
- Would like to hear what the Sámediggi finds out about the contract that
- Follow up on CVS mailing:
- Have a look at why Maaren and Thomas get two copies of every samicvs
- This one is ok.
- This one is ok.
- Have a look at why Maaren and Thomas get two copies of every samicvs
- Contact oahpahusossodat and the rest of the SD about texts
- Contact these to clarify details on how to gather the texts
- Not done
- Not done
- Contact these to clarify details on how to gather the texts
- Document the corpus infrastructure
- Has documented the word2xml.pl and corpus2dir.pl scripts.
- Has documented the word2xml.pl and corpus2dir.pl scripts.
- Reorganise the directory structure
- Done it on my own machine, discussed with Trond and Tomi on how to arrange
- Done it on my own machine, discussed with Trond and Tomi on how to arrange
- Continue converting text from input format to our xml
- Done
- Done
- Contact Saara about pdf conversions.
- Not done
- Not done
- Have a look at the placenames files.
- Not done
- Not done
-
Børre to ask Thor Øyvind to configure Bugzilla to send e-mail
- Sent an e-mail, no answer.
- Sent an e-mail, no answer.
- add cvs commit xml validation
- Not done
- Not done
- look into how divvun2web can provide (better) error messages, or look at
- divvun2web seems to work satisfactory
- divvun2web seems to work satisfactory
-
Børre and Saara to discuss what perhaps is two different approaches
- Belongs to the other point
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- not done
- not done
- shall get mainly through the missing list from risten.no this week
- have worked with risten.no only on thursday
- have worked with risten.no only on thursday
- Start working on grammatical issues with Thomas and Trond
- not done
- not done
- Work on the name project with Trond and Sjur
- shall start to work with this issue this week
- shall start to work with this issue this week
- Start looking at normativity issues
- shall start looking at this issue this week
- shall start looking at this issue this week
- Work on the numerals project with Trond
- waiting for Trond
Saara
- Look at the corpus infrastructure issue
- Has looked at it.
- Has looked at it.
- Look at the corpus interface issue with Lars
- We will have to wait until Lars has a working demo ready, and then evaluate
- We will have to wait until Lars has a working demo ready, and then evaluate
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- Got a grasp without converting.
- Got a grasp without converting.
- Have a look at the pdf-to-xml issue
- Has looked at it, is now working on the conversion issue.
- use the priority list earlier in the memo for a guidance
- Has looked at it, is now working on the conversion issue.
-
Børre and Saara to discuss what perhaps is two different approaches
Sjur
- risten.no bugs and fixes
- brief discussions with Pia, Risten about moving forward with the main issue
- brief discussions with Pia, Risten about moving forward with the main issue
- complete the action summary after our half-year evaluation
- finally done!
- finally done!
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- To the board:
- place name status
- done
- done
- south sami project draft
- done
- done
- deliverables
- done
- done
- check the memo from the last meeting
- not yet done
- not yet done
- place name status
- project planning with Trond
- postponed till Kautokeino this week
- looked more into software, tools for this, preferably coupled with
- postponed till Kautokeino this week
- Work on the name project with Trond and Maaren
- coming up in Kautokeino
- coming up in Kautokeino
- Prepare for a Lule Sámi meeting with Árran
- nothing done
- nothing done
- Follow up on place names from Norge Digitalt
- waiting for an answer
- waiting for an answer
- Evaluate SFST as speller (and analyzer) lexicon
- still on the postponed list
- still on the postponed list
- prepare for the Guovdageaidnu meeting:
- name lexicon
- not done
- not done
- three-part compounds
- not done
- not done
- name lexicon
- others:
- tried looking at the utf-8 error reported by perl when running
- tried looking at the utf-8 error reported by perl when running
Thomas
- work on Lule Sami compounding and derivation
- worked and still working
- worked and still working
- Look at Linguistic bugs with Trond.
- solved some
- solved some
- Prepare for a Lule Sámi meeting with Árran
- not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done yet
- Not done yet
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- corpus infrastructure: file and dir organisation
- Worked on this one with Børre
- Worked on this one with Børre
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
- This is on the way...
- The trick is to convert the html to well formed xhtml with Tidy and then
- This is on the way...
- Other tasks:
- Did cgi-script for uploading corpus files to cochise
Trond
- Work on the bug list (11 open).
- Worked a little.
- Worked a little.
- Work on compounds (three-part, with Tomi)
- Not done.
- Not done.
- Work on the corpus interface (with Lars and Saara)
- Discussed plans with them.
- Discussed plans with them.
- Work on the name agreement with "Norge digitalt" with Thomas
- Not done.
- Not done.
- Get the new version of the New Testament
- Talked to Bibelselskapt, now awaiting their decision.
- Talked to Bibelselskapt, now awaiting their decision.
- Introduce the new coworker to the work routines
- This has taken most of my time.
- This has taken most of my time.
- project planning with Sjur
- Not done
- Not done
- Work on the name project with Maaren and Sjur
- Have prepaired for the Guovdageaidnu meeting, by discussing with Kari Pitkänen,
- Have prepaired for the Guovdageaidnu meeting, by discussing with Kari Pitkänen,
- Prepare for a Lule Sámi meeting with Árran
- Not done
- Not done
- Work on the numerals project with Maaren
- Not done
- Not done
- Prepare for three-part compounds meeting in Guovdageaidnu
- Still to be done, not in G. yet.
- Still to be done, not in G. yet.
- Contact the University lawyers for comments on the contract
- Done, but the lawyer I talked to is on leave this week and the beginning of next,
- Done, but the lawyer I talked to is on leave this week and the beginning of next,
3. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
- Now we have 4 documents:
- Correct corpus (disamb usage)
- Corpus plan (for the disamb corpus cwb)
- Corpus conversion, two versions, in infra and in ling. Tomi and Børre have
- catxml
- Correct corpus (disamb usage)
For the basic corpora, we need 3 types of documentation, or doc for 3 target
- For the users/linguists:
- What corpus are found, how do I use them (this info is now scattered)
- What corpus are found, how do I use them (this info is now scattered)
- For the collectors:
- How do I add texts, where do I add them, how do I convert them (this is the
- How do I add texts, where do I add them, how do I convert them (this is the
- For the programmer
- What did I actually do? (this is partly the catxml doc)
For the work on the graphical user interface, we need documentation as well, in
- add/update Aspell documentation (Tomi)
- finish divvun2web script (Børre)
- the cronjob is up an working. It needs a better error reporting mechanism,
- the cronjob is up an working. It needs a better error reporting mechanism,
- as always: document what you're doing: -) (all)
Crontab:
Do we need validation upon cvs check-in? What about forrestbot?
We need better error reporting, and errors should preferably be caught before
Børre:
- add cvs commit xml validation
- look into how divvun2web can provide (better) error messages, or look at
4. Corpus gathering
See notes from the 12.9. meeting
Contracts
Tasks:
- read through Trond's translations (Børre, Sjur)
- Done, and commented
- Done, and commented
- e-mail Kimmo Koskenniemi about the missing fourth contract, and about
- Done
- Done
- contact lawyers (find suitable lawyers, Trond will start with the
- send the license text to lawyers
- These should be sent to the university lawyers, as the sámediggi ones
- These should be sent to the university lawyers, as the sámediggi ones
- add a background document explaining the model
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
We will send the contracts as is to the lawyers, in parallel with waiting for
North Sámi New Testament
Trond has been in contact with Bibelselskapet, and sent the version he
Lule Sámi New Testament
Børre has converted the translation (which was only available in pdf) to
Update (week 39): Olavi Korhonen had some problems with fonts in the document,
5. Corpus infrastructure
Naming conventions and directory structure
See notes from the 12.9.
meeting
Børre has done some work, but it is only locally on his machine. Some more
- Børre:
- The three directory structure is too cumbersome, the only thing we need to
- We will have to have some build system (make, scons) to do the grunt work.
- The three directory structure is too cumbersome, the only thing we need to
- Sjur
- If everything but the orig directory is automatically created and rene wed,
- This also requires some sort of build system, which Sjur had assumed
- If everything but the orig directory is automatically created and rene wed,
Corpus conversion
Pdf to XML
Extraction priority list
- retain correct Sámi characters: ok
- retain word and sentence order: ok
- retain paragraph order: ok
- retain structure
- paragraphs: ok, by perl
- titles, headers: ok, by perl
- metadata (author, year, etc.): ok, when it is present in the document
- lists: no
- tables: no
- paragraphs: ok, by perl
A Perl module for character conversions
-
Saara proposes a Perl module that handles the character coding
Problems found so far using open-source tools:
- paragraphs are correctly ordered, but not separated (i.e. one long
- no structure: parsed afterwards
Tasks:
-
Børre and Saara to discuss what perhaps is two different approaches
- and evaluate differences and strengths.
HTML to XML
- we already have some tools according to Saara
- this is anyway easy, as HTML provides us with the structure we need
- what is needed is a transformation to our XML, + adding the metadata as usual
- it can wait at least a week or two (after pdf conversion is mostly done)
6. Linguistics
Name lexicon
See notes from the 12.9. meeting
Place name summary
Sjur needs a short resumé of the present status wrt the parallel place
- Finland: we have received all names, in North Sámi and Finnish. What
- Sweden: the transcription (from old to new orthogr.) is finished, now the
- Lantmäteriet are now manually linking the parallel names: Due to lack of
- Lantmäteriet are now manually linking the parallel names: Due to lack of
- Norway: we have received all Sámi names, but the CD is unreadable (we have
- We can not get any parallel listing directly from Statens kartverk,
Conclusion: it has been much easier to get place names from Finland and Sweden
Twol SETS definition issue
Trond and Thomas tried to define Lule Sámi G1, G2, G3 sequences in the SETS
North Sámi
- three-part compounds issue still open, as are the name and number projects.
-
Trond, Maaren, Sjur will look into this in Guovdageaidnu
-
Trond, Maaren, Sjur will look into this in Guovdageaidnu
-
Johnny Andersen has written a letter to us on the treatment of Sámi place
-
Sjur has written an e-mail to the UFD contact person,
-
Sjur has written an e-mail to the UFD contact person,
- normativity issues:
- the Giellalávdegoddi meeting is in October sometime
Lule Sámi
- we need a lexicon
- compounding and derivation
-
Thomas has finished with deverbals, now working with denominals
- most likely the same three-part compound problem in Lule Sámi as well
- it is possible that even the first stem shortens the same way
-
Thomas has finished with deverbals, now working with denominals
- Suffix boundary symbol has not been added, we are not sure whether we should
Numerals
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
7. Speller infrastructure
Aspell
Write documentation here as well.
The munch-list is working, and the affix file is improving. See 15.8. meeting memo for more.
Got an e-mail from Roy Dragseth, that he had to terminate the aspell
See 12.9. meeting memo for
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- Another example:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
Bug fixing
- 13 open bugs (and 24 risten.no bugs) - it seems Sjur can need some help
Buying
- new external screens for all Divvun workers
- rugsacks for all? Yes.
9. Summary, task list
Børre
- discuss with Anders Kintel about possible cooperation
- Contact oahpahusossodat and the rest of the SD about texts
- Reorganise the directory structure
- Continue converting text from input format to our xml
- Contact Saara about pdf conversions.
- Have a look at the placenames files.
- Ask Thor-Øivind to move bugzilla to our new webserver.
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- shall do it next week
- shall do it next week
- shall get mainly through the missing list from risten.no this week
- working with risten.no this week also
- working with risten.no this week also
- Start working on grammatical issues with Thomas and Trond
- shall do it this week or next week?
- shall do it this week or next week?
- Work on the name project with Trond and Sjur
- okei okei
- okei okei
- Start looking at normativity issues
- shall do it this week
- shall do it this week
- Work on the numerals project with Trond
- shall contact Trond
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- Have a look at the pdf-to-xml issue
- use the priority list earlier in the memo for a guidance
- use the priority list earlier in the memo for a guidance
-
Børre and Saara to discuss what perhaps is two different approaches
Sjur
- Lule Sámi twol problems, have a look at the sets definition
- risten.no bugs and fixes
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- For the board meeting:
- check the memo from the last meeting
- check the memo from the last meeting
- project planning with Trond
- Work on the name project with Trond and Maaren
- Prepare for a Lule Sámi meeting with Árran
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- prepare for the Guovdageaidnu meeting:
- name lexicon
- three-part compounds
- name lexicon
Thomas
- Post a summary on the Lule Sámi issue to the news group
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond
- Prepare for a Lule Sámi meeting with Árran
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
- Cgi-script for uploading documents to corpus base
Trond
- Work on the bug list (11 open).
- Work on the name agreement with "Norge digitalt" with Thomas
- Get the new version of the New Testament
- project planning with Sjur
- Work on the name project with Maaren and Sjur
- Prepare for a Lule Sámi meeting with Árran
- Work on the numerals project with Maaren
- Prepare for three-part compounds meeting in Guovdageaidnu
- Contact the University lawyers for comments on the contract
10. Next meeting, closing
10.10.2005 10: 00
Closed at 11: 20