Meeting_2005-11-21
Meeting setup
- Date: 21.11.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- The Árran journey
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 12.
Present: Børre, Saara, Sjur, Thomas, Trond
Absent: Maaren, Tomi
Main secretary: Sjur
Agenda accepted with Árran as an additional point.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Done some, will have to automate it.
- Done some, will have to automate it.
- Document the corpus directory structure
- Done some.
- Done some.
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
- Not done
- Not done
- mark-up names
- Not done
- Not done
- divvun.no and giellatekno.uit.no
- Binary files download area
- Make the conversion to static site, using our own script.
- Not done
- Not done
- Binary files download area
- hyphenation in corpus docs
- Not done
- Not done
- meet with Anders Kintel in Árran
- Done
- Done
- corpus xsl files under version control
- Not done
- Not done
- Other
- Doctored Maarens computer on Tuesday. See, emacs and XXE should work as
- Doctored Maarens computer on Tuesday. See, emacs and XXE should work as
Maaren
- shall work with Sámi place names only
- update the last issue in the North Sámi normativity issues document
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- xml template for namelex format in gt/common/src/proper-nouns.xml
- xml template for namelex format in gt/common/src/proper-nouns.xml
- Convert the name lexicon from present format to xml
- waiting for the format
- waiting for the format
- document corpus infrastructure, your own parts
- done
- done
- Look at the hyphenation issue
- not done
- not done
- Update the corpus.dtd
- done
- done
- corpus xsl files under version control
- not done
- not done
- make preprocess and lookup2cg faster
- work in progress
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- No time last week
- No time last week
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- still no working link
- still no working link
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- nothing
- nothing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- nothing
- nothing
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- nothing
- nothing
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not yet
- not yet
- discuss kvensk project support with Trond
- proper name integration with risten.no
- discuss risten.no work with Tomi
- some further discussions done
- some further discussions done
- write public tender documents
- updates to the deliverables doc
- updates to the deliverables doc
- buy:
- new computer (project server)?
- new computer (project server)?
- hyphenation in corpus docs
- Børre and I discussed this topic in the car, regarding some of the texts we
- Børre and I discussed this topic in the car, regarding some of the texts we
- meet with Anders Kintel in Árran
- Done, as well as with Bård Eriksen from a publisher housed in Árran.
- Done, as well as with Bård Eriksen from a publisher housed in Árran.
- Other:
- ordered AppleCare extended warranty to all Divvun computers.
- presentation of the Divvun project at the Árran conference
- ordered AppleCare extended warranty to all Divvun computers.
Thomas
- work on North Sámi compounding and derivation
- worked on compounding and still working
- worked on compounding and still working
- Look at Linguistic bugs with Trond
- not anything done this week
- not anything done this week
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- not met this week, written suggestion to the last problem in Unison newsgroup
- not met this week, written suggestion to the last problem in Unison newsgroup
- update the lule sámi normativity issues document
- done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- discuss risten.no work with Sjur
- Look into synchronisation of proper names with risten.no
- hyphenation in corpus docs
- corpus xsl files under version control
- add automatic language detection to the corpus processing
- corpus processing problem (convert2xml.pl at line 91)
Trond
- Send the contract to the university lawyer
- Done, and discussed with her. Next step is to integrate the (minor) changes
- Done, and discussed with her. Next step is to integrate the (minor) changes
- Look into the document hyphenation issue
- Not done.
- Not done.
- Look at the three-part compound issue
- Had a look at Thomas' rule set, that's all.
- Had a look at Thomas' rule set, that's all.
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- Had a long look at them. Except the notoriously wexy #77, all of them are
- Had a long look at them. Except the notoriously wexy #77, all of them are
- project planning with Sjur, continued
- Bought the program, at least, but at that point Sjur was off to Drag.
- also look at the development processes - specification and testing
- Bought the program, at least, but at that point Sjur was off to Drag.
- The name project
- Work on the name project, mark up names (exactly 100 names left)
- Most work on this issue done by others this week
- Extract complex names from version 1.126 and save them as a separate file in
- Checked in this moment...
- Work on the name project, mark up names (exactly 100 names left)
- discuss kvensk project support with Sjur
- Sporadically mentioned.
- Sporadically mentioned.
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Discussed a bit with Thomas, otherways awaiting Thomas' Lule Sámi cleanup.
- Discussed a bit with Thomas, otherways awaiting Thomas' Lule Sámi cleanup.
- Worked mostly on disamb issues, including corpus issues.
3. Árran trip
The fifth Sámi Conference
Killer Whale Safari
A wonderful (put in your favourite travel noun here)!
Meeting with Anders Kintel
He is using Filemaker Pro, with two fields in his database. Sámi word in one
We have asked for the first field only, and we will put it in the corpus
Meeting with Bård Eriksen, publisher from Báhko
Very positive, Børre will return to him around the middle of December.
Presentation
Sjur held a 15-20 min presentation of the Divvun project, and a short
4. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- The directory structure is now settled (as of last meeting), and should be
For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Divvun.no down again
Tomcat is running out of memory in between. Børre will look into changing
Update: Only one small change needed in our own script. Binary download section
5. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Børre has gathered files from the Sámediggi
-
http://troms.kulturnett.no/bibliotek/samisk/samisk_materiale.htm
-
Sámi legal text on the net
- We need all these texts
- We need to survey the site in the future
- We need the Norwegian versions as well
- We need all these texts
- Lule Sámi: see the
Contracts
Update: All SD versions now synchronised with the templates. Trond met with the
6. Corpus infrastructure
Updated task list:
- Include the xsl files under version control (Børre, Tomi, Saara)
- Incorporate language detection as part of the corpus processing (Tomi)
- we need a way to deal with hyphenated documents (documents with (manually) inserted
- Discuss details in the newsgroup
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- This is true for examples like "eala-<CR>hus", they
- In cases of truncated compounds like "ealahus-<CR> ja ...", we want the
- There are sporadically text books with explicit hyphenation points, like:
- This is true for examples like "eala-<CR>hus", they
- Discuss details in the newsgroup
7. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- name markup
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
I updated the file gt/common/src/proper-nouns.xml with different formats for printing
When these two tasks are done (at some point in the future), the conversion will
Status quo on the two lines of work:
The mark up of the remaining 400 entries until conversion starts (People
31 BERN 19 LONDON 16 NIILLAS 15 MARJA 11 ACCRA 4 HEANDARAT 3 ANAR 1 ALEUHTAT
The technical issues are specified in earlier memos. Conducted by:
A very short example is found at common/src/proper-nouns.xml.
Complex names
Task list for this issue:
- find eventual unique second-parts (B-parts of names that do not exist in
- remove these B-parts from the ordinary name file (Tomi)
- the resulting file format should be identical to our present prop-name
- make sure xml2lexc can handle complex names in ways compatible with our
The file proper-complex.xml has been added to gt/common/src.
The details of the new XML format needs to be further discussed in the newsgroup
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is working on it
- the exact rules for when shortening happens should be documented (Maaren
- descriptive facts from our corpus (Trond, Thomas)
- linguistic analysis/discussion to continue in the newsgroup
- look at Lule Sámi, but apply it to second-parts only
- number project still open
- diphthong simplification/G3 issue should be carried over from Lule Sámi
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Tasks:
- update the normativity issues document:
- Px issue
- Px issue
- G3 open issues (S2; Sx = Spiik, consonant series)
- Great progress has been made on the G3 issue, just some minor points remain.
- Thomas, Trond and Sjur meet later this week to solve the rest
- Great progress has been made on the G3 issue, just some minor points remain.
Numerals
The issue awaits closure of the propernames project, and is postponed to next week.
8. Speller infrastructure
Nothing this week either.
9. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there
- Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
XXE updates
Who has the latest XXE (3.0) and the latest forrest config?
- Børre - ok
- Trond - ok
- Maaren - ok
- Tomi - no
- Thomas - no
- Saara - ok
- Sjur - ok
- Ilona - ok?
- Linda - no
Børre is updating the ones not yet up to speed.
Video conferencing across firewalls
The problem we've had with the SD firewall persists, and there doesn't seem to
Bug fixing
24 open bugs (and 24 risten.no bugs)
Bugzilla update
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
- infrastructure for proper names in place by end of November, if everything
AppleCare extended warranty
All Divvun computers (PowerBook G4s) have received an extended warranty to the
9. Summary, task list
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
- mark-up names
- divvun.no and giellatekno.uit.no
- Binary files download area
- Make the conversion to static site, using our own script.
- Binary files download area
- hyphenation in corpus docs
- corpus xsl files under version control
- register AppleCare
Maaren
- shall work with Sámi place names only
- update the last issue in the North Sámi normativity issues document
- register AppleCare
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Convert the name lexicon from present format to xml
- document corpus infrastructure, your own parts
- Look at the hyphenation issue
- Update the corpus.dtd
- corpus xsl files under version control
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support with Trond
- proper name integration with risten.no
- discuss risten.no work with Tomi
- write public tender documents
- hyphenation in corpus docs
- buy:
- new computer (project server)?
- new computer (project server)?
- register AppleCare
Thomas
- work on North sámi compounding and derivation
- Look at Linguistic bugs with Trond
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- update the lule sámi normativity issues document about incorporation of loan words
- register AppleCare
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- discuss risten.no work with Sjur
- Look into synchronisation of proper names with risten.no
- hyphenation in corpus docs
- corpus xsl files under version control
- add automatic language detection to the corpus processing
- register AppleCare
Trond
- update the contracts with changes from the university lawyer
- Look into the document hyphenation issue
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- The name project
- Work on the name project, mark up names (100 names left)
- Work on the name project, mark up names (100 names left)
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
10. Next meeting, closing
21.11.2005 09: 30
Closed at 12: 31