Meeting_2005-11-14
Meeting setup
- Date: 14.11.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 08.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat about texts
- Not done
- Not done
- Gather public texts
- Some gathered
- Some gathered
- Continue converting text from input format to our xml
- convert2xml.pl doesn't work
- convert2xml.pl doesn't work
- Document the corpus directory structure
- Done to some extent
- Done to some extent
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- Haven't heard anything
- Haven't heard anything
- ... and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
- Not done
- Not done
- mark-up names
- Not done
- Not done
- move existing corpus docs from gt/ to new corpus repository
- Done
- Done
- divvun.no and giellatekno.uit.no
- Binary files download area
- Not done
- Not done
- Moving to static site, using forrestbot or something else.
- Investigated, will continue with internal script, converting to site.
- Binary files download area
Maaren
- shall work with Sámi place names only
- done a little bit
- done a little bit
- update the last issue in the North Sámi normativity issues document
- done
Saara
- Look at the corpus infrastructure issue
- scripts for transferring the corpus metadata to the mysql database
- scripts for transferring the corpus metadata to the mysql database
- start looking at conversion of the name lexicon from present format to xml
- namelex2xml.pl is ready
- namelex2xml.pl is ready
- document corpus infrastructure, your own parts
- the scripts are documented, some updates needed
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- continued, still some issues open.
- continued, still some issues open.
- risten.no bugs and fixes
- done some work on the new eXist version
- done some work on the new eXist version
- discuss risten.no work with Tomi
- Meeting with Risten and Tomi, some work on eXist
- Meeting with Risten and Tomi, some work on eXist
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- URL received, but it was only internal, and when I tried it, the license had
- URL received, but it was only internal, and when I tried it, the license had
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- nothing done
- nothing done
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- nothing
- nothing
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- nothing
- nothing
- Discuss the contract issue with Kimmo, return the new version to the lawyer
- had a meeting, updated versions checked with Trond and sent to lawyer
- had a meeting, updated versions checked with Trond and sent to lawyer
- Follow up on meeting with Anders Kintel
- Meeting confirmed 17.11., Arran, afternoon.
- Meeting confirmed 17.11., Arran, afternoon.
- discuss kvensk project support with Trond
- not with Trond, but with Risten and Tomi, as part of the risten.no updates
- not with Trond, but with Risten and Tomi, as part of the risten.no updates
- write public tender documents
- nothing
- nothing
- buy:
- new computer (project server)?
Thomas
- work on North sámi compounding and derivation
- done some work on three-part compounding
- done some work on three-part compounding
- Look at Linguistic bugs with Trond
- worked with bug 193
- worked with bug 193
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- met and made great progress, G2>G3 issue left
- met and made great progress, G2>G3 issue left
- update the lule sámi normativity issues document
- not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- Specification for new catxml in C++
- this includes also placing the source and binary
- Not done
- Not done
- this includes also placing the source and binary
- discuss risten.no work with Sjur
- We had meeting
- We had meeting
- discuss about xml-processing with Saara
- Not done
- Not done
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Didn't we discussed about this?
- Didn't we discussed about this?
- start looking at conversion of the name lexicon from present format to xml
- Not done
- Not done
- Look into synchronisation of proper names with risten.no
- Done some
- Done some
- Other
- Have been looking into internal structure of risten.no
Trond
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- Not worked on other bugs than 193.
- Not worked on other bugs than 193.
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- Not done.
- also look at the development processes - specification and testing
- Work on the name project, mark up names
- Very much done, approximately 500 names left. There will be need for revision
- Very much done, approximately 500 names left. There will be need for revision
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas
- Done substantial progress here.
- Done substantial progress here.
- Also worked on disambiguation with Linda
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- The directory structure is now settled (as of last meeting), and should be
For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Divvun.no down again
Tomcat is running out of memory in between. Børre will look into changing
Update: Only one small change needed in our own script. Binary download section
4. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Børre has gathered files from the Sámediggi
Contracts
Sjur had a meeting with Kimmo Koskenniemi, resolving all the issues that he
5. Corpus infrastructure
- Problems with convert2xml.pl?
- Barfs up at line 91
- Add issue to Bugzilla (always when you find problems!)
- Barfs up at line 91
Quoting from the convert2xml.pl file: 26 my $xsl_file = ''; 27 my $dir = ''; 28 my $log_dir = ''; 90 my $log_file = $log_dir . "/" . $file . ".log"; 91 open STDERR, '>>', "$log_file" or die "Can't redirect STDERR: $!";
Problem analysed and will be corrected (Tomi)
Updated task list:
- Make a system for file and directory permission (today: we all belong to the
-
Done
-
Done
- Include the xsl files under version control (Børre, Tomi, Saara)
- Incorporate language detection as part of the corpus processing (Tomi)
- we need a way to deal with hyphenated documents (documents with (manually) inserted
- Discuss details in the newsgroup
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- Discuss details in the newsgroup
Corpus dtd issue
To summarize (taken from Saara's newsgroup posting of Fri, 11 Nov 2005:
- Change the person name to firstname and lastname. Agreed/decided.
- Add an element collection:
<!ELEMENT collection (#PCDATA) >
- Agreed.
- Leave the element translator as it is now (easier to read that way and
- Add element conserning the completeness of the metadata. I guess the
<!ELEMENT metadata (complete|incomplete)> <!ELEMENT complete EMPTY> <!ELEMENT incomplete EMPTY>
- Element for the word count (this is used when counting e.g. the frequencies
<!ELEMENT wordcount #PCDATA>
- Add an element for the license type. See the bug:
- license type bug
Saara's suggestion:
<!ELEMENT availability (free|license) <!ELEMENT free EMPTY> <!ELEMENT license EMPTY> <!ATTLIST license type (type1|type2|..) #REQUIRED >
Saara will update the dtd.
6. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- name markup
- testing of conversion
When these two tasks are done (at some point in the future), the conversion will
Status quo on the two lines of work:
The mark up of the remaining 400 entries until conversion starts (People
323 NYSTØ 32 BERN 20 LONDON 18 MARJA 17 NIILLAS 12 ACCRA 5 HEANDARAT 4 ANAR 2 ALEUHTAT
The technical issues (specified in earlier memos: Conducted by:
A very short example is found at common/src/proper-nouns.xml.
Complex names
In the present lexicon, complex names are treated as a class of first parts (see
With the new XML lexicon format, the complex names should be restored. The
Also, integration with risten.no and the kvensk project (and through that, also
There are ~100 first parts of complex names, the name lexicon contained 739
First-part tags are now listed separately:
LEXICON ProperNounFirstPart El% Baradej BERN-sur ; FirstTag ; Badje FirstTag ; Bajimus FirstTag ; Bajit FirstTag ; Bassi FirstTag ;
The format we left a year ago looked like this:
Aleksander% I%:a% suo0lu:Aleksander% I%:a% suollu SUOLU ; Amerihká% Ovttastuvvan% Stáhtat:Amerihká% Ovttastuvvan% Stáhta ALEUHTAT ; Amery% jiekn1arav0da:Amery% jiekn1arav'da DEATNU ; Amundsena-Scotta% stas1uvdna DEATNU ; Austrália% Álppat:Austrália% Ál'pa ALEUHTAT ; Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ; Badje% Stuorjoh0ka:Badje% Stuorjoh'ka DEATNU ; Bajimus% Fielvuonjáv0ri:Bajimus% Fielvuonjáv'ri DEATNU ; Bajimus% Molles1jáv0ri:Bajimus% Molles1jáv'ri DEATNU ;
They were broken up with the following argumentation:
revision 1.127 date: 2004/10/14 09:38:17; author: trond; state: Exp; lines: +4653 -5153 This is the great % removal revision. The background was that our pre-composed multiword names, such as Davimus Borsejoh0ka, etc. did not work. They passed the preprocessor only in the nominative, and not in other cases. In the worst case, their parts were not recognised as such, and the result would be a missing analysis. Now, the first part has been assigned to a separate lexicon, ProperNounFirstPart, that get the tag +N+Prop+Attr only. This lexicon contains entries like Davimus, Guhkes, Helse, magnehtalas1, and other first parts of complex names. These should be disambiguated in sme-dis.rle, leaving the tag only when there is a N Prop following it. As a result of this, the file bin/attr.txt is drastically reduced.
Task list for this issue:
- restore complex names from old cvs; c/sh-ould be stored in a separate file
- cvs up -r 1.126 sme/src/propernoun-sme-lex.txt (Trond)
- grep % propernoun-sme-lex.txt
- cvs up -r 1.126 sme/src/propernoun-sme-lex.txt (Trond)
iconv -f L1 -t UTF-8 | less (Trond) |
- find eventual unique second-parts (B-parts of names that do not exist in
- remove these B-parts from the ordinary name file (Tomi)
- the resulting file format should be identical to our present prop-name
- make sure xml2lexc can handle complex names in ways compatible with our
Example of how the old lexicon can be used to identify complex name last parts
$grep Riebejo gt/sme/src/propernoun-sme-lex.txt Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ; Riebejoh0ka:Riebejoh'ka DEATNU ;
The details of the new XML format needs to be further discussed in the newsgroup
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is working on it
- the exact rules for when shortening happens should be documented (Maaren
- descriptive facts from our corpus (Trond, Thomas)
- linguistic analysis/discussion to continue in the newsgroup
- look at Lule Sámi, but apply it to second-parts only
- number project still open
- diphthong simplification/G3 issue should be carried over from Lule Sámi
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Tasks:
- update the normativity issues document:
- Px issue
- Px issue
- G3 open issues (S2, some S3, S5, S6 and S7; Sx = Spiik, consonant series)
- Great progress has been made on the G3 issue, just some minor points remain.
Numerals
The issue awaits closure of the propernames project, and is postponed to next week.
Árran meeting
Børre, Anne Britt and Sjur go to Árran on Wednesday, for meetings on
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- One way to "resolve" this is to redirect the error messages to /dev/null:
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
... | preprocess 2> /dev/null | lookup ...
XXE updates
Who has the latest XXE (3.0) and the latest forrest config?
- Børre - ok
- Trond - ok
- Maaren - XXE, but no config
- Tomi - no
- Thomas - no
- Saara - ok
- Sjur - ok
- Ilona - ok?
- Linda - no
Børre is updating the ones not yet up to speed.
Video conferencing across firewalls
The problem we've had with the SD firewall persists, and there doesn't seem to
Bug fixing
24 open bugs (and 23 risten.no bugs)
Bugzilla update
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
- infrastructure for proper names in place by end of November, if everything
9. Summary, task list
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
- mark-up names
- divvun.no and giellatekno.uit.no
- Binary files download area
- Make the conversion to static site, using our own script.
- Binary files download area
- hyphenation in corpus docs
- meet with Anders Kintel in Árran
- corpus xsl files under version control
Maaren
- shall work with Sámi place names only
- update the last issue in the North Sámi normativity issues document
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Convert the name lexicon from present format to xml
- document corpus infrastructure, your own parts
- Look at the hyphenation issue
- Update the corpus.dtd
- corpus xsl files under version control
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support with Trond
- proper name integration with risten.no
- discuss risten.no work with Tomi
- write public tender documents
- buy:
- new computer (project server)?
- new computer (project server)?
- hyphenation in corpus docs
- meet with Anders Kintel in Árran
Thomas
- work on North sámi compounding and derivation
- Look at Linguistic bugs with Trond
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- update the lule sámi normativity issues document
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- discuss risten.no work with Sjur
- Look into synchronisation of proper names with risten.no
- hyphenation in corpus docs
- corpus xsl files under version control
- add automatic language detection to the corpus processing
Trond
- Send the contract to the university lawyer
- Look into the document hyphenation issue
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- The name project
- Work on the name project, mark up names (400 names left)
- Extract complex names from version 1.126 and save them as a separate file in common.
- Work on the name project, mark up names (400 names left)
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
10. Next meeting, closing
21.11.2005 10: 00
Closed at 11: 47