Meeting_2005-11-14

Meeting setup

Date: 14.11.2005
Time: 10.00 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit, phone

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
Speller infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10: 08.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Contact oahpahusossodat about texts
- Not done
Gather public texts
- Some gathered
Continue converting text from input format to our xml
- convert2xml.pl doesn't work
Document the corpus directory structure
- Done to some extent
Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
  - Haven't heard anything
install new XXE and the new XXE Forrest config for all (or check that it is installed and working)
- Not done
mark-up names
- Not done
move existing corpus docs from gt/ to new corpus repository
- Done
divvun.no and giellatekno.uit.no
- Binary files download area
  - Not done
- Moving to static site, using forrestbot or something else.
  - Investigated, will continue with internal script, converting to site.

Maaren

shall work with Sámi place names only
- done a little bit
update the last issue in the North Sámi normativity issues document
- done

Saara

Look at the corpus infrastructure issue
- scripts for transferring the corpus metadata to the mysql database are ready (but waiting for changes in corpus.dtd and test documents)
start looking at conversion of the name lexicon from present format to xml
- namelex2xml.pl is ready
document corpus infrastructure, your own parts
- the scripts are documented, some updates needed

Sjur

Lule Sámi twol problems, look again at the sets definition with Thomas and Trond, Wednesday
- continued, still some issues open.
risten.no bugs and fixes
- done some work on the new eXist version
discuss risten.no work with Tomi
- Meeting with Risten and Tomi, some work on eXist
follow up on voice group-chat not working to Sámediggi
- Test Marratech
  - URL received, but it was only internal, and when I tried it, the license had expired: -( They seem to have installed a new license now, but the page does not work.
project planning with Trond, continued
- also look at the development processes - specification and testing
  - nothing done
Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
  - nothing
Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
write a background document on the corpus contracts
- nothing
Discuss the contract issue with Kimmo, return the new version to the lawyer
- had a meeting, updated versions checked with Trond and sent to lawyer
Follow up on meeting with Anders Kintel
- Meeting confirmed 17.11., Arran, afternoon.
discuss kvensk project support with Trond
- not with Trond, but with Risten and Tomi, as part of the risten.no updates
write public tender documents
- nothing
buy:
- new computer (project server)?

Thomas

work on North sámi compounding and derivation
- done some work on three-part compounding
Look at Linguistic bugs with Trond
- worked with bug 193
Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- met and made great progress, G2>G3 issue left
update the lule sámi normativity issues document
- not done

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
  - Not done
three-part compounding
- Not done
corpus infrastructure: dtd location (both public and internal)
- Not done
Specification for new catxml in C++
- this includes also placing the source and binary
  - Not done
discuss risten.no work with Sjur
- We had meeting
discuss about xml-processing with Saara
- Not done
look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Didn't we discussed about this?
start looking at conversion of the name lexicon from present format to xml
- Not done
Look into synchronisation of proper names with risten.no
- Done some
Other
- Have been looking into internal structure of risten.no

Trond

Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed.
- Not worked on other bugs than 193.
project planning with Sjur, continued
- also look at the development processes - specification and testing
- Not done.
Work on the name project, mark up names
- Very much done, approximately 500 names left. There will be need for revision of the marking that has been undertaken, but all in all the name lexicon should now be ready for translation to xml format.
discuss kvensk project support with Sjur
Work on the G3 bug issue with Sjur and Thomas
- Done substantial progress here.
Also worked on disambiguation with Linda

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general ( Børre, Tomi, Trond, Saara):

The directory structure is now settled (as of last meeting), and should be documented.

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)

test:

add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
as always: document what you're doing: -) (all)

Divvun.no down again

Tomcat is running out of memory in between. Børre will look into changing to Forrest generating static html pages (forrest site), and serve those off of the standard Apache server. He will also look at utilizing Forrestbot as the tool to update the site, instead of our homegrown script.

Update: Only one small change needed in our own script. Binary download section should be included.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Børre has gathered files from the Sámediggi Will go on gathering files from Odin.

Contracts

Sjur had a meeting with Kimmo Koskenniemi, resolving all the issues that he had with it. Trond and Sjur has also discussed these, changed them a little bit. The contracts are ready to be sent to the lawyer (who sadly is ill).

5. Corpus infrastructure

Problems with convert2xml.pl?
- Barfs up at line 91
- Add issue to Bugzilla (always when you find problems!)

Quoting from the convert2xml.pl file:
    26  my $xsl_file = ''; 
    27  my $dir = '';
    28  my $log_dir = ''; 
    90  my $log_file = $log_dir . "/" . $file . ".log";
    91  open STDERR, '>>', "$log_file" or die "Can't redirect STDERR: $!";

Problem analysed and will be corrected (Tomi)

Updated task list:

Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files (Børre)
1. Done
Include the xsl files under version control (Børre, Tomi, Saara)
Incorporate language detection as part of the corpus processing (Tomi)
we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. ( Tomi, Børre, (discussion in the newsgroup: ) Sjur, Trond, Saara)
1. Discuss details in the newsgroup
2. in normal cases hyphenation points should be removed
3. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Corpus dtd issue

To summarize (taken from Saara's newsgroup posting of Fri, 11 Nov 2005:

Change the person name to firstname and lastname. Agreed/decided.
Add an element collection:

<!ELEMENT collection (#PCDATA) >

Agreed.
Leave the element translator as it is now (easier to read that way and avoid the changes).
Add element conserning the completeness of the metadata. I guess the uncompleteness here means that some information that should be added is missing. (This info would be for our internal use.)

<!ELEMENT metadata (complete|incomplete)>
<!ELEMENT complete EMPTY>
<!ELEMENT incomplete EMPTY>

Element for the word count (this is used when counting e.g. the frequencies of some form etc.). Filling this slot should be part of the standard xsl, then.

<!ELEMENT wordcount #PCDATA>

Add an element for the license type. See the bug:
license type bug

Saara's suggestion:

<!ELEMENT availability (free|license)
<!ELEMENT free EMPTY>
<!ELEMENT license EMPTY>
<!ATTLIST license
	type (type1|type2|..) #REQUIRED
 >

Saara will update the dtd.

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

name markup
testing of conversion

When these two tasks are done (at some point in the future), the conversion will be done.

Status quo on the two lines of work:

The mark up of the remaining 400 entries until conversion starts (People allocated look at the rest: Maaren, Ilona, Trond, Børre). This week's status quo is as follows (some 400 names not assigned):

 323 NYSTØ
  32 BERN
  20 LONDON
  18 MARJA
  17 NIILLAS
  12 ACCRA
   5 HEANDARAT
   4 ANAR
   2 ALEUHTAT

The technical issues (specified in earlier memos: Conducted by: Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back on a plan for using risten.no as editor for our name lexicon

A very short example is found at common/src/proper-nouns.xml. Saara has made a conversion script which is ready to use. More discussions on the layout of the resulting xml file is needed.

Complex names

In the present lexicon, complex names are treated as a class of first parts (see below), and the last part is stored as a regular simplex name.

With the new XML lexicon format, the complex names should be restored. The present lexicon format can easily be reconstructed (by splitting at the space character), and the list of complex names can also be used for other purposes in the future.

Also, integration with risten.no and the kvensk project (and through that, also Kartverket in one way or another), presupposes that we can store the complete and "true" name.

There are ~100 first parts of complex names, the name lexicon contained 739 such complex names before they were broken up in 2004/10/14.

First-part tags are now listed separately:

LEXICON ProperNounFirstPart
El% Baradej BERN-sur ;
 FirstTag ;
Badje FirstTag ;
Bajimus FirstTag ;
Bajit FirstTag ;
Bassi FirstTag ;

The format we left a year ago looked like this:

Aleksander% I%:a% suo0lu:Aleksander% I%:a% suollu SUOLU ;
Amerihká% Ovttastuvvan% Stáhtat:Amerihká% Ovttastuvvan% Stáhta ALEUHTAT ;
Amery% jiekn1arav0da:Amery% jiekn1arav'da DEATNU ;
Amundsena-Scotta% stas1uvdna DEATNU ;
Austrália% Álppat:Austrália% Ál'pa ALEUHTAT ;
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Badje% Stuorjoh0ka:Badje% Stuorjoh'ka DEATNU ;
Bajimus% Fielvuonjáv0ri:Bajimus% Fielvuonjáv'ri DEATNU ;
Bajimus% Molles1jáv0ri:Bajimus% Molles1jáv'ri DEATNU ;

They were broken up with the following argumentation:

revision 1.127
date: 2004/10/14 09:38:17;  author: trond;  state: Exp;  lines: +4653 -5153
This is the great % removal revision. The background was that our
pre-composed multiword names, such as Davimus Borsejoh0ka, etc. did
not work. They passed the preprocessor only in the nominative, and not
in other cases. In the worst case, their parts were not recognised as
such, and the result would be a missing analysis. Now, the first part
has been assigned to a separate lexicon, ProperNounFirstPart, that get
the tag +N+Prop+Attr only. This lexicon contains entries like Davimus,
Guhkes, Helse, magnehtalas1, and other first parts of complex
names. These should be disambiguated in sme-dis.rle, leaving the tag
only when there is a N Prop following it. As a result of this, the
file bin/attr.txt is drastically reduced.

Task list for this issue:

restore complex names from old cvs; c/sh-ould be stored in a separate file
- cvs up -r 1.126 sme/src/propernoun-sme-lex.txt (Trond)
- grep % propernoun-sme-lex.txt

iconv -f L1 -t UTF-8

less (Trond)

find eventual unique second-parts (B-parts of names that do not exist in isolation)
remove these B-parts from the ordinary name file (Tomi)
the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names
make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now)

Example of how the old lexicon can be used to identify complex name last parts that are also used as simple names:

$grep Riebejo gt/sme/src/propernoun-sme-lex.txt 
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Riebejoh0ka:Riebejoh'ka DEATNU ;

The details of the new XML format needs to be further discussed in the newsgroup and integrated with the rest of the XML work and discussion.

North Sámi

three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is working on it
- the exact rules for when shortening happens should be documented (Maaren will send the normativity decisions to Thomas)
- descriptive facts from our corpus (Trond, Thomas)
- linguistic analysis/discussion to continue in the newsgroup
number project still open

diphthong simplification/G3 issue should be carried over from Lule Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.

Tasks:

update the normativity issues document:
- Px issue
G3 open issues (S2, some S3, S5, S6 and S7; Sx = Spiik, consonant series)
- Great progress has been made on the G3 issue, just some minor points remain.

Numerals

The issue awaits closure of the propernames project, and is postponed to next week.

Árran meeting

Børre, Anne Britt and Sjur go to Árran on Wednesday, for meetings on Thursday. Main meeting is with Anders Kintel about using his Lule Sámi dictionary in our projects.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
  - Another example of the same bug:
  - : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33.
  - One way to "resolve" this is to redirect the error messages to /dev/null:

... | preprocess 2> /dev/null | lookup ...

XXE updates

Who has the latest XXE (3.0) and the latest forrest config?

Børre - ok
Trond - ok
Maaren - XXE, but no config
Tomi - no
Thomas - no
Saara - ok
Sjur - ok
Ilona - ok?
Linda - no

Børre is updating the ones not yet up to speed.

Video conferencing across firewalls

The problem we've had with the SD firewall persists, and there doesn't seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I'll send you the URL to the meeting room as soon as I get it.

Bug fixing

24 open bugs (and 23 risten.no bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

risten.no

Organisation: could Tomi be used, in exchange for more linguistic work by (old) GIO members? Yes, it is ok, but how much still needs to be evaluated
it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all proper names integrated with risten.no, df above
- needs further development of risten.no to allow for multiple XML bases to be presented and maintained in parallel. This is to be further worked on by Tomi and Sjur
infrastructure for proper names in place by end of November, if everything goes well (or according to a tentative plan)

9. Summary, task list