Meeting_2005-11-14

Meeting setup

  • Date: 14.11.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit, phone

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10: 08.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • Contact oahpahusossodat about texts
    • Not done
  • Gather public texts
    • Some gathered
  • Continue converting text from input format to our xml
    • convert2xml.pl doesn't work
  • Document the corpus directory structure
    • Done to some extent
  • Ask Thor-Øivind to move bugzilla to our new webserver
    • ... and update Bugzilla at the same time
      • Haven't heard anything
  • install new XXE and the new XXE Forrest config for all (or check that it is installed and working)
    • Not done
  • mark-up names
    • Not done
  • move existing corpus docs from gt/ to new corpus repository
    • Done
  • divvun.no and giellatekno.uit.no
    • Binary files download area
      • Not done
    • Moving to static site, using forrestbot or something else.
      • Investigated, will continue with internal script, converting to site.

Maaren

  • shall work with Sámi place names only
    • done a little bit
  • update the last issue in the North Sámi normativity issues document
    • done

Saara

  • Look at the corpus infrastructure issue
    • scripts for transferring the corpus metadata to the mysql database are ready (but waiting for changes in corpus.dtd and test documents)
  • start looking at conversion of the name lexicon from present format to xml
    • namelex2xml.pl is ready
  • document corpus infrastructure, your own parts
    • the scripts are documented, some updates needed

Sjur

  • Lule Sámi twol problems, look again at the sets definition with Thomas and Trond, Wednesday
    • continued, still some issues open.
  • risten.no bugs and fixes
    • done some work on the new eXist version
  • discuss risten.no work with Tomi
    • Meeting with Risten and Tomi, some work on eXist
  • follow up on voice group-chat not working to Sámediggi
    • Test Marratech
      • URL received, but it was only internal, and when I tried it, the license had expired: -( They seem to have installed a new license now, but the page does not work.
  • project planning with Trond, continued
    • also look at the development processes - specification and testing
      • nothing done
  • Follow up on place names from Norge Digitalt
    • write an e-mail to or call Bjørn Olav Megard
      • nothing
  • Evaluate SFST as speller (and analyzer) lexicon
    • more thorough analysis than was possible in Guovdageaidnu
  • write a background document on the corpus contracts
    • nothing
  • Discuss the contract issue with Kimmo, return the new version to the lawyer
    • had a meeting, updated versions checked with Trond and sent to lawyer
  • Follow up on meeting with Anders Kintel
    • Meeting confirmed 17.11., Arran, afternoon.
  • discuss kvensk project support with Trond
    • not with Trond, but with Risten and Tomi, as part of the risten.no updates
  • write public tender documents
    • nothing
  • buy:
    • new computer (project server)?

Thomas

  • work on North sámi compounding and derivation
    • done some work on three-part compounding
  • Look at Linguistic bugs with Trond
    • worked with bug 193
  • Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
    • met and made great progress, G2>G3 issue left
  • update the lule sámi normativity issues document
    • not done

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
      • Not done
  • three-part compounding
    • Not done
  • corpus infrastructure: dtd location (both public and internal)
    • Not done
  • Specification for new catxml in C++
    • this includes also placing the source and binary
      • Not done
  • discuss risten.no work with Sjur
    • We had meeting
  • discuss about xml-processing with Saara
    • Not done
  • look into efficient editing of the xml proper name lexicon (tools, modes, etc)
    • Didn't we discussed about this?
  • start looking at conversion of the name lexicon from present format to xml
    • Not done
  • Look into synchronisation of proper names with risten.no
    • Done some
  • Other
    • Have been looking into internal structure of risten.no

Trond

  • Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed.
    • Not worked on other bugs than 193.
  • project planning with Sjur, continued
    • also look at the development processes - specification and testing
    • Not done.
  • Work on the name project, mark up names
    • Very much done, approximately 500 names left. There will be need for revision of the marking that has been undertaken, but all in all the name lexicon should now be ready for translation to xml format.
  • discuss kvensk project support with Sjur
  • Work on the G3 bug issue with Sjur and Thomas
    • Done substantial progress here.
  • Also worked on disambiguation with Linda

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general ( Børre, Tomi, Trond, Saara):

  • The directory structure is now settled (as of last meeting), and should be documented.

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)

test:

  • add/update Aspell documentation (Tomi)
    • Some documentation has been written, but there still is work to be done.
  • as always: document what you're doing: -) (all)

Divvun.no down again

Tomcat is running out of memory in between. Børre will look into changing to Forrest generating static html pages (forrest site), and serve those off of the standard Apache server. He will also look at utilizing Forrestbot as the tool to update the site, instead of our homegrown script.

Update: Only one small change needed in our own script. Binary download section should be included.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Børre has gathered files from the Sámediggi Will go on gathering files from Odin.

Contracts

Sjur had a meeting with Kimmo Koskenniemi, resolving all the issues that he had with it. Trond and Sjur has also discussed these, changed them a little bit. The contracts are ready to be sent to the lawyer (who sadly is ill).

5. Corpus infrastructure

  • Problems with convert2xml.pl?
    • Barfs up at line 91
    • Add issue to Bugzilla (always when you find problems!)
Quoting from the convert2xml.pl file:
    26  my $xsl_file = ''; 
    27  my $dir = '';
    28  my $log_dir = ''; 
    90  my $log_file = $log_dir . "/" . $file . ".log";
    91  open STDERR, '>>', "$log_file" or die "Can't redirect STDERR: $!";

Problem analysed and will be corrected (Tomi)

Updated task list:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files (Børre)
    1. Done
  2. Include the xsl files under version control (Børre, Tomi, Saara)
  3. Incorporate language detection as part of the corpus processing (Tomi)
  4. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. ( Tomi, Børre, (discussion in the newsgroup: ) Sjur, Trond, Saara)
    1. Discuss details in the newsgroup
    2. in normal cases hyphenation points should be removed
    3. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Corpus dtd issue

To summarize (taken from Saara's newsgroup posting of Fri, 11 Nov 2005:

  • Change the person name to firstname and lastname. Agreed/decided.
  • Add an element collection:
<!ELEMENT collection (#PCDATA) >
  • Agreed.
  • Leave the element translator as it is now (easier to read that way and avoid the changes).
  • Add element conserning the completeness of the metadata. I guess the uncompleteness here means that some information that should be added is missing. (This info would be for our internal use.)
<!ELEMENT metadata (complete|incomplete)>
<!ELEMENT complete EMPTY>
<!ELEMENT incomplete EMPTY>
  • Element for the word count (this is used when counting e.g. the frequencies of some form etc.). Filling this slot should be part of the standard xsl, then.

<!ELEMENT wordcount #PCDATA>

Saara's suggestion:

<!ELEMENT availability (free|license)
<!ELEMENT free EMPTY>
<!ELEMENT license EMPTY>
<!ATTLIST license
	type (type1|type2|..) #REQUIRED
 >

Saara will update the dtd.

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. name markup
  2. testing of conversion

When these two tasks are done (at some point in the future), the conversion will be done.

Status quo on the two lines of work:

The mark up of the remaining 400 entries until conversion starts (People allocated look at the rest: Maaren, Ilona, Trond, Børre). This week's status quo is as follows (some 400 names not assigned):

 323 NYSTØ
  32 BERN
  20 LONDON
  18 MARJA
  17 NIILLAS
  12 ACCRA
   5 HEANDARAT
   4 ANAR
   2 ALEUHTAT

The technical issues (specified in earlier memos: Conducted by: Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back on a plan for using risten.no as editor for our name lexicon

A very short example is found at common/src/proper-nouns.xml. Saara has made a conversion script which is ready to use. More discussions on the layout of the resulting xml file is needed.

Complex names

In the present lexicon, complex names are treated as a class of first parts (see below), and the last part is stored as a regular simplex name.

With the new XML lexicon format, the complex names should be restored. The present lexicon format can easily be reconstructed (by splitting at the space character), and the list of complex names can also be used for other purposes in the future.

Also, integration with risten.no and the kvensk project (and through that, also Kartverket in one way or another), presupposes that we can store the complete and "true" name.

There are ~100 first parts of complex names, the name lexicon contained 739 such complex names before they were broken up in 2004/10/14.

First-part tags are now listed separately:

LEXICON ProperNounFirstPart
El% Baradej BERN-sur ;
 FirstTag ;
Badje FirstTag ;
Bajimus FirstTag ;
Bajit FirstTag ;
Bassi FirstTag ;

The format we left a year ago looked like this:

Aleksander% I%:a% suo0lu:Aleksander% I%:a% suollu SUOLU ;
Amerihká% Ovttastuvvan% Stáhtat:Amerihká% Ovttastuvvan% Stáhta ALEUHTAT ;
Amery% jiekn1arav0da:Amery% jiekn1arav'da DEATNU ;
Amundsena-Scotta% stas1uvdna DEATNU ;
Austrália% Álppat:Austrália% Ál'pa ALEUHTAT ;
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Badje% Stuorjoh0ka:Badje% Stuorjoh'ka DEATNU ;
Bajimus% Fielvuonjáv0ri:Bajimus% Fielvuonjáv'ri DEATNU ;
Bajimus% Molles1jáv0ri:Bajimus% Molles1jáv'ri DEATNU ;

They were broken up with the following argumentation:

revision 1.127
date: 2004/10/14 09:38:17;  author: trond;  state: Exp;  lines: +4653 -5153
This is the great % removal revision. The background was that our
pre-composed multiword names, such as Davimus Borsejoh0ka, etc. did
not work. They passed the preprocessor only in the nominative, and not
in other cases. In the worst case, their parts were not recognised as
such, and the result would be a missing analysis. Now, the first part
has been assigned to a separate lexicon, ProperNounFirstPart, that get
the tag +N+Prop+Attr only. This lexicon contains entries like Davimus,
Guhkes, Helse, magnehtalas1, and other first parts of complex
names. These should be disambiguated in sme-dis.rle, leaving the tag
only when there is a N Prop following it. As a result of this, the
file bin/attr.txt is drastically reduced.

Task list for this issue:

  • restore complex names from old cvs; c/sh-ould be stored in a separate file
    • cvs up -r 1.126 sme/src/propernoun-sme-lex.txt (Trond)
    • grep % propernoun-sme-lex.txt
iconv -f L1 -t UTF-8 less (Trond)
  • find eventual unique second-parts (B-parts of names that do not exist in isolation)
  • remove these B-parts from the ordinary name file (Tomi)
  • the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names
  • make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now)

Example of how the old lexicon can be used to identify complex name last parts that are also used as simple names:

$grep Riebejo gt/sme/src/propernoun-sme-lex.txt 
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Riebejoh0ka:Riebejoh'ka DEATNU ;

The details of the new XML format needs to be further discussed in the newsgroup and integrated with the rest of the XML work and discussion.

North Sámi

  • three-part compounds issue still open
    • look at Lule Sámi, but apply it to second-parts only
    • Thomas is working on it
    • the exact rules for when shortening happens should be documented (Maaren will send the normativity decisions to Thomas)
    • descriptive facts from our corpus (Trond, Thomas)
    • linguistic analysis/discussion to continue in the newsgroup
  • number project still open
  • diphthong simplification/G3 issue should be carried over from Lule Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.

Tasks:

  • update the normativity issues document:
    • Px issue
  • G3 open issues (S2, some S3, S5, S6 and S7; Sx = Spiik, consonant series)
    • Great progress has been made on the G3 issue, just some minor points remain.

Numerals

The issue awaits closure of the propernames project, and is postponed to next week.

Árran meeting

Børre, Anne Britt and Sjur go to Árran on Wednesday, for meetings on Thursday. Main meeting is with Anders Kintel about using his Lule Sámi dictionary in our projects.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

  • The mac os / perl bug (at least Trond and Sjur has it):
    • utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
      • Another example of the same bug:
      • : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33.
      • One way to "resolve" this is to redirect the error messages to /dev/null:
... | preprocess 2> /dev/null | lookup ...

XXE updates

Who has the latest XXE (3.0) and the latest forrest config?

  • Børre - ok
  • Trond - ok
  • Maaren - XXE, but no config
  • Tomi - no
  • Thomas - no
  • Saara - ok
  • Sjur - ok
  • Ilona - ok?
  • Linda - no

Børre is updating the ones not yet up to speed.

Video conferencing across firewalls

The problem we've had with the SD firewall persists, and there doesn't seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I'll send you the URL to the meeting room as soon as I get it.

Bug fixing

24 open bugs (and 23 risten.no bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

risten.no

  • Organisation: could Tomi be used, in exchange for more linguistic work by (old) GIO members? Yes, it is ok, but how much still needs to be evaluated
  • it is ok to integrate "kvensk" placenames with risten.no
    • this should be integrated with the general proper name work - we want all proper names integrated with risten.no, df above
    • needs further development of risten.no to allow for multiple XML bases to be presented and maintained in parallel. This is to be further worked on by Tomi and Sjur
  • infrastructure for proper names in place by end of November, if everything goes well (or according to a tentative plan)

9. Summary, task list

Børre

  • Contact oahpahusossodat about texts
  • Gather public texts
  • Continue converting text from input format to our xml
  • Document the corpus directory structure
  • Ask Thor-Øivind to move bugzilla to our new webserver
    • ... and update Bugzilla at the same time
  • install new XXE and the new XXE Forrest config for all (or check that it is installed and working)
  • mark-up names
  • divvun.no and giellatekno.uit.no
    • Binary files download area
    • Make the conversion to static site, using our own script.
  • hyphenation in corpus docs
  • meet with Anders Kintel in Árran
  • corpus xsl files under version control

Maaren

  • shall work with Sámi place names only
  • update the last issue in the North Sámi normativity issues document

Saara

  • Look at the corpus infrastructure issue
  • Look at the corpus interface issue with Lars
  • look into efficient editing of the xml proper name lexicon (tools, modes, etc)
  • Convert the name lexicon from present format to xml
  • document corpus infrastructure, your own parts
  • Look at the hyphenation issue
  • Update the corpus.dtd
  • corpus xsl files under version control

Sjur

  • Lule Sámi twol problems, look again at the sets definition with Thomas and Trond, Wednesday
  • risten.no bugs and fixes
  • follow up on voice group-chat not working to Sámediggi
    • Test Marratech
  • project planning with Trond, continued
    • also look at the development processes - specification and testing
  • Follow up on place names from Norge Digitalt
    • write an e-mail to or call Bjørn Olav Megard
  • Evaluate SFST as speller (and analyzer) lexicon
    • more thorough analysis than was possible in Guovdageaidnu
  • write a background document on the corpus contracts
  • discuss kvensk project support with Trond
  • proper name integration with risten.no
  • discuss risten.no work with Tomi
  • write public tender documents
  • buy:
    • new computer (project server)?
  • hyphenation in corpus docs
  • meet with Anders Kintel in Árran

Thomas

  • work on North sámi compounding and derivation
  • Look at Linguistic bugs with Trond
  • Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
  • update the lule sámi normativity issues document

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
  • three-part compounding
  • corpus infrastructure: dtd location (both public and internal)
  • Document aspell and corpus infrastructure
  • Specification for new catxml in C++
    • this includes also placing the source and binary
  • discuss about xml-processing with Saara
  • look into efficient editing of the xml proper name lexicon (tools, modes, etc)
  • start looking at conversion of the name lexicon from present format to xml
  • discuss risten.no work with Sjur
  • Look into synchronisation of proper names with risten.no
  • hyphenation in corpus docs
  • corpus xsl files under version control
  • add automatic language detection to the corpus processing

Trond

  • Send the contract to the university lawyer
  • Look into the document hyphenation issue
  • Look at the three-part compound issue
  • Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed.
  • project planning with Sjur, continued
    • also look at the development processes - specification and testing
  • The name project
    • Work on the name project, mark up names (400 names left)
    • Extract complex names from version 1.126 and save them as a separate file in common.
  • discuss kvensk project support with Sjur
  • Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.

10. Next meeting, closing

21.11.2005 10: 00

Closed at 11: 47