Meeting_2005-12-19

Meeting setup

Date: 19.12.2005
Time: 09.30 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit, phone

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
name lexicon infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10: 20.

Present: Børre, Maaren, Saara, Sjur, Thomas

Absent: Maaren, Saara, Tomi, Trond

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Gather public texts
- Not done
Continue converting text from input format to our xml
- Done. Reported bugs in convert2xml.pl which Saara fixed.
Document the corpus infrastructure
- Not done
Ask Thor-Øivind to help us move bugzilla to our new project server
- ... and make the URL http://giellatekno.uit.no/bugzilla point to it (and even to http://project.divvun.no/bugzilla).
  - He began this work on friday afternoon.
install new XXE and the new XXE Forrest config for Ilona
- not necessary, she doesn't use it
divvun.no and giellatekno.uit.no:
- Binary files download area
  - Done
- Continue the conversion to static site, using our own script.
  - Done
review code and documentation for corpus xsl files under version control
- Not done
review the convert2xml.pl script, what works well and what doesn´t.
- Done, see above
comment review template made by Saara
- Not done
fix bugs!

Maaren

work with risten.no
review corpus usage documentation, and the usage of the corpus
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February
SGL: have had their meeting - all of our issues have been handled, we're waiting for their answer (from Maaren: 'SGL har i sitt møte på tirsdag "godkjent" lista vår med noen små endringer')

Saara

Look at the corpus infrastructure issue
Look at the corpus interface issue with Lars
Convert the name lexicon from present format to xml
discuss the new lexicon format in the newsgroup
document corpus infrastructure, your own parts
- done
catxml review: make a template, and post it in the newsgroup
- done
Language detection for the corpus files
- not done
Implement addition of <hyph> tags to the converted corpus files
- done
Do some testing for bug #211
update the convert2xml script according to the comments
- done
review corpus upload user interface
- done
add version control of the corpus xsl files to the (upload) processing
- preliminary commands added (still in comments) and documentation written. some things with write access of RCS files have to solved first.
xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry
update preprocess to handle inflected forms of complex names
- done, but needs to be optimized
fix bugs!

Sjur

Lule Sámi twol problems, with Thomas and Trond
follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
  - waiting for the new server
project planning with Trond, continued
- also look at the development processes - specification and testing
  - done some
Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
write a background document on the corpus contracts
- not done
discuss kvensk project support
- done
proper names:
- discuss lexicon format
  - some discussion with Trond
- discuss implementation with Tomi
  - nope
- post a draft XML format for the name lexicon, based on the discussion above
  - not yet
write public tender documents
- worked with the selection criterias with Trond
update the SD contract version
- done
send SD corpus contract to lawyer
- done
review corpus upload user interface
- done
fix bugs!
Other:
- made a test version of the Komi lexicon in eXist/risten.no (my local installation) - a very nice proof of concept for dictionaries and language technology from the same source: it is (a copy of) the very same lexicon as is used to compile the transducer (using xml2lexc)! Have a look: http://84.231.62.157:8080/exist/risten/index.html, search for e.g. 'кыввор*' (translations displayed are only in Finnish, but the English translations can easily be displayed as well). This is the future - this is what we want to have!
- Tomi and I tried hard to debug the parallel editing bug in risten.no/eXist, to no avail. But a new e-mail to the eXist list was sent, with some new tests from Wolfgang as a suggested next step.

Thomas

work on North Sámi compounding and derivation
- worked and still working
review corpus usage documentation
- didn't quite understand it
translate the normativity-issue on second-syllable deletion when compounding and send it to Maaren
- done
smj G3 issue with Sjur and Trond
- nothing done this week

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
corpus infrastructure: dtd location (both public and internal)
Document aspell and corpus infrastructure
Specification for new catxml in C++
- this includes also placing the source and binary
  - New catxml sources and Makefile are in cvs - gt/script/samiXMLParser/
new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into ynchronisation of proper names with risten.no
hyphenation in corpus docs
comment review template made by Saara
fix bugs!
pick up backpacks after Xmas

Trond

project planning with Sjur, continued
- Substantial progress on work with the planning tools during my Helsinki stay.
Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Not done.
document corpus infrastructure, your part
- Not done.
update the SD contract version
- Done.
send SD corpus contract to lawyer
- Done.
review corpus upload user interface
- Had a look at it, but not filed anything yet.
review corpus usage documentation
- Not given feedback.
comment review template made by Saara
- Have read it.
discuss the new lexicon format in the newsgroup
- Some done.
fix bugs!
- Here my activity has been more subversive, as I have made Linda add more bug reports...
Other:
- Worked with disambiguation, with Komi and with Greenlandic. The outcome of the two latter ones turn out to be of relevance to us as well, the former on the dictionary-xml-lexc issue (here Saara had done a great job!) and the latter on spellchecker issues. The good news about Komi is that the xml dictionary (kt/kom/src/kom-lex.xml) now is made into a set of lexc files via make, and simultaneously into a searchable web dictionary via exist.

3. Documentation

Reviews

Add documentation on our corpus infrastructure and our corpus work in general ( Børre, Tomi, Trond, Saara). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
1. catxml done, which is what is needed mostly. Do we need more?
2. Review: Thomas, Maaren, Linda, Ilona, Trond
For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)
1. we need a review of the web interface for corpus uploading - what is still missing?
  1. Review: Sjur, Saara, Trond, Thomas

Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation.

Deadline for reviews: by next meeting.

Corpus uploading review: closed.

Catxml review: this should be based on the new tool made by Tomi (C++ version), called ccat. Tomi will install it, and announce it in news when ready for review.

Tomcat->static HTML

Done. Cronscript doesn't work, requiring manual updates for now. Will be fixed this week.

4. Corpus gathering

Contracts

Next step:

Make our versions of the updated Helsinki contracts, and make sure they are according to our intention. (Sjur and Trond)
1. done
send them to the SD lawyer and to the University lawyer through formal channels. (Sjur and Trond)
1. done
possibly update contracts with remarks from lawyers
start using them!

Contract 1 should have the main priority (contract 2 for Trond).

Deadline: This was finished last week.

Collecting

Børre have downloaded quite a few web pages and even sites. How and where to store them? Sites stored as directory structures. The Odin pages we will get as source files directly from the editor, cf Trond's report from last meeting.

5. Corpus infrastructure

Updated task list:

Include the xsl files under version control
1. RCS to be used, Saara to include it in the (upload) processing of new corpus files, as well as documentation (Børre to review)
  1. Saara has made a first draft, but the code is commented out - some bugs need to be fixed
Incorporate language detection as part of the corpus processing (Saara)
1. The tool needs better training material than was used initially.
we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess.
1. Saara has made a hyphen detection script that tries to discriminate between real hyphenation and other uses of hyphens (recall that only real hyphenation should be tagged <hyph/>).
we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
1. Acceptable results: 90% of all real hyphens correctly tagged.
CGI-admin script to add xsl-file to a corpus file that doesn't have one ( Tomi)

6. Linguistics

North Sámi

three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
  - Thomas is finished with three-part compounds for North Sámi. On the negative side, we note a compilation time of 10m for the sme parser as a whole.
- compound shortening issue is waiting for the SGL to make a normative decision
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in this discussion, but how? We need a common meeting / normativity seminar / seminar on proofing issues and on language technology in a broader sense with them. Then we also need to open the linguistic issues newsgroup we talked about earlier, and set up their computers and teach them to use the newsgroup. This seminar needs to include the Lule Sámi section of SGL (others are welcome too). Timetable:
  - The Sámediggiráđđi is going to appoint new members before Christmas, the meeting is 13.12. (Johan Mikkel Saara is asking people)
  - Contact the SGL when it is elected, in December, and ask them to arrange a normativity seminar, with invited keypersons.
  - seminar in February?
- TODO list:
  - Maaren to discuss with relevant persons on this issue.
number project still open (see below)
diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
  - Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
  - Bug 77 - clearify whether háliid is acceptable - it is found, thus we need to be able to analyse it, but only as !SUB:

Lule Sámi

Open tasks:

Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

Three-part compounds
Diphthong simplification
Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Summary: see the newsgroup

Complex names

Task list for this issue:

find eventual unique second-parts (B-parts of names that do not exist in isolation): Take the 1.126 version, remove the non-last parts, and compare the B-parts of proper-complex.xml with the simplex forms in 1.126. Non-hits should then be removed from the current propernoun-sme-lex.txt.
- Postponed till next week
remove these B-parts from the ordinary name file (Tomi)
- New York => York, Los Angeles =/=> *Angeles
make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Saara)
- Tomi is working on a C++ version of the xml2lexc tool. He will put it into CVS soon (it is actually catxml, can easily be adopted to the xml2lexc requirements)
the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
The core issue: The preprocess script can handle uninflected (A + B) compounds, but not (A + inflected B). Saara will add the analyzer as part of the preprocess, to be able to handle inflected cases as well.

XML format

Basis for the progress: separate documents according to project:

Containing one common concept or name id field, plus Divvun/Disamb fields only (based on our old proposal)
Containing one common concept or name id field, plus kvensk fields only (based on the field list from Irene)
common document containing one common id field, plus fields common to several bases

We need a meeting to move forward. Suggested time: Wednesday 21, 9: 30. Participants: Børre, Tomi, Sjur, Trond, Saara?

Tasks:

testing of conversion
continue the discussion of the name lexicon format ( Saara, Tomi, Sjur, Trond)
implement a prototype in eXist
eXist as editor:
1. develop the needed XQueries and interface
2. synchronisation between risten.no and
3. test whether eXist as editor is actually working well

8. Other

Xmas

Planned vacations and working days (two days of work, 4 hours each):

Børre: will work on 27th and 28th, away 29th and 30th.
Maaren: ?
Sjur: no vacation, don't know what days working
Thomas: no vacation, don't know what days working
Tomi: no vacation, working in Finland offline at romjula and online after new year

Technical issues

The mac os / perl bug (at least Trond and Sjur has it, Bugzilla #211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Saara, Tomi)
- Test: the result of the last line should indicate whehter this is a problem in cat or a Perl/OS mismatch.

   preprocess file_name.txt OK
   cat file_name.txt | preprocess !!
   catxml file_name.xml | preprocess ??

Bug fixing

28 open bugs (and 25 risten.no bugs)

Move Bugzilla

Move Bugzilla to our new server when it arrives, and make it work at both the old URL http://giellatekno.uit.no/bugzilla/ as well as a similar one based on the name of the new server (e.g. something like http://project.divvun.no/bugzilla, where the 'project' part is still open for discussion) (Thor Øivind and Børre).

9. Summary, task list