Meeting_2007-11-19

Meeting setup

  • Date: 19.11.2007
  • Time: 09.30 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat/Skype

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10: 15.

Present: Børre, Ilona, Per-Eric, Sjur, Thomas, Tomi

Absent: Risten, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • move Steinar's error markup in the xml files to (a copy of) the original
    • not done
  • fix Unicode bug in Hunspell conversion java code
    • don't know the reason for this one
  • fix bug 550
    • not done
  • move Bugzilla to the G5.
    • done
  • fix Windows CD installation bug
    • not done
  • discuss more parallel texts
    • nothing con
  • fix bugs!

Ilona

  • lexicalise smj missing words
    • Done.. at least most of it.
  • other smj tasks, ask Thomas

Maaren

  • lexicalise actio compounds

Per-Eric

  • lexicalise words from the Olavi missing list
    • Worked and still working. It will be ready this week, some strange words left. We have to make a new missng list to see which words are left.
  • fix bugs!
    • Nothing to fix

Risten

  • finish the design/text for the CD and the cover
    • done
  • set up CD-printing printer
  • try to burn a CD at SD
    • done
  • get price and schedule for printed CD cover

Saara

  • add new XSL/XML headers for proofing test docs
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)
  • add nested error markup to xml conversion
    • almost finished, just some testing left
  • discuss more parallel texts

Sjur

  • work on the XML name editor/risten.no integration
    • nothing - this will have to wait till Divvun2
  • set up risten.no on the G5 again
    • really tried last week, to help Trond with some work, but failed (eXist couldn't reliably restore files from a local backup, leaving the dictionary and term collections incomplete and useless; and Forrest rejected to change port despite explicit requests for it, and port-crashed with the existing divvun.no site running off the same computer, this made the whole portal dysfunctional)
  • test new and nested error markup
    • Saara tried getting it in place, but is not finished with her work, thus nothing to test yest
  • get command line hyphenator for automated testing of the hyph-lexicons
    • done
  • add hyphenation testing
    • first rough version done
  • improve paradigm testing
    • done
  • fix bug 550
    • not yet
  • follow-up support for the sami languages in OpenOffice.org
    • done, they will be included in the next OOo release - 2.4
  • fix Windows CD installation bug
    • not yet done
  • fix circularity issue in nonrec transducers
    • Tomi did this on his own - great!
  • fix bugs!

Thomas

  • sme->smj lexicon conversion to build bilingual lexicon resources
    • not this week
  • check for bad hyphenation
    • not this week
  • look at test cases still not behaving properly
    • worked
  • paradigm testing
    • worked a lot
  • fix bugs!
    • worked

Tomi

  • Hunspell lexicon conversion
    • not done
  • fix circularity issue in nonrec transducers
    • fixed
  • fix bugs!
  • other
    • installed Leopard

Trond

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • fix hyphenation of derivations, inflections
  • telephone meeting with Sjur and the faroese group re faroese speller
  • fix circularity issue in nonrec transducers
  • discuss more parallel texts
  • fix bugs!.

3. Documentation

Nothing new.

TODO:

4. Corpus infrastructure

Nothing.

5. Infrastructure

Bugzilla is up and running again.

TODO:

  • add Jabber account in iChat (all)

6. Linguistics

North Sámi

TODO:

  • fix hyphenation of derivations (Thomas, Tomi, Sjur, Trond)
    • now tested, and much improved, but still needs improvements and further investigation
  • fix circularity issue (Sjur, Tomi, Trond)
    • Tomi fixed it

Lule Sámi

TODO:

  • lexicalise words from the Olavi missing list, but check against the pdf original where in doubt (Per-Eric)
    • almost finished - only parts of the letter r still missing
  • sme->smj lexicon conversion to build bilingual lexicon resources, and increase smj coverage (Trond, Thomas, Svenne)
  • look at missing baseforms (Thomas)
    • done

7. Name lexicon infrastructure

This sub-project needs to get up and running soon. Mainly Sjur's task.

Decisions made in Tromsø can be found in this meeting memo.

TODO:

  1. set up Tomcat and risten.no on the G5 again (Sjur, Børre)
    1. install risten.no
      1. really tried, but got problems
  2. fix bugs in lexc2xml; add comments to the log element (Saara)
  3. finish first version of the editing (Sjur)
  4. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  5. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  6. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  7. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  8. start to use the xml file as source file
  9. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  10. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  11. publish the name lexicon on risten.no (Sjur)
  12. add missing parallel names for placenames (linguists)
  13. add informative links between first names like Niillas and Nils ( linguists)

8. Proofing tools

Hunspell

TODO:

  1. Follow-up support for the sami languages in OpenOffice.org (Børre, Sjur)
    1. done, scheduled for version 2.4
  2. Hunspell lexicon conversion (Tomi, Børre)
    1. improved, nouns compile much improved, adjectives also converts, but with errors, verbs are quite rough

Testing

Spelling Error Markup

This will wait till after the release.

TODO:

  • Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Saara)
  • move Steinar's error markup in the xml files to (a copy of) the original ( Børre, Kimme)
  • add nested error markup to xml conversion (Saara)
  • test new and nested error markup (Sjur)

Automated testing

TODO:

  • add hyphenation testing (Sjur)
    • rough version added
  • improve paradigm testing report (Sjur)
    • done

MS Office

An important aspect of this testing is to document in the user guide anything that could be a problem for users.

TODO:

  • test the proofing tools with all MS Office applications (Børre, Thomas)

Lexicon conversion to the PLX format

PLX conversion update: we will soon get an updated speller engine with flags for marking word-initial, word-internal and word-final positions. That will make it possible for us to resolve some outstanding issues in the PLX conversion.

Open issues based on test results:

smj

482 - still problematic (prefix), 484 - double hyphens suggested; 552 - now fixed!, Svierigadárogielan - still rejected (prefix)

sme

397 - double hyphens (name+name), 419, 425 - roman number, 431 - does not accept the correct string, but DO suggest the same; also hyphen final forms are accepted, but not the same form when part of a compound, 452 - miel is a prefix, 461 - ovda accepted, almost 50 % (17) gets correct suggestion, 489, 522, 524, Guovdageainnu-láđđi not accepted.

Guovdageaidnu-láđđi  nom-
Guovdageainnu-láđđi  gen-

It should be Guovdageaidnu-láđđi OR Guovdageainnu láđđi. The first one is suggested + Guovdageainnutláđđi

Harstad-biila (nom) is ok, whereas gen. Harstada-biila is not, ie the same pattern.

ovda-
ovda-	ovda+Cmpnd

TODO:

  • look at test cases still not behaving properly (Thomas, Tomi)

InDesign tools

TODO:

  • add hyphenation testing (Sjur)
    • done but not finished

Hyphenators

We should look into the possibility of generating pattern-based hyphenation for OOo. It shouldn't be too hard, or require too much work, but needs investigation. => Divvun2.

TODO:

  • get command line hyphenator (Sjur)
    • done

Release version

Schedule and tasks for the remaining weeks:

TODO:

  • try to burn a CD at SD (Risten, Leif-Åge)
    • done, it is working exactly as one burned on the Mac
  • finish text to go on the CD cover (Risten)
    • done
  • set up CD-printing printer (Risten)
    • in the works
  • fix Windows CD installation bug (Sjur, Børre)
    • not yet done
  • get price and schedule for printed CD cover (Risten)
    • not received
  • fix remaining bugs - golden master by next Monday (all)
  • finalise InDesign hyphenator (Sjur, Børre, Thomas)
    • testing
    • documentation
    • installation
  • update usage and installation documentation (Børre, Thomas, Sjur)
  • translate all new documentation (all)
  • QA all documentation (all)
  • do as much hunsopell as possible (Børre, Tomi)

Actual release

December 11-13, one of these days.

Hotel rooms received for all except Ilona, will be received for her as well.

There will be a release party in the afternoon.

9. Other

Corpus contracts

Delayed till after final release.

TODO:

  • publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Faroese

Speller for fao using our infrastructure and the knowledge we have.

TODO:

  • set up a telephone meeting with them and Sjur ( Trond)

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

69 open Divvun/Disamb bugs (35 of these 56 are speller-related bugs, 34 are other bugs), and 23 risten.no bugs

Dictionaries

TODO:

  • eXist on G5 - done
  • risten.no on G5 - homepage done, but needs port tweaking/changing
  • risten.no XQuery framework - done
  • XInclude conversion script - done
  • homepage on the documetation pages - draft done

Parallel corpora

TODO:

  • discuss more parallel texts (Børre, Saara, Trond)

SD yearly personell seminar

6.-7. December. Sjur has discussed it with Julia, and we won't go there.

Software updates

  • Leopard, 10.5
    • Ilona - Doesn't have a proper DVD yet.
    • Trond
    • Per-Eric

10. Next meeting, closing

The next meeting is 26.11.2007, 09: 30 Norwegian time.

The meeting was closed at 11: 34.

Appendix - task lists for the next week

Boerre

  • move Steinar's error markup in the xml files to (a copy of) the original
  • fix bug 550
  • fix Windows CD installation bug
  • discuss more parallel texts
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • fix bugs!

Ilona

  • lexicalise smj missing words
  • other smj tasks, ask Thomas
  • Buy the DVD for Leopard

Maaren

  • lexicalise actio compounds

Per-Eric

  • lexicalise words from the Olavi missing list
  • derivations tests
  • fix bugs!

Risten

  • set up CD-printing printer
  • get price and schedule for printed CD cover

Saara

  • add new XSL/XML headers for proofing test docs
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)
  • add nested error markup to xml conversion
  • discuss more parallel texts

Sjur

  • work on the XML name editor/risten.no integration
  • set up risten.no on the G5 again
  • test new and nested error markup
  • improve hyphenation testing
  • fix bug 550
  • fix Windows CD installation bug
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • fix bugs!

Thomas

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • check for bad hyphenation
  • look at test cases still not behaving properly
  • paradigm testing
  • test the proofing tools with all MS Office applications
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • fix bugs!

Tomi

Trond

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • telephone meeting with Sjur and the faroese group re faroese speller
  • discuss more parallel texts
  • fix bugs!.