Meeting_2006-02-06

Meeting setup

  • Date: 06.02.2006
  • Time: 09.30 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09: 38.

Present: Børre, Saara, Sjur, Tomi, Trond

Absent: Maaren, Thomas

Main secretary: Trond

Agenda accepted as is, we'll try to finish by 10.55, to allow for joining the celebration of the Sámi national day.

2. Reviewing the task list from the last meeting

Børre

  • send out contracts with accompanying letter
    • Sent to Iđut and Kåfjord municipality
  • Gather public texts, preferrably also parallel ones
    • Not done
  • Continue converting text from input format to our xml
    • Not done
  • review code and documentation for corpus xsl files under version control
    • Not done
  • fix bugs!
    • Not done
  • Other
    • The server didn't get an IP-address using DHCP. It turned out that if the «Gateway Assistant» was used, then the network card connected to «the outside world» didn't get an IP-address using DHCP. Even if all the services that was started using the «Gateway Assistant» were turned off, this behaviour went on. The «solution» was to reinstall Mac OS X Server, and not use the «GA».

Maaren

  • work with risten.no
  • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

  • continue discussion on the new lexicon format
  • Refine language detection for Finnish
  • Finnish the review of the hyphenation detection.
  • Review the handling of xsl-files in corpus infrastructure, including version control
    • almost done, I'll need some help with the xsl-processing of the main text.
  • Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions.
    • it seems that building a preprocessor-specific analyzator is not possible.
  • finalize an improved working version of the CGI and command line scripts for corpus additions
    • almost done.
  • update conversion from lexc to xml (proper names) with the latest refinements
  • Try to add numeral treatment as part of the analyzator.
    • not done
  • Change character coding detection to paragraph-based.
    • done, use convert2xml.pl with option --multi-coding. This introduces some errors (due to the small size of some paragraphs), so the default is still one coding per file.
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
    • not done
  • Lule Sámi twol problems, with Thomas and Trond
    • not done
  • follow up on voice group-chat not working to Sámediggi
    • Test Marratech when the new Marratech server is in place
      • not done
  • project planning with Trond, continued
    • also look at the development processes - specification and testing
      • not done
  • Follow up on place names from Norge Digitalt
    • write an e-mail to or call Bjørn Olav Megard
      • not done
  • Evaluate SFST as speller (and analyzer) lexicon
    • more thorough analysis than was possible in Guovdageaidnu
      • not done
  • write a background document on the corpus contracts
    • not done
  • continue proper name lexicon work and discussion
    • did a lot to upgrade the risten.no infrastructure to be multi-collection aware
    • discussions in the newsgroup
    • added the test lexicons Saara created to my own instance of risten.no
  • public tender:
    • waited for and received a draft public tender document from Finnut
  • smj G3 issue with Thomas and Trond
    • not done
  • sme G3 issue with Thomas and Trond
    • not done
  • call EDD/ Christian Emil Ore about national place name lexicon
    • not done
  • fix bugs!

Thomas

On sick leave.

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
      • Not done
  • corpus infrastructure:
    • dtd location (both public and internal)
      • Not done
    • cgi-admin script for adding xsl-files
      • Not done
  • Document aspell and corpus infrastructure
  • ccat: add a -v option - it should return the version of the tool
    • Done
  • new proper name lexicon
    • remove last part of complex names not used as simplex names
      • Not done
    • start looking at conversion of the name lexicon from present format to xml
    • discuss the new lexicon format in the newsgroup
    • Look into synchronisation of proper names with risten.no
      • Some progress
    • new version of xml2lexc (based on catxml, now ccat)
      • Not done
      • xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry
  • comment review template made by Saara
  • fix bugs!

Trond

  • Work on corpus texts with Børre.
    • Done some progres wrt. processing of texts.
  • 3-part compounds with Sjur and Thomas.
    • Had a look at the rule set myself, but awaiting Thomas.
  • smj G3 issue with Sjur and Thomas.
    • Not done.
  • sme G3 issue with Sjur and Thomas.
    • Not done.
  • fix bugs!
    • Not done.
  • Worked mostly on disambiguation.

3. Documentation

Reviews

Anything? Nothing.

Other docu

Anything? Documentation has been added for disambiguation.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

Sent letter to Iđut and Kåfjord.

TODO: Still a lot for Børre!

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

Nothing heard.

Bible texts

TODO:

  • write a paratext2xml converter
    • Tomi has already done it! Excellent!
    • files requiring this converter should have the filename extension .ptx
      • Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX 19PSANBST.u8.PTX
    • Børre will review the converter as part of adding the Norwegian texts to our corpus
  • convert smj NT to paratext. (Børre)
  • ask to get fin and swe NT and OT in paratext format. (Trond)

5. Corpus infrastructure

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We'll continue the discussion in the newsgroup.
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. Almost finished. Needs improved Finnish language model - presently it isn't able to distinguish Finnish from Sámi (proving the family bonds: -)
  3. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
    1. Acceptable results: 90% of all real hyphens correctly tagged.
  4. CGI-admin script to add xsl-file to a corpus file that doesn't have one ( Saara)

Things are moving forward, but still more work to do. The list is left as is.

E-mail address in case of upload errors:

corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.

/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest) http: //localhost: 8888/upload/upload_corpus_file.html (Forrest)

One option is to ask the cochise team, that would be royd or steinar and the address cc.uit.no.

  • Problems with greek letter in Word documents. With font Sam Times Uni(versal)
    • ( Børre) Can't we just manually change the letters and fonts in the few documents affected?

We forget about these texts for the time being, they'll be put in a dir. for broken texts. Such texts can be looked upon later , if wanted/needed.

Suggestion for Script for text analysis.

We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue, with one file for each of the five directories. A way of getting this is to ach night (afternoon!): Make a crontab job, run the following command, for each directory admin, bible, facta, ficti, laws, news:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg 
 | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt 

For example:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle
  > /usr/local/share/corp/ga/sme/admin.txt 
  • Today: /usr/local/share/corp/gt/sme/DIR(/*)/*xml
  • Addition: /usr/local/share/corp/ga/sme/dir.txt

TODO:

  • Look at the suggestion from Trond ( Saara, discuss with Trond if unclear)
  • ask for e-mail adress as specified above (Trond)

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

  • make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
    • the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
  • Saara has added the analyzer as part of the preprocess, but it is slow, and needs to be optimized.

XML format

TODO:

  1. update conversion from lexc to xml to reflect new xml format (Saara)
    1. mostly done, some open questions left
  2. testing of conversion
  3. eXist as editor:
    1. develop the needed XQueries and interface
    2. data synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

More TODO:

  • read and comment in the news group (all)
  • decide upon and set up infra for new projects and project ideas

Definitions/terminology:

  • synchronisation in our context is data synchronisation, that is, to bring the two repositories (CVS and risten.no) in synch regarding the shared lexicons (propnouns at least).
  • code refactoring is the process of reorganising the code by moving general functions and snippets out of specific functions and into general libraries for shared access, easier maintenance, and a better organisation.

8. Other

SGL Seminar

  • SGL/normativity seminar
    • all members = potentially/likely all languages
      • not all languages, only North Sámi
    • date? As early as possible, end of February/beginning of March
    • place? Maaren will investigate

Technical issues

  • The mac os / perl bug (at least Trond and Sjur has it, Bugzilla #211):
    • utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Saara, Tomi)
      • 10.4 introduced support for locales in the shell (10.3 and earlier didn't know about locales)
    • Test: the result of the last line should indicate whether this is a problem in cat or a Perl/OS mismatch.
    • Is this a problem with ccat?
      • It doesn't seem so (3 min and still counting)
      • In the end, the bug turned up with ccat as well. I gave the command:
      • zcorp/gt/sme/*/*xml
preprocess --abbr=bin/abbr.txt lookup -flags

mbTT -utf8 bin/sme.fst

lookup2cg vislcg --grammar=src/sme-dis.rle

--minimal

sort less
    1729 constraint rules
    utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
    

    To ccat's defence I must say that cat, in a similar situation, would have given far more error messages (hold on, testing still under way).

       preprocess file_name.txt - OK
       cat file_name.txt | preprocess - bug!!
       catxml file_name.xml | preprocess - ??
       ccat filename | preprocess - bug !!
    

    This bug isn't a high priority any more, because ccat behaves differently than cat, and because there is the possibility of avoiding cat when working locally.

    BUG: close as Won't fix. (Børre)

    Bug fixing

    32 open bugs (and 24 risten.no bugs)

    • Add bug report for the Xerox backspace error (Trond)

    9. Summary, task list

    Børre

    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Continue converting text from input format to our xml
    • review code and documentation for corpus xsl files under version control
    • convert nob and nno bible texts to be used as part of a parallel corpus, and review the paratext2xml converter as part of the conversion
    • convert smj NT to paratext
    • close bug 211 as WONTFIX
      • DONE : -)

    Maaren

    • work with risten.no
    • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

    Saara

    • continue discussion on the new lexicon format
    • Refine language detection for Finnish
    • Finnish the review of the hyphenation detection.
    • Review the handling of xsl-files in corpus infrastructure, including version control
    • Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions.
    • finalize an improved working version of the CGI and command line scripts for corpus additions
    • update conversion from lexc to xml (proper names) with the latest refinements
    • Try to add numeral treatment as part of the analyzator.
    • Look at crontab ga/ directory issue with Trond.
    • fix bugs!

    Sjur

    • Follow up the lawyer treatment of the contracts
    • Lule Sámi twol problems, with Thomas and Trond
    • project planning with Trond, continued
    • Follow up on place names from Norge Digitalt
    • Evaluate SFST as speller (and analyzer) lexicon
    • write a background document on the corpus contracts
    • public tender:
      • review draft tender document from Finnut
    • smj G3 issue with Thomas and Trond
    • sme G3 issue with Thomas and Trond
    • call EDD/ Christian Emil Ore about national place name lexicon
    • risten.no/proper noun lexicon development: fix bugs, continue development
    • fix bugs!

    Thomas

    • work on North Sámi compounding and derivation
    • review corpus usage documentation
    • smj G3 issue with Sjur and Trond
    • sme G3 issue with Sjur and Trond

    Tomi

    • Aspell: Continue working on the affix file & aspell
      • Contact aspell author (UTF-8 thing)
    • corpus infrastructure:
      • dtd location (both public and internal)
    • Document aspell and corpus infrastructure
    • new proper name lexicon
      • remove last part of complex names not used as simplex names
      • discuss the new lexicon format and other issues in the newsgroup
      • Look into data synchronisation of proper nouns between risten.no and CVS
      • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
    • comment review template made by Saara
    • fix bugs!

    Trond

    • Work on corpus texts with Børre.
    • Contact the Finnish and Swedish Bible societies to get Bible texts.
    • Look at ga/ directory issue with Saara.
    • News group discussion followup.
    • Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
    • Ask for e-mail adress for corpus upload script
    • fix bugs!.

    10. Next meeting, closing

    13.02.2006 09: 30

    Closed at 10: 37