Meeting_2006-02-06

Meeting setup

Date: 06.02.2006
Time: 09.30 Norw. time
Place: Wherever we are : -)
Tools: iChat, SubEthaEdit

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
name lexicon infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 38.

Present: Børre, Saara, Sjur, Tomi, Trond

Absent: Maaren, Thomas

Main secretary: Trond

Agenda accepted as is, we'll try to finish by 10.55, to allow for joining the celebration of the Sámi national day.

2. Reviewing the task list from the last meeting

Børre

send out contracts with accompanying letter
- Sent to Iđut and Kåfjord municipality
Gather public texts, preferrably also parallel ones
- Not done
Continue converting text from input format to our xml
- Not done
review code and documentation for corpus xsl files under version control
- Not done
fix bugs!
- Not done
Other
- The server didn't get an IP-address using DHCP. It turned out that if the «Gateway Assistant» was used, then the network card connected to «the outside world» didn't get an IP-address using DHCP. Even if all the services that was started using the «Gateway Assistant» were turned off, this behaviour went on. The «solution» was to reinstall Mac OS X Server, and not use the «GA».

Maaren

work with risten.no
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

continue discussion on the new lexicon format
Refine language detection for Finnish
Finnish the review of the hyphenation detection.
Review the handling of xsl-files in corpus infrastructure, including version control
- almost done, I'll need some help with the xsl-processing of the main text.
Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions.
- it seems that building a preprocessor-specific analyzator is not possible.
finalize an improved working version of the CGI and command line scripts for corpus additions
- almost done.
update conversion from lexc to xml (proper names) with the latest refinements
Try to add numeral treatment as part of the analyzator.
- not done
Change character coding detection to paragraph-based.
- done, use convert2xml.pl with option --multi-coding. This introduces some errors (due to the small size of some paragraphs), so the default is still one coding per file.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
- not done
Lule Sámi twol problems, with Thomas and Trond
- not done
follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
  - not done
project planning with Trond, continued
- also look at the development processes - specification and testing
  - not done
Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
  - not done
Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
  - not done
write a background document on the corpus contracts
- not done
continue proper name lexicon work and discussion
- did a lot to upgrade the risten.no infrastructure to be multi-collection aware
- discussions in the newsgroup
- added the test lexicons Saara created to my own instance of risten.no
public tender:
- waited for and received a draft public tender document from Finnut
smj G3 issue with Thomas and Trond
- not done
sme G3 issue with Thomas and Trond
- not done
call EDD/ Christian Emil Ore about national place name lexicon
- not done
fix bugs!
- closed bug #217

Thomas

On sick leave.

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
  - Not done
corpus infrastructure:
- dtd location (both public and internal)
  - Not done
- cgi-admin script for adding xsl-files
  - Not done
Document aspell and corpus infrastructure
ccat: add a -v option - it should return the version of the tool
- Done
new proper name lexicon
- remove last part of complex names not used as simplex names
  - Not done
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
  - Some progress
- new version of xml2lexc (based on catxml, now ccat)
  - Not done
  - xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry
comment review template made by Saara
fix bugs!

Trond

Work on corpus texts with Børre.
- Done some progres wrt. processing of texts.
3-part compounds with Sjur and Thomas.
- Had a look at the rule set myself, but awaiting Thomas.
smj G3 issue with Sjur and Thomas.
- Not done.
sme G3 issue with Sjur and Thomas.
- Not done.
fix bugs!
- Not done.
Worked mostly on disambiguation.

3. Documentation

Reviews

Anything? Nothing.

Other docu

Anything? Documentation has been added for disambiguation.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

Sent letter to Iđut and Kåfjord.

TODO: Still a lot for Børre!

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

Nothing heard.

Bible texts

TODO:

write a paratext2xml converter
- Tomi has already done it! Excellent!
- files requiring this converter should have the filename extension .ptx
  - Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX 19PSANBST.u8.PTX
- Børre will review the converter as part of adding the Norwegian texts to our corpus
convert smj NT to paratext. (Børre)
ask to get fin and swe NT and OT in paratext format. (Trond)

5. Corpus infrastructure

Task list:

Include the xsl files under version control
1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We'll continue the discussion in the newsgroup.
Incorporate language detection as part of the corpus processing (Saara)
1. Almost finished. Needs improved Finnish language model - presently it isn't able to distinguish Finnish from Sámi (proving the family bonds: -)
we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
1. Acceptable results: 90% of all real hyphens correctly tagged.
CGI-admin script to add xsl-file to a corpus file that doesn't have one ( Saara)

Things are moving forward, but still more work to do. The list is left as is.

E-mail address in case of upload errors:

corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.

/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest) http: //localhost: 8888/upload/upload_corpus_file.html (Forrest)

One option is to ask the cochise team, that would be royd or steinar and the address cc.uit.no.

Problems with greek letter in Word documents. With font Sam Times Uni(versal)
- ( Børre) Can't we just manually change the letters and fonts in the few documents affected?

We forget about these texts for the time being, they'll be put in a dir. for broken texts. Such texts can be looked upon later , if wanted/needed.

Suggestion for Script for text analysis.

We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue, with one file for each of the five directories. A way of getting this is to ach night (afternoon!): Make a crontab job, run the following command, for each directory admin, bible, facta, ficti, laws, news:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg 
 | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt 

For example:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
 | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle
  > /usr/local/share/corp/ga/sme/admin.txt

Today: /usr/local/share/corp/gt/sme/DIR(/*)/*xml
Addition: /usr/local/share/corp/ga/sme/dir.txt

TODO:

Look at the suggestion from Trond ( Saara, discuss with Trond if unclear)
ask for e-mail adress as specified above (Trond)

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
Saara has added the analyzer as part of the preprocess, but it is slow, and needs to be optimized.

XML format

TODO:

update conversion from lexc to xml to reflect new xml format (Saara)
1. mostly done, some open questions left
testing of conversion
eXist as editor:
1. develop the needed XQueries and interface
2. data synchronisation between risten.no and
3. test whether eXist as editor is actually working well

More TODO:

read and comment in the news group (all)
decide upon and set up infra for new projects and project ideas

Definitions/terminology:

synchronisation in our context is data synchronisation, that is, to bring the two repositories (CVS and risten.no) in synch regarding the shared lexicons (propnouns at least).
code refactoring is the process of reorganising the code by moving general functions and snippets out of specific functions and into general libraries for shared access, easier maintenance, and a better organisation.

8. Other

SGL Seminar

SGL/normativity seminar
- all members = potentially/likely all languages
  - not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate

Technical issues

The mac os / perl bug (at least Trond and Sjur has it, Bugzilla #211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Saara, Tomi)
  - 10.4 introduced support for locales in the shell (10.3 and earlier didn't know about locales)
- Test: the result of the last line should indicate whether this is a problem in cat or a Perl/OS mismatch.
- Is this a problem with ccat?
  - It doesn't seem so (3 min and still counting)
  - In the end, the bug turned up with ccat as well. I gave the command:
  - zcorp/gt/sme/*/*xml

preprocess --abbr=bin/abbr.txt

lookup -flags

mbTT -utf8 bin/sme.fst

lookup2cg

vislcg --grammar=src/sme-dis.rle

--minimal

sort

less

1729 constraint rules
utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.

To ccat's defence I must say that cat, in a similar situation, would have given far more error messages (hold on, testing still under way).

   preprocess file_name.txt - OK
   cat file_name.txt | preprocess - bug!!
   catxml file_name.xml | preprocess - ??
   ccat filename | preprocess - bug !!

This bug isn't a high priority any more, because ccat behaves differently than cat, and because there is the possibility of avoiding cat when working locally.

BUG: close as Won't fix. (Børre)

Bug fixing

32 open bugs (and 24 risten.no bugs)

Add bug report for the Xerox backspace error (Trond)

9. Summary, task list

Børre

send out contracts with accompanying letter
Gather public texts, preferrably also parallel ones
Continue converting text from input format to our xml
review code and documentation for corpus xsl files under version control
convert nob and nno bible texts to be used as part of a parallel corpus, and review the paratext2xml converter as part of the conversion
convert smj NT to paratext
close bug 211 as WONTFIX
- DONE : -)

fix bugs!

Maaren

work with risten.no
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

continue discussion on the new lexicon format
Refine language detection for Finnish
Finnish the review of the hyphenation detection.
Review the handling of xsl-files in corpus infrastructure, including version control
Fix the preprocess script and optimize it by building an analyzator for the multi-part expressions.
finalize an improved working version of the CGI and command line scripts for corpus additions
update conversion from lexc to xml (proper names) with the latest refinements
Try to add numeral treatment as part of the analyzator.
Look at crontab ga/ directory issue with Trond.
fix bugs!

Sjur

Follow up the lawyer treatment of the contracts
Lule Sámi twol problems, with Thomas and Trond
project planning with Trond, continued
Follow up on place names from Norge Digitalt
Evaluate SFST as speller (and analyzer) lexicon
write a background document on the corpus contracts
public tender:
- review draft tender document from Finnut
smj G3 issue with Thomas and Trond
sme G3 issue with Thomas and Trond
call EDD/ Christian Emil Ore about national place name lexicon
risten.no/proper noun lexicon development: fix bugs, continue development
fix bugs!

Thomas

work on North Sámi compounding and derivation
review corpus usage documentation
smj G3 issue with Sjur and Trond
sme G3 issue with Sjur and Trond

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
corpus infrastructure:
- dtd location (both public and internal)
Document aspell and corpus infrastructure
new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
comment review template made by Saara
fix bugs!

Trond

Work on corpus texts with Børre.
Contact the Finnish and Swedish Bible societies to get Bible texts.
Look at ga/ directory issue with Saara.
News group discussion followup.
Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
Ask for e-mail adress for corpus upload script
fix bugs!.

10. Next meeting, closing

13.02.2006 09: 30

Closed at 10: 37