Meeting_2006-02-06
Meeting setup
- Date: 06.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 38.
Present: Børre, Saara, Sjur, Tomi, Trond
Absent: Maaren, Thomas
Main secretary: Trond
Agenda accepted as is, we'll try to finish by 10.55, to allow for joining
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Sent to Iđut and Kåfjord municipality
- Sent to Iđut and Kåfjord municipality
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Continue converting text from input format to our xml
- Not done
- Not done
- review code and documentation for corpus xsl files under version control
- Not done
- Not done
-
fix bugs!
- Not done
- Not done
- Other
- The server didn't get an IP-address using DHCP. It turned out that if the
- The server didn't get an IP-address using DHCP. It turned out that if the
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
- almost done, I'll need some help with the xsl-processing of the
- almost done, I'll need some help with the xsl-processing of the
- Fix the preprocess script and optimize it by building an analyzator
- it seems that building a preprocessor-specific analyzator is not possible.
- it seems that building a preprocessor-specific analyzator is not possible.
- finalize an improved working version of the CGI and command line scripts for
- almost done.
- almost done.
- update conversion from lexc to xml (proper names) with the latest
- Try to add numeral treatment as part of the analyzator.
- not done
- not done
- Change character coding detection to paragraph-based.
- done, use convert2xml.pl with option --multi-coding. This
- done, use convert2xml.pl with option --multi-coding. This
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- not done
- not done
- Lule Sámi twol problems, with Thomas and Trond
- not done
- not done
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- not done
- not done
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- not done
- not done
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- not done
- not done
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- not done
- not done
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not done
- not done
- continue proper name lexicon work and discussion
- did a lot to upgrade the risten.no infrastructure to be multi-collection
- discussions in the newsgroup
- added the test lexicons Saara created to my own instance of risten.no
- did a lot to upgrade the risten.no infrastructure to be multi-collection
- public tender:
- waited for and received a draft public tender document from Finnut
- waited for and received a draft public tender document from Finnut
- smj G3 issue with Thomas and Trond
- not done
- not done
- sme G3 issue with Thomas and Trond
- not done
- not done
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
-
fix bugs!
- closed bug #217
Thomas
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- Not done
- Not done
- cgi-admin script for adding xsl-files
- Not done
- Not done
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- ccat: add a -v option - it should return the version of the tool
- Done
- Done
- new proper name lexicon
- remove last part of complex names not used as simplex names
- Not done
- Not done
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- Some progress
- Some progress
- new version of xml2lexc (based on catxml, now ccat)
- Not done
- xml2lexc update to handle complex names: construct entries like we have now
- Not done
- remove last part of complex names not used as simplex names
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Done some progres wrt. processing of texts.
- Done some progres wrt. processing of texts.
- 3-part compounds with Sjur and Thomas.
- Had a look at the rule set myself, but awaiting Thomas.
- Had a look at the rule set myself, but awaiting Thomas.
- smj G3 issue with Sjur and Thomas.
- Not done.
- Not done.
- sme G3 issue with Sjur and Thomas.
- Not done.
- Not done.
-
fix bugs!
- Not done.
- Not done.
- Worked mostly on disambiguation.
3. Documentation
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
Sent letter to Iđut and Kåfjord.
TODO: Still a lot for Børre!
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
Nothing heard.
Bible texts
TODO:
- write a paratext2xml converter
-
Tomi has already done it! Excellent!
- files requiring this converter should have the filename extension .ptx
- Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX
- Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX
-
Børre will review the converter as part of adding the Norwegian texts
-
Tomi has already done it! Excellent!
- convert smj NT to paratext. (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
5. Corpus infrastructure
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
- RCS version control is almost finished, but an issue with access control is
- Incorporate language detection as part of the corpus processing (Saara)
- Almost finished. Needs improved Finnish language model - presently it isn't
- Almost finished. Needs improved Finnish language model - presently it isn't
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn't have one
Things are moving forward, but still more work to do. The list is left as is.
E-mail address in case of upload errors:
corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.
/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest)
One option is to ask the cochise team, that would be royd or steinar and the
- Problems with greek letter in Word documents. With font Sam Times Uni(versal)
- ( Børre) Can't we just manually change
- ( Børre) Can't we just manually change
We forget about these texts for the time being, they'll be put in a dir. for
Suggestion for Script for text analysis.
We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue,
ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt For example: ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/admin.txt
- Today: /usr/local/share/corp/gt/sme/DIR(/*)/*xml
- Addition: /usr/local/share/corp/ga/sme/dir.txt
TODO:
- Look at the suggestion from Trond ( Saara, discuss with Trond
- ask for e-mail adress as specified above (Trond)
6. Linguistics
Anything? Nothing.
7. Name lexicon infrastructure
Complex names
TODO:
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
-
Saara has added the analyzer as part
XML format
TODO:
- update conversion from lexc to xml to reflect new xml format (Saara)
- mostly done, some open questions left
- mostly done, some open questions left
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- data synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
More TODO:
- read and comment in the news group (all)
- decide upon and set up infra for new projects and project ideas
Definitions/terminology:
-
synchronisation in our context is data synchronisation, that is, to
-
code refactoring is the process of reorganising the code by moving general
8. Other
SGL Seminar
- SGL/normativity seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
- all members = potentially/likely all languages
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- Test: the result of the last line should indicate whether this is a problem
- Is this a problem with ccat?
- It doesn't seem so (3 min and still counting)
- In the end, the bug turned up with ccat as well. I gave the command:
- zcorp/gt/sme/*/*xml
- It doesn't seem so (3 min and still counting)
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess --abbr=bin/abbr.txt | lookup -flags |
mbTT -utf8 bin/sme.fst
lookup2cg | vislcg --grammar=src/sme-dis.rle |
--minimal
sort | less |
1729 constraint rules utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
To ccat's defence I must say that cat, in a similar situation, would have given far
preprocess file_name.txt - OK cat file_name.txt | preprocess - bug!! catxml file_name.xml | preprocess - ?? ccat filename | preprocess - bug !!
This bug isn't a high priority any more, because ccat behaves differently than
BUG: close as Won't fix. (Børre)
Bug fixing
32 open bugs (and 24 risten.no bugs)
- Add bug report for the Xerox backspace error (Trond)
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- convert nob and nno bible texts to be used as part of a parallel corpus, and
- convert smj NT to paratext
- close bug 211 as WONTFIX
- DONE : -)
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
- Fix the preprocess script and optimize it by building an analyzator
- finalize an improved working version of the CGI and command line scripts for
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
- remove last part of complex names not used as simplex names
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- News group discussion followup.
- Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
- Ask for e-mail adress for corpus upload script
- fix bugs!.
10. Next meeting, closing
13.02.2006 09: 30
Closed at 10: 37