Meeting_2006-01-30
Meeting setup
- Date: 30.01.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 48.
Present: Børre, Sjur, Tomi, Trond
Absent: Maaren, Saara, Thomas
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Not done
- Not done
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Contact Odin editor (Ove Sæth) to ask for source (and parallel) documents
- Done
- Done
- Continue converting text from input format to our xml
- Not done
- Not done
- review code and documentation for corpus xsl files under version control
- Not done
- Not done
-
fix bugs!
- Not done
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Convert the name lexicon from present format to xml for testing; final
- done
- done
- Refine language detection for Finnish
- not done
- not done
- Finish the review of the hyphenation detection.
- not done
- not done
- Review the handling of xsl-files in corpus infrastructure, including version
- in progress
- in progress
- Do some testing for bug
- not done
- not done
- optimize the preprocess script
- not done
- not done
- Write/update user documentation for the corpus usage in preparation for the
- done
- done
- finalize an improved working version of the CGI and command line scripts for
- in progress
- in progress
- xml2lexc update to handle complex names: construct entries like we have now
- tomi will do the xml2lexc-script(?)
- tomi will do the xml2lexc-script(?)
- update conversion from lexc to xml (proper names) with the latest
- not done
- not done
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- not done
- not done
- Project seminar
- plan and make schedule with Trond
- done
- done
- check which hotels SD has an agreement with
- done
- done
- plan XQuery/XSLT training session
- done
- the whole seminar is done
- done
- plan and make schedule with Trond
- Lule Sámi twol problems, with Thomas and Trond
- not done
- not done
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- not done
- not done
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- not done
- not done
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- not done
- not done
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- not done
- not done
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not done
- not done
- continue proper name lexicon work and discussion
- done a lot at the seminar
- done a lot at the seminar
- public tender:
- review offer from Finnut Consult AS
- done, as well as asked for two other offers, then picked one (Finnut), and
- done, as well as asked for two other offers, then picked one (Finnut), and
- review offer from Finnut Consult AS
- smj G3 issue with Thomas and Trond
- not done
- not done
- sme G3 issue with Thomas and Trond
- not done
- not done
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
- risten.no/name lexicon development: fix bugs, continue development
- done some, backed up and restarted the server; the backup isn't completely
- done some, backed up and restarted the server; the backup isn't completely
- fix bugs!
Thomas
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- Not done
- Not done
- cgi-admin script for adding xsl-files
- Not done
- Not done
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Not done
- Not done
- Specification for new catxml in C++
- install and announce new ccat tool
- Done
- Done
- install and announce new ccat tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- Not done
- Not done
- start looking at conversion of the name lexicon from present format to xml
- Not done
- Not done
- discuss the new lexicon format in the newsgroup
- Not done
- Not done
- Look into synchronisation of proper names with risten.no
- Not done
- Not done
- meeting to arrive at final xml format
- Participated the meeting
- Participated the meeting
- new version of xml2lexc (based on catxml, now ccat)
- Not done
- Not done
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- Not done, but has Saara done it? Sjur: yes.
- Not done, but has Saara done it? Sjur: yes.
- comment review template made by Saara
- Not done
- Not done
-
fix bugs!
- Not done
- Not done
- pick up backpacks after Xmas
- Done
Trond
- Contact Odin editor (Ove Sæth) immediately to reopen contacts
- Done.
- Done.
- Project seminar
- plan and make schedule with Sjur
- Done.
- check with Linda and Ilona whether we can start on Monday after lunch
- Done.
- plan and make schedule with Sjur
- sign contract with Bibelselskapet for Norwegian parallel texts
- Done.
- Done.
- document corpus infrastructure, your part
- review corpus usage documentation (ccat)
- Done.
- Done.
- discuss the new lexicon format in the newsgroup
- Done, but not extensively.
- Done, but not extensively.
- smj G3 issue with Sjur and Thomas
- Not done
- Not done
- sme G3 issue with Sjur and Thomas
- Not done
- Not done
-
fix bugs!
- I think we made some progress on some of the number bugs, but we haven't
- I think we made some progress on some of the number bugs, but we haven't
3. Documentation
Reviews
ccat review
Saara, Linda, Ilona, Trond
Conducted the review as part of the seminar, although Thomas and Maaren weren't
It works mostly as documented, a few glitches were found and corrected. The
-a Print all text elements. -p Print plain paragraphs. (default) -T Print paragraphs with title type. -L Print paragraphs with list type. -t Print paragraphs with table type. -r <dir> Recursively process directory dir and subdirs enountered. -h Print this help message.
- (Trond: ) I thought that -T printed titles, but it gave the same output as -a.
- (Tomi: ) No, it gave you all regular paragraphs + titles, but no tables or
- (Sjur: ) I would like a -v option, so we are able to identify which version to
The [xml-based documentation|/ling/catxml.html] was not completely
Other documentation
-
Børre: Informed about the forrest documentation: the documentation tree
4. Corpus gathering
Discussed briefly how the formalities should be implemented: signature, Websak,
Collecting
See the previous meeting memo for what's to be done.
TODO: Still a lot for Børre!
Odin
DONE: Trond, and then Børre to call Ove Sæth to re-establish
Sæth to discuss with colleagues about how to implement the cooperation.
Bible texts
ccat -t zcorp/gt/sme/bible/ot/1Mos_09-01.doc.xml | less
This gives everything.
- The first column should be suppressed
- The second column should be marked number or something
- The third column should be marked header if the typographic code in the fist
sme$ls zcorp/orig/sme/bible/ot/ 1Mos_09-01.doc Salmmat-_garvasat_0203.doc
There are two new books in the paratext format waiting in the nob orig and nno
TODO:
- write a paratext2xml converter
- convert smj NT to paratext
- ask to get fin and swe NT and OT in paratext format
We already have an embryonic converter: gt/script/testament.xsl
5. Corpus infrastructure
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
- RCS version control is almost finished, but an issue with access control is
- Incorporate language detection as part of the corpus processing (Saara)
- Almost finished. Needs improved Finnish language model - presently it isn't
- Almost finished. Needs improved Finnish language model - presently it isn't
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn't have one
Things are moving forward, but still more work to do. The list is left as is.
6. Linguistics
Nothing today, our linguists are on sick leave or not participating. For the
7. Name lexicon infrastructure
Complex names
Task list for this issue:
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
-
Saara has added the analyzer as part
XML format
Tasks:
- make a test lexicon for evaluating the format, set up the editing, and test it
- Done
- Done
- update conversion from lexc to xml to reflect new xml format (Saara)
- mostly done, some open questions left
- mostly done, some open questions left
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
SGL Seminar
- SGL/normativity seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
- all members = potentially/likely all languages
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- Test: the result of the last line should indicate whether this is a problem
- Is this a problem with ccat?
- It doesn't seem so (3 min and still counting)
- In the end, the bug turned up with ccat as well. I gave the command:
- zcorp/gt/sme/*/*xml
- It doesn't seem so (3 min and still counting)
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess --abbr=bin/abbr.txt | lookup -flags |
mbTT -utf8 bin/sme.fst
lookup2cg | vislcg --grammar=src/sme-dis.rle |
--minimal
sort | less |
1729 constraint rules utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
To ccat's defence I must say that cat, in a similar situation, would have given far
preprocess file_name.txt - OK cat file_name.txt | preprocess - bug!! catxml file_name.xml | preprocess - ?? ccat filename | preprocess - bug !!
This bug isn't a high priority any more, because ccat behaves differently than
BUG: close as Won't fix.
Bug fixing
30 open bugs (and 25 risten.no bugs)
Norwegian ispell press release
The i18n section of Skolelinux plans a press release including a paragraph
- Dette er den eneste kilden til retteprogram for norsk uavhengig av Microsoft,
- Et separat prosjekt ved Sametinget er i gang for å utvikle samiske
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
- Do some testing for bug
- Fix the preprocess script and optimize it by building an analyzator
- finalize an improved working version of the CGI and command line scripts for
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Change character coding detection to paragraph-based.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- continue proper name lexicon work and discussion
- public tender:
- review offer from Finnut Consult AS
- review offer from Finnut Consult AS
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/name lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- cgi-admin script for adding xsl-files
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- ccat: add a -v option - it should return the version of the tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- new version of xml2lexc (based on catxml, now ccat)
- xml2lexc update to handle complex names: construct entries like we have now
- xml2lexc update to handle complex names: construct entries like we have now
- remove last part of complex names not used as simplex names
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- 3-part compounds with Sjur and Thomas.
- smj G3 issue with Sjur and Thomas.
- sme G3 issue with Sjur and Thomas.
- fix bugs!
10. Next meeting, closing
06.02.2006 09: 30
Closed at 12: 03