Meeting_2005-12-19
Meeting setup
- Date: 19.12.2005
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 20.
Present: Børre, Maaren, Saara, Sjur, Thomas
Absent: Maaren, Saara, Tomi, Trond
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Gather public texts
- Not done
- Not done
- Continue converting text from input format to our xml
- Done. Reported bugs in convert2xml.pl which Saara fixed.
- Done. Reported bugs in convert2xml.pl which Saara fixed.
- Document the corpus infrastructure
- Not done
- Not done
- Ask Thor-Øivind to help us move bugzilla to our new project server
- ... and make the URL http://giellatekno.uit.no/bugzilla point to it (and
- He began this work on friday afternoon.
- He began this work on friday afternoon.
- ... and make the URL http://giellatekno.uit.no/bugzilla point to it (and
- install new XXE and the new XXE Forrest config for Ilona
- not necessary, she doesn't use it
- not necessary, she doesn't use it
- divvun.no and giellatekno.uit.no:
- Binary files download area
- Done
- Done
- Continue the conversion to static site, using our own script.
- Done
- Done
- Binary files download area
- review code and documentation for corpus xsl files under version control
- Not done
- Not done
- review the convert2xml.pl script, what works well and what doesn´t.
- Done, see above
- Done, see above
- comment review template made by Saara
- Not done
- Not done
- fix bugs!
Maaren
- work with risten.no
- review corpus usage documentation, and the usage of the corpus
- discuss with relevant people regarding seminar on proofing tools, normativity
- SGL: have had their meeting - all of our issues have been handled, we're
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- document corpus infrastructure, your own parts
- done
- done
- catxml review: make a template, and post it in the newsgroup
- done
- done
- Language detection for the corpus files
- not done
- not done
- Implement addition of <hyph> tags to the converted corpus files
- done
- done
- Do some testing for bug
- update the convert2xml script according to the comments
- done
- done
- review corpus upload user interface
- done
- done
- add version control of the corpus xsl files to the (upload)
- preliminary commands added (still in comments) and documentation
- preliminary commands added (still in comments) and documentation
- xml2lexc update to handle complex names: construct entries like we have now
- update preprocess to handle inflected forms of complex names
- done, but needs to be optimized
- done, but needs to be optimized
- fix bugs!
Sjur
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- waiting for the new server
- waiting for the new server
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- done some
- done some
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not done
- not done
- discuss kvensk project support
- done
- done
- proper names:
- discuss lexicon format
- some discussion with Trond
- some discussion with Trond
- discuss implementation with Tomi
- nope
- nope
- post a draft XML format for the name lexicon, based on the discussion above
- not yet
- not yet
- discuss lexicon format
- write public tender documents
- worked with the selection criterias with Trond
- worked with the selection criterias with Trond
- update the SD contract version
- done
- done
- send SD corpus contract to lawyer
- done
- done
- review corpus upload user interface
- done
- done
-
fix bugs!
- Other:
- made a test version of the Komi lexicon in eXist/risten.no (my local
- Tomi and I tried hard to debug the parallel editing bug in risten.no/eXist,
- made a test version of the Komi lexicon in eXist/risten.no (my local
Thomas
- work on North Sámi compounding and derivation
- worked and still working
- worked and still working
- review corpus usage documentation
- didn't quite understand it
- didn't quite understand it
- translate the normativity-issue on second-syllable deletion when compounding
- done
- done
- smj G3 issue with Sjur and Trond
- nothing done this week
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- New catxml sources and Makefile are in cvs - gt/script/samiXMLParser/
- New catxml sources and Makefile are in cvs - gt/script/samiXMLParser/
- this includes also placing the source and binary
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into ynchronisation of proper names with risten.no
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- comment review template made by Saara
-
fix bugs!
- pick up backpacks after Xmas
Trond
- project planning with Sjur, continued
- Substantial progress on work with the planning tools during my Helsinki stay.
- Substantial progress on work with the planning tools during my Helsinki stay.
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Not done.
- Not done.
- document corpus infrastructure, your part
- Not done.
- Not done.
- update the SD contract version
- Done.
- Done.
- send SD corpus contract to lawyer
- Done.
- Done.
- review corpus upload user interface
- Had a look at it, but not filed anything yet.
- Had a look at it, but not filed anything yet.
- review corpus usage documentation
- Not given feedback.
- Not given feedback.
- comment review template made by Saara
- Have read it.
- Have read it.
- discuss the new lexicon format in the newsgroup
- Some done.
- Some done.
-
fix bugs!
- Here my activity has been more subversive, as I have made Linda add more bug
- Here my activity has been more subversive, as I have made Linda add more bug
- Other:
- Worked with disambiguation, with Komi and with Greenlandic. The outcome of
- Worked with disambiguation, with Komi and with Greenlandic. The outcome of
3. Documentation
Reviews
Add documentation on our corpus infrastructure and our corpus work in general
- For the users/linguists: What corpus are found, how do I use them (this
- catxml done, which is what is needed mostly. Do we need more?
- Review: Thomas, Maaren, Linda, Ilona, Trond
- catxml done, which is what is needed mostly. Do we need more?
- For the collectors: How do I add texts, where do I add them, how do I
- we need a review of the web interface for corpus uploading - what is still
- Review: Sjur, Saara, Trond, Thomas
- we need a review of the web interface for corpus uploading - what is still
Review setup and reporting to be posted to the newsgroup, possibly a summary in
Deadline for reviews: by next meeting.
Corpus uploading review: closed.
Catxml review: this should be based on the new tool made by Tomi (C++
Tomcat->static HTML
Done. Cronscript doesn't work, requiring manual updates for now. Will be fixed
4. Corpus gathering
Contracts
Next step:
- Make our versions of the updated Helsinki contracts, and make sure they
- done
- done
- send them to the SD lawyer and to the University lawyer through formal
- done
- done
- possibly update contracts with remarks from lawyers
- start using them!
Contract 1 should have the main priority (contract 2 for Trond).
Deadline: This was finished last week.
Collecting
Børre have downloaded quite a few web pages and even sites. How and where to
5. Corpus infrastructure
Updated task list:
- Include the xsl files under version control
- RCS to be used, Saara to include it in the (upload) processing of new
-
Saara has made a first draft, but the code is commented out - some bugs
-
Saara has made a first draft, but the code is commented out - some bugs
- RCS to be used, Saara to include it in the (upload) processing of new
- Incorporate language detection as part of the corpus processing (Saara)
- The tool needs better training material than was used initially.
- The tool needs better training material than was used initially.
- we need a way to deal with hyphenated documents (documents with (manually)
-
Saara has made a hyphen detection script that tries to discriminate
-
Saara has made a hyphen detection script that tries to discriminate
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn't have one
6. Linguistics
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is finished with three-part compounds for North Sámi. On the negative
- Thomas is finished with three-part compounds for North Sámi. On the negative
- compound shortening issue is waiting for the SGL to make a normative decision
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- The Sámediggiráđđi is going to appoint new members before Christmas, the
- Contact the SGL when it is elected, in December, and ask them to arrange a
- seminar in February?
- The Sámediggiráđđi is going to appoint new members before Christmas, the
- TODO list:
-
Maaren to discuss with relevant persons on this issue.
-
Maaren to discuss with relevant persons on this issue.
- look at Lule Sámi, but apply it to second-parts only
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- Bug 77 - clearify whether háliid is acceptable - it is found, thus we
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- TODO:
Lule Sámi
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
Summary: see the newsgroup
Complex names
Task list for this issue:
- find eventual unique second-parts (B-parts of names that do not exist in
- Postponed till next week
- Postponed till next week
- remove these B-parts from the ordinary name file (Tomi)
- New York => York, Los Angeles =/=> *Angeles
- New York => York, Los Angeles =/=> *Angeles
- make sure xml2lexc can handle complex names in ways compatible with our
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle uninflected (A + B)
XML format
Basis for the progress: separate documents according to project:
- Containing one common concept or name id field, plus Divvun/Disamb fields only
- Containing one common concept or name id field, plus kvensk fields only
- common document containing one common id field, plus fields common to several
We need a meeting to move forward. Suggested time: Wednesday 21, 9: 30.
Tasks:
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
Xmas
Planned vacations and working days (two days of work, 4 hours each):
- Børre: will work on 27th and 28th, away 29th and 30th.
- Maaren: ?
- Sjur: no vacation, don't know what days working
- Thomas: no vacation, don't know what days working
- Tomi: no vacation, working in Finland offline at romjula and online after
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Test: the result of the last line should indicate whehter this is a problem
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ??
Bug fixing
28 open bugs (and 25 risten.no bugs)
Move Bugzilla
Move Bugzilla to our new server when it arrives, and make it work at both the
9. Summary, task list
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus infrastructure
- review code and documentation for corpus xsl files under version control
- review the convert2xml.pl script, what works well and what doesn´t.
- comment review template made by Saara
- proper names meeting to arrive at final xml format
- fix bugs!
Maaren
- work with risten.no
- find decisisons/documents regarding syllable shortening in compounds in the
- review corpus usage documentation, and the usage of the corpus
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- document corpus infrastructure, your own parts
- catxml review: make a template, and post it in the newsgroup
- Language detection for the corpus files
- Implement addition of <hyph> tags to the converted corpus files
- Do some testing for bug
- update the convert2xml script according to the comments
- review corpus upload user interface
- add version control of the corpus xsl files to the (upload) processing
- xml2lexc update to handle complex names: construct entries like we have now
- update preprocess to handle inflected forms of complex names
- fix bugs!
Sjur
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- proper names meeting to arrive at final xml format
- write public tender documents
- review corpus upload user interface
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- cgi-admin script for adding xsl-files
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- install and announce new ccat tool
- install and announce new ccat tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- meeting to arrive at final xml format
- new version of xml2lexc (based on catxml, now ccat)
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- comment review template made by Saara
-
fix bugs!
- pick up backpacks after Xmas
Trond
- project planning with Sjur, continued
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- document corpus infrastructure, your part
- review corpus upload user interface
- review corpus usage documentation
- comment review template made by Saara
- discuss the new lexicon format in the newsgroup
- proper names meeting to arrive at final xml format
- fix bugs!
10. Next meeting, closing
02.01.2006 09: 30
Closed at 11: 18