Meeting_2006-01-02
Meeting setup
- Date: 02.01.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 55.
Present: Børre, Maaren, Saara, Sjur, Thomas, Trond
Absent: Tomi (sick leave until Jan. 17th)
Main secretary: Sjur
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Gather public texts
- Not done
- Not done
- Continue converting text from input format to our xml
- Some done
- Some done
- Document the corpus infrastructure
- Not done
- Not done
- review code and documentation for corpus xsl files under version control
- Not done
- Not done
- review the convert2xml.pl script, what works well and what doesn´t.
- Done
- Done
- comment review template made by Saara
- Not done
- Not done
- proper names meeting to arrive at final xml format
- Done
- Done
- fix bugs!
Maaren
- work with risten.no
- not done, I have been very, very lazy
- not done, I have been very, very lazy
- review corpus usage documentation, and the usage of the corpus
- not done
- not done
- discuss with relevant people regarding seminar on proofing tools, normativity
- discussed with Laila (SGL) about seminar
Saara
- Look at the corpus infrastructure issue
- Convert the name lexicon from present format to xml
- not done
- not done
- discuss the new lexicon format in the newsgroup
- Language detection for the corpus files
- almost done
- almost done
- Implement addition of <hyph> tags to the converted corpus files
- done
- done
- Do some testing for bug
- not done
- not done
- update the convert2xml script according to the comments
- some xsl-handling still needed
- some xsl-handling still needed
- add version control of the corpus xsl files to the (upload) processing
- RCS access things have to be solved first.
- RCS access things have to be solved first.
- xml2lexc update to handle complex names: construct entries like we have now
- not done
- not done
- update preprocess to handle inflected forms of complex names
- not yet optimal
- not yet optimal
- fix bugs!
Sjur
- Lule Sámi twol problems, with Thomas and Trond
- nothing
- nothing
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- waiting for the server
- waiting for the server
- Test Marratech when the new Marratech server is in place
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- last info: he has not forgotten it, but has not succeeded in bringing it up
- last info: he has not forgotten it, but has not succeeded in bringing it up
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not done
- not done
- proper names meeting to arrive at final xml format
- conducted the meeting, important progress made, and results posted to the
- conducted the meeting, important progress made, and results posted to the
- write public tender documents
- Selection criterion document updated and sent to the project board - almost
- Selection criterion document updated and sent to the project board - almost
-
fix bugs!
- other:
- worked more with risten.no/eXist and the parallel editing bug - still no
- worked more with risten.no/eXist and the parallel editing bug - still no
Thomas
- work on North Sámi compounding and derivation
- worked and still working
- worked and still working
- review corpus usage documentation
- begun
- begun
- smj G3 issue with Sjur and Trond
- not anything this week
Tomi
On sick leave.
Trond
- project planning with Sjur, continued
- Not after last meeting
- Not after last meeting
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Not done.
- Not done.
- document corpus infrastructure, your part
- documented tags only.
- documented tags only.
- review corpus usage documentation
- Read things
- Read things
- comment review template made by Saara
- Not done
- Not done
- discuss the new lexicon format in the newsgroup
- Done.
- Done.
- proper names meeting to arrive at final xml format
- Participated.
- Participated.
- fix bugs!
3. Documentation
Reviews
Add documentation on our corpus infrastructure and our corpus work in general
- For the users/linguists: What corpus are found, how do I use them (this
- catxml done, which is what is needed mostly. Do we need more?
- Review: Thomas, Maaren, Linda, Ilona, Trond
- catxml done, which is what is needed mostly. Do we need more?
Review setup and reporting to be posted to the newsgroup, possibly a summary in
Deadline for reviews: by next meeting.
Catxml review: this should be based on the new tool made by Tomi (C++
Update: Tomi is on sick leave, and Saara will make the tool available in
4. Corpus gathering
Contracts
Next step:
- wait for comments from the lawyers - remind them of the task?
- possibly update contracts with remarks from lawyers
- start using them!
Collecting
Nothing new, now hampered by the lawyers checking the final version of the
We want both parallell and errouneous (unproofed) text files. Børre to
5. Corpus infrastructure
Updated task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
- RCS version control is almost finished, but an issue with access control is
- Incorporate language detection as part of the corpus processing (Saara)
- Almost finished. Some heuristics regarding other Sámi languages in the same
- Almost finished. Some heuristics regarding other Sámi languages in the same
- we need a way to deal with hyphenated documents (documents with (manually)
- done, needs review (Saara)
- done, needs review (Saara)
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn't have one
-
Saara will review the existing code, consult Tomi, and try to make
-
Saara will review the existing code, consult Tomi, and try to make
6. Linguistics
North Sámi
- three-part compounds issue still open
- Thomas is finished with three-part compounds for North Sámi.
- compilation time has increased to 10m for the sme parser
- it overgenerates wildly. This is bad for several of our applications.
- compilation time has increased to 10m for the sme parser
- compound shortening issue is waiting for the SGL to make a normative decision
- we have to wait until SGL´s meeting protocol (???) is ready
- we have to wait until SGL´s meeting protocol (???) is ready
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- Laila has informed the members of the SGL about this. This is very okei -
- Laila has informed the members of the SGL about this. This is very okei -
- Timetable:
- The Sámediggiráđđi is going to appoint new members in January
- Contact the SGL when it is elected, and ask them to arrange a
- The Sámediggiráđđi is going to appoint new members in January
- Thomas is finished with three-part compounds for North Sámi.
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
-
Trond, Thomas and Sjur to discuss this this month.
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- Bug 77 - clearify whether háliid is acceptable - it is found, thus we
-
Trond, Thomas and Sjur to discuss this this month.
- TODO:
Lule Sámi
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
Summary: see the newsgroup
Complex names
Task list for this issue:
- find eventual unique second-parts (B-parts of names that do not exist in
- No unique B-parts found, nothing removed. Task closed.
- No unique B-parts found, nothing removed. Task closed.
- make sure xml2lexc can handle complex names in ways compatible with our
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle uninflected (A + B)
XML format
We had our meeting, and the result was pretty close to the structure of the
Tasks:
- update conversion from lexc to xml to reflect new xml format
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
Seminars
- SGL/normativity seminar
- all members = potentially/likely all languages
- date? As early as possible, but not likely before the end of February
- place? Maaren will investigate
- all members = potentially/likely all languages
- project meeting
- date: soon, (third)/last week of January? Last week best for Maaren
- place: Tromsø?
- date: soon, (third)/last week of January? Last week best for Maaren
- we'll return to these next week, and make decissions about our own meeting
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Test: the result of the last line should indicate whehter this is a problem
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ??
Bug fixing
28 open bugs (and 25 risten.no bugs)
Move Bugzilla
Bugzilla now works at the old URL http://giellatekno.uit.no/bugzilla/.
9. Summary, task list
Børre
- Gather public texts, preferrably also parallel ones
- Contact Odin editor to ask for source (and parallel) documents
- Continue converting text from input format to our xml
- Document the corpus infrastructure
- review code and documentation for corpus xsl files under version control
- review the convert2xml.pl script, what works well and what doesn´t.
- comment review template made by Saara
- proper names meeting to arrive at final xml format
- fix bugs!
Maaren
- work with risten.no
- review corpus usage documentation, and the usage of the corpus
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Look at the corpus infrastructure issue
- Convert the name lexicon from present format to xml and test it
- discuss the new lexicon format in the newsgroup
- Language detection for the corpus files
- Review the hyphenation detection.
- Catxml review: look at the ccat tool
- Review the handling of xsl-files in corpus infrastructure
- Do some testing for bug
- add version control of the corpus xsl files to the (upload) processing
- xml2lexc update to handle complex names: construct entries like we have now
- update preprocess to handle inflected forms of complex names
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- continue proper name lexicon work and discussion
- write public tender documents
- review corpus upload user interface
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- cgi-admin script for adding xsl-files
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- install and announce new ccat tool
- install and announce new ccat tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- meeting to arrive at final xml format
- new version of xml2lexc (based on catxml, now ccat)
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- comment review template made by Saara
-
fix bugs!
- pick up backpacks after Xmas
Trond
- Follow up the lawyer treatment of the contracts
- project planning with Sjur, continued
- smj G3 issue with Sjur and Thomas
- sme G3 issue with Sjur and Thomas
- document corpus infrastructure, your part
- review corpus usage documentation
- discuss the new lexicon format in the newsgroup
- fix bugs!
10. Next meeting, closing
09.01.2006 09: 30
Closed at 11: 43