Meeting_2006-01-16
Meeting setup
- Date: 16.01.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 45.
Present: Børre, Maaren, Saara, Sjur, Trond
Absent: Thomas, Tomi (sick leave until Jan. 17th)
Main secretary: Trond
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Gather public texts, preferrably also parallel ones
- Contact Odin editor (Ove Sæth) to ask for source (and parallel) documents
- Waiting for Trond
- Waiting for Trond
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- make an XML test lexicon for our new name lexicon; format is based on the
- Some work done
- Some work done
- fix bugs!
Maaren
- work with risten.no
- done
- done
- review corpus usage documentation, and the usage of the corpus
- not done, do not know where to find it. Can`t read my emails...sorry....
- not done, do not know where to find it. Can`t read my emails...sorry....
- discuss with relevant people regarding seminar on proofing tools, normativity
- done. Possible in February or at the beg. of March
Saara
- Convert the name lexicon from present format to xml for testing; final
- Script is ready, the test conversion will be done today.
- Script is ready, the test conversion will be done today.
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- not done
- not done
- Finnish the review of the hyphenation detection.
- not done
- not done
- Review the handling of xsl-files in corpus infrastructure, including version
- work in progress. the updating xsl-files and cgi-infrastructure will be
- work in progress. the updating xsl-files and cgi-infrastructure will be
- Do some testing for bug
- not done
- not done
- xml2lexc update to handle complex names: construct entries like we have now
- not done
- not done
- update preprocess to handle inflected forms of complex names
- done but needs to be optimized.
- done but needs to be optimized.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- done
- done
- Plan the forthcoming seminar
- nothing but the dates and place
- nothing but the dates and place
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- continue proper name lexicon work and discussion
- public tender:
- received offer for handling the whole process from Finnut
Consult AS
- received offer for handling the whole process from Finnut
Consult AS
- follow up new server - it's not delivered yet
- it was delivered Friday afternoon
- it was delivered Friday afternoon
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- tried calling him several times, but he has been ill, I've started on an
- tried calling him several times, but he has been ill, I've started on an
- risten.no/name lexicon development: fix bugs, continue development
-
fix bugs!
- various:
- commented a letter from Hallgeir Varsi to SD, regarding the amount of
- wrote monthly reports for November, December
- commented a letter from Hallgeir Varsi to SD, regarding the amount of
Thomas
Tomi
Trond
- Contact Odin editor (Ove Sæth) to reopen contacts
- Forgot it.
- Forgot it.
- Plan the forthcoming seminar
- Private thoughts, yes, must now flesh them out.
- Private thoughts, yes, must now flesh them out.
- sign contract with Bibelselskapet for Norwegian parallel texts
- Not done.
- Not done.
- project planning with Sjur, continued
- Not done.
- Not done.
- smj G3 issue with Sjur and Thomas
- Not done.
- Not done.
- sme G3 issue with Sjur and Thomas
- Not done.
- Not done.
- document corpus infrastructure, your part
- Not done.
- Not done.
- review corpus usage documentation (ccat)
- Not done.
- Not done.
- discuss the new lexicon format in the newsgroup
- Not done.
- Not done.
-
fix bugs!
- Not done.
- Not done.
- This revision was not too impressive, instead I have worked on disambiguation.
3. Documentation
Reviews
Add documentation on our corpus infrastructure and our corpus work in general
- For the users/linguists: What corpus are found, how do I use them (this
- catxml done, which is what is needed mostly. Do we need more?
- Review: Thomas, Maaren, Linda, Ilona, Trond
- catxml done, which is what is needed mostly. Do we need more?
Saara will update the user documentation, and add new if necessary. We will
Findings so far:
4. Corpus gathering
Contracts
Ready - start using them!
Collecting
List of people/organisations/companies to contact to be found in an old meeting
memo. Based
- Anders Kintel (Børre)
- Newspaper text:
- Sámi Instituhtta's (for the old archive of Min Áigi and Áššu) (Børre)
- Áššu has been making a CD since the end of May, there should be a pile
- Min Áigi (Børre)
- Sámi Instituhtta's (for the old archive of Min Áigi and Áššu) (Børre)
- Commercially published texts
- Iđut and key authors there (Børre)
- Davvi Girji and key authors there (Børre)
- Author organisations' meetings (Børre)
- Key authors one by one
- (list of author names) Kerttu Vuolab, Kirsi Paltto, ...
- Iđut and key authors there (Børre)
List of texts with lower priority (to be gathered when the above list is
- the Sámi municipalities,
- Authors with smaller production
- Textbooks
TODO: a lot for Børre!
Odin
We want both parallell and errouneous (unproofed) text files. What we need is a
TODO: Trond, and then Børre to call Ove Sæth to re-establish
Bible texts
We have received Norwegian texts and a contract draft from Bibelselskapet, in
TODO: Trond will accept the contract as is, and then negotiate a separate
5. Corpus infrastructure
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
- RCS version control is almost finished, but an issue with access control is
- Incorporate language detection as part of the corpus processing (Saara)
- Almost finished. Needs improved Finnish language model - presently it isn't
- Almost finished. Needs improved Finnish language model - presently it isn't
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn't have one
-
Saara will review the existing code, consult Tomi, and try to make
-
Saara will review the existing code, consult Tomi, and try to make
We will have a major review of all these things next week.
6. Linguistics
North Sámi
- three-part compounds issue still open
- Thomas is finished with three-part compounds for North Sámi.
- compilation time has increased to 10m for the sme parser
- it overgenerates wildly. This is bad for several of our applications.
- compilation time has increased to 10m for the sme parser
- compound shortening issue is waiting for the SGL to make a normative decision
- we have to wait until SGL´s meeting protocol (???) is ready
- we have to wait until SGL´s meeting protocol (???) is ready
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- Laila has informed the members of the SGL about this. This is very okei -
- Laila has informed the members of the SGL about this. This is very okei -
- Timetable:
- The Sámediggiráđđi is going to appoint new members in January
- not done yet, perhaps this week
- Contact the SGL when it is elected, and ask them to arrange a
- The Sámediggiráđđi is going to appoint new members in January
- Thomas is finished with three-part compounds for North Sámi.
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
-
Trond, Thomas and Sjur to discuss this this month.
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- Bug 77 - clearify whether háliid is acceptable - it is found, thus we
-
Trond, Thomas and Sjur to discuss this this month.
- TODO:
Lule Sámi
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
Complex names
Task list for this issue:
- make sure xml2lexc can handle complex names in ways compatible with our
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle uninflected (A + B)
XML format
We had our meeting, and the result was pretty close to the structure of the
Tasks:
- make a test lexicon for evaluating the format, set up the editing, and test it
- update conversion from lexc to xml to reflect new xml format (Saara)
- mostly done, some open questions left
- mostly done, some open questions left
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
SGL Seminar
- SGL/normativity seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
- all members = potentially/likely all languages
Divvun/Disamb Seminar in Tromsø
- project meeting
- date: 23. (after lunch) - 27. (Friday is travelling day) januar.
- place: Tromsø
- still too many open questions regarding place+date, to be determined in a
- date: 23. (after lunch) - 27. (Friday is travelling day) januar.
Maaren is able to attend Monday morning and Tuesday (all day)
Practical arrangements:
- room(s): One big room for all, with internet access, and at least one
- lunch & coffee breaks, think of how to arrange this.
- hotel = Grand Hotel/Polar Hotel, Grønnegata, Scandic is close to the
-
Sjur will check which ones are acceptable to SD
- these need rooms at the hotel: Ilona, Saara, Linda, Sjur, Maaren
-
Sjur will check which ones are acceptable to SD
Suggested content for project meeting:
- Common
- Presentation, kick-off
- Project updates
- Project milestones
- Cooperation evaluation (practical/daily cooperation)
- project management
- programmer work
- linguistic work
- anything else?
- project management
- Presentation, kick-off
- Divvun
- Project milestones
- Evaluation/feedback
- public tender meeting
- Project milestones
- Disamb
- Project milestones
- Tagsets (Evaluating and fixing (?) the tagset for the disambiguator)
- Project milestones
- Linguists
- G3 smj/sme
- Compilation time and twol ruleset
- Numeral project
- spelling errors (esp. in risten.no)
- Routines for addition to lexicon, cooperation wrt. new corpus texts coming in
- G3 smj/sme
- Programmers
- Proper names:
- xml
- eXist/editing
- xml
- Work distribution (in bug fixing, documentation, maintenance etc.)
- Parallel corpora and the corpus infrastructure
- xml2lexc implementation plan
- Proper names:
Teaching sessions (list of free thoughts and personal frustrations).
- How to use the xml corpora (all? more like a group review session - 1h)
- Use of Xerox tools (t: Trond; p: RTFW)
- Twig (t: Saara; p: Trond)
- XQuery/XSL (t: Sjur/Tomi(?); p: Trond, Børre, Saara)
- xml webapp development, security and session integrity (group work: p: Sjur,
- file-specific xsl scripts for corpus conversions, the what, how and who issues
Final schedule to be worked out by Trond and Sjur (Tuesday 10 AM?).
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- 10.4 introduced support for locales in the shell (10.3 and earlier didn't
- Test: the result of the last line should indicate whether this is a problem
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ??
After some heavy investigation, we got no further. There is no difference
One new insight though: both cat and print gives the same errors, thus
Bug fixing
29 open bugs (and 25 risten.no bugs)
C implementation of preprocess.pl
Do we want to have a C/C++ implementation for speed reasons? Is it going to be
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Contact Odin editor (Ove Sæth) to ask for source (and parallel) documents
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Convert the name lexicon from present format to xml for testing; final
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
- Do some testing for bug
- optimize the preprocess script
- Write/update user documentation for the corpus usage in preparation for the
- finalize an improved working version of the CGI and command line scripts for
- xml2lexc update to handle complex names: construct entries like we have now
- update conversion from lexc to xml (proper names) with the latest refinements
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Project seminar
- plan and make schedule with Trond
- check which hotels SD has an agreement with
- plan XQuery/XSLT training session
- plan and make schedule with Trond
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- check for SD feedback on the last two contracts (2 & 3)
- continue proper name lexicon work and discussion
- public tender:
- review offer from Finnut Consult AS
- review offer from Finnut Consult AS
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/name lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- cgi-admin script for adding xsl-files
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- install and announce new ccat tool
- install and announce new ccat tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- meeting to arrive at final xml format
- new version of xml2lexc (based on catxml, now ccat)
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- comment review template made by Saara
-
fix bugs!
- pick up backpacks after Xmas
Trond
- Contact Odin editor (Ove Sæth) immediately to reopen contacts
- Project seminar
- plan and make schedule with Sjur
- check with Linda and Ilona whether we can start on Monday after lunch
- plan and make schedule with Sjur
- sign contract with Bibelselskapet for Norwegian parallel texts
- document corpus infrastructure, your part
- review corpus usage documentation (ccat)
- discuss the new lexicon format in the newsgroup
- smj G3 issue with Sjur and Thomas
- sme G3 issue with Sjur and Thomas
- fix bugs!
10. Next meeting, closing
30.01.2006 09: 30
Closed at 11: 43