Meeting_2005-12-12
Meeting setup
- Date: 12.12.2005
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 47.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- He is back from a journey today, !'ll contact him
- He is back from a journey today, !'ll contact him
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- install new XXE and the new XXE Forrest config for Ilona
- Haven't seen her online
- Haven't seen her online
- divvun.no and giellatekno.uit.no:
- Binary files download area
- Done
- Done
- Continue the conversion to static site, using our own script.
- The script is done, has to be put into cron and Thor-Øivind will have to
- The script is done, has to be put into cron and Thor-Øivind will have to
- Binary files download area
- corpus xsl files under version control
- Not done
- Not done
- review the convert2xml.pl script, what works well and what doesn´t.
- Some done
- Some done
- comment review template made by Saara
- Not done, but it was excellent!
- Not done, but it was excellent!
-
fix bugs!
- Fixed bug 227
- Fixed bug 227
- Cleaned up the Lule Sámi New Testament, added it to the corpus and sent it to
Maaren
- work with risten.no
- some done
- some done
- find decisisons/documents regarding syllable shortening in compounds in the
- not done (didn´t find any documents)
- not done (didn´t find any documents)
- review corpus usage documentation
- ?? More today
- ?? More today
- discuss with relevant people regarding seminar on proofing tools, normativity
- discussed with SGL (Laila)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert the name lexicon from present format to xml
- pending
- pending
- discuss the new lexicon format in the newsgroup
- not done
- not done
- document corpus infrastructure, your own parts
- not done
- not done
- corpus infrastr. user docu review: make a template, and post it in the
- done for the web interface, other parts missing
- done for the web interface, other parts missing
- Language detection for the corpus files
- work in progress
- work in progress
- Implement addition of <hyph> tags to the converted corpus files
- the script is ready after some testing with documents that really contain
- the script is ready after some testing with documents that really contain
- Do some testing for bug
- not done
- not done
- update the convert2xml script according to the comments
- done, needs to be documented
- done, needs to be documented
- review corpus upload user interface
- done some, I think Tomi has to do some necessary fixes before the real review.
- done some, I think Tomi has to do some necessary fixes before the real review.
- corpus xsl files under version control
- not done
- not done
- xml2lexc update to handle complex names: construct entries like we have now
- not done
- not done
- update preprocess to handle inflected forms of complex names
- Not done
- Not done
-
fix bugs!
- Done some
- Done some
- Make preprocess faster
- Is now almost as fast as can be with Perl. Should be fast enough, though.
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- nothing
- nothing
- follow up on voice group-chat not working to Sámediggi
- Marratech: Sámediggi has bought a new server for this, but it is very unclean
- Marratech: Sámediggi has bought a new server for this, but it is very unclean
- project planning with Trond, continued
- also look at the development processes - specification and testing
- nothing
- nothing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- Finally received an answer - he had forgotten about it
- Finally received an answer - he had forgotten about it
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- nothing
- nothing
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- nothing
- nothing
- discuss kvensk project support
- nothing
- nothing
- proper names:
- discuss lexicon format
- nothing
- nothing
- discuss implementation with Tomi
- nothing
- nothing
- discuss lexicon format
- write public tender documents
- nothing
- nothing
- hyphenation in corpus docs
- discussed in news group
- discussed in news group
- buy:
- new computer
- done (more further down)
- done (more further down)
- new computer
- Investigate the never-arriving backpacks
- done, they're all in Guovdageaidnu
- done, they're all in Guovdageaidnu
- update the SD contract version
- not done
- not done
- send SD corpus contract to lawyer
- not done
- not done
- review corpus upload user interface
- this week
- this week
- comment review template made by Saara
- commented Saara's proposal
- commented Saara's proposal
-
fix bugs!
- commented some
Thomas
- work on North sámi compounding and derivation
- worked and still working
- worked and still working
- review corpus upload user interface
- done
- done
- review corpus usage documentation
- häh?! More on this today.
- häh?! More on this today.
-
fix bugs!
- presented a solution to Bug 77
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- Document aspell and corpus infrastructure
- Not done
- Not done
- Specification for new catxml in C++
- this includes also placing the source and binary
- Not done
- Not done
- this includes also placing the source and binary
- new proper name lexicon
- remove last part of complex names not used as simplex names
- Started
- Started
- start looking at conversion of the name lexicon from present format to xml
- Nothing
- Nothing
- discuss the new lexicon format in the newsgroup
- Nothing
- Nothing
- Look into synchronisation of proper names with risten.no
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- Nothing
- Nothing
- corpus xsl files under version control
- not done
- not done
- comment review template made by Saara
- Haven't really anything to say to it, than it is good : )
- Haven't really anything to say to it, than it is good : )
-
fix bugs!
- Fixed some
Trond
- Look at the three-part compound issue
- Not done
- Not done
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- Not done
- Not done
- project planning with Sjur, continued
- No serious planning.
- No serious planning.
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Not done
- Not done
- document corpus infrastructure, your part
- Not written documentation
- Not written documentation
- update the SD contract version
- Not done
- Not done
- send SD corpus contract to lawyer
- Not done
- Not done
- review corpus upload user interface
- Not done
- Not done
- review corpus usage documentation
- Not done
- Not done
- comment review template made by Saara
- Not done
- Not done
- discuss the new lexicon format in the newsgroup
-
fix bugs!
- Other:
- KUNSTI meeting, disambiguation, corpus format. webadr.fst.
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- For the users/linguists: What corpus are found, how do I use them (this
- catxml done, which is what is needed mostly. Do we need more?
- we need a review: Thomas, Maaren, Linda, Ilona, Trond
- catxml done, which is what is needed mostly. Do we need more?
- For the collectors: How do I add texts, where do I add them, how do I
- we need a review of the web interface for corpus uploading - what is still
- Review: Sjur, Saara, Trond, Thomas
- we need a review of the web interface for corpus uploading - what is still
Review setup and reporting to be posted to the newsgroup, possibly a summary in
Saara has made a review template, final version ready today.
Deadline for reviews: by next meeting.
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Tomcat->static HTML progress
Almost done. Thor Øivind is back now, and will help with the last URL fixes.
Deadline: in operation by next meeting.
4. Corpus gathering
Contracts
Next step:
- Make our versions of the updated Helsinki contracts, and make sure they
- send them to the SD lawyer and to the University lawyer through formal
Contract 1 should have the main priority (contract 2 for Trond).
Deadline: This should be finished this week.
Collecting
Børre has completed the HTML version of the Lule Sámi New Testamente.
Trond discussed with the editor of Odin, we will get a direct contact with
5. Corpus infrastructure
Updated task list:
- Include the xsl files under version control
- RCS to be used, Saara to include it in the (upload) processing of new
- RCS to be used, Saara to include it in the (upload) processing of new
- Incorporate language detection as part of the corpus processing (Saara)
- The tool needs better training material.
- The tool needs better training material.
- we need a way to deal with hyphenated documents (documents with (manually)
-
Saara has made a hyphen detection script that tries to discriminate
-
Saara has made a hyphen detection script that tries to discriminate
Examples of false positives (hard cases) - these should not be converted:
teknihkalaš<hyph>luonddudieđalaš Norplus<hyph>prográmma Mjøs<hyph>lávdegotti norgga<hyph>ruoŧa dánska<hyph>norgalaš Precision, Recall, a repetition False positives: Hyphens that should be kept as is False negatives: Soft hyphens not recognised. tp = true positives, fp = false positives tn = true negatives, fn = false negatives P = tp/tp+fp and R = tp/tp+tn P = (number of real hyphens detected) / (number of hyphens found) R = (number of real hyphens detected) / (number of real hyphens in the text)
If we pick only hyph at line end, then the number of false positives will drop.
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
6. Linguistics
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is finished with three-part compounds for North Sámi. On the negative
- Thomas is finished with three-part compounds for North Sámi. On the negative
- the exact rules for when shortening happens should be documented (Maaren
- were not found any
-
Nickel has something, but our corpus, an article by Pekka Sammallahti
- meeting in the SGL this week (old members) - Thomas to send the compound
- were not found any
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- The Sámediggiráddi is going to appoint new members before Christmas, the
- Contact the SGL when it is elected, in December, and ask them to arrange a
- seminar in February?
- The Sámediggiráddi is going to appoint new members before Christmas, the
- TODO list:
-
Maaren to discuss with relevant persons on this issue.
-
Maaren to discuss with relevant persons on this issue.
- look at Lule Sámi, but apply it to second-parts only
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- Bug 77 - clearify whether háliid is acceptable - it is found, thus we
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
- TODO:
(negative of háliidit and gen/acc of mii etc.) the "in háliit"/"in ??" and maid/*mait and guliid/*guliit d>t only for lexical stems, not for suffixes and closed class words. Since we do not have any suffix boundary symbol %>, this is difficult. hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liid ' corp/* h.liid hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liit ' corp/* h.liit ielddaválggat, ja NSR:ii ges Sámediggi. Utsi ii háliit dasa dadjat maide, go dál ollašuvvan. Okta gielda ii háliit oassálastit barggus danne ollašuvvan. Okta gielda ii háliit oassálastit barggus danne aid maid Suodjalus loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas ččii boahtteáiggi sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái. sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái. Dál ii háliit šat joatkit dan birra ság Dál ii háliit šat joatkit dan birra ság Status quo after Thomas' last bug fix: háliit háliidit+V+TV+Ind+Prs+ConNeg háliid háliidit+V+TV+Ind+Prs+ConNeg <==== should this one be allowed? maid maid+Interj mait mait +?
Lule Sámi
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
Summary: see the newsgroup
Complex names
Task list for this issue:
- find eventual unique second-parts (B-parts of names that do not exist in
- Postponed till next week
- Postponed till next week
- remove these B-parts from the ordinary name file (Tomi)
- New York => York, Los Angeles =/=> *Angeles
- New York => York, Los Angeles =/=> *Angeles
- make sure xml2lexc can handle complex names in ways compatible with our
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
-
Tomi is working on a C++ version of the xml2lexc tool. He will put it
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle uninflected (A + B)
XML format
Basis for the progress: separate documents according to project:
- Containing one common concept or name id field, plus Divvun/Disamb fields only
- Containing one common concept or name id field, plus kvensk fields only
- common document containing one common id field, plus fields common to several
Discussion to continue in the newsgroup. Sjur will post a draft XML
Tasks:
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
New server
Ordered a new computer last Friday: Quad G5 (PowerMac), with 30" screen.
- video conferencing
- lexicon compilation
- other heavy processsing
- possibly our own Bugzilla there? Yes, we will try that, it is frustrating to
- other server services?
Code conventions
Suggestions from Sjur:
- 80 char linelength (source code, meeting memos)
- twol-smj.txt has 140 - we need to allow some exceptions
- twol-smj.txt has 140 - we need to allow some exceptions
- do *never* change whitespace as part of other changes, whitespace cleanup
- This is nice
- inform all others before, to allow all to check in all changes. Then clean,
- This is nice
- lexicon sorting is a BAD THING for cvs and change tracking/diffing
- a. sort or arbitrary order
- This is bad for cvs, but good for the linguistic work, and should be
- Shall we have shell sort or emacs sort-lines? Note: We do not sort the whole
- a. sort or arbitrary order
sort | rev command, all other sorted |
lexica are sorted alphabetically, with the sort-lines command in emacs.
- XXE is fine for larger documents, but imposes it's own XML serializing
-
Convention: when making smaller adjustments to a text, use a text editor
-
Convention: when making smaller adjustments to a text, use a text editor
These conventions decided upon, and should be used from now on.
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Test: the result of the last line should indicate whehter this is a problem
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ??
Video conferencing across firewalls
A new server is bought, but it is open when it will be installed and useable. It
Bug fixing
28 open bugs (and 25 risten.no bugs)
Move Bugzilla
Move Bugzilla to our new server when it arrives, and make it work at both the
risten.no
It is back on the air since last Wednesday, with a small correction on Thursday.
Rugsacks
They are all in Guovdageaidnu. Tomi can pick them up after Christmas on his way
9. Summary, task list
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus infrastructure
- Ask Thor-Øivind to help us move bugzilla to our new project server
- ... and make the URL http://giellatekno.uit.no/bugzilla point to it (and
- ... and make the URL http://giellatekno.uit.no/bugzilla point to it (and
- install new XXE and the new XXE Forrest config for Ilona
- not necessary, she doesn't use it
- not necessary, she doesn't use it
- divvun.no and giellatekno.uit.no:
- Binary files download area
- Continue the conversion to static site, using our own script.
- Binary files download area
- review code and documentation for corpus xsl files under version control
- review the convert2xml.pl script, what works well and what doesn´t.
- comment review template made by Saara
- fix bugs!
Maaren
- work with risten.no
- find decisisons/documents regarding syllable shortening in compounds in the
- review corpus usage documentation, and the usage of the corpus
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- document corpus infrastructure, your own parts
- catxml review: make a template, and post it in the newsgroup
- Language detection for the corpus files
- Implement addition of <hyph> tags to the converted corpus files
- Do some testing for bug
- update the convert2xml script according to the comments
- review corpus upload user interface
- add version control of the corpus xsl files to the (upload) processing
- xml2lexc update to handle complex names: construct entries like we have now
- update preprocess to handle inflected forms of complex names
- fix bugs!
Sjur
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support
- proper names:
- discuss lexicon format
- discuss implementation with Tomi
- post a draft XML format for the name lexicon, based on the discussion above
- discuss lexicon format
- write public tender documents
- update the SD contract version
- send SD corpus contract to lawyer
- review corpus upload user interface
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- translate the normativity-issue on second-syllable deletion when compounding
- smj G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into ynchronisation of proper names with risten.no
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- comment review template made by Saara
-
fix bugs!
- pick up backpacks after Xmas
Trond
- project planning with Sjur, continued
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- document corpus infrastructure, your part
- update the SD contract version
- send SD corpus contract to lawyer
- review corpus upload user interface
- review corpus usage documentation
- comment review template made by Saara
- discuss the new lexicon format in the newsgroup
- fix bugs!
10. Next meeting, closing
19.12.2005 09: 30
Closed at 12: 02