Meeting_2005-12-05
Meeting setup
- Date: 05.12.2005
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 48.
Present: Børre (after 15 min), Saara, Sjur, Thomas (only a few minutes), Tomi
Absent: Maaren, Trond
Main secretary: Sjur
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor Øivind to move bugzilla to our new webserver
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- install new XXE and the new XXE Forrest config for Ilona
- divvun.no and giellatekno.uit.no
- Binary files download area
- Continue the conversion to static site, using our own script.
- Binary files download area
- corpus xsl files under version control
Maaren
- working with risten.no
- register AppleCare
-
Børre did it.
-
Børre did it.
- find decisisons/documents regarding syllable shortening in compounds in the
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Convert the name lexicon from present format to xml
- not done
- not done
- Look at the hyphenation issue
- implemented in preprocess, see documentation
- implemented in preprocess, see documentation
- corpus xsl files under version control
- discussion going on
- discussion going on
- Make preprocess faster
- preprocess is now as fast as possible with Perl implementation. Now I am
- preprocess is now as fast as possible with Perl implementation. Now I am
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- nothing
- nothing
- risten.no bugs and fixes
- recovered the lost data after the crash, prepared a new install, but were
- recovered the lost data after the crash, prepared a new install, but were
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- no URL yet
- no URL yet
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- nothing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- he had forgotten about it.
- he had forgotten about it.
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- nothing
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- nothing
- nothing
- discuss kvensk project support with Trond
- nothing
- nothing
- proper name integration with risten.no
- nothing
- nothing
- discuss risten.no work with Tomi
- continued
- continued
- write public tender documents
- nothing
- nothing
- hyphenation in corpus docs
- nothing
- nothing
- buy:
- new computer (project server)?
- nothing
- new computer (project server)?
- Work on the Speech Application (Dec. 1)
- done
- done
- Work on the proofing article (Dec. 5.)
- done
- done
- Investigate the never-arriving backpacks
- done
Thomas
- work on North sámi compounding and derivation
- worked and still working
- worked and still working
- Settle the empirical facts for sme diphthong simpl.
- done
- done
- add descriptive facts about shortened forms in compounding from our corpus
- done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done -> transferred to Thomas, Trond and Sjur
- Not done -> transferred to Thomas, Trond and Sjur
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- Document aspell and corpus infrastructure
- Nothing done
- Nothing done
- Specification for new catxml in C++
- this includes also placing the source and binary
- Not done
- Not done
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- start looking at conversion of the name lexicon from present format to xml
- Looked a bit
- Looked a bit
- Look into synchronisation of proper names with risten.no
- Not done
- Not done
- hyphenation in corpus docs
- Not done
- Not done
- corpus xsl files under version control
- Not done
- Not done
- add automatic language detection to the corpus processing
- Looked at the language detection documentation
- Looked at the language detection documentation
- Work on the proofing article (Dec. 5.)
- Done
- Done
- remove last part of complex names not used as simplex names
- Not done
Trond
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Working on the Speech Application (Dec. 1)
- Working on the proofing article (Dec. 5.)
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- For the users/linguists: What corpus are found, how do I use them (this
- catxml done, which is what is needed mostly. Do we need more?
- we need a review: Thomas, Maaren, Linda, Ilona, Trond
- catxml done, which is what is needed mostly. Do we need more?
- For the collectors: How do I add texts, where do I add them, how do I
- we need a review of the web interface for corpus uploading - what is still
- Review: Sjur, Saara, Trond, Thomas
- we need a review of the web interface for corpus uploading - what is still
Review setup and reporting to be posted to the newsgroup, possibly a summary in
Deadline for comments and final template: by next meeting.
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Tomcat->static HTML progress
Now, all pages are generated directly from XML by Forrest within Tomcat. We'll
Deadline: Finished by this week.
4. Corpus gathering
The Lule Sámi New Testament is ready for inclusion in our repository
Contracts
Next step:
- Make our versions of the updated Helsinki contracts, and make sure they
- send them to the SD lawyer and to the University lawyer through formal
Contract 1 should have the main priority (contract 2 for Trond).
5. Corpus infrastructure
Updated task list:
- Include the xsl files under version control (Børre, Tomi, Saara)
- Saara has started a dicsussion in the newsgroup - please follow up!
- we can start using RCS right away, and we do so. The main users should be
- we can start using RCS right away, and we do so. The main users should be
- Saara has started a dicsussion in the newsgroup - please follow up!
- Incorporate language detection as part of the corpus processing (Saara)
- we need a way to deal with hyphenated documents (documents with (manually)
- What needs to be identified now is the conditions for the difference between
- identify all cases of the first type, and replace all hyphens NOT in
- we need to review whether only automatic hyphen detection is good enough, or
- Acceptable results: 90% of all real hyphens correctly tagged.
- identify all cases of the first type, and replace all hyphens NOT in
- What needs to be identified now is the conditions for the difference between
6. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- name markup
- Done! There are errors in the markup, people are urged to correct them as they
- Done! There are errors in the markup, people are urged to correct them as they
Complex names
Task list for this issue:
- Restoring two-part names
- done, common/src/proper-complex.xml
- done, common/src/proper-complex.xml
- find eventual unique second-parts (B-parts of names that do not exist in
- Postponed till next week
- Postponed till next week
- remove these B-parts from the ordinary name file (Tomi)
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle (A + B) compounds, but not
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is finished with three-part compounds for North Sámi. On the negative
- Thomas is finished with three-part compounds for North Sámi. On the negative
- the exact rules for when shortening happens should be documented (Maaren
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- The Sámediggiráddi is going to appoint new members before Christmas
- Contact the SGL when it is elected, in December, and ask them to arrange a
- seminar in February?
- The Sámediggiráddi is going to appoint new members before Christmas
- TODO list:
-
Maaren to discuss with relevant persons on this issue.
-
Maaren to discuss with relevant persons on this issue.
- look at Lule Sámi, but apply it to second-parts only
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond,
- Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond,
- TODO:
Lule Sámi
Great progress has been made on the G3 issue, just some minor points remain.
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur meet shortly after Dec. 5th to finish the rest.
Today's compilation time:
real 5m17.157s
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
Present proposal:
- name-oriented, single document:
- one name with many uses stored only once
Present risten.no:
- Concept-oriented center, contains:
- ID, links to each language entry
- ID, links to each language entry
- each language as a separate document, with links to the concept/entity in the
Possible new propsal 1: as risten.no
Possible new proposal 2: separate documents:
- Containing one common concept or name id field, plus Divvun/Disamb fields only
- Containing one common concept or name id field, plus kvensk fields only
- common document containing one common id field, plus fields common to several
Porsanger both person and place
5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud
Discussion to continue in the newsgroup.
Tasks:
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, Bugzilla
#211):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Test: the result of the last line should indicate whehter this is a problem
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
preprocess file_name.txt OK cat file_name.txt | preprocess !! catxml file_name.xml | preprocess ??
Video conferencing across firewalls
We're still waiting for a working URL (working from outside SD, that is).
Bug fixing
28 open bugs (and 2 risten.no bugs)
Move Bugzilla
Move Bugzilla to the same server as the other ones (or make it work at the
TODO, TODO. Thor Øivind.
risten.no
The risten.no data has been rescued, and a new version of eXist is ready for
Tomi will continue the proper name work.
Rugsacks
Were delivered on Nov 25. They have disappeared at UiTø, but that is now being
9. Summary, task list
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- install new XXE and the new XXE Forrest config for Ilona
- divvun.no and giellatekno.uit.no:
- Binary files download area
- Continue the conversion to static site, using our own script.
- Binary files download area
- corpus xsl files under version control
- review the convert2xml.pl script, what works well and what doesn´t.
- comment review template made by Saara
- fix bugs!
Maaren
- work with risten.no
- find decisisons/documents regarding syllable shortening in compounds in the
- review corpus usage documentation
- discuss with relevant people regarding seminar on proofing tools, normativity
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- document corpus infrastructure, your own parts
- corpus infrastr. user docu review: make a template, and post it in the
- Language detection for the corpus files
- Implement addition of <hyph> tags to the converted corpus files
- Do some testing for bug
- update the convert2xml script according to the comments
- review corpus upload user interface
- corpus xsl files under version control
- xml2lexc update to handle complex names: construct entries like we have now
- update preprocess to handle inflected forms of complex names
- fix bugs!
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support
- proper names:
- discuss lexicon format
- discuss implementation with Tomi
- discuss lexicon format
- write public tender documents
- hyphenation in corpus docs
- buy:
- new computer (project server)?
- new computer (project server)?
- Work on the Speech Application (Dec. 1)
- Work on the proofing article (Dec. 5.)
- Investigate the never-arriving backpacks
- update the SD contract version
- send SD corpus contract to lawyer
- review corpus upload user interface
- comment review template made by Saara
- update SD contracts
- send SD contracts to SD lawyer
- fix bugs!
Thomas
- work on North sámi compounding and derivation
- review corpus upload user interface
- review corpus usage documentation
- fix bugs!
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- remove last part of complex names not used as simplex names
- hyphenation in corpus docs
- corpus xsl files under version control
- comment review template made by Saara
- fix bugs!
Trond
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- document corpus infrastructure, your part
- update the SD contract version
- send SD corpus contract to lawyer
- review corpus upload user interface
- review corpus usage documentation
- comment review template made by Saara
- discuss the new lexicon format in the newsgroup
- fix bugs!
10. Next meeting, closing
12.12.2005 09: 30
Closed at 11: 12