Meeting_2005-11-28
Meeting setup
- Date: 28.11.2005
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- speech synthesis
- Things happening within SD
- the UiT NFR application
- Things happening within SD
- Proofing article: deadline Dec. 5.
- speech synthesis
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 56.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Sjur
Agenda accepted with some additions, and replaced speller infra with proper name
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat about texts
- Not done
- Not done
- Gather public texts
- Gathered laws and some websites
- Gathered laws and some websites
- Continue converting text from input format to our xml
- Done, found some bugs in convert2xml.pl
- Done, found some bugs in convert2xml.pl
- Document the corpus directory structure
- Some done
- Some done
- Ask Thor-Øivind to move bugzilla to our new webserver
- He is on vacation untill the 28th.
- He is on vacation untill the 28th.
- install new XXE and the new XXE Forrest config for all (or check that it is
- Done, except Ilona.
- Done, except Ilona.
- mark-up names
- Not done
- Not done
- divvun.no and giellatekno.uit.no
- Binary files download area
- Not done
- Not done
- Make the conversion to static site, using our own script.
- Done some work on that, working on the script
- Done some work on that, working on the script
- Binary files download area
- hyphenation in corpus docs
- Not done
- Not done
- corpus xsl files under version control
- Not done
- Not done
- register AppleCare
- Done for Børre, Duommá and Tomi
Maaren
- shall work with Sámi place names only
- done
- done
- update the last issue in the North Sámi normativity issues document
- done
- done
- register AppleCare
- not done
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools,
- not done (the kvensk place name issue)
- not done (the kvensk place name issue)
- Convert the name lexicon from present format to xml
- not done
- not done
- document corpus infrastructure, your own parts
- done
- done
- Look at the hyphenation issue
- updating preprocess script according to the spec in previous meeting
- updating preprocess script according to the spec in previous meeting
- Update the corpus.dtd
- done
- done
- corpus xsl files under version control
- started discussion in the newsgroups
- started discussion in the newsgroups
- make preprocess and lookup2cg faster
- lookup2cg ready
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- nothing
- nothing
- risten.no bugs and fixes
- risten.no has crashed. Took most of Thursday and Friday, still not back.
- risten.no has crashed. Took most of Thursday and Friday, still not back.
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- internal URL is now working (got an updated URL), I'm waiting for the
- internal URL is now working (got an updated URL), I'm waiting for the
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- nothing
- nothing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- wrote to him, nothing more due to the risten.no crash
- wrote to him, nothing more due to the risten.no crash
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- nothing
- nothing
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- nothing
- nothing
- discuss kvensk project support with Trond
- not with Trond, but spoke with Hallgeir Varsi, and wrote to the
- not with Trond, but spoke with Hallgeir Varsi, and wrote to the
- proper name integration with risten.no
- part of general work with risten.no
- part of general work with risten.no
- discuss risten.no work with Tomi
- done
- done
- write public tender documents
- nothing
- nothing
- hyphenation in corpus docs
- nothing
- nothing
- buy:
- new computer (project server)?
- not done yet. If we do it, the invoice should be at SD before Dec. 15.
- not done yet. If we do it, the invoice should be at SD before Dec. 15.
- new computer (project server)?
- register AppleCare
- done.
- done.
- other:
- bureaucrazy: travel invoices, other invoices.
Thomas
- work on North sámi compounding and derivation
- finished with second-syllable deletion in compounding
- finished with second-syllable deletion in compounding
- Look at Linguistic bugs with Trond
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- no meeting or bug-fixing this week
- no meeting or bug-fixing this week
- update the lule sámi normativity issues document about incorporation of loan words
- done
- done
- register AppleCare
- asked Børre to do it for me
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done.
- Not done.
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- Document aspell and corpus infrastructure
- Not done this week
- Not done this week
- Specification for new catxml in C++
- this includes also placing the source and binary
- Not done
- Not done
- this includes also placing the source and binary
- start looking at conversion of the name lexicon from present format to xml
- Not done
- Not done
- Look into synchronisation of proper names with risten.no
- Modified risten.no to accept multiple collections
- so far, this is only done for searching, not editing, true? Editing is not
- so far, this is only done for searching, not editing, true? Editing is not
- Modified risten.no to accept multiple collections
- hyphenation in corpus docs
- Not done
- Not done
- corpus xsl files under version control
- Not done
- Not done
- add automatic language detection to the corpus processing
- Not done
- Not done
- register AppleCare
- Børre did? Jipps: -). Done.
- Børre did? Jipps: -). Done.
- Other:
- Done some work with risten.no
Trond
- update the contracts with changes from the university lawyer
- Done
- Done
- Look into the document hyphenation issue
- Not done.
- Not done.
- Look at the three-part compound issue
- Worked on this issue with Thomas
- Worked on this issue with Thomas
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- Still 7 open
- Still 7 open
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- The name project
- Work on the name project, mark up names (100 names left)
- All names now marked
- All names now marked
- Work on the name project, mark up names (100 names left)
- discuss kvensk project support with Sjur
- Not done.
- Not done.
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Progress made at smj, still not sme.
- Progress made at smj, still not sme.
- Other:
- Disamb with Linda
- Corpus conversion & format with Børre.
- Disamb with Linda
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- The directory structure is now settled (as of last meeting), and should be
For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Tomcat->static HTML progress
Now, all pages are generated directly from XML by Forrest within Tomcat. We'll
4. Corpus gathering
All the available law texts are downloaded from Odin, with (most of?) their
All available texts from NSI's web site have been downloaded.
The texts are now in our corpus repository, but not converted to the corpus
-
http://troms.kulturnett.no/bibliotek/samisk/samisk_materiale.htm
-
Sámi legal text on the net
- We need all these texts
- done
- done
- We need to survey the site in the future
- new translations under way
- new translations under way
- We need the Norwegian versions as well
- done
- We need all these texts
Contracts
Update: All SD versions now synchronised with the templates. Trond met with the
Next step:
- Make our versions of the updated Helsinki contracts, and make sure they
- send them to the SD lawyer and to the University lawyer through formal
Contract 1 should have the main priority.
5. Corpus infrastructure
Updated task list:
- Include the xsl files under version control (Børre, Tomi, Saara)
- Incorporate language detection as part of the corpus processing (Tomi)
-
Tomi will start, but possibly hand it over to Saara if it takes too
-
Tomi will start, but possibly hand it over to Saara if it takes too
- we need a way to deal with hyphenated documents (documents with (manually) inserted
- Discuss details in the newsgroup
- What needs to be discussed now is the conditions for the difference between
- What needs to be discussed now is the conditions for the difference between
- in normal cases hyphenation points should be removed
- Done by Saara in preprocess. It is possible to turn it back to a regular
- Done by Saara in preprocess. It is possible to turn it back to a regular
- when testing the robustness of our parsers, as well as when testing the
- This is true for examples like "eala-<CR>hus", they
- In cases of truncated compounds like "ealahus-<CR> ja ...", we want the
- There are sporadically text books with explicit hyphenation points, like:
- This is true for examples like "eala-<CR>hus", they
- Discuss details in the newsgroup
6. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- name markup
- Done! There are errors in the markup, people are urged to correct them as they
- Done! There are errors in the markup, people are urged to correct them as they
Complex names
Task list for this issue:
- Restoring two-part names
- done, common/src/proper-complex.xml
- done, common/src/proper-complex.xml
- find eventual unique second-parts (B-parts of names that do not exist in
- Postponed till next week
- Postponed till next week
- remove these B-parts from the ordinary name file (Tomi)
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- The core issue: The preprocess script can handle (A + B) compounds, but not
North Sámi
- three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
- Thomas is finished with three-part compounds for North Sámi. On the negative
- Thomas is finished with three-part compounds for North Sámi. On the negative
- the exact rules for when shortening happens should be documented (Maaren
- descriptive facts from our corpus (Trond, Thomas) added to the normative
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in
- The Sámediggiráddi is going to appoint new members before Christmas
- Contact the SGL when it is elected, in December, and ask them to arrange a
- seminar in February?
- The Sámediggiráddi is going to appoint new members before Christmas
- TODO list:
-
Maaren to discuss with relevant persons on this issue.
-
Maaren to discuss with relevant persons on this issue.
- look at Lule Sámi, but apply it to second-parts only
- number project still open (see below)
- diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
- Settling the empirical facts for sme (Thomas)
- Writing the rules (copy from smj, adjust)(Sjur, Thomas and Trond,
- Settling the empirical facts for sme (Thomas)
- TODO:
Lule Sámi
Great progress has been made on the G3 issue, just some minor points remain.
Open tasks:
- Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur meet shortly after Dec. 5th to finish the rest.
Today's compilation time:
real 5m17.157s
Numerals
The following North Sámi linguistic issues should be settled before going into
- Three-part compounds
- Diphthong simplification
- Derivation
These issues are recently done in Lule Sámi, and it is more efficient to
Numeral treatment is on different level in the existing sme and smj parsers, but
Numerals in North Sámi:
Numerals in Lule Sámi:
7. Name lexicon infrastructure
The kvensk place name lexicon have their own info needs:
Linguistic info 1. oppslagsform, headword 2. bøyningsform, inflection 3. kode for bøyningsform 4. evt. grenser mellom navneledd og –element, boundaries of the word component and name element(s) 15. kvensk(e) navnevariant(er) 16. samisk parallellnavn 17. norsk parallellnavn 18. beslekta navn, relating toponyms, vertailunimi(-nimet) 24. etymologi Geographical/location info: 5. type sted, type of place, paikanlaji 6. kode for navntype, iflg. SSR 7. gnr., bnr. + 10. kommune 11. fylke 14. koordinat Legal info: 8. status etter Lov om stadnamn 9. vedtaksmyndighet etter Lov om stadnamn Source info: 19. informantforklaringer 20. informant(er) 21. innsamler(e), årstall 22. arkiv 23. litteratur, kilder Unclassified: 12. kartprodukt 13. kartblad 25. pilhenvisning, nuoliviite, til annen artikkel 26. lydfil 27. bilde(r), illustrasjone(r) 28. andre kommentarer, “sekkepost”
Present proposal:
- name-oriented, single document:
- one name with many uses stored only once
Present risten.no:
- Concept-oriented center, contains:
- ID, links to each language entry
- ID, links to each language entry
- each language as a separate document, with links to the concept/entity in the
Possible new propsal 1: as risten.no
Possible new proposal 2: separate documents:
- Containing one common concept or name id field, plus Divvun/Disamb fields only
- Containing one common concept or name id field, plus kvensk fields only
- common document containing one common id field, plus fields common to several
Porsanger both person and place
5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud
Discussion to continue in the newsgroup.
Tasks:
- testing of conversion
- continue the discussion of the name lexicon format
- implement a prototype in eXist
- eXist as editor:
- develop the needed XQueries and interface
- synchronisation between risten.no and
- test whether eXist as editor is actually working well
- develop the needed XQueries and interface
8. Other
speech synthesis
Things happening within SD
- The Davvi girji plan: Have a person read in tapes of the whole dictionary
- The ODIN edt. wants to add speech synthesis to their web pages (for disabled
the UiT NFR application
Application for international cooperation to NFR, Deadline Dec. 1.
Both Divvun and Disamb milieus see this as an important development path, and we
Proofing article:deadline Dec. 5.
We go for it - 2 pages.
"The official versions should be prepared according to the technical
( Trond, Sjur and Tomi)
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there
- Trond has filed a bug report on this (#211), and discussed with Thor-Øivind, there
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
Video conferencing across firewalls
We're still waiting for a working URL (working from outside SD, that is).
Bug fixing
24 open bugs (and 24 risten.no bugs)
Move Bugzilla
Move Bugzilla to the same server as the other ones (or make it work at the
TODO, TODO. Thor Øivind.
risten.no
Risten.no crashed badly last week, with no traces of what happened. Tomi
Tomi will then continue the proper name work.
AppleCare extended warranty
Only Maaren still to register it, will do it today.
Rugsacks
Not delivered yet! Sjur will investigate.
9. Summary, task list
Børre
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- ... or make the URL http://giellatekno.uit.no/bugzilla point to the present
- install new XXE and the new XXE Forrest config for Ilona
- divvun.no and giellatekno.uit.no
- Binary files download area
- Continue the conversion to static site, using our own script.
- Binary files download area
- corpus xsl files under version control
Maaren
- working with risten.no
- register AppleCare
- find decisisons/documents regarding syllable shortening in compounds in the
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Convert the name lexicon from present format to xml
- document corpus infrastructure, your own parts
- Look at the hyphenation issue
- Update the corpus.dtd
- corpus xsl files under version control
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support with Trond
- proper name integration with risten.no
- discuss risten.no work with Tomi
- write public tender documents
- hyphenation in corpus docs
- buy:
- new computer (project server)?
- new computer (project server)?
- Work on the Speech Application (Dec. 1)
- Work on the proofing article (Dec. 5.)
- Investigate the never-arriving backpacks
Thomas
- work on North sámi compounding and derivation
- Settle the empirical facts for sme diphthong simpl.
- add descriptive facts about shortened forms in compounding from our corpus
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- Look into synchronisation of proper names with risten.no
- hyphenation in corpus docs
- corpus xsl files under version control
- add automatic language detection to the corpus processing
- Work on the proofing article (Dec. 5.)
- remove last part of complex names not used as simplex names
Trond
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Working on the Speech Application (Dec. 1)
- Working on the proofing article (Dec. 5.)
10. Next meeting, closing
05.12.2005 09: 30
Closed at 12: 53