Meeting_2005-10-31
Meeting setup
- Date: 31.10.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 05.
Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond
Absent: Saara
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Get help from the Tromsø department of Sámediggi to dig in WebSak
- Done, and picked up some texts
- Done, and picked up some texts
- Get help from the Tromsø department of Sámediggi to dig in WebSak
- Gather public texts
- From the Sámediggi
- From the Sámediggi
- Reorganise the directory structure
- Put all corpus texts into one place
- Done
- Done
- Continue converting text from input format to our xml
- Not done
- Not done
- Put all corpus texts into one place
- Ask Thor-Øivind to move bugzilla to our new webserver.
- He has been very busy, and since bugzilla seems to work ok, he has
- He has been very busy, and since bugzilla seems to work ok, he has
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- Not done
- Not done
- continue working with the missing list from risten.no
- Not done
- Not done
- Start working on Sámi place names
- Not done
- Not done
- Start working at normativity issues (numeral issues with Trond?)
- Not done, sorry
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- make an emacs mode for the name project (cf. specs in the memo
- Done
- Done
- Plan a conversion script for the name lexicon.
- Not done, discussed with Tomi about using his c++ code in xml2lexc script.
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- Not done
- Not done
- risten.no bugs and fixes
- installed the latest eXist snapshot, and tested it
- corrected several conformity bugs that had been accepted by earlier snapshots
- installed the latest eXist snapshot, and tested it
- discuss risten.no work with Tomi
- Not done
- Not done
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- He is reminded, but no response so far. Needs more action.
- He is reminded, but no response so far. Needs more action.
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Not done
- Not done
- Discuss the contract issue with Trond, return the new version to the lawyer
- Call Kimmo Koskenniemi for comments
- no answer so far
- no answer so far
- Call Kimmo Koskenniemi for comments
- write to the Giellalávdegoddi once more, emphasizing timetable and response
- wrote draft letter and sent it to Maaren for QA and translation
- wrote draft letter and sent it to Maaren for QA and translation
- discuss kvensk project support with Trond
- Not done
- Not done
- write public tender documents
- Nothing last week
- Nothing last week
- other tasks:
- finally corrected the Forrest config for XXE
- The new config is available from me
- you should all update to XXE 3.0: -)
- finally corrected the Forrest config for XXE
- buy:
- rucksacks
- new computer (project server)?
- project management software
- OmniOutline (upgrade)
- OmniGraffle (upgrade)
- ISDN card for Maaren (Maaren will order herself)
- rucksacks
Thomas
- work on Lule Sami compounding and derivation
- finished (this calls for at least a virtual celebration !!)
- finished (this calls for at least a virtual celebration !!)
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
- not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- corpus infrastructure: file and dir organisation
- Done
- Done
- Document aspell and corpus infrastructure
- Partially done
- Partially done
- Cgi-script for uploading documents to corpus base
- Done, but needs modifications?
- Done, but needs modifications?
- Specification for new catxml in C++
- this includes also placing the source and binary
- Not done
- clean the script/ catalogue with Trond
- Not done
- this includes also placing the source and binary
- Common makefile issues
- Not done
- Not done
- discuss risten.no work with Sjur
- Not done
Trond
- Work on the bug list (7 open).
- Still 7 open, as the number project hasn't started.
- Started looking at the G3 definition issue for sme, it is the key to some of
- Still 7 open, as the number project hasn't started.
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- Not done.
- Not done.
- Work on the name project:
- Introduce the +Mal, +Fem, ... tags to the parser
- Done. Now the tags are there, and in the multichar list. We may thus start
- and discuss the work with Maaren and Børre.
- Not done. Hopefully
- Introduce the +Mal, +Fem, ... tags to the parser
- clean the script/ dir
- Down from 76 to 51 entities.
- Down from 76 to 51 entities.
- discuss kvensk project support with Sjur
- Hmm, did we discuss this issue?
- Hmm, did we discuss this issue?
- Otherwise, the week was dominated by going to a conference on minority
3. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
- Now we have 4 documents:
- Correct corpus (disamb usage)
- Corpus plan (for the disamb corpus cwb)
- catxml
- Correct corpus (disamb usage)
For the basic corpora, we need 3 types of documentation, or doc for 3 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
- For the programmer: What did I actually do? (this is partly the catxml doc)
For the work on the graphical user interface, we need documentation as well, in
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
4. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Tasks:
- move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- Collect public (pdf and html) files (Børre)
- Done some test downloading, will have to look at tools to do this
- Done some test downloading, will have to look at tools to do this
Contracts
Tasks:
- Follow-up on the lawyers' comments (Trond has started with the university)
-
Trond and Sjur finished the next revision of the contracts, and are
- Update: No comments from Kimmo yet
- Update: No comments from Kimmo yet
-
Trond and Sjur finished the next revision of the contracts, and are
- add a background document explaining the model (Sjur)
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
North Sámi New Testament
Our inhouse sme nt is as new as the one they have at Bibelselskapet, and we
Lule Sámi New Testament
Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi
- Haven´t heard anything from Olavi Korhonen. The only problem with our version
Lule Sámi Dictionary
Nothing new about the meeting with Anders Kintel.
5. Corpus infrastructure
Updated task list:
- Make a system for file and directory permission (today: we all belong to the
- Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- in normal cases hyphenation points should be removed
6. Linguistics
Name lexicon
Summary: see the newsgroup
Unclassified: 6090 entries 3203 LONDON 1788 BERN 468 NYSTØ 330 ACCRA 80 MARJA 59 NIILLAS 43 ANAR 43 ALEUHTAT 20 GIEDDI 17 HEANDARAT 16 DUORTNUS 7 SULLOT 4 VARGGAT 4 GEAVNNIS 3 EATNAMAT 2 NYOBL 2 GUOLBBA 1 PIERA
Motivation:
-
Divvun: We want to cross-link different versions of the same locations
-
Common: We do not want to enter the same names twice. We want a
-
Disamb: Having a richer tag set makes it easier to disambiguate
-
Future: Richer analysis makes new applications possible, within
Needed: A plan for this project:
- do the main markup in the present propernoun file
- make a script for converting it to xml (to be done one time)
- make a script for xml2lexc (to be done by the makefile)
- There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
- There is a working xml2lexc for Komi, written by Saara
- There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
- make the tags etc. in the parser
Conversion:
- Mark up the remaining 6090 entires until conversion starts (Maaren to do
- Entries still to be done: see above
- This means we would need a seventh option, the unspecified name.
- Then split propernoun-sme-lex.txt into two, one with the sami name being
- Look into efficient editing of the XML lexicon (Tomi, Saara)
- Then convert to xml (Tomi, Saara)
- Look into efficient editing of the XML lexicon again (Tomi, Saara)
- Look into synchronisation issues with risten.no - we want the names there
- Consider automatic sorting on commit
Twol SETS definition issue
The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
Update: it is still not working, see bug 193
SUGGESTION (Trond): Thomas, Trond and Sjur didn't meet last week
North Sámi
- three-part compounds issue still open
- number project still open
- The treatment of Sámi place names, we need a contract with "Norge digitalt",
-
Sjur has written an e-mail to the UFD contact person,
-
Sjur has written an e-mail to the UFD contact person,
- normativity issues:
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- The document with the list of open
issues
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Numerals
- The issue is postponed to next week.
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
We will return to this issue after the name conversion.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- One way to "resolve" this is to redirect the error messages to /dev/null:
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
... | preprocess 2> /dev/null | lookup ...
Video conferencing across firewalls
The problem we've had with the SD firewall persists, and there doesn't seem to
Bug fixing
19 open bugs (and 24 risten.no bugs)
Bugzilla update
Buying
- rucksacks for the whole Divvun team
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
Project planning and development processes
Trond is using his project as a test case for an IT guy, Geir Tore Voktor,
Conference report from Trond
- Relevant themes on the conference for us:
- Terminology work
- Dictionary work between minority languages
- Repositories for minority language resources
- Disambiguation work, for South African languages
- Terminology work
- Also, our work is relevant to other projects. This
- There is welsh work on terminology:
- ( Dewi Jones, Delyth Prys, U Wales, Bangor, I couldn't find).
- ( Dewi Jones, Delyth Prys, U Wales, Bangor, I couldn't find).
9. Summary, task list
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install Marratech client to Maaren's computer
- install new XXE and the new XXE Forrest config for all (or check that it is
- mark-up names
- move existing corpus docs from gt/ to new corpus repository
Maaren
- continue working with the missing list from risten.no
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- make an emacs mode for the name project (cf. specs in the memo above)
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- discuss risten.no work with Tomi
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- install Marratech client to Maaren's computer
- install Marratech client to Maaren's computer
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Trond, return the new version to the lawyer
- Call Kimmo Koskenniemi for comments, perhaps arrange a meeting with him
- Call Kimmo Koskenniemi for comments, perhaps arrange a meeting with him
- Follow up on meeting with Anders Kintel
- discuss kvensk project support with Trond
- write public tender documents
- buy:
- new computer (project server)?
- project management software
- OmniOutline (upgrade)
- OmniGraffle (upgrade)
- new computer (project server)?
Thomas
- do main markup in the present propernoun file
- work on North sámi compounding and derivation
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- clean the script/ catalogue with Trond
- clean the script/ catalogue with Trond
- this includes also placing the source and binary
- Common makefile issues
- discuss risten.no work with Sjur
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- Look into synchronisation of proper names with risten.no
Trond
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Work on the name project:
- Discuss the conversion with Maaren and Børre
- mark up names
- Discuss the conversion with Maaren and Børre
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas
10. Next meeting, closing
7.11.2005 10: 00
Closed at 11: 06