Meeting_2005-11-07
Meeting setup
- Date: 07.11.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 07.
Present: Børre, Maaren, Saara (online), Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Trond
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat about texts
- Not done
- Not done
- Gather public texts
- Lot of texts from the Sámediggi
- Lot of texts from the Sámediggi
- Continue converting text from input format to our xml
- Texts haven´t been added to the corpus
- Texts haven´t been added to the corpus
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install Marratech client to Maaren's computer
- Done
- Done
- install new XXE and the new XXE Forrest config for all (or check that it is
- Maaren has the new XXE
- Maaren has the new XXE
- mark-up names
- Some done
- Some done
- move existing corpus docs from gt/ to new corpus repository
- Not done
- Not done
- Other
- Have been programming a gui-program for maintaining the corpus,
- Sorted out Maaren's problems with emacs (edited the .emacs file).
- Have been programming a gui-program for maintaining the corpus,
Maaren
- continue working with the missing list from risten.no
- not done
- not done
- Start working on Sámi place names
- worked a little bit
- worked a little bit
- Start working at normativity issues (numeral issues with Trond?)
- not done
Saara
- Look at the corpus infrastructure issue
- set up a mysql database for corpus metainformation (for testing). started
- set up a mysql database for corpus metainformation (for testing). started
- Look at the corpus interface issue with Lars
- make an emacs mode for the name project (cf. specs in the memo above)
- done
- done
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- started working with this
- started working with this
- start looking at conversion of the name lexicon from present format to xml
- script namelex2xml.pl can be used for testing.
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- Done, major progress
- Done, major progress
- risten.no bugs and fixes
- helped Pia get her editor back up and running. She can now finally edit the
- xml-cleaned several of the sma grammar files
- updated risten.no with the corrected sma grammar files (and an updated
- helped Pia get her editor back up and running. She can now finally edit the
- discuss risten.no work with Tomi
- not done
- not done
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- install Marratech client to Maaren's computer
- called Pål Hivand, he has been trying it
- now also e-mailed him, still no meeting room URL
- install Marratech client to Maaren's computer
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- settled for Merlin as the planning tool
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- not done
- not done
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- not done
- not done
- Discuss the contract issue with Trond, return the new version to the lawyer
- Call Kimmo Koskenniemi for comments, perhaps arrange a meeting with him
- finally got hold of him, meeting next Tuesday
- finally got hold of him, meeting next Tuesday
- Call Kimmo Koskenniemi for comments, perhaps arrange a meeting with him
- Follow up on meeting with Anders Kintel
-
Berit Karen will talk to Bitte, who will call him (or I will)
-
Berit Karen will talk to Bitte, who will call him (or I will)
- discuss kvensk project support with Trond
- they called us before we got to discuss it. Anyway, it fits pretty well with
- they called us before we got to discuss it. Anyway, it fits pretty well with
- write public tender documents
- nothing more this week
- nothing more this week
- buy:
- new computer (project server)?
- project management software
- bought (Merlin)
- bought (Merlin)
- OmniOutline (upgrade)
- upgraded for the whole Divvun team
- upgraded for the whole Divvun team
- OmniGraffle (upgrade)
- upgraded for the whole Divvun team
- new computer (project server)?
Thomas
- do main markup in the present propernoun file
- done what I can do
- done what I can do
- work on North sámi compounding and derivation
- started to look at three-part compounds
- started to look at three-part compounds
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
- Met and managed to define some consonant cluster-series
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- Document aspell and corpus infrastructure
- Partially done
- Partially done
- Specification for new catxml in C++
- Did some modifications and testing with new and old catxml
- this includes also placing the source and binary
- Not done
- Not done
- Did some modifications and testing with new and old catxml
- clean the script/ catalogue with Trond
- Discussed about in newsgroup
- Discussed about in newsgroup
- Common makefile issues
- Discussed about in newsgroup
- Discussed about in newsgroup
- discuss risten.no work with Sjur
- Scheduled to be done on Tuesday 8.11
- Scheduled to be done on Tuesday 8.11
- discuss about xml-processing with Saara
- Not done
- Not done
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Emacs seems to be most efficient : )
- XML tool should be able to open files in small bits, I know only one that is
- Emacs seems to be most efficient : )
- start looking at conversion of the name lexicon from present format to xml
- Not done
- Not done
- Look into synchronisation of proper names with risten.no
- Not done
- Not done
- Other:
- Added multiple authors to corpus interface
Trond
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- I have not been in a postion to do that, have done more basic work, with
- I have not been in a postion to do that, have done more basic work, with
- project planning with Sjur, continued
- Had some discussion, schedules still open.
- also look at the development processes - specification and testing
- Not done.
- Not done.
- Had some discussion, schedules still open.
- Work on the name project:
- Discuss the conversion with Maaren and Børre
- Done.
- Done.
- mark up names
- Much work done last week, now down to 4100 unassigned entries (they get harder)
- Discuss the conversion with Maaren and Børre
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas
- Done substantial progress on this front, cf. cvs log. Next issue is get the rest of
- Otherwise, I have been working on disambiguation.
- Done substantial progress on this front, cf. cvs log. Next issue is get the rest of
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
- The directory structure is now settled (as of last meeting), and should be
For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
Divvun.no down again
Tomcat is running out of memory in between. Børre will look into changing
4. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Tasks:
- move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- Collect public (pdf and html) files (Børre)
- Done some test downloading, will have to look at tools to do this
- Done some test downloading, will have to look at tools to do this
Contracts
Tasks:
- Follow-up on the lawyers' comments (Trond has started with the university)
-
Trond and Sjur finished the next revision of the contracts, and are
- Update: Sjur will meet with Kimmo on Tuesday this week (tomorrow)
- Update: Sjur will meet with Kimmo on Tuesday this week (tomorrow)
-
Trond and Sjur finished the next revision of the contracts, and are
- add a background document explaining the model (Sjur)
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
5. Corpus infrastructure
Updated task list:
- Make a system for file and directory permission (today: we all belong to the
- Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- in normal cases hyphenation points should be removed
Catxml performance testing
Testing done with xml files we have now in corpus base, there are seven (7)
Perl catxml:
sme$time catxml --all --lang=no -i=. real 0m18.509s user 0m17.753s sys 0m0.426s sme$time catxml --all --lang=no -i=. > /dev/null real 0m17.033s user 0m16.934s sys 0m0.095s
C++ catxml:
sme$time /home/tomi/tagparser2/catxml -r * real 0m0.827s user 0m0.176s sys 0m0.083s sme$time /home/tomi/tagparser2/catxml -r * > /dev/null real 0m0.183s user 0m0.174s sys 0m0.009s
Decision: C++ it is!
6. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- mane markup
- testing of conversion
When these two tasks are done (at some point in the future), the conversion will
Status quo on the two lines of work:
The mark up of the remaining 3900 entries until conversion starts (People
2170 LONDON 1275 BERN 314 NYSTØ 45 ACCRA 42 ALEUHTAT 19 MARJA 17 NIILLAS 14 HEANDARAT 3 VARGGAT 3 ANAR 2 GIEDDI
The technical issues (specified in earlier memos: Conducted by:
North Sámi
- three-part compounds issue still open
- number project still open
- normativity issues:
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
-
Sjur, with the help of Maaren has written to the Divvun board and to
-
Sjur, with the help of Maaren has written to the Divvun board and to
- The document with the list of open
issues has been updated by Thomas.
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Tasks:
- update the normativity issues document:
- Px issue
- Px issue
- G3 open issues (S2, some S3, S5, S6 and S7; Sx = Spiik, consonant series)
Numerals
The issue is postponed to next week.
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
8 8 8+Num 8 8+Num+Acc 8 8+Num+Gen 8 8+Num+Nom
We will return to this issue after the name conversion.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- One way to "resolve" this is to redirect the error messages to /dev/null:
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
... | preprocess 2> /dev/null | lookup ...
Video conferencing across firewalls
The problem we've had with the SD firewall persists, and there doesn't seem to
Bug fixing
24 open bugs (and 23 risten.no bugs)
Bugzilla update
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
Binary files for downloading
As we move forward in the Divvun project, we need a place to store downloadable
As I wrote in my task status in the previous meeting, I have fixed bugs and
Thus, I need a place to store a zip file of the updated config, to make it
Børre: have a look.
Meeting memo improvements?
Look at this:
It gave me the idea that we could skip the task lists in the end altogether, and
But such automatic postprocessing does require some coding. Can we afford it, or
Decision: nothing now, we can go back later if updating the task lists get too
Reimbursement of expenses
Should be filed before mid November. Whatever needs to be bought, should be
9. Summary, task list
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- ... and update Bugzilla at the same time
- ... and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
- mark-up names
- move existing corpus docs from gt/ to new corpus repository
- divvun.no and giellatekno.uit.no
- Binary files download area
- Moving to static site, using forrestbot or something else.
- Binary files download area
Maaren
- shall work with Sámi place names only
- update the last issue in the North Sámi normativity issues document
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- document corpus infrastructure, your own parts
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- discuss risten.no work with Tomi
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Kimmo, return the new version to the lawyer
- Follow up on meeting with Anders Kintel
- discuss kvensk project support with Trond
- write public tender documents
- buy:
- new computer (project server)?
- project management software
- OmniOutline (upgrade)
- OmniGraffle (upgrade)
- new computer (project server)?
Thomas
- work on North sámi compounding and derivation
- Look at Linguistic bugs with Trond
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- update the lule sámi normativity issues document
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- this includes also placing the source and binary
- discuss risten.no work with Sjur
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- Look into synchronisation of proper names with risten.no
Trond
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Work on the name project, mark up names
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas
10. Next meeting, closing
14.11.2005 10: 00
Closed at 11: 37