Meeting_2005-10-10
Contents:
Meeting setup
- Date: 10.10.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Board meeting summary
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 15.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Trond
Agenda accepted with revisions.
2. Reviewing the task list from the last meeting
Børre
- discuss with Anders Kintel about possible cooperation
- Contacted him on friday, and proposed a cooperation between us. He said
he would like to know more about our project and he would like to have cleared juridical issues before going into a cooperation. Apart from that he was positive about this proposal.
- Contacted him on friday, and proposed a cooperation between us. He said
- Contact oahpahusossodat and the rest of the SD about texts
- Not done
- Not done
- Reorganise the directory structure
- They are still only on my machine, not quite done : -(
- They are still only on my machine, not quite done : -(
- Continue converting text from input format to our xml
- This one goes well
- This one goes well
- Contact Saara about pdf conversions.
- Not done
- Not done
- Have a look at the placenames files.
- Not done
- Not done
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Sent an e-mail last week, no answer. Will have to contact him personally.
- Sent an e-mail last week, no answer. Will have to contact him personally.
- Other
- Been to Guovdageaidnu. Demonstrated aspell using Linux and Mac OS X.
- Worked out a suggestion for a name lexicon together with Trond, Sjur and
Maaren.
- Been to Guovdageaidnu. Demonstrated aspell using Linux and Mac OS X.
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
file-for-file review, in order to get different terminology.- Not done
- Not done
- shall get mainly through the missing list from risten.no this week
- worked with risten.no and have 1065 words to work with (50% are Typos)
- worked with risten.no and have 1065 words to work with (50% are Typos)
- Start working on grammatical issues with Thomas and Trond
- Not done
- Not done
- Work on the name project with Trond and Sjur
- worked on this with Sjur, Trond and Börre
- worked on this with Sjur, Trond and Börre
- Start looking at normativity issues
- Not done
- Not done
- Work on the numerals project with Trond
- Not done
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- I have now access to the database in omilia, but haven't looked at it yet.
- I have now access to the database in omilia, but haven't looked at it yet.
- Have a look at the pdf-to-xml issue
- Almost ready. Also the character conversion package is almost ready,
some testing still needed.
- Almost ready. Also the character conversion package is almost ready,
Sjur
- Lule Sámi twol problems, have a look at the sets definition
- nothing done
- nothing done
- risten.no bugs and fixes
- discussed the future organisation with Risten and Bitte, cf Other
- discussed the future organisation with Risten and Bitte, cf Other
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Still haven't received anything, but the reorganisation at Samediggi
has made the question relevant to the rest of SD as well. Hopefully something will happen soonish.
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- For the board meeting:
- check the memo from the last meeting
- done
- done
- the board meeting was held last Tuesday, cf separate topic today
- check the memo from the last meeting
- project planning with Trond
- done in Guovdageaidnu, still a lot more to be done
- done in Guovdageaidnu, still a lot more to be done
- Work on the name project with Trond and Maaren
- done in Guovdageaidnu, also including Børre - see more under
- done in Guovdageaidnu, also including Børre - see more under
- Prepare for a Lule Sámi meeting with Árran
- Discussed with Bitte, suggestion: around Nov. 17 (there's a conference
in Tysfjord then, about Tysfjord becoming part of the Sámi adm. area).
- Discussed with Bitte, suggestion: around Nov. 17 (there's a conference
- Follow up on place names from Norge Digitalt
- Done as part of the board meeting
- Done as part of the board meeting
- Evaluate SFST as speller (and analyzer) lexicon
- Trond and Børre and I had a short overview in Guovdageaidnu
- Trond and Børre and I had a short overview in Guovdageaidnu
- prepare for the Guovdageaidnu meeting:
- name lexicon
- done
- done
- three-part compounds
- nothing done
- name lexicon
Thomas
- Post a summary on the Lule Sámi issue to the news group
- Done
- Done
- work on Lule Sami compounding and derivation
- Still working but starting to distinguish the bottom of the pot,
maybe a few more weeks work
- Still working but starting to distinguish the bottom of the pot,
- Look at Linguistic bugs with Trond
- Lule sámi bugs that now can be solved are solved
- Lule sámi bugs that now can be solved are solved
- Prepare for a Lule Sámi meeting with Árran
- Not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done.
- Not done.
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- corpus infrastructure: file and dir organisation
- Still discussing on this one with Børre
- Still discussing on this one with Børre
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
- This is on the way
- This is on the way
- Cgi-script for uploading documents to corpus base
- Done. File gets uploaded and converted to xml, and you can modify the file-
specific xsl-template from browser.
- Done. File gets uploaded and converted to xml, and you can modify the file-
Trond
- Work on the bug list (11 open).
- Still open
- Still open
- Work on the name agreement with "Norge digitalt" with Thomas
- CD was empty.
- This one can be dropped, as it is transferred to Sjur and the board
- This one can be dropped, as it is transferred to Sjur and the board
- CD was empty.
- Get the new version of the New Testament
- Still only promises from Bibelselskapet.
- Still only promises from Bibelselskapet.
- project planning with Sjur
- Done some work, but backlash from Merlin.
- Done some work, but backlash from Merlin.
- Work on the name project with Maaren and Sjur
- Much done, cf. main agenda.
- Much done, cf. main agenda.
- Prepare for a Lule Sámi meeting with Árran
- Not done.
- Not done.
- Work on the numerals project with Maaren
- Not done.
- Not done.
- Prepare for three-part compounds meeting in Guovdageaidnu
- Not done.
- Not done.
- Contact the University lawyers for comments on the contract
- Still waiting for reaction from the lawyer at the research dept./UiT
3. Board meeting summary
The participants were satisifed with the progress of the Divvun project,
The meeting went through the criteria for participating as well as for selecting
The suggestion for a permanent maintenance organisation was accepted, and will
The proposed South Sámi project was accepted as well, and work for finding
4. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
("To be done by the ones making the corpora": Børre, Tomi, Trond, Saara).
- Now we have 4 documents:
- Correct corpus (disamb usage)
- Corpus plan (for the disamb corpus cwb)
- Corpus conversion, two versions, in infra and in ling. Tomi and Børre
have done parallell work ;-( - catxml
- Correct corpus (disamb usage)
For the basic corpora, we need 3 types of documentation, or doc for 3 target
- For the users/linguists:
- What corpus are found, how do I use them (this info is now scattered)
- What corpus are found, how do I use them (this info is now scattered)
- For the collectors:
- How do I add texts, where do I add them, how do I convert them (this is the
Corpus conversion doc)
- How do I add texts, where do I add them, how do I convert them (this is the
- For the programmer
- What did I actually do? (this is partly the catxml doc)
For the work on the graphical user interface, we need documentation as well, in
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
5. Corpus gathering
- Governmental documents (earlier in pdf, now in html)
Tasks:
- move existing gov. documents (pdf) from gt/ to our corpus repository
- Collect public (pdf and html) files.
Contracts
Tasks:
- Follow-up on the lawyers' comments (Trond has started with the university)
- add a background document explaining the model (Sjur)
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
North Sámi New Testament
If we don't hear anything from Bibelselskapet, we will have to use the version
Lule Sámi Dictionary
We will invite Anders Kintel to a meeting in Tysfjord on Nov 17th, where we
Bitte and Børre will participate, as well as Sjur, given that
6. Corpus infrastructure
Naming conventions and directory structure
- The original file should be protected using file and directory permission.
- The meta information (i.e., the xsl translation files) should be under version
control - Given that our language detection works well, the intermediate file don't need
to be under version control (the lg identification tool is under gt/script, and it needs to be made part of the coprus processing)
Tasks:
- Make a system for file and directory permission (today: we all belong to the
cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files - Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- the dir structure is:
- one dir for orig, containing also the meta-info and interm. files
- another dir for our ready-to-use xml files after conversion
- one dir for orig, containing also the meta-info and interm. files
- dir structure for web-posted corpus files:
- subdivision according to week or month, we start out with month till we see
the amount of traffic (yyyy-mm)
- subdivision according to week or month, we start out with month till we see
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
hyphenator, the hyphenation points should be retained
- in normal cases hyphenation points should be removed
Corpus conversion
Pdf to XML
Saara has made a new conversion module, it is almost finished. We'll return
Task: Saara to prepare for this.
HTML to XML
Tomi has been looking at this, and is making an xsl script for it. The web form
The URL posting need to check whether the same URL has been posted before, and
XHTML to XML
Tomi has been looking at this as well.
Task: Tomi and Saara to present status quo and suggest routines, merger,
7. Linguistics
Name lexicon
Summary: see the newsgroup
Motivation:
-
Divvun: We want to cross-link different versions of the same locations
in different languages -
Common: We do not want to enter the same names twice. We want a
language-independent name lexicon -
Disamb: Having a richer tag set makes it easier to disambiguate
-
Future: Richer analysis makes new applications possible, within
information retrieval, grammar checking, machine translation etc.
Needed: A plan for this project:
a. do the main markup in the present propernoun file
Conversion:
- This week
- clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
- Trond - Make an emacs mode for markup (Saara). Options: fem, mal, sur, plc, org,
obj, none). Combinations: surplc
- clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
- (end of this week and) Next week:
- Mark up as much as possible within a week or so (Maaren to do the Sámi
names, and to split CNAME into BERN and LONDON, Trond and Børre to look at the rest) - Then convert to xml
- Then mark up the rest with correct semantic tags
- This means we would need a seventh option, the unspecified name.
- Look into efficient editing of the XML lexicon
- Look into synchronisation issues with risten.no - we want the names there
as well
- Mark up as much as possible within a week or so (Maaren to do the Sámi
Status quo:
Needed tools: An emacs mode doing this (Saara):
Possible refinement: Encode for combined options (both plc and sur, e.g.)
Twol SETS definition issue
The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
SUGGESTION (Trond): We have a separate meeting, e.g. Thomas, Trond,
North Sámi
- three-part compounds issue still open, as is the number project.
- The treatment of Sámi place names, we need a contract with "Norge digitalt",
via UFD.-
Sjur has written an e-mail to the UFD contact person,
Øystein Johannessen, who will look into it soon. He has not responded beyond saying he will return to it. Sjur brought this up in the board meeting, and Bjørn Olav Megard will remind Øystein Johannessen about this issue. Sjur will follow up on this one.
-
Sjur has written an e-mail to the UFD contact person,
- normativity issues:
- the Giellalávdegoddi meeting is in October sometimes, maybee next week.
Lule Sámi
Lule Sámi issues will be discussed at the Tuesday meeting between Sjur,
Numerals
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
We will return to this issue after the name conversion.
8. Speller infrastructure
Nothing this week.
9. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
33.
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
Bug fixing
13 open bugs (and 24 risten.no bugs)
Buying
- rucksacks for all
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
(old) GIO members? Yes, it is ok, but how much still needs to be evaluated - it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
proper names integrated with risten.no, df above - needs further development of risten.no to allow for multiple XML bases to
be presented and maintained in parallel. This is to be further worked on by Tomi and Sjur
- this should be integrated with the general proper name work - we want all
10. Summary, task list
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Reorganise the directory structure
- Continue converting text from input format to our xml
- Have a look at the placenames files.
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Gather public texts
- Work on the name lexicon
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
file-for-file review, in order to get different terminology. - continue working with the missing list from risten.no
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- Have a look at the pdf-to-xml issue
- use the priority list earlier in the memo for a guidance
- use the priority list earlier in the memo for a guidance
- make an emacs mode for the name project (cf. specs in the memo above)
- prepare for a presentation of the pdf etc. conversion together with Tomi
for the next meeting.
Sjur
- Lule Sámi twol problems, have a look at the sets definition
- risten.no bugs and fixes
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- project planning with Trond, continued
- Prepare for a Lule Sámi meeting with Anders Kintel 17th of November
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
Thomas
- work on Lule Sami compounding and derivation
- Meet with Sjur and Trond about the definition of G1, G2, G3 in Lule Sámi
- Look at Linguistic bugs with Trond
- Prepare for a Lule Sámi meeting with Árran
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
- Cgi-script for uploading documents to corpus base
- Add URL uploading
- Add URL uploading
- Contact Saara about xml conversion
- prepare for a presentation of the pdf etc. conversion together with Saara
for the next meeting.
- prepare for a presentation of the pdf etc. conversion together with Saara
Trond
- Work on the bug list (11 open).
- Get the new version of the New Testament
- project planning with Sjur, continued
- Follow-up the University lawyers for comments on the contract
- Work on the name project: Clean up the lexicon file, discuss the emacs mode with
Saara and the work with Maaren and Børre. - Add docu on the corpus infrastructure
10. Next meeting, closing
17.10.2005 10: 00
Closed at 12: 25

