Meeting_2005-10-17
Meeting setup
- Date: 18.10.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 07.
Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond
Absent: Saara
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Contacted the archiving department, where they told me how to search for
- Contacted the archiving department, where they told me how to search for
- Reorganise the directory structure
- Not done
- Not done
- Continue converting text from input format to our xml
- Not done
- Not done
- Have a look at the placenames files.
- Not done
- Not done
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Not done
- Not done
- Gather public texts
- Some work done
- Some work done
- Work on the name lexicon
- Not done
- Not done
- Other work
- Fixed the divvun2web script to skip the doc/admin/Projects directory
- Discussed with Tomi on how to implement the new corpus structure
- Fixed the divvun2web script to skip the doc/admin/Projects directory
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- Not done
- Not done
- continue working with the missing list from risten.no
- done a little bit
- done a little bit
- Start working on Sámi place names
- not done
- not done
- Start working at normativity issues (numeral issues with Trond?)
- not done
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Done some work on that issue.
- Done some work on that issue.
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- Got a grasp of it.
- Got a grasp of it.
- Have a look at the pdf-to-xml issue
- Done, but not documented.
- use the priority list earlier in the memo for a guidance
- Sjur: I assume this one can be closed?
- Sjur: I assume this one can be closed?
- Done, but not documented.
- make an emacs mode for the name project (cf. specs in the memo above)
- Not done.
- Not done.
- prepare for a presentation of the pdf etc. conversion together with Tomi
- Not done.
Sjur
- Lule Sámi twol problems, have a look at the sets definition
- Done, but more work is needed
- Done, but more work is needed
- risten.no bugs and fixes
- Nothing
- Nothing
- follow up on:
- voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Nothing
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- voice group-chat not working to Sámediggi
- project planning with Trond, continued
- More evaluation of tools, due to some of the limitations of Merlin
- More evaluation of tools, due to some of the limitations of Merlin
- Prepare for a Lule Sámi meeting with Anders Kintel 17th of November
- Berit Karen Paulsen will invite him; Børre, Sjur and Bitte will participate
- Berit Karen Paulsen will invite him; Børre, Sjur and Bitte will participate
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- Nothing
- Nothing
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- Nothing
- Nothing
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Nothing
- Nothing
- Other:
- went through the comments from the lawyer with Trond, re 1. contract
- checked all open Divvun bugs in Bugzilla, updated and closed some
- wrote a lengthy e-mail to the makers of Merlin, with several requests for
- 2 days off in Trondheim
- went through the comments from the lawyer with Trond, re 1. contract
Thomas
- work on Lule Sami compounding and derivation
- worked and still working
- worked and still working
- Meet with Sjur and Trond about the definition of G1, G2, G3 in Lule Sámi
- we had our meeting
- we had our meeting
- Look at Linguistic bugs with Trond
- looked at some
- looked at some
- Prepare for a Lule Sámi meeting with Árran
- not done
-
Sjur: This can be removed, it is now in the hands of Sjur/Bitte/Berit Karen
-
Sjur: This can be removed, it is now in the hands of Sjur/Bitte/Berit Karen
- not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done.
- Not done.
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- corpus infrastructure: file and dir organisation
- Still working
- Still working
- Document aspell and corpus infrastructure
- Add html-to-xml conversion to corpus infra
- Done
- Done
- Cgi-script for uploading documents to corpus base
- Add URL uploading
- Functionality not implemented yet
- Functionality not implemented yet
- Add URL uploading
- Contact Saara about xml conversion
- prepare for a presentation of the pdf etc. conversion together with Saara
- Not done
- Not done
- prepare for a presentation of the pdf etc. conversion together with Saara
- Other tasks:
- Wrote new xml-processing tool in C++ to replace catxml-script.
Trond
- Work on the bug list (11 open).
- 7 open.
- 7 open.
- Get the new version of the New Testament
- Still no answer in the second round.
- Still no answer in the second round.
- project planning with Sjur, continued
- Done some.
- Done some.
- Follow-up the University lawyers for comments on the contract
- Discussed with lawyer, gone through 1 of 3 with Sjur.
- Discussed with lawyer, gone through 1 of 3 with Sjur.
- Work on the name project: Clean up the lexicon file, discuss the emacs mode with
- Still awaiting Saara's emacs mode, while waiting I have converted some thousand names.
- Still awaiting Saara's emacs mode, while waiting I have converted some thousand names.
- Add docu on the corpus infrastructure
- Don't remember this discussion, not done.
- Don't remember this discussion, not done.
- Other:
- Meeting with Kvensk revitalisation project
- Discussions with Linda on work tasks, preparing things.
- Done work on the Lule Sámi rule set with Sjur and Thomas.
- Meeting with Kvensk revitalisation project
3. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
- Now we have 4 documents:
- Correct corpus (disamb usage)
- Corpus plan (for the disamb corpus cwb)
- Corpus conversion, two versions, in infra and in ling. Tomi and Børre
- catxml
- Correct corpus (disamb usage)
For the basic corpora, we need 3 types of documentation, or doc for 3 target
- For the users/linguists:
- What corpus are found, how do I use them (this info is now scattered)
- What corpus are found, how do I use them (this info is now scattered)
- For the collectors:
- How do I add texts, where do I add them, how do I convert them (this is the
- How do I add texts, where do I add them, how do I convert them (this is the
- For the programmer
- What did I actually do? (this is partly the catxml doc)
For the work on the graphical user interface, we need documentation as well, in
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
4. Corpus gathering
- Governmental documents (earlier in pdf, now in html)
Tasks:
- move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
- Collect public (pdf and html) files (Børre)
Contracts
Tasks:
- Follow-up on the lawyers' comments (Trond has started with the university)
- Still more work to be done.
- Still more work to be done.
- add a background document explaining the model (Sjur)
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
North Sámi New Testament
- If we don't hear anything from Bibelselskapet, we will have to use the version
- Still not anything. Trond will inform them that we will use what we have.
Lule Sámi Dictionary
We will invite Anders Kintel to a meeting in Tysfjord on Nov 17th, where we
Bitte and Børre will participate, as well as Sjur, given that
- Sjur will check whether Berit Karen has contacted Anders Kintel
5. Corpus infrastructure
- Wrote new xml-processing tool in C++ to replace catxml-script.
- where to put it:
- binary:
- source:
- binary:
- specification:
- command-line interface
- what we have already
- command-line interface
- where to put it:
- the dir reorg disc. betw. Børre and Tomi
- Directory structure should be defined according to the xml-file metainformation
- Directory structure should be defined according to the xml-file metainformation
<!-- scheme="dewey" code="44444" --> <!-- scheme should be dewey or uit or whatever --> <!-- UiT: bible, news, fiction, facts, adm, ... --> <!ELEMENT genre EMPTY > <!ATTLIST genre scheme #PCDATA #REQUIRED code #PCDATA #REQUIRED > <genre scheme="uit" code="news" /> <genre scheme="dewey" code="444" />
For reference: This is what we decided in Helsinki:
admin/depts/ (governmental departments) guovda/ (Guovdageaidnu municipality) karas/ (Kárášjohka municipality) sd/ (Sámi parliament) others/ (everything else) bible/ot/ nt/ facta/ ficti/ laws/ news/MinAigi /Assu /NRK /YLE /other
- Reprocess the old (from new dir.) corpus files
Naming conventions and directory structure
- The original file should be protected using file and directory permission.
- The meta information (i.e., the xsl translation files) should be under version
- Given that our language detection works well, the intermediate file don't need
Tasks:
- Make a system for file and directory permission (today: we all belong to the
- Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- the dir structure is:
- one dir for orig, containing also the meta-info and interm. files
- another dir for our ready-to-use xml files after conversion
- one dir for orig, containing also the meta-info and interm. files
- dir structure for web-posted corpus files:
- subdivision according to week or month, we start out with month till we see
- Done
- Done
- subdivision according to week or month, we start out with month till we see
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- in normal cases hyphenation points should be removed
Corpus conversion
Pdf to XML
Saara has made a new conversion module, it is almost finished. We'll return
Task: Saara to prepare for this presentation, and to make documentation.
perldoc gt/script/samiChar/Decode.pm
(X)HTML to XML
Tomi has been looking at this, and is making an xsl script for it. The web form
The URL posting need to check whether the same URL has been posted before, and
Task: Tomi and Saara to present status quo and suggest routines, merger,
The documentation for corpus conversion should be added to the
6. Linguistics
Name lexicon
Summary: see the newsgroup
Motivation:
-
Divvun: We want to cross-link different versions of the same locations
-
Common: We do not want to enter the same names twice. We want a
-
Disamb: Having a richer tag set makes it easier to disambiguate
-
Future: Richer analysis makes new applications possible, within
Needed: A plan for this project:
a. do the main markup in the present propernoun file
Conversion:
- This week
- clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
- Make an emacs mode for markup (Saara). Options: fem, mal, sur, plc, org,
- clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
- (end of this week and) Next week:
- Mark up as much as possible within a week or so (Maaren to do the Sámi
- Then convert to xml
- Then mark up the rest with correct semantic tags
- This means we would need a seventh option, the unspecified name.
- Look into efficient editing of the XML lexicon
- Look into synchronisation issues with risten.no - we want the names there
- Mark up as much as possible within a week or so (Maaren to do the Sámi
Updated status quo:
- Entries: 20000
- Converted: 13500
- Time used: 10 h
Needed tools: An emacs mode doing this (Saara):
Possible refinement: Encode for combined options (both plc and sur, e.g.)
Waiting for emacs mode.
Twol SETS definition issue
The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
Update: it is still not working, see bug 193
SUGGESTION (Trond): Thomas, Trond and Sjur have a new meeting on
North Sámi
- three-part compounds issue still open, as is the number project.
- The treatment of Sámi place names, we need a contract with "Norge digitalt",
-
Sjur has written an e-mail to the UFD contact person,
-
Sjur has written an e-mail to the UFD contact person,
- normativity issues:
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Numerals
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
We will return to this issue after the name conversion.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
Bug fixing
10 open bugs (and 24 risten.no bugs)
Buying
- rucksacks for all
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
Meeting with Kvensk revitalisation project
Grammar, dictionary, placename lexicon for Kvensk. They want similar
9. Summary, task list
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Doing some digging into WebSak
- Doing some digging into WebSak
- Reorganise the directory structure
- Put all corpus texts into one place
- Continue converting text from input format to our xml
- Have a look at the placenames files.
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Gather public texts
- Work on the name lexicon
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- continue working with the missing list from risten.no
- working with the missing list from risten.no this week (today)
- working with the missing list from risten.no this week (today)
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- make an emacs mode for the name project (cf. specs in the memo above)
- prepare for a presentation of the pdf etc. conversion together with Tomi
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on the meeting with Anders Kintel 17th of November -> ask
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Trond, return the new version to the lawyer
- write to the board about the lack of progress with the Giellalávdegoddi, and
- write to the Giellalávdegoddi once more, emphasizing timetable and response
- discuss kvensk project support with Trond
- write public tender documents
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Cgi-script for uploading documents to corpus base
- Specification for new catxml in C++
- this includes also placing the source and binary
- clean the script/ catalogue with Trond
- clean the script/ catalogue with Trond
- this includes also placing the source and binary
- Common makefile issues
Trond
- Work on the bug list (7 open).
- Get the new version of the New Testament
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Discuss the contract issue with Sjur, return the new version to the lawyer.
- Work on the name project: Clean up the lexicon file, discuss the emacs mode with
- Add docu on the corpus infrastructure
- clean the script/ dir
- discuss kvensk project support with Sjur
10. Next meeting, closing
24.10.2005 10: 00
Closed at 12: 11