Meeting_2005-10-24
Meeting setup
- Date: 24.10.2005
- Time: 10.00 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10: 10.
Present: Børre, Saara, Sjur, Tomi, Trond
Absent: Thomas, Maaren
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Doing some digging into WebSak
- Will contact the Tromsø sámediggi department to get help on this.
- Will contact the Tromsø sámediggi department to get help on this.
- Doing some digging into WebSak
- Reorganise the directory structure
- Done once, new decisions on friday leads to that all that work has to
- Done once, new decisions on friday leads to that all that work has to
- Put all corpus texts into one place
- Not done
- Not done
- Continue converting text from input format to our xml
- Not done
- Not done
- Have a look at the placenames files.
- Not done
- Not done
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Not done
- Not done
- Gather public texts
- Have done a test download of governmental html-texts
- Have done a test download of governmental html-texts
- Work on the name lexicon
- Not done
- Not done
- Other, not scheduled
- Helping out Svenska Bibelsällskapet with making a current Lule Sámi
- Helping out Svenska Bibelsällskapet with making a current Lule Sámi
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- continue working with the missing list from risten.no
- working with the missing list from risten.no this week (today)
- working with the missing list from risten.no this week (today)
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- done, can we remove this? Yes, indeed.
- done, can we remove this? Yes, indeed.
- make an emacs mode for the name project (cf. specs in the memo above)
- done
- done
- prepare for a presentation of the pdf etc. conversion together with Tomi
- done some
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- nothing done last week
- nothing done last week
- risten.no bugs and fixes
- nothing done, but I have received a lot of feedback and requests. This one
- nothing done, but I have received a lot of feedback and requests. This one
- follow up on voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- Nothing done by the IT guys, they're too few and have too much to do.
- Nothing done by the IT guys, they're too few and have too much to do.
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- project planning with Trond, continued
- also look at the development processes - specification and testing
- looked a bit more on project management tools, but still not finished
- looked a bit more on project management tools, but still not finished
- also look at the development processes - specification and testing
- Follow up on the meeting with Anders Kintel 17th of November -> ask
- done
- done
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- done
- done
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- nothing more yet
- nothing more yet
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- nope
- nope
- Discuss the contract issue with Trond, return the new version to the lawyer
- done, the contracts are now off for comments from Kimmo Koskenniemi
- done, the contracts are now off for comments from Kimmo Koskenniemi
- write to the board about the lack of progress with the Giellalávdegoddi, and
- done
- done
- write to the Giellalávdegoddi once more, emphasizing the timetable
- not done yet
- not done yet
- discuss kvensk project support with Trond
- nothing
- nothing
- write public tender documents
- nothing done except adding this to my task list
- nothing done except adding this to my task list
- other:
- finally looked into several requests regarding Sámi speech synthesis,
- continued to work on open bugs
- finally looked into several requests regarding Sámi speech synthesis,
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Not done
- Not done
- Contact aspell author (UTF-8 thing)
- three-part compounding
- Not done
- Not done
- corpus infrastructure: dtd location (both public and internal)
- Not done
- Not done
- corpus infrastructure: file and dir organisation
- Almost done, with Børre
- Almost done, with Børre
- Document aspell and corpus infrastructure
- Documenting
- Documenting
- Cgi-script for uploading documents to corpus base
- Almost ready
- Almost ready
- Specification for new catxml in C++
- this includes also placing the source and binary
- clean the script/ catalogue with Trond
- Not done
- clean the script/ catalogue with Trond
- this includes also placing the source and binary
- Common makefile issues
- Done some
Trond
- Work on the bug list (7 open).
- Still 7 open bugs.
- Still 7 open bugs.
- Get the new version of the New Testament
- Not done.
- Not done.
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- Done some work on the issue, albeit not with Sjur.
- also look at the development processes - specification and testing
- Discuss the contract issue with Sjur, return the new version to the lawyer.
- Made a new version with Sjur, it is now in Hki for comments.
- Made a new version with Sjur, it is now in Hki for comments.
- Work on the name project: Clean up the lexicon file, discuss the emacs mode
- Done substantial work here: CNAME gone, unclassified names down from 35k to
- Done substantial work here: CNAME gone, unclassified names down from 35k to
- Add docu on the corpus infrastructure
- Hmm, don't remember this one. Not done.
- Hmm, don't remember this one. Not done.
- clean the script/ dir
- Not done.
- Not done.
- discuss kvensk project support with Sjur
- Not done.
3. Documentation
Documentation tasks:
- Add documentation on our corpus infrastructure and our corpus work in general
- Now we have 4 documents:
- Correct corpus (disamb usage)
- Corpus plan (for the disamb corpus cwb)
- catxml
- Correct corpus (disamb usage)
For the basic corpora, we need 3 types of documentation, or doc for 3 target
- For the users/linguists: What corpus are found, how do I use them (this
- For the collectors: How do I add texts, where do I add them, how do I
- For the programmer: What did I actually do? (this is partly the catxml doc)
For the work on the graphical user interface, we need documentation as well, in
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- Some documentation has been written, but there still is work to be done.
- as always: document what you're doing: -) (all)
4. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Tasks:
- move existing gov. documents (pdf) from gt/ to our corpus repository (Børre)
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- There are appr. 10 non-broken pdf documents in gt/sme/corp/original/
- Collect public (pdf and html) files (Børre)
- Done some test downloading, will have to look at tools to do this
- Done some test downloading, will have to look at tools to do this
Contracts
Tasks:
- Follow-up on the lawyers' comments (Trond has started with the university)
-
Trond and Sjur finished the next revision of the contracts, and are
-
Trond and Sjur finished the next revision of the contracts, and are
- add a background document explaining the model (Sjur)
The most problematic issue:
Who has the copyright of extracted material, like single words, collections of
North Sámi New Testament
- If we don't hear anything from Bibelselskapet, we will have to use the version
- Still not anything. Trond will inform them that we will use what we have.
Lule Sámi New Testament
Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi
Lule Sámi Dictionary
Sjur will check whether Berit Karen has contacted Anders Kintel. —
5. Corpus infrastructure
Naming conventions and directory structure
New suggestions last Friday, with a proposal from Børre and Tomi:
orig/yyyy-mm/filename.doc /filename.doc.xsl /filename.doc.xml /samefilename.doc => samefilename.doc /samefilename.doc => samefilename-1.doc /This\ is\ a\ very\ cumbersome\ and\ long\ filename.doc => /This_is_a_very_cumbersome_and_long_filename.doc
Reasoning:
- What do we have to do manually, and what can be done automatically?
- If we name the docs manually, we need to document the original file name
- We can solve original filename from searching the title name from
- In the xsl file.
- We can solve original filename from searching the title name from
- Principle: All things manually go into the xsl file
- Principle: the gt catalogue is fully generated
- Principle: Use original file names in orig/, but replace SPACE with underscore
- Principle for naming .xml files:
- use orig file name if possible
- Use title when the orig filename is
- if none of the above leads to a unique filename, find a short and
- use orig file name if possible
If input document is filename.(doc|pdf|html|txt|whatever), it has a title
- What we want to know: when the doc arrived, parallell language docs, plus
- Could be implemented as empty field on the first conversion. The above
- Could be implemented as empty field on the first conversion. The above
After a long discussion, we decided on the following:
orig/sme/news/thelongandstupidnameswegetasinputwithunderscore_for_space.doc /thelongandstupidnameswegetasinputwithunderscore_for_space.xsl sma smj nob fin swe /news/title2.xml /laws/title.xml /fict/title.xml ! oops same name as cousin in laws/ /fact /bibl /admi gt/sme/news/thenewshortandsmartnameweinventedifneeded.xml (cf. lines 258-263, for smartness directions) sma smj nob fin swe /news/title2.xml /laws/title.xml /fict/title.xml ! oops same name as cousin in laws/ /fact /bibl /admi parallel.xml
What parallel.xml could look like:
<paradocs> <entry id=1> <file lang=sme orig=yes>sme-file.xml</file> <file lang=nob>nob-file.xml</file> </entry> ... <entry id=1234> <file lang=sme orig=yes>sme-OTHERfile.xml</file> <file lang=nob>nob-OTHERfile.xml</file> </entry> </paradocs>
This decision is final!
Further discussion is directed to the news group.
The old task list is repeated for convenience:
- Make a system for file and directory permission (today: we all belong to the
- Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- the dir structure is:
- one dir for orig, containing also the meta-info and interm. files
- another dir for our ready-to-use xml files after conversion
- one dir for orig, containing also the meta-info and interm. files
- dir structure for web-posted corpus files:
- subdivision according to week or month, we start out with month till we see
- Done
- Done
- subdivision according to week or month, we start out with month till we see
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
- in normal cases hyphenation points should be removed
Corpus conversion
All conversion (doc, pdf, html) are now integrated into one script.
Encoding conversion
perldoc gt/script/samiChar/Decode.pm
gt/script/convert2xml.pl --dir=dir_name # The directory where the files are searched --use-decode # Use the character decoding (for testing) --xsl=file_name # The name of the xsl file. I am going to change this.
Tasks:
- testing
- add move to target directory
This is Documentation
Pdf to XML
Saara has made a new conversion module, it is almost finished.
Task: Saara to prepare for this presentation, and to make documentation.
(X)HTML to XML
This is implemented by Tomi, under gt/script/xhtml2corpus.xsl. Usage:
tidy --quote-nbsp no --add-xml-decl yes --enclose-block-text yes -asxml -utf8 -language sme file.html | xsltproc $HOME/gt/script/xhtml2corpus.xsl - > file.xml
Documentation
6. Linguistics
Name lexicon
Summary: see the newsgroup
Motivation:
-
Divvun: We want to cross-link different versions of the same locations
-
Common: We do not want to enter the same names twice. We want a
-
Disamb: Having a richer tag set makes it easier to disambiguate
-
Future: Richer analysis makes new applications possible, within
Needed: A plan for this project:
- do the main markup in the present propernoun file
- make a script for converting it to xml (to be done one time)
- make a script for xml2lexc (to be done by the makefile)
- There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
- There is a working xml2lexc for Komi, written by Saara
- There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
- make the tags etc. in the parser
Conversion:
- This week
- (end of this week and) Next week:
- Then add the +Plc, +Mal, etc. tags in the parser
- Mark up as much as possible within a week or so (Maaren to do the Sámi
- Still to be done:
- Then add the +Plc, +Mal, etc. tags in the parser
7985 DEATNU 3836 LONDON 1939 BERN 1388 C-FI-NEN 692 ACCRA 471 NYSTØ 134 MARJA 118 DUORTNUS 59 NIILLAS 45 ALEUHTAT 43 ANAR 29 SULLOT 20 GIEDDI 17 HEANDARAT 8 GUOLBBA 4 VARGGAT 4 GEAVNNIS 4 EATNAMAT 1 ROMSA
- list continued:
- Then mark up the rest with correct semantic tags
- This means we would need a seventh option, the unspecified name.
- Then split propernoun-sme-lex.txt into two, one with the sami name being
- Look into efficient editing of the XML lexicon
- Then convert to xml
- Look into efficient editing of the XML lexicon again
- Look into synchronisation issues with risten.no - we want the names there
- Then mark up the rest with correct semantic tags
Updated status quo:
- Converted: 19400
- Still left: 15000 (8000 of which are pretty straightforward, the DEATNU case)
- Time used: 20 h
Twol SETS definition issue
The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
Update: it is still not working, see bug 193
SUGGESTION (Trond): Thomas, Trond and Sjur didn't meet last week
North Sámi
- three-part compounds issue still open
- number project still open
- The treatment of Sámi place names, we need a contract with "Norge digitalt",
-
Sjur has written an e-mail to the UFD contact person,
-
Sjur has written an e-mail to the UFD contact person,
- normativity issues:
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- Actions: Sjur will bring this to the Divvun board, write a new letter to
- The document with the list of open
issues
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Numerals
- The issue is postponed to next week.
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- Numeral generation
- A clear concept of how we want to treat them
- Tagging
- Tagging
- A treatment
We will return to this issue after the name conversion.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
- Another example of the same bug:
- : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk
- One way to "resolve" this is to redirect the error messages to /dev/null:
- Another example of the same bug:
- utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line
... | preprocess 2> /dev/null | lookup ...
Video conferencing across firewalls
The problem we've had with the SD firewall persists, and there doesn't seem to
Bug fixing
17 open bugs (and 24 risten.no bugs)
Bugzilla: 37 nor P2 Mac thor.oivind.johansen@hum.ui... ASSI Bugzilla is not able to handle the Sámi characters. 197 nor P2 Mac boerre@skolelinux.no NEW Links to Bugzilla must be checked and corrected for new s... UTF-8: 61 nor P2 Mac boerre@skolelinux.no ASSI mpage barfs on utf-8 input 196 nor P2 All boerre.gaup@samediggi.no NEW UTF-8 encoded html gets garbled Corpus: 160 nor P2 Mac tomi.pieski@hum.uit.no NEW Hyphen not recognised in Genesis 187 nor P2 All tomi.pieski@hum.uit.no ASSI catxml is undocumented 188 nor P2 All tomi.pieski@hum.uit.no ASSI catxml crashes if XML/Twig.pm is not installed 198 nor P2 Mac tomi.pieski@hum.uit.no NEW xsl script for Bible files does not single out chapter he... Hard to solve: 77 nor P2 Mac trond.trosterud@hum.uit.no ASSI consonantchange in the end of verbstem háliidit d > t in final position -ijd is spelled iid and should be spelled -iit. We should have had ''in háliit'' but do have ''in háliid'' Present situation: háliit háliit +? #wrong háliid háliidit+V+TV+Ind+Prs+ConNeg #wrong maid maid+Interj #ok, but not if háliit is corrected maid maid+Adv #ok, but not if háliit is corrected guliid guolli+N+Pl+Gen #ok, but not if háliit is corrected maid mii+Pron+Interr+Pl+Acc #ok, but not if háliit is corrected G3 definition issue: 50 nor P2 Mac Maren.Palismaa@Samediggi.no NEW LEXICON-GEARGGUS and others 56 nor P2 Mac trond.trosterud@hum.uit.no ASSI -headdjiid and -heddjiid 186 nor P2 Mac trond.trosterud@hum.uit.no ASSI No dipht. simpl in actor nouns before uj 193 nor P2 Mac trond.trosterud@hum.uit.no NEW oa->å dipht. simpl. in actor nouns Numeral project: 6 nor P2 All tomi.pieski@hum.uit.no NEW Num tag is needed in compounds, but stripped in lookup2cg 158 nor P2 Mac trond.trosterud@hum.uit.no ASSI Num+Sg+Gen+logi 169 nor P2 Mac trond.trosterud@hum.uit.no NEW golbmalohkása 176 nor P2 Mac trond.trosterud@hum.uit.no NEW beal+Ord
Bugzilla update
Buying
- rucksacks for the whole Divvun team
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
- it is ok to integrate "kvensk" placenames with risten.no
- this should be integrated with the general proper name work - we want all
- needs further development of risten.no to allow for multiple XML bases to
- this should be integrated with the general proper name work - we want all
Project planning and development processes
Trond is using his project as a test case for an IT guy, Geir Tore Voktor,
9. Summary, task list
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Get help from the Tromsø department of Sámediggi to dig in WebSak
- Get help from the Tromsø department of Sámediggi to dig in WebSak
- Gather public texts
- Reorganise the directory structure
- Put all corpus texts into one place
- Continue converting text from input format to our xml
- Put all corpus texts into one place
- Ask Thor-Øivind to move bugzilla to our new webserver.
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
- continue working with the missing list from risten.no
- working with the missing list from risten.no this week (today)
- working with the missing list from risten.no this week (today)
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- make an emacs mode for the name project (cf. specs in the memo above)
- prepare for a presentation of the pdf etc. conversion together with Tomi
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
- risten.no bugs and fixes
- discuss risten.no work with Tomi
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- Test Marratech
- project planning with Trond, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Trond, return the new version to the lawyer
- Call Kimmo Koskenniemi for comments
- Call Kimmo Koskenniemi for comments
- write to the Giellalávdegoddi once more, emphasizing timetable and response
- discuss kvensk project support with Trond
- write public tender documents
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Cgi-script for uploading documents to corpus base
- Specification for new catxml in C++
- this includes also placing the source and binary
- clean the script/ catalogue with Trond
- clean the script/ catalogue with Trond
- this includes also placing the source and binary
- Common makefile issues
- discuss risten.no work with Sjur
Trond
- Work on the bug list (7 open).
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- also look at the development processes - specification and testing
- Work on the name project:
- Introduce the +Mal, +Fem, ... tags to the parser
- Introduce the +Mal, +Fem, ... tags to the parser
- clean the script/ dir
- discuss kvensk project support with Sjur
10. Next meeting, closing
31.10.2005 10: 00
Closed at 12: 36