Meeting_2006-03-20
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Summary, task list
- 12. Next meeting, closing
Meeting setup
- Date: 13.03.2006
- Time: 09.30 Norw. time
- Place: Wherever we are : -)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 40.
Present: Børre, Linda (from topic 7 onwards), Maaren, Saara, Sjur, Thomas,
Absent: none
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- http: //girji.info/skolehist done …
- http: //girji.info/skolehist done …
- Continue converting text from input format to our xml
- Done
- Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Waiting for paratext versions
- Waiting for paratext versions
- review the paratext2xml converter
- Waiting for paratext versions of nob and nny
- Waiting for paratext versions of nob and nny
- convert smj NT to paratext
- Waiting for paratext versions of nob and nny
- Waiting for paratext versions of nob and nny
- Call Ove Sæth
- Not done
- Not done
- Move complex name lexicon issue to bugzilla
- Not done
- Not done
- Ask KIO Grafisk to make a test Quark document based on a Word document from us
- Iđut and KIO Grafisk won't give access to their Quark files, so they don't
- Iđut and KIO Grafisk won't give access to their Quark files, so they don't
- Send out letters to the Iđut authors
- Åge Persen will collect an address list.
- Åge Persen will collect an address list.
- Add corpus security re G5 syncing as an issue to Bugzilla
- Not done
- Not done
- set up anon. read-only cvs with Sjur
- done
- done
- write docu for how to apply for a corpus user account (forms, recipients,
- Not done
- Not done
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- Not done
- Not done
- integrate generated corpus repository documents in the Forrest site
- Not daon
- Not daon
- give Saara the needed details regarding corpus dtd location on our public
- Not done
- Not done
- fix bugs!
Maaren
- work with the top-ten list
- done
- done
- translate stopword list from Norw. to smj, fin (aligner, stopword list from
- ?????? (ask Trond)
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
- Done.
- Done.
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Done, not installed.
- Done, not installed.
- Create a group for corpus users.
- learned how to create groups and added one: corpus.
- learned how to create groups and added one: corpus.
- Finish corpus dtd documentation, dtd location and permlink reference
- The dtd-location still open.
- The dtd-location still open.
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- Implemented, some questions left before installation.
- Implemented, some questions left before installation.
- add more texts to the graphical corpus interface.
- Not done. emailed Lars some questions.
- Not done. emailed Lars some questions.
- grammatical searchability in the graphical corpus interface
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- not done
- not done
- Lule Sámi twol problems, with Thomas and Trond
- not done
- not done
- project planning with Trond, continued
- not done
- not done
- Follow up on place names from Norge Digitalt
- not done
- not done
- Evaluate SFST as speller (and analyzer) lexicon
- not done
- not done
- write a background document on the corpus contracts
- not done
- not done
- public tender:
- answer requests/questions
- no questions this week
- no questions this week
- set up anon. read-only cvs with Børre
- done (well, I really did nothing)
- done (well, I really did nothing)
- corpus repo access
- still open, but we'll let them have only the free part for now; further
- still open, but we'll let them have only the free part for now; further
- answer requests/questions
- conversion of corpus repo summary xml to Forrest xml
- not done
- not done
- smj G3 issue with Thomas and Trond
- not done
- not done
- sme G3 issue with Thomas and Trond
- not done
- not done
- call EDD/ Christian Emil Ore about national place name lexicon
- not done
- not done
- risten.no/proper noun lexicon development:
- refactor code
- did some small adjustments - most of the work waiting for the task below
- did some small adjustments - most of the work waiting for the task below
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- first version of this system implemented and working for the initial query
- first version of this system implemented and working for the initial query
- refactor code
-
fix bugs!
- no bugs fixed last week: -(
Thomas
- add incoming Lule sámi words
- added and still adding
- added and still adding
- include the SGL decisions in our normativity document
- done
- done
- include normativity desicions made by Magga and Sammalahti in our normativity
- done
- done
- work on North Sámi compounding and derivation
- nothing
- nothing
- smj G3 issue with Sjur and Trond
- nothing
- nothing
- sme G3 issue with Sjur and Trond
- nothing
- nothing
- translate stopword list into smj (aligner; list from Trond)
- not done
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- not done
- not done
- corpus infrastructure:
- dtd location (both public and internal)
- not done
- not done
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- implement data synchronisation of proper nouns between risten.no and CVS
- looked briefly
- looked briefly
- XQuery refactoring and code development for our proper noun editor
- helped Sjur with this one
- helped Sjur with this one
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- discuss the new lexicon format and other issues in the newsgroup
- read aligner docu, install, provide feedback
- not done
- not done
- implement oslolaš issue for smj
- done
- done
-
fix bugs!
- not done
Trond
- Clean up the old corp/ directory.
- Done
- Done
- Work on corpus texts with Børre for parallel NT texts
- Not much, still awaiting response from Bible societies.
- Not much, still awaiting response from Bible societies.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Asked my contacts in Norway to get addresses, still no response.
- Asked my contacts in Norway to get addresses, still no response.
- read aligner docu, install, provide feedback
- Not done.
- Not done.
- Ask for a Quark test file from his sister
- Done, received, also got some from Michael Everson.
- Done, received, also got some from Michael Everson.
- translate stopword list into nno?
- Not done anything with the aligner issue.
- Not done anything with the aligner issue.
-
fix bugs!.
- Not done.
3. Documentation
Changes and updates because of the Divvun public tender
TODO:
- document anonymous, read-only access to our cvs repo (Børre)
- done
- done
- review: Sjur, Saara, by Wednesday morning
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo
- Not done
- Not done
- we need to finish the corpus dtd documentation (Saara)
Permlink location for all our dtd's (filename will vary, of course):
http://giellatekno.uit.no/dtd/corpus.dtd
This corresponds to the dir ~/gt/public_html/dtd/ on our public web server.
TODO:
- copy updated DTD's to the permlink location (Børre or Saara)
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
TODO:
- call Sæth (Børre)
- Not done.
Olavi Korhonen's Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
KIO Grafisk and the Iđut books
Iđut and KIO Grafisk won't give access to their Quark files, due to copyright
Citations from one of the discussions we have had with Quark experts:
- Trond: Can you confirm that I can get a quark file from you WITHOUT at the
- Michael: Of course. Quark files do not embed fonts. They are in the Fonts
- Trond: what about pictures?
- Michael: However, if I send you a Quark file and you don't have the fonts, it
TODO:
-
Børre will send letters to the authors.
- send an e-mail explaining the issues as we see them, to Iđut and KIO Grafisk,
Bible texts
TODO:
- review paratext2xml converter (Saara)
- convert smj NT to paratext (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Still not done. Trond has contacted Bibelselskapet for a new sme
- Still not done. Trond has contacted Bibelselskapet for a new sme
5. Corpus infrastructure
TODO in transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
- done
- done
- then these files should be removed from gt/sme/corp/ (Trond/Børre)
- done
- done
- all small files could just be forgotten/ignored
- make sure there's nothing left with a copyright attached to it (Trond)
- Trond will go a second round
TODO for access control:
- Access control to corpus repo resolved through Unix groups: one group for
-
Saara has asked for a *nix group - use it when created
- done
-
Saara has asked for a *nix group - use it when created
Further discussion about corpus analysis and computer use:
- we need to develop strong enough security routines for the G5 to fulfill our
- TODO: Børre to move this to bugzilla
TODO dtd usage and documentation:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location =
- see above under documentation for details.
- see above under documentation for details.
- structure, content/model and location of the dtd (location =
- add xml validation against our dtd to the corpus conversion process
HTML conversion problem
We need to extract only the table from input like below, since our DTD does not
<p> <table> ... </table> </p>
The solution is a simple XSL template that will only match the relevant
<xsl:template match="p[table]"> <xsl:apply-templates select="./table"> </xsl:template>
Correction tags?
There are many scenarios where information about spelling and other errors is
... this is <error correct="text">tekst</error> with an error...
No problem in adding it to the DTD, together with corresponding info in the
TODO:
- update the DTD (Saara)
OPEN ISSUES:
- since this is manual editing, we break the automatic regeneration/reconversion
- the proposed markup is too simplistic for describing more complex error
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- extract meta info into a compact xml document, the xml should be stored in the
- done
- done
- discuss and decide upon the structure of the generated xml above (Sjur and
- done
- done
- convert from that xml to Forrest document format (Sjur)
- looked at the generated file, nothing more yet
- looked at the generated file, nothing more yet
- integrate the final Forrest documents into Forrest, and make sure it gets
- waiting for the above
Free and non-free texts
More info in the previous meeting memo.
TODO:
- update scripts to handle this dichotomy. (Saara)
- almost finished
More texts to the graphical corpus interface:
- We would like to have more than the NT in the graphical interface (Saara)
- We would like to have grammatical searchability, not only POS. (Saara,
- This presupposes a discussion with Oslo. (Trond to start discussion
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
6. Infrastructure
We need to set up anonymous, read-only access to our cvs repo as outlined by our
Howto/who:
- what do we need?
- web interface? maybe, not required
- command line check-out? yes (Roy Dragseth / Børre)
- need to be able to restrict anonymous cvs to only specific modules
- done
- done
- web interface? maybe, not required
- testing needed: ( Saara, Sjur)
Aligner
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
-
Trond to send relevant documents to Tomi.
-
Trond to send relevant documents to Tomi.
- Translate the stopword list anchor-eng-nor.txt into sme (and fin?)
-
Saara to install the aligner, everyone to read the documentation on
-
Saara to install the aligner, everyone to read the documentation on
Language recogniser
We don't have enough Finnish text. We will look at the Helsinki corpus
- This is documented in bug database.
7. Linguistics
North Sámi
TODO:
- document all past decisions in our normativity document (Thomas)
- done
- done
- decide on a semantic feature system for nouns (Linda).
Concrete +/ \- Animate Verbal Content +/ \- +/ \- Human Moving Control Mass +/ \- +/ \- +/ \- +/ \- #humans# Moving #vehicles# Movable Perfective Perfective #features# Count .......
TODO:
- Work with semantically based sets (Trond, Linda)
- Return to the infrastructure issue (Trond)
- A full semantic encoding of the lexicon is a future project, outside the
Semantic tags we already have:
Place names:
Now: Tags Plc Sur and combinations (London, Trosterud).
Problem:
- Solution A:
- Heavily retag the lexicon: London also as Sur (Jack)
- Solution B: Do not use (new..) double tags.
- Plc being the default tag for Plc/Sur ("if it can be Plc, it is Plc)
- Sur being the tag of things that cannot be places (Andersen)
- Then cg rules turning Plc into &Plc and &Sur, and Sur into &Sur.
- Then rules for interpreting London, Trosterud, etc. as &Sur.
- Then a final rule for removing ambiguous ones (remove Plc &Sur strings).
- Plc being the default tag for Plc/Sur ("if it can be Plc, it is Plc)
- Solution C compromise:
- Real placenames Menešjávri
- Convertible placenames (today's double) England, Bonn, (default)
- Real surnames Andersen, Johansson
- Real placenames Menešjávri
TODO (Trond):
- Discussion testing
- infrastructure
- semiatomatic retagging
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
- still working on it, should finish this week
- still working on it, should finish this week
- name morphology (Thomas)
- handed Tomi list
- handed Tomi list
- oslolaš for smj (Tomi)
- done
- done
- translate Northern Sámi lists and sets to Lule Sámi
-
Linda, Trond, with help from mother tongue speakers (Thomas, others).
-
Linda, Trond, with help from mother tongue speakers (Thomas, others).
Trond will go to Drag tomorrow. Issues for the trip? No unobvious ones.
8. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our
- the resulting file format should be identical to our present prop-name
- the resulting file format should be identical to our present prop-name
TODO:
- Move this issue to bugzilla (Børre)
XML format
TODO on eXist as editor:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- Progressing well
- Progressing well
- refactor the code into more and more specific components according to our
- develop the Cocoon sitemap to delegate requests to the proper folder level,
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- done some, but it didn't work out, will need to start on a different trail
- done some, but it didn't work out, will need to start on a different trail
- test whether eXist as editor is actually working well (linguists)
Data synchronisation task list/specification:
- the xml file needs to be stored/updated in cvs
- there should be no diffs on whitespace and sorting order (to ensure we get
- the prop name update cycle should something like:
- dump the xml from eXist (in proper sorting order)
- check whether there are diffs against cvs; continue only if there are
- update from cvs
- error check: are there conflicts? if yes, send report to <somebody>
- are there still diffs? if yes, continue:
- check in/commit w. generated comment
- error check: is the document valid and conformant xml? if no, stop and send a
- reimport the xml file into eXist
- dump the xml from eXist (in proper sorting order)
- question: do we need to lock the file in eXist through this update cycle?
- the update cycle should be a nightly cron job
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
Gobby
A cross-platform alternative to SubEthaEdit, Gobby, is now available for OS X
Requirements for easy install:
- DarwinPorts (http://darwinports.opendarwin.org/)
- Port Authority (GUI for DarwinPorts)
Install and run the above as admin user. Then find and install Gobby (hint: use
Bug fixing
35 open Divvun/Disamb bugs, and 25 risten.no bugs
SPR language policy decision
Last week's SPR meeting decided upon a language policy. Their decision was the
«En fungerende språklig infrastruktur er av avgjørende betydning for at de
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- copy updated DTD's to the permlink location, or help Saara do it
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- fix bugs!
Maaren
- work with new missing lists
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- update the corpus dtd with option for correction tags
- copy updated dtd's to permanent external location
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- review paratext2xml converter.
- install sentence aligner.
- test anonymous cvs access and review documentation.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- test anon. read-only cvs, review docu, and send link to Finnut
- corpus repo access to free texts (with Børre)
- answer requests/questions
- conversion of corpus repo summary xml to Forrest xml
- call EDD/ Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- code design for XQueries needed for dict/term editing
- refactor code
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- add manual editing of corpus files as an issue to Bugzilla (error tags)
- fix bugs!
Thomas
- add incoming Lule sámi words
- work on North Sámi compounding and derivation
- smj G3 issue
- sme G3 issue
- translate stopword list into smj (aligner; list from Trond)
- assist Trond and Linda with the smj disamb work
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- fix bugs!
Trond
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- translate stopword list into nno?
- double check all remaining docs in gt/sme/corp/ for copyright issues
- grammatical searchability in the graphical corpus interface
- better smj NT text
- work on semantically based sets (sme, smj)
- start and lead discussion and work on semantic features for disamb
- fix bugs!.
12. Next meeting, closing
27.03.2006 09: 30
Maaren will be away the next four weeks, starting next week. After that she will
Closed at 11: 48