Meeting_2006-06-19

Meeting setup

  • Date: 06.06.2006
  • Time: 09.30 Norw. time
  • Place: Where we are
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 59.

Present: Sjur, Thomas, Trond, Børre, Tomi

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • corpus collection:
    • send out contracts with accompanying letter
      • Synnøve Persen
    • Gather public texts, preferrably also parallel ones
      • http: //finnmarksloven.no gathered, fully parallel
    • Send out letters to the rest of the Iđut authors
    • send contract to Kurt Tore Andersen
      • Done
    • call Brita Kåven again
      • Not done
    • contact Ája (Kåfjord)
      • No answer
    • Send renaming scripts to R. Valkeapää
      • Done
    • call Bård Eriksen
      • Not done
    • send contracts to Čálliid Lágádus
      • Not done
  • corpus conversion:
    • Continue converting text from input format to our xml
      • Done
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
    • convert smj NT to paratext
      • Not done
    • complete Min Áigi metadata
      • Some done
  • corpus access:
    • meeting 9.6. (postponed to 15(?).6) t 9.30: discuss and decide upon the exact access policy we want
      • Done
    • set upp the unix group structure for corpus users (also Saara, Trond)
  • set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
    • Not done
  • test Gobby (with others)
    • done
  • document use of Gobby within our project if above test is ok
    • Not done
  • fix bugs!

Maaren

  • On sick leave

Saara

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • Implement parallel corpus upload in web upload script
    • not done
  • Install Gobby
    • not done
  • set upp the unix group structure for corpus users (also Børre, Trond)
  • Test the aligners once again
    • not done
  • refine the xml output of the xml-tagged analyses
    • waiting Tomi's ccat implementation
  • convert or adapt the received PHP for paradigm generation to our needs
    • done some planning.
  • remove headers and footers from antiword documents, other improvements
    • done, now improving handling of pdf-documents
  • discuss parallel corpus markup
    • done
  • extend the DTD with parallel markup
    • do we want that? In the newsgroup, I proposed to keep the marking separate from corpus db. Sjur: I think I meant the markup in the header to link parallel files, which you have done now.
  • Implement cleaning the corpus text in the file-specific xsl file
    • not done
  • fix bugs!

Sjur

  • public tender:
    • collect the last clarifying answers, and make a final proposal to the board
      • still no answer from PL. If it doesn't arrive today, we'll make our proposal to the board based on the information at hand.
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
    • started some work on integrating M4 processing of the twol code - needed to adapt the twol to different scenarios
  • name lexicon:
    • implement editing functions
    • finalise refactoring for multiple collections (regular search interface)
      • investigation ready, and the concept phase is over, only implementation left
  • change corpus summary processing to generate smaller pages
    • nothing yet
  • meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
    • done
  • test Gobby (with others)
    • not done
  • discuss parallel corpus markup
    • not really
  • move the SEE 2.5 extension list to Bugzilla
    • not done
  • fix bugs!
  • other:
    • monthly reports for April and May
    • fixed a bug in the Forrest JSPWiki input plugin that produced invalid HTML for nested lists.

Thomas

  • investigate Actio compounding, first sme, later smj
    • done North-sámi three-syll
  • discuss findings with Maaren, and later all of us
    • not done, Maaren is on sickleave
  • add proper numeral analysis/treatment to smj
    • not done
  • add loanwords (e.g. latin -ere verbs) to smj
    • not done
  • work on compounding
    • soon finished
  • lule sámi incoming words
    • finished
  • bug-fixing
    • done some
  • sme G3 issue
    • not done

Tomi

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
      • not done
    • XQuery refactoring and code development for our proper noun editor
      • not done
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
      • not done
  • read aligner docu, install, provide feedback
    • not done
  • Set up the mechanism for the hash-mark transducer package
    • not done
  • refine the xml output of the xml-tagged analyses
    • done, but is it ready?
  • convert or adapt the received PHP for paradigm generation to our needs
    • not done
  • test Gobby (with others)
    • not done
  • fix bugs!

Trond

  • better smj NT text
    • Not done, must finish Swedish work with Saara and Børre first.
  • get fin, swe, nob and nno NT and OT in paratext format
    • Still waiting for Oslo
  • install aligner, test it and give feedback
    • Not done
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
    • Not done
  • meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
    • Done
  • set upp the unix group structure for corpus users (also Børre, Saara)
    • Done
  • test Gobby (with others)
    • Done on a regular basis, also last week.
  • discuss parallel corpus markup
    • Discussed with Lars and Saara, made good progress, and we will send texts this week.
  • fix bugs!.

3. Documentation

TODO:

  • documentation on how to apply for a user account for the corpus repo ( Børre)
    • we will administer the corpus user accounts ourselves
    • We first have to discuss and decide what we want before Børre can write documentation (see further down)
      • done
  • Write both user and admin documentation (Børre, review: Sjur, Thomas)
    • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
    • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO:

  • Send out the rest of the letters (Børre)

New contracts:

  • none last 2 weeks

Olavi Korhonen's Lule Sámi dictionary.

TODO:

  • set up user account/corpus access for Olavi (Børre)
    • we need the infrastructure for corpus access in place first
      • decided, but still not implemented
  • Contact Olavi Korhonen, to actually get the dictionary

KIO Grafisk and the Iđut books

TODO:

  • send letter to Kurt Tore Andersen (Børre)
    • done
  • send letters to the other authors (Børre)
    • sent to Synnøve Persen
    • Erling Persen has lost his files, only source is Iđut

Bible texts

TODO:

  • get nob and nno NT and OT in paratext format. (Trond)
    • waiting for Oslo
  • convert smj NT to paratext (Børre)
    • waiting for an NT in paratext format (whatever language will do)
  • convert fin, swe to paratext or directly to our XML (Børre)

Davvi Girji

TODO:

  • call Brita Kåven again towards the end of the week (Børre)
    • not done
  • call the authors (Børre)
    • not more

Min Áigi

TODO:

  • complete metainformation (Børre)
    • continued, not finished

Kåfjord

TODO:

  • contact Ája (Børre)
    • no answer, will try again

Sámi Instituhtta

TODO:

  • send renaming scripts to R. Valkeapää (Børre)
    • done

Čálliid Lágádus

http://www.calliidlagadus.org/

TODO:

  • send contracts (Børre)
    • not done

Árran

TODO:

  • call Bård Eriksen (Børre)
    • Done

5. Corpus infrastructure

General

Errors in the Antiword conversions found when parsing the xml corpus.

There are also problems with the PDF conversion. Børre has found a tool which will produce xml output, he sent the link to Saara.

TODO:

  • remove headers and footers from antiword documents, other improvements as needed (Saara)
    • in the works, also improvements to the PDF conversion

User accounts and access

TODO:

  • discuss and decide upon the exact access policy we want to give corpus users; meeting on Tuesday May 23, at 9.30 (Børre, Sjur, Trond)

Conclusion: we have the following classes of users:

  • Users that don't need a unix shell
    • linguists doing research on singleton examples
    • historians and other people interested in content, not in form
  • Users that do need a unix shell
    • linguists doing research on texts as a whole
    • linguists with separate analysis tools
    • language technology developers

Shell access

External users will get their own user account, belonging to the groups myself and bound, and will be able to install their own tools and programs for corpus processing, analysis, etc.

To let the bound group members be able to analyse, we need to do some minor adjustments - as other they automatically have full access to the Xerox tools, and the compiled fst's are available in /opt/smi/sme/bin/sme-num.fst etc. The Xerox tools and vislcg are available in /opt/Xerox/bin. A couple of tools are missing right now, and need to be added to /opt/ by a crontab.

TODO:

  • make a group bound for our external corpus users, which: ( Børre)
    • gives access to read our bound texts
    • gives access to execute/run the tools in /opt
  • export to /opt (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Tomi)
    • ccat (and some perl scripts?)
    • other tools?
  • make shell script wrappers for the most common commands for user friendlyness ( Trond)
  • write user account form, probably ask for copy of existing ones from the IT centre (Trond and Sjur)
  • possibly deploy the user account form as an HTML form (Børre)
  • write documentation for our bound users, with pointers to the ordinary documentation (Børre, Trond)
  • write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.) (Børre)
  • make our own guidelines for the user application processing (Børre)
  • make a test user (Børre)
  • test corpus access as test user (Trond)

Web browser access

Users of only the free corpus won't need anything but a browser.

Users of the bound corpus will need a username and password to the Oslo computer (until the base is moved to Tromsø). These usernames and passwords will be created and administered by the Oslo people (modulo discussion), later by ourselves.

TODO:

  • discuss with Oslo
  • delay other tasks until we are ready to go public?
  • user management for access to bound texts

More texts to the graphical corpus interface:

TODO:

  1. refine xml-tagged output (Saara and Tomi)
    1. done, but still open if it is finished
  2. add text to the server (Lars)

Aligner

Trond and Saara will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which documents belong together. Discussion to continue in the newsgroup (Saara has started it - please respond!).

TODO:

  • discuss parallel corpus markup (Saara, Trond, Sjur, others)
    • done
  • extend the DTD (Saara)
    • done

Language recognition

Still waiting for more smj text to improve it.

Free and non-free texts

Anything? Final check with Børre and Saara - waiting for them to return. Nothing more according to Børre. Only texts which are explicitly marked as free are now included in the free/ directory.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

TODO:

  • trim generated corpus summary pages (Sjur)
    • suggestion: lump together files with content less than X paragraphs (X < 5?)
    • nothing done yet

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:

  • convert or adapt the received PHP to our needs (Tomi or Saara)
    • Saara has started to look at it

Hyphenator

TODO:

  • correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
    • Sjur has started some work with M4 integration
  • Update the sme and sma hyphenator rule sets with the insights gained from smj updates
    • Thomas has updated sme
    • update (some of) sma (Trond during summer vacation)

Automatic Bugzilla reminder for untouched bugs

TODO:

  • set up Bugzilla to send automatic reminders for bugs not touched in a given timeframe; ask Thor-Øivind for help if needed (Børre)

JSPWiki update

Sjur has corrected and improved the jspwiki parsing in Forrest, and found that mixed lists should not be supported. We should check whether we have any such lists anywhere, and correct them, otherwise we risk that they are not rendered in HTML, and as such impose an information loss. Sjur has sent in a patch to Forrest that will correct nested list behaviour, but it will at the same time make mixed lists invisible except for the first level. It seems that mixed lists aren't part of the wiki format.

Mixed list example:

# something numbered
** some sub-thing with a bullet

* something bulleted
## some sub-thing with a number

Sjur tried to grep, but multiline pattern matching is beyond him. Tomi: Here's the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

  • grep for all occurences of ^* followed by a line ^## and vice versa (the patterns above) - send the list of identified documents to Sjur ( Tomi)
    • done during the meeting

7. Linguistics

Name double-tagging

Conclusion, in a principled fashion:

  1. hardcoded sem-tags win
  2. There is a sem-tag conversion procedure: according to a hierarchy of sem-tags: Any Plc can be interpreted as Sur, etc. (to be spelled out)

TODO:

  • Make a section under gt/doc/lang/smi/, add a chapter Common linguistic resources under the Languages section in site-frag.xml, and add the document below to that section (Børre)
  • write guidelines for annotators wrt. to name tagging and put them under gt/doc/lang/smi. A short version of the guidelines goes as follows:
    Systematic ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (Eira (Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be added to our documentation file. (Trond)
  • when adding new names, only use one sem-tag unless there are known objects which belong to different sem classes(?) (all linguists)
  • write disamb rules to implement the system above (Trond, Linda)

North Sámi

TODO:

  • investigate Actio compounding (Thomas)
    • done North-sámi three-syll
  • discuss findings with Maaren, and later all of us (Thomas)
    • not done, Maaren is on sickleave

Actio compounding

It is definitely productive. Whether this is a problem for our speller(s), we don't know yet, but if there's a lot of overlap or parallel forms with the same verbal stem producing compounds both with and without Actio, it is most likely a problem, unless we correct to both forms (but then we risk correcting to an impossible form, which is also bad).

Whether Actio is used or not in compounding a verbal stem follows from the semantic properties of the Actio. We need to try to identify this property, and formalise it in one way or another, otherwise we will overgenerate. And overgeneration is a speller's worst enemy : -)

TODO:

  • investigate and identify under which conditions Actio compounding is possible ( Thomas)

Lule Sámi

TODO:

  • 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks ( Thomas)
    • done except for abbr. and numerals
      • add abbr to a new abbr lexicon file
      • numerals are covered by the next task
  • add proper numeral analysis/treatment (Thomas)
    • not done
  • add loanwords (e.g. latin -ere verbs) (Thomas)
    • not done

8. Name lexicon infrastructure

TODO:

  1. finish refactoring for multiple collections in the search interfarce ( Sjur)
    1. investigation done, still not implemented
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. nothing this week
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again ( Sjur)
      1. not yet done

9. Public tender

Nothing received from PL yet. They have an extended deadline today. After that we will decide upon the information we have.

TODO:

  • final evaluation based on the answers to the clarification letters
    • waiting for the PL answer
  • present the final report to the board, and bring their decision into action

10. Other

Summer vacation

From Bitte: "I følge fjorårets liste tok iallefall Børre ut 10 dager av årets ferie og Tomi tok 8 dager, dvs at Børre kan ta 3 uker med lønn og Tomi 3 uker og 2 dager. De kan selvsagt låne av neste års ferie dersom de ønsker det."

She would like to receive a final vacation plan soon.

Who When
Børre 24.7 - 20.8
Linda ?
Maaren on sick leave
Saara July
Sjur at least 2 weeks in July, but still open
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help... Saara will contact Roy on this issue.

Gobby

TODO:

  • install Gobby (Saara)
  • test it (Tomi, Trond, Sjur, Børre, Lars?)
    • done
  • if successful, document its use within our project (Børre)

SEE 2.5 extensions

Future extensions and whish modes:

  • write a perl script to extract all TODO items and sort them according to responsible person; wrap the script in a AppleScript available from the menu
  • twol mode
  • xfst mode
  • lexc mode
  • vislcg mode

TODO:

  • move the above list to Bugzilla (Sjur)

Task lists as iCal entries

With the latest corrections to the Wiki parsing, and with the tasks at the end of the document, we should be able to use the intermediate XML format to extract the info needed to create iCal entries for all tasks.

iCal entries look like the following:

BEGIN:VTODO
DTSTAMP:20060619T090920Z
ORGANIZER;CN=Børre Gaup:MAILTO:boerre@skolelinux.no
CREATED:20050621T171425Z
UID:libkcal-939001838.216
SEQUENCE:1
LAST-MODIFIED:20050622T050540Z
SUMMARY:Ordne lenker
CLASS:PUBLIC
PRIORITY:5
DUE:20050824T073000Z
COMPLETED:20050622T050540Z
PERCENT-COMPLETE:100
END:VTODO

A reference can be found at Wikipedia

References should be of the type:

/doc/admin/weekly/2006/Tasks_2006-06-19_Sjur.ics

TODO:

  • create iCal entries from our meeting memos (Sjur or Børre)

11. Next meeting, closing

26.06.2006 09: 30

Sjur might be away, will inform you later.

Closed at 11: 20.

Appendix - task lists for the next week

Boerre

  • corpus collection:
    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Send out letters to the rest of the Iđut authors
    • call Brita Kåven again
    • contact Ája (Kåfjord)
    • call Bård Eriksen
    • send contracts to Čálliid Lágádus
    • Contact Olavi Korhonen, to actually get the dictionary
  • corpus conversion:
    • Continue converting text from input format to our xml
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
    • convert smj NT to paratext
    • complete Min Áigi metadata
  • corpus access:
    • make a group bound for our external corpus users
    • possibly deploy the user account form as an HTML form
    • make a test user
    • Write both user and admin documentation
  • set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
  • document use of Gobby within our project
  • create document & document entry for name double-tagging
  • create iCal entries from our meeting memos (or Sjur)
  • fix bugs!

Maaren

  • On sick leave

Saara

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
  • Implement parallel corpus upload in web upload script
  • Install Gobby
  • Test the aligners once again
  • refine the xml output of the xml-tagged analyses
  • convert or adapt the received PHP for paradigm generation to our needs
  • remove headers and footers from antiword documents, other improvements
  • fix bugs!

Sjur

  • public tender:
    • collect the last clarifying answers, and make a final proposal to the board
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • name lexicon:
    • implement editing functions
    • finalise refactoring for multiple collections (regular search interface)
  • change corpus summary processing to generate smaller pages, see bug report
  • move the SEE 2.5 extension list to Bugzilla
  • create iCal entries from our meeting memos (or Børre)
  • review user and admin documentation for corpus access
  • write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
  • fix bugs!

Thomas

  • investigate productivity of Actio compounding in smj
  • investigate and identify under which conditions Actio compounding is possible
  • discuss findings with the rest of us
  • add proper numeral analysis/treatment to smj
  • add loanwords (e.g. latin -ere verbs) to smj
  • work on compounding
  • sme G3 issue
  • review user documentation for corpus access
  • create smj abbr file
  • fix bugs!

Tomi

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • read aligner docu, install, provide feedback
  • Set up the mechanism for the hash-mark transducer package
  • test the new xml output of the xml-tagged analyses
  • export corpus tools to /opt (with cron)
  • fix bugs!

Trond

  • better smj NT text
  • get fin, swe, nob and nno NT and OT in paratext format
  • install aligner, test it and give feedback
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • make shell script wrappers for the most common commands for user friendlyness
  • write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
  • write documentation for our bound users, with pointers to the ordinary documentation
  • write documentation on double-tagging names
  • discuss web-only user access management with Oslo
  • fix bugs!.