Meeting_2006-06-26

Meeting setup

  • Date: 06.06.2006
  • Time: 09.30 Norw. time
  • Place: Where we are
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

The meeting was delayed due to the project board having a telephone meeting at our regular meeting time.

Opened at 13: 17.

Present: Sjur, Thomas, Børre, Tomi

Absent: Maaren, Saara, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • corpus collection:
    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Send out letters to the rest of the Iđut authors
    • call Brita Kåven again
      • Done
    • contact Ája (Kåfjord)
      • Not done
    • call Bård Eriksen
      • Done
    • send contracts to Čálliid Lágádus
      • Not done
    • Contact Olavi Korhonen, to actually get the dictionary
      • Done
  • corpus conversion:
    • Continue converting text from input format to our xml
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
    • convert smj NT to paratext
      • None of the bible things are done
    • complete Min Áigi metadata
      • Done
  • corpus access:
    • make a group bound for our external corpus users
      • Done
    • possibly deploy the user account form as an HTML form
      • Not done
    • make a test user
      • Not done
    • Write both user and admin documentation
      • Some done
  • set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
    • Built-in feature of Bugzilla, tried to turn it on.
  • document use of Gobby within our project
    • Done
  • create document & document entry for name double-tagging
    • Not done
  • create iCal entries from our meeting memos (or Sjur)
    • Not done
  • fix bugs!

Maaren

  • On sick leave

Saara

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
  • Implement parallel corpus upload in web upload script
  • Install Gobby
  • Test the aligners once again
  • refine the xml output of the xml-tagged analyses
  • convert or adapt the received PHP for paradigm generation to our needs
  • remove headers and footers from antiword documents, other improvements
  • fix bugs!

Sjur

  • public tender:
    • collect the last clarifying answers, and make a final proposal to the board
      • done. Decision by the board: Polderland
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
    • nothing since last attempt
  • name lexicon:
    • implement editing functions
    • finalise refactoring for multiple collections (regular search interface)
  • change corpus-summary processing to generate smaller pages, see bug report
    • done, now waiting for feedback from Børre
  • move the SEE 2.5 extension list to Bugzilla
    • done
  • create iCal entries from our meeting memos (or Børre)
    • done
  • review user and admin documentation for corpus access
  • write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
  • fix bugs!
  • other tasks:
    • administrative tasks

Thomas

  • investigate productivity of Actio compounding in smj
    • had a look at the three-syllables
  • investigate and identify under which conditions Actio compounding is possible
    • have identified some things
  • discuss findings with the rest of us
    • will do
  • add proper numeral analysis/treatment to smj
    • not done
  • add loanwords (e.g. latin -ere verbs) to smj
    • not done
  • work on compounding
    • finished in practice, some things still that I need to discuss with Maaren
  • sme G3 issue
    • not done
  • review user documentation for corpus access
    • not done
  • create smj abbr file
    • not done
  • fix bugs!
    • not done

Tomi

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
      • not done
    • XQuery refactoring and code development for our proper noun editor
      • doing
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
      • not done
  • read aligner docu, install, provide feedback
    • not done
  • Set up the mechanism for the hash-mark transducer package
    • not done
  • test the new xml output of the xml-tagged analyses
    • not done
  • export corpus tools to /opt (with cron)
    • not done
  • fix bugs!

Trond

  • better smj NT text
  • get fin, swe, nob and nno NT and OT in paratext format
  • install aligner, test it and give feedback
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • make shell script wrappers for the most common commands for user friendlyness
  • write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
  • write documentation for our bound users, with pointers to the ordinary documentation
  • write documentation on double-tagging names
  • discuss web-only user access management with Oslo
  • fix bugs!.

3. Documentation

TODO:

  • Write both user and admin documentation (Børre, review: Sjur, Thomas)
    • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
    • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
      • not done yet

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO:

  • Send out the rest of the letters (Børre)

New contracts:

  • none last 2 weeks

Olavi Korhonen's Lule Sámi dictionary.

Talked with him, sent him the contract.

TODO:

  • Contact Olavi Korhonen, to actually get the dictionary
    • done

KIO Grafisk and the Iđut books

TODO:

  • send letters to the other authors (Børre)

Bible texts

TODO:

  • get nob and nno NT and OT in paratext format. (Trond)
    • waiting for Oslo
  • convert smj NT to paratext (Børre)
    • waiting for an NT in paratext format (whatever language will do)
  • convert fin, swe to paratext or directly to our XML (Børre)

Davvi Girji

Called Brita Kåven. She can't say anything about letting us have the texts untill she has talked with the employees, to get a work estimate on their part. Nothing was decided in their previous meeting. Davvi Girji seems generally very positive, but has problems finding the best way of actually delivering the texts to us.

TODO:

  • call Brita Kåven again towards the end of the week (Børre)
    • done
  • call the authors (Børre)
    • not more

Min Áigi

TODO:

  • complete metainformation (Børre)
    • done as far as possible: author and language info
    • some smj files moved to the smj tree
    • still some Norwegian files to be checked

Kåfjord

TODO:

  • contact Ája (Børre)
    • forgotten

Sámi Instituhtta

When will we get the corpus? We don't know, Børre will contact him again.

TODO:

  • contact NSI again (Børre)

Čálliid Lágádus

http://www.calliidlagadus.org/

TODO:

  • send contracts (Børre)
    • nothing

Árran

Talked to Bård Eriksen, he needs to discuss more with his coworkers.

TODO:

  • continue discussion (Børre)
    • done, not finished

5. Corpus infrastructure

General

TODO:

  • remove headers and footers from antiword documents, other improvements as needed, including PDF conversion (Saara)
    • in the works, also improvements to the PDF conversion
      • problems with running the tools, they need some updated libraries in Victorio

User accounts and access

For details, see the previous meeting memo, as well as the memo from a dedicated meeting.

Shell access

TODO:

  • make a group bound for our external corpus users, which: ( Børre)
    • gives access to read our bound texts
    • gives access to execute/run the tools in /opt
      • done
  • export to /opt (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Tomi)
    • ccat (and some perl scripts?)
    • other tools?
      • nothing yet
  • make shell script wrappers for the most common commands for user friendlyness ( Trond)
  • write user account form, probably ask for copy of existing ones from the IT centre (Trond and Sjur)
    • short discussion with Trond about what we need
  • possibly deploy the user account form as an HTML form (Børre)
    • waiting for the form to be written
  • write documentation for our bound users, with pointers to the ordinary documentation (Børre, Trond)
    • nothing done
  • write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.) (Børre)
    • nothing
  • make our own guidelines for the user application processing (Børre)
    • nothing
  • make a test user (Børre)
    • nothing
  • test corpus access as test user (Trond)
    • nothing yet

Web browser access

TODO:

  • discuss with Oslo (Trond)
  • delay other tasks until we are ready to go public?
  • user management for access to bound texts

More texts to the graphical corpus interface:

TODO:

  1. refine xml-tagged output (Saara and Tomi)
    1. done, but still open if it is finished
  2. add text to the server (Lars)

Aligner

More to be said about this? (certainly, but right now?)

Language recognition

Still waiting for more smj text to improve it.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

It is now implemented, but a test to skip this summary option for smaller collections did not work. Thus all genres in all languages are treated the same.

Improvement suggestion: instead of summarize files under a certain limit, one could also split Min Áigi into years, and display it year-wise. That will require some more info in the generated source file corpus-content.xml.

TODO:

  • trim generated corpus summary pages (Sjur)
    • suggestion: lump together files with content less than X paragraphs (X < 5?)
      • done - Forrest seems to work fine now
  • add improvement suggestion to Bugzilla (Børre)
    • done

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

Saara has given a report on the PHP code in News. Please read.

TODO:

  • convert or adapt the received PHP to our needs (Saara)
    • code evaluated, evaluation reported

Hyphenator

TODO:

  • correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
    • nothing since last meeting
  • Update the sma hyphenator rule set with the insights gained from smj updates ( Trond during summer vacation)

Automatic Bugzilla reminder for untouched bugs

TODO:

  • set up Bugzilla to send automatic reminders for bugs not touched in a given timeframe; ask Thor-Øivind for help if needed (Børre)
    • turned on the feature, but it doesn't seem to work - needs some more investigation

JSPWiki update

Here's the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

  • grep for all occurences of ^* followed by a line ^## and vice versa (the patterns above) (Sjur)
    • Tomi tried it, but nothing found
  • correct the documents found to have consistent lists (Sjur)
    • not done

7. Linguistics

Derivation and spellers like Aspell

It is impossible to create a model that can dynamically generate new derivations of verbs, and then let them nominalise for compounding. What we need is to extract all derived verbs not in the lexicon, and lexicalise most of them. We could potentially let the 5(?) most productive verbal derivations be unlexicalised, and generate them from our lexicon directly when creating entries. A tentive, first guess on which derivational suffixes this could be is (for sme):

+st, +l, +h, -goahtit, (o)juvvot (passive)

Later we might consider lexicalising all other derivations found in our corpus.

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme. Example:

"<laktigohtet>"
        "laktit" V TV goahti Ind Prs Pl3 @+FMAINV

Only the tag goahti is identifying that the word form is a derivation. It should be extended to Der/goahti as in smj. Then it is easy to grep for the pattern Der/ to get all derived verbs.

TODO:

  • change tagging of derived stems in the disamb output, to facilitate much easier extraction of these verbs (Trond)
    • not done, the other tasks depend on this one
  • find and study all derived verbs in our corpus (Thomas)
  • suggest which derivations could be generated (Thomas)
    • see source code above, but also consider overgeneration problems, as well as input from the statistical results in the previous point
  • lexicalise the rest (Thomas)

Name double-tagging

TODO:

  • Make a section under gt/doc/lang/smi/, add a chapter Common linguistic resources under the Languages section in site-frag.xml, and add the document below to that section (Børre)
  • write guidelines for annotators wrt. to name tagging and put them under gt/doc/lang/smi. A short version of the guidelines goes as follows:
    Systematic ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (Eira (Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be added to our documentation file. (Trond)
  • when adding new names, only use one sem-tag unless there are known objects which belong to different sem classes(?) (all linguists)
  • write disamb rules to implement the system above (Trond, Linda)

North Sámi

TODO:

  • discuss findings with all of us (Thomas)
  • investigate and identify under which conditions Actio compounding is possible ( Thomas)

Following already derived verbs are not happy with further derivation. It seems like the most of them do not appear as Actio forms in first part of compounding either. The following holds for both sme and smj:

LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 
5-syllables, formerly directed to 
MUITAL
 +V+TV: MUITALStem ;
!SHOULD be directed here as well:
!Reflexives on -dit
!Reciprocals on -dit, -(a)lit
!Momentatives on -dit, -(a)lit, -ádit, -ihit
!Frequentatives on -(a)lit, -(u)hit, -dit
!Continuatives on -dit, -(u)hit, -nit
!Inchoatives in -nit 
!Translatives on -dit
!Essives on -dit and -stit
!Causatives on -dit, -stit

Lule Sámi

TODO:

  • add inc abbr to a new abbr lexicon file (Thomas)
  • add proper numeral analysis/treatment (Thomas)
  • add loanwords (e.g. latin -ere verbs) (Thomas)
    • nothing done to any of these

8. Name lexicon infrastructure

TODO:

  • finish refactoring for multiple collections in the search interfarce ( Sjur)
    • still not implemented
  • develop the needed XQueries and interface (Sjur, Tomi)
    • done the inflection menu, but it doesn't work...
  • data synchronisation between risten.no and the cvs repo (Tomi)
    • discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again ( Sjur)
      • nothing done

9. Public tender

We finally recieved their answer as well.

TODO:

  • final evaluation based on the answers to the clarification letters
    • done
  • present the final report to the board, and bring their decision into action
    • done. Decision: Polderland.
  • write a contract (mostly done by Finnut, review by Sjur)
  • get it signed (Finnut, Lennart Mikkelsen)

10. Other

Summer vacation

Who When
Børre 24.7 - 20.8
Linda ?
Maaren on sick leave
Saara July
Sjur 3.7 - 23.7 + single days at other times
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help... Saara will contact Roy on this issue.

Gobby

TODO:

  • install Gobby (Saara)
  • document its use within our project (Børre)
    • done
  • review the document ( Thomas)

SEE 2.5 extensions

Future extensions and wish modes:

  • write a perl script to extract all TODO items and sort them according to responsible person; wrap the script in a AppleScript available from the menu
  • twol mode
  • xfst mode
  • lexc mode
  • vislcg mode

TODO:

  • move the above list to Bugzilla (Sjur)
    • done

Task lists as iCal entries

This feature requires that the patch Sjur sent in to Forrest regarding parsing of nested lists within wiki documents is applied. It was applied to Forrest last Friday, thus all who is interested in using this feature from their local Forrest should update their installation. It is easiest done using Subversion (svn) to update your local copy.

TODO:

  • create iCal entries from our meeting memos (Sjur or Børre)
    • done (Sjur)
  • update all forrest installations to latest svn HEAD (Børre)

Project meeting in Tromsø in august?

The project board has decided upon a meeting in Tromsø in august. We'll discuss the date later this week.

11. Next meeting, closing

Next meeting is undefined due to summer vacation.

Closed at 14: 36.

Appendix - task lists for the next week

Boerre

  • corpus collection:
    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Send out letters to the rest of the Iđut authors
    • contact Ája (Kåfjord)
    • send contracts to Čálliid Lágádus
    • Contact Richard Valkepää at NSI about older Min Áigi and Ássu files.
  • corpus conversion:
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
    • Move norwegian documents in Min Áigi from sme to nob
  • corpus access:
    • possibly deploy the user account form as an HTML form
    • make a test user
    • Write both user and admin documentation (Børre, review: Sjur, Thomas)
      • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
      • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
  • set up Bugzilla automatic reminders for open issues
  • create document & document entry for name double-tagging
  • Update forrests to latest svn version
  • fix bugs!

Maaren

  • On sick leave

Saara

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
  • Implement parallel corpus upload in web upload script
  • Install Gobby
  • Test the aligners once again
  • refine the xml output of the xml-tagged analyses
  • convert or adapt the received PHP for paradigm generation to our needs
  • remove headers and footers from antiword documents, other improvements
  • fix bugs!

Sjur

  • public tender:
    • review letters to tenderers, contract for subcontractor
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • name lexicon:
    • implement editing functions
    • finalise refactoring for multiple collections (regular search interface)
  • review user and admin documentation for corpus access
  • write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
  • correct jspwiki docs with mixed lists
  • fix bugs!

Thomas

  • investigate productivity of even-syllable Actio compounding
  • investigate and identify under which conditions even-syllable Actio compounding is possible
  • discuss findings with the rest of us
  • add proper numeral analysis/treatment to smj
  • add loanwords (e.g. latin -ere verbs) to smj
  • sme G3 issue
  • review user documentation for corpus access
  • create smj abbr file
  • review the document
  • Redirected following three syllable verbs and prevent them from being first-part Actios in compounds:
    • Reflexives on -dit
    • Reciprocals on -dit, -(a)lit
    • Momentatives on -dit, -(a)lit, -ádit, -ihit
    • Frequentatives on -(a)lit, -(u)hit, -dit
    • Continuatives on -dit, -(u)hit, -nit
    • Inchoatives in -nit
    • Translatives on -dit
    • Essives on -dit and -stit
    • Causatives on -dit, -stit
  • find and study all derived verbs in our corpus (depends on Trond)
  • suggest which derivations could be generated (depends on Trond)

Tomi

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • read aligner docu, install, provide feedback
  • Set up the mechanism for the hash-mark transducer package
  • test the new xml output of the xml-tagged analyses
  • export corpus tools to /opt (with cron)
  • fix bugs!

Trond

  • better smj NT text
  • get fin, swe, nob and nno NT and OT in paratext format
  • install aligner, test it and give feedback
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • make shell script wrappers for the most common commands for user friendlyness
  • write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
  • write documentation for our bound users, with pointers to the ordinary documentation
  • write documentation on double-tagging names
  • discuss web-only user access management with Oslo
  • change tagging of derived stems in the disamb output, to facilitate much easier extraction of non-lexicalised derivations
  • fix bugs!.