Meeting_2006-06-26

Contents:

Meeting setup
Agenda
1. Opening, agenda review, participants
2. Updated task status since last meeting
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
3. Documentation
4. Corpus gathering
5. Corpus infrastructure
6. Infrastructure
7. Linguistics
8. Name lexicon infrastructure
9. Public tender
10. Other
11. Next meeting, closing
Appendix - task lists for the next week
- Boerre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond

Meeting setup

Date: 06.06.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

The meeting was delayed due to the project board having a telephone meeting at our regular meeting time.

Opened at 13: 17.

Present: Sjur, Thomas, Børre, Tomi

Absent: Maaren, Saara, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again
  - Done
- contact Ája (Kåfjord)
  - Not done
- call Bård Eriksen
  - Done
- send contracts to Čálliid Lágádus
  - Not done
- Contact Olavi Korhonen, to actually get the dictionary
  - Done
corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
  - None of the bible things are done
- complete Min Áigi metadata
  - Done
corpus access:
- make a group bound for our external corpus users
  - Done
- possibly deploy the user account form as an HTML form
  - Not done
- make a test user
  - Not done
- Write both user and admin documentation
  - Some done
set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Built-in feature of Bugzilla, tried to turn it on.
document use of Gobby within our project
- Done
create document & document entry for name double-tagging
- Not done
create iCal entries from our meeting memos (or Sjur)
- Not done
fix bugs!

Maaren

On sick leave

Saara

Create a parallel corpora of the new testaments
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
Implement parallel corpus upload in web upload script
Install Gobby
Test the aligners once again
refine the xml output of the xml-tagged analyses
convert or adapt the received PHP for paradigm generation to our needs
remove headers and footers from antiword documents, other improvements
fix bugs!

Sjur

public tender:
- collect the last clarifying answers, and make a final proposal to the board
  - done. Decision by the board: Polderland
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- nothing since last attempt
name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
change corpus-summary processing to generate smaller pages, see bug report
- done, now waiting for feedback from Børre
move the SEE 2.5 extension list to Bugzilla
- done
create iCal entries from our meeting memos (or Børre)
- done
review user and admin documentation for corpus access
write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
fix bugs!
other tasks:
- administrative tasks

Thomas

investigate productivity of Actio compounding in smj
- had a look at the three-syllables
investigate and identify under which conditions Actio compounding is possible
- have identified some things
discuss findings with the rest of us
- will do
add proper numeral analysis/treatment to smj
- not done
add loanwords (e.g. latin -ere verbs) to smj
- not done
work on compounding
- finished in practice, some things still that I need to discuss with Maaren
sme G3 issue
- not done
review user documentation for corpus access
- not done
create smj abbr file
- not done
fix bugs!
- not done

Tomi

new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
  - not done
- XQuery refactoring and code development for our proper noun editor
  - doing
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  - not done
read aligner docu, install, provide feedback
- not done
Set up the mechanism for the hash-mark transducer package
- not done
test the new xml output of the xml-tagged analyses
- not done
export corpus tools to /opt (with cron)
- not done
fix bugs!

Trond

better smj NT text
get fin, swe, nob and nno NT and OT in paratext format
install aligner, test it and give feedback
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
make shell script wrappers for the most common commands for user friendlyness
write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
write documentation for our bound users, with pointers to the ordinary documentation
write documentation on double-tagging names
discuss web-only user access management with Oslo
fix bugs!.

3. Documentation

TODO:

Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
  - not done yet

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO:

Send out the rest of the letters (Børre)

New contracts:

none last 2 weeks

Olavi Korhonen's Lule Sámi dictionary.

Talked with him, sent him the contract.

TODO:

Contact Olavi Korhonen, to actually get the dictionary
- done

KIO Grafisk and the Iđut books

TODO:

send letters to the other authors (Børre)

Bible texts

TODO:

get nob and nno NT and OT in paratext format. (Trond)
- waiting for Oslo
convert smj NT to paratext (Børre)
- waiting for an NT in paratext format (whatever language will do)
convert fin, swe to paratext or directly to our XML (Børre)

Davvi Girji

Called Brita Kåven. She can't say anything about letting us have the texts untill she has talked with the employees, to get a work estimate on their part. Nothing was decided in their previous meeting. Davvi Girji seems generally very positive, but has problems finding the best way of actually delivering the texts to us.

TODO:

call Brita Kåven again towards the end of the week (Børre)
- done
call the authors (Børre)
- not more

Min Áigi

TODO:

complete metainformation (Børre)
- done as far as possible: author and language info
- some smj files moved to the smj tree
- still some Norwegian files to be checked

Kåfjord

TODO:

contact Ája (Børre)
- forgotten

Sámi Instituhtta

When will we get the corpus? We don't know, Børre will contact him again.

TODO:

contact NSI again (Børre)

Čálliid Lágádus

http://www.calliidlagadus.org/

TODO:

send contracts (Børre)
- nothing

Árran

Talked to Bård Eriksen, he needs to discuss more with his coworkers.

TODO:

continue discussion (Børre)
- done, not finished

5. Corpus infrastructure

General

TODO:

remove headers and footers from antiword documents, other improvements as needed, including PDF conversion (Saara)
- in the works, also improvements to the PDF conversion
  - problems with running the tools, they need some updated libraries in Victorio

User accounts and access

For details, see the previous meeting memo, as well as the memo from a dedicated meeting.

Shell access

TODO:

make a group bound for our external corpus users, which: ( Børre)
- gives access to read our bound texts
- gives access to execute/run the tools in /opt
  - done
export to /opt (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Tomi)
- ccat (and some perl scripts?)
- other tools?
  - nothing yet
make shell script wrappers for the most common commands for user friendlyness ( Trond)
write user account form, probably ask for copy of existing ones from the IT centre (Trond and Sjur)
- short discussion with Trond about what we need
possibly deploy the user account form as an HTML form (Børre)
- waiting for the form to be written
write documentation for our bound users, with pointers to the ordinary documentation (Børre, Trond)
- nothing done
write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.) (Børre)
- nothing
make our own guidelines for the user application processing (Børre)
- nothing
make a test user (Børre)
- nothing
test corpus access as test user (Trond)
- nothing yet

Web browser access

TODO:

discuss with Oslo (Trond)
delay other tasks until we are ready to go public?
user management for access to bound texts

More texts to the graphical corpus interface:

TODO:

refine xml-tagged output (Saara and Tomi)
1. done, but still open if it is finished
add text to the server (Lars)

Aligner

More to be said about this? (certainly, but right now?)

Language recognition

Still waiting for more smj text to improve it.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

It is now implemented, but a test to skip this summary option for smaller collections did not work. Thus all genres in all languages are treated the same.

Improvement suggestion: instead of summarize files under a certain limit, one could also split Min Áigi into years, and display it year-wise. That will require some more info in the generated source file corpus-content.xml.

TODO:

trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
  - done - Forrest seems to work fine now
add improvement suggestion to Bugzilla (Børre)
- done

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

Saara has given a report on the PHP code in News. Please read.

TODO:

convert or adapt the received PHP to our needs (Saara)
- code evaluated, evaluation reported

Hyphenator

TODO:

correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
- nothing since last meeting
Update the sma hyphenator rule set with the insights gained from smj updates ( Trond during summer vacation)

Automatic Bugzilla reminder for untouched bugs

TODO:

set up Bugzilla to send automatic reminders for bugs not touched in a given timeframe; ask Thor-Øivind for help if needed (Børre)
- turned on the feature, but it doesn't seem to work - needs some more investigation

JSPWiki update

Here's the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

grep for all occurences of ^* followed by a line ^## and vice versa (the patterns above) (Sjur)
- Tomi tried it, but nothing found
correct the documents found to have consistent lists (Sjur)
- not done

7. Linguistics

Derivation and spellers like Aspell

It is impossible to create a model that can dynamically generate new derivations of verbs, and then let them nominalise for compounding. What we need is to extract all derived verbs not in the lexicon, and lexicalise most of them. We could potentially let the 5(?) most productive verbal derivations be unlexicalised, and generate them from our lexicon directly when creating entries. A tentive, first guess on which derivational suffixes this could be is (for sme):

+st, +l, +h, -goahtit, (o)juvvot (passive)

Later we might consider lexicalising all other derivations found in our corpus.

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme. Example:

"<laktigohtet>"
        "laktit" V TV goahti Ind Prs Pl3 @+FMAINV

Only the tag goahti is identifying that the word form is a derivation. It should be extended to Der/goahti as in smj. Then it is easy to grep for the pattern Der/ to get all derived verbs.

TODO:

change tagging of derived stems in the disamb output, to facilitate much easier extraction of these verbs (Trond)
- not done, the other tasks depend on this one
find and study all derived verbs in our corpus (Thomas)
suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as input from the statistical results in the previous point
lexicalise the rest (Thomas)

Name double-tagging

TODO:

Make a section under gt/doc/lang/smi/, add a chapter Common linguistic resources under the Languages section in site-frag.xml, and add the document below to that section (Børre)
write guidelines for annotators wrt. to name tagging and put them under gt/doc/lang/smi. A short version of the guidelines goes as follows:
Systematic ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (Eira (Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be added to our documentation file. (Trond)
when adding new names, only use one sem-tag unless there are known objects which belong to different sem classes(?) (all linguists)
write disamb rules to implement the system above (Trond, Linda)

North Sámi

TODO:

discuss findings with all of us (Thomas)
investigate and identify under which conditions Actio compounding is possible ( Thomas)

Following already derived verbs are not happy with further derivation. It seems like the most of them do not appear as Actio forms in first part of compounding either. The following holds for both sme and smj:

LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 
5-syllables, formerly directed to 
MUITAL
 +V+TV: MUITALStem ;
!SHOULD be directed here as well:
!Reflexives on -dit
!Reciprocals on -dit, -(a)lit
!Momentatives on -dit, -(a)lit, -ádit, -ihit
!Frequentatives on -(a)lit, -(u)hit, -dit
!Continuatives on -dit, -(u)hit, -nit
!Inchoatives in -nit 
!Translatives on -dit
!Essives on -dit and -stit
!Causatives on -dit, -stit

Lule Sámi

TODO:

add inc abbr to a new abbr lexicon file (Thomas)
add proper numeral analysis/treatment (Thomas)
add loanwords (e.g. latin -ere verbs) (Thomas)
- nothing done to any of these

8. Name lexicon infrastructure

TODO:

finish refactoring for multiple collections in the search interfarce ( Sjur)
- still not implemented
develop the needed XQueries and interface (Sjur, Tomi)
- done the inflection menu, but it doesn't work...
data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again ( Sjur)
  - nothing done

9. Public tender

We finally recieved their answer as well.

TODO:

final evaluation based on the answers to the clarification letters
- done
present the final report to the board, and bring their decision into action
- done. Decision: Polderland.
write a contract (mostly done by Finnut, review by Sjur)
get it signed (Finnut, Lennart Mikkelsen)

10. Other

Summer vacation

Who	When
Børre	24.7 - 20.8
Linda	?
Maaren	on sick leave
Saara	July
Sjur	3.7 - 23.7 + single days at other times
Thomas	3.7 - 7.8
Trond	3.7 - 14.8 (last two weeks off at summer school)
Tomi	8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help... Saara will contact Roy on this issue.

Gobby

TODO:

install Gobby (Saara)
document its use within our project (Børre)
- done
review the document ( Thomas)

SEE 2.5 extensions

Future extensions and wish modes:

write a perl script to extract all TODO items and sort them according to responsible person; wrap the script in a AppleScript available from the menu
twol mode
xfst mode
lexc mode
vislcg mode

TODO:

move the above list to Bugzilla (Sjur)
- done

Task lists as iCal entries

This feature requires that the patch Sjur sent in to Forrest regarding parsing of nested lists within wiki documents is applied. It was applied to Forrest last Friday, thus all who is interested in using this feature from their local Forrest should update their installation. It is easiest done using Subversion (svn) to update your local copy.

TODO:

create iCal entries from our meeting memos (Sjur or Børre)
- done (Sjur)
update all forrest installations to latest svn HEAD (Børre)

Project meeting in Tromsø in august?

The project board has decided upon a meeting in Tromsø in august. We'll discuss the date later this week.

11. Next meeting, closing

Next meeting is undefined due to summer vacation.

Closed at 14: 36.

Appendix - task lists for the next week

Boerre

corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord)
- send contracts to Čálliid Lágádus
- Contact Richard Valkepää at NSI about older Min Áigi and Ássu files.
corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
  - User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
  - Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
set up Bugzilla automatic reminders for open issues
create document & document entry for name double-tagging
Update forrests to latest svn version
fix bugs!

Maaren

On sick leave

Saara

Create a parallel corpora of the new testaments
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
Implement parallel corpus upload in web upload script
Install Gobby
Test the aligners once again
refine the xml output of the xml-tagged analyses
convert or adapt the received PHP for paradigm generation to our needs
remove headers and footers from antiword documents, other improvements
fix bugs!

Sjur

public tender:
- review letters to tenderers, contract for subcontractor
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
review user and admin documentation for corpus access
write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
correct jspwiki docs with mixed lists
fix bugs!

Thomas

investigate productivity of even-syllable Actio compounding
investigate and identify under which conditions even-syllable Actio compounding is possible
discuss findings with the rest of us
add proper numeral analysis/treatment to smj
add loanwords (e.g. latin -ere verbs) to smj
sme G3 issue
review user documentation for corpus access
create smj abbr file
review the document
Redirected following three syllable verbs and prevent them from being first-part Actios in compounds:
- Reflexives on -dit
- Reciprocals on -dit, -(a)lit
- Momentatives on -dit, -(a)lit, -ádit, -ihit
- Frequentatives on -(a)lit, -(u)hit, -dit
- Continuatives on -dit, -(u)hit, -nit
- Inchoatives in -nit
- Translatives on -dit
- Essives on -dit and -stit
- Causatives on -dit, -stit
find and study all derived verbs in our corpus (depends on Trond)
suggest which derivations could be generated (depends on Trond)

Tomi

new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
read aligner docu, install, provide feedback
Set up the mechanism for the hash-mark transducer package
test the new xml output of the xml-tagged analyses
export corpus tools to /opt (with cron)
fix bugs!

Trond

better smj NT text
get fin, swe, nob and nno NT and OT in paratext format
install aligner, test it and give feedback
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
make shell script wrappers for the most common commands for user friendlyness
write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
write documentation for our bound users, with pointers to the ordinary documentation
write documentation on double-tagging names
discuss web-only user access management with Oslo
change tagging of derived stems in the disamb output, to facilitate much easier extraction of non-lexicalised derivations
fix bugs!.