Meeting_2006-06-26
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Public tender
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 06.06.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
The meeting was delayed due to the project board having a telephone meeting at
Opened at 13: 17.
Present: Sjur, Thomas, Børre, Tomi
Absent: Maaren, Saara, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again
- Done
- Done
- contact Ája (Kåfjord)
- Not done
- Not done
- call Bård Eriksen
- Done
- Done
- send contracts to Čálliid Lágádus
- Not done
- Not done
- Contact Olavi Korhonen, to actually get the dictionary
- Done
- Done
- send out contracts with accompanying letter
- corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- None of the bible things are done
- None of the bible things are done
- complete Min Áigi metadata
- Done
- Done
- Continue converting text from input format to our xml
- corpus access:
- make a group bound for our external corpus users
- Done
- Done
- possibly deploy the user account form as an HTML form
- Not done
- Not done
- make a test user
- Not done
- Not done
- Write both user and admin documentation
- Some done
- Some done
- make a group bound for our external corpus users
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Built-in feature of Bugzilla, tried to turn it on.
- Built-in feature of Bugzilla, tried to turn it on.
- document use of Gobby within our project
- Done
- Done
- create document & document entry for name double-tagging
- Not done
- Not done
- create iCal entries from our meeting memos (or Sjur)
- Not done
- Not done
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- remove headers and footers from antiword documents, other improvements
- fix bugs!
Sjur
- public tender:
- collect the last clarifying answers, and make a final proposal to the board
- done. Decision by the board: Polderland
- done. Decision by the board: Polderland
- collect the last clarifying answers, and make a final proposal to the board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- nothing since last attempt
- nothing since last attempt
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus-summary processing to generate smaller pages, see
- done, now waiting for feedback from Børre
- done, now waiting for feedback from Børre
- move the SEE 2.5 extension list to Bugzilla
- done
- done
- create iCal entries from our meeting memos (or Børre)
- done
- done
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
-
fix bugs!
- other tasks:
- administrative tasks
Thomas
- investigate productivity of Actio compounding in smj
- had a look at the three-syllables
- had a look at the three-syllables
- investigate and identify under which conditions Actio compounding is possible
- have identified some things
- have identified some things
- discuss findings with the rest of us
- will do
- will do
- add proper numeral analysis/treatment to smj
- not done
- not done
- add loanwords (e.g. latin -ere verbs) to smj
- not done
- not done
- work on compounding
- finished in practice, some things still that I need to discuss with Maaren
- finished in practice, some things still that I need to discuss with Maaren
- sme G3 issue
- not done
- not done
- review user documentation for corpus access
- not done
- not done
- create smj abbr file
- not done
- not done
-
fix bugs!
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- doing
- doing
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- test the new xml output of the xml-tagged analyses
- not done
- not done
- export corpus tools to /opt (with cron)
- not done
- not done
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- fix bugs!.
3. Documentation
TODO:
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- not done yet
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
- none last 2 weeks
Olavi Korhonen's Lule Sámi dictionary.
Talked with him, sent him the contract.
TODO:
- Contact Olavi Korhonen, to actually get the dictionary
- done
KIO Grafisk and the Iđut books
TODO:
- send letters to the other authors (Børre)
Bible texts
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- waiting for Oslo
- waiting for Oslo
- convert smj NT to paratext (Børre)
- waiting for an NT in paratext format (whatever language will do)
- waiting for an NT in paratext format (whatever language will do)
- convert fin, swe to paratext or directly to our XML (Børre)
Davvi Girji
Called Brita Kåven. She can't say anything about letting us have the texts
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- done
- done
- call the authors (Børre)
- not more
Min Áigi
TODO:
- complete metainformation (Børre)
- done as far as possible: author and language info
- some smj files moved to the smj tree
- still some Norwegian files to be checked
- done as far as possible: author and language info
Kåfjord
TODO:
- contact Ája (Børre)
- forgotten
Sámi Instituhtta
When will we get the corpus? We don't know, Børre will contact him again.
TODO:
- contact NSI again (Børre)
Čálliid Lágádus
http://www.calliidlagadus.org/
TODO:
- send contracts (Børre)
- nothing
Árran
Talked to Bård Eriksen, he needs to discuss more with his coworkers.
TODO:
- continue discussion (Børre)
- done, not finished
5. Corpus infrastructure
General
TODO:
- remove headers and footers from antiword documents, other improvements
- in the works, also improvements to the PDF conversion
- problems with running the tools, they need some updated libraries in
- problems with running the tools, they need some updated libraries in
- in the works, also improvements to the PDF conversion
User accounts and access
For details, see the previous meeting memo, as well as the
Shell access
TODO:
- make a group bound for our external corpus users, which: ( Børre)
- gives access to read our bound texts
- gives access to execute/run the tools in /opt
- done
- done
- gives access to read our bound texts
- export to /opt (with cron) tools that the project team members find in
- ccat (and some perl scripts?)
- other tools?
- nothing yet
- nothing yet
- ccat (and some perl scripts?)
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- short discussion with Trond about what we need
- short discussion with Trond about what we need
- possibly deploy the user account form as an HTML form (Børre)
- waiting for the form to be written
- waiting for the form to be written
- write documentation for our bound users, with pointers to the ordinary
- nothing done
- nothing done
- write documentation for how to apply for a user account (where's the form, to
- nothing
- nothing
- make our own guidelines for the user application processing (Børre)
- nothing
- nothing
- make a test user (Børre)
- nothing
- nothing
- test corpus access as test user (Trond)
- nothing yet
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
More texts to the graphical corpus interface:
TODO:
- refine xml-tagged output (Saara and Tomi)
- done, but still open if it is finished
- done, but still open if it is finished
- add text to the server (Lars)
Aligner
More to be said about this? (certainly, but right now?)
Language recognition
Still waiting for more smj text to improve it.
Corpus summary
Forrest goes into an endless loop when processing these files. It happens when
It is now implemented, but a test to skip this summary option for smaller
Improvement suggestion: instead of summarize files under a certain limit,
TODO:
- trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- done - Forrest seems to work fine now
- done - Forrest seems to work fine now
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- add improvement suggestion to Bugzilla (Børre)
- done
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
Saara has given a report on the PHP code in News. Please read.
TODO:
- convert or adapt the received PHP to our needs (Saara)
- code evaluated, evaluation reported
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- nothing since last meeting
- nothing since last meeting
- Update the sma hyphenator rule set with the insights gained from smj updates
Automatic Bugzilla reminder for untouched bugs
TODO:
- set up Bugzilla to send automatic reminders for bugs not touched in a given
- turned on the feature, but it doesn't seem to work - needs some more
- turned on the feature, but it doesn't seem to work - needs some more
JSPWiki update
Here's the pattern to use:
egrep -C 3 -R "^\*.*[$]*{1,16}\#" *
TODO:
- grep for all occurences of ^* followed by a line ^## and vice versa (the
- Tomi tried it, but nothing found
- Tomi tried it, but nothing found
- correct the documents found to have consistent lists (Sjur)
- not done
7. Linguistics
Derivation and spellers like Aspell
It is impossible to create a model that can dynamically generate new derivations
+st, +l, +h, -goahtit, (o)juvvot (passive)
Later we might consider lexicalising all other derivations found in our corpus.
To make it easier to extract all derived stems, we should enhance the tags used
"<laktigohtet>" "laktit" V TV goahti Ind Prs Pl3 @+FMAINV
Only the tag goahti is identifying that the word form is a derivation. It
TODO:
- change tagging of derived stems in the disamb output, to facilitate much
- not done, the other tasks depend on this one
- not done, the other tasks depend on this one
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
- see source code above, but also consider overgeneration problems, as well as
- lexicalise the rest (Thomas)
Name double-tagging
TODO:
- Make a section under gt/doc/lang/smi/, add a chapter
- write guidelines for annotators wrt. to name tagging and put them under
Systematic - when adding new names, only use one sem-tag unless there are known objects
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
TODO:
- discuss findings with all of us (Thomas)
- investigate and identify under which conditions Actio compounding is possible
Following already derived verbs are not happy with further derivation. It seems
LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 5-syllables, formerly directed to MUITAL +V+TV: MUITALStem ; !SHOULD be directed here as well: !Reflexives on -dit !Reciprocals on -dit, -(a)lit !Momentatives on -dit, -(a)lit, -ádit, -ihit !Frequentatives on -(a)lit, -(u)hit, -dit !Continuatives on -dit, -(u)hit, -nit !Inchoatives in -nit !Translatives on -dit !Essives on -dit and -stit !Causatives on -dit, -stit
Lule Sámi
TODO:
- add inc abbr to a new abbr lexicon file (Thomas)
- add proper numeral analysis/treatment (Thomas)
- add loanwords (e.g. latin -ere verbs) (Thomas)
- nothing done to any of these
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- still not implemented
- still not implemented
- develop the needed XQueries and interface (Sjur, Tomi)
- done the inflection menu, but it doesn't work...
- done the inflection menu, but it doesn't work...
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- nothing done
- discussion started on eXist-list, nothing useful came up. We need to
9. Public tender
We finally recieved their answer as well.
TODO:
- final evaluation based on the answers to the clarification letters
- done
- done
- present the final report to the board, and bring their decision into action
- done. Decision: Polderland.
- done. Decision: Polderland.
- write a contract (mostly done by Finnut, review by Sjur)
- get it signed (Finnut, Lennart Mikkelsen)
10. Other
Summer vacation
Who | When |
---|---|
Børre | 24.7 - 20.8 |
Linda | ? |
Maaren | on sick leave |
Saara | July |
Sjur | 3.7 - 23.7 + single days at other times |
Thomas | 3.7 - 7.8 |
Trond | 3.7 - 14.8 (last two weeks off at summer school) |
Tomi | 8.7 - 16.7, 2 more weeks in July and/or August |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
Gobby
TODO:
- install Gobby (Saara)
- document its use within our project (Børre)
- done
- done
- review the document ( Thomas)
SEE 2.5 extensions
Future extensions and wish modes:
- write a perl script to extract all TODO items and sort them according to
- twol mode
- xfst mode
- lexc mode
- vislcg mode
TODO:
- move the above list to Bugzilla (Sjur)
- done
Task lists as iCal entries
This feature requires that the patch Sjur sent in to Forrest regarding parsing
TODO:
- create iCal entries from our meeting memos (Sjur or Børre)
- done (Sjur)
- done (Sjur)
- update all forrest installations to latest svn HEAD (Børre)
Project meeting in Tromsø in august?
The project board has decided upon a meeting in Tromsø in august. We'll discuss
11. Next meeting, closing
Next meeting is undefined due to summer vacation.
Closed at 14: 36.
Appendix - task lists for the next week
Boerre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord)
- send contracts to Čálliid Lágádus
- Contact Richard Valkepää at NSI about older Min Áigi and Ássu files.
- send out contracts with accompanying letter
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- convert nob and nno bible texts to be used as part of a parallel corpus
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
- possibly deploy the user account form as an HTML form
- set up Bugzilla automatic reminders for open issues
- create document & document entry for name double-tagging
- Update forrests to latest svn version
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- remove headers and footers from antiword documents, other improvements
- fix bugs!
Sjur
- public tender:
- review letters to tenderers, contract for subcontractor
- review letters to tenderers, contract for subcontractor
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- correct jspwiki docs with mixed lists
- fix bugs!
Thomas
- investigate productivity of even-syllable Actio compounding
- investigate and identify under which conditions even-syllable Actio
- discuss findings with the rest of us
- add proper numeral analysis/treatment to smj
- add loanwords (e.g. latin -ere verbs) to smj
- sme G3 issue
- review user documentation for corpus access
- create smj abbr file
- review the document
- Redirected following three syllable verbs and prevent them from being
- Reflexives on -dit
- Reciprocals on -dit, -(a)lit
- Momentatives on -dit, -(a)lit, -ádit, -ihit
- Frequentatives on -(a)lit, -(u)hit, -dit
- Continuatives on -dit, -(u)hit, -nit
- Inchoatives in -nit
- Translatives on -dit
- Essives on -dit and -stit
- Causatives on -dit, -stit
- Reflexives on -dit
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- change tagging of derived stems in the disamb output, to facilitate much
- fix bugs!.