Meeting_2006-06-19
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Public tender
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 06.06.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 59.
Present: Sjur, Thomas, Trond, Børre, Tomi
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Synnøve Persen
- Synnøve Persen
- Gather public texts, preferrably also parallel ones
- http: //finnmarksloven.no gathered, fully parallel
- http: //finnmarksloven.no gathered, fully parallel
- Send out letters to the rest of the Iđut authors
- send contract to Kurt Tore Andersen
- Done
- Done
- call Brita Kåven again
- Not done
- Not done
- contact Ája (Kåfjord)
- No answer
- No answer
- Send renaming scripts to R. Valkeapää
- Done
- Done
- call Bård Eriksen
- Not done
- Not done
- send contracts to Čálliid Lágádus
- Not done
- Not done
- send out contracts with accompanying letter
- corpus conversion:
- Continue converting text from input format to our xml
- Done
- Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- Not done
- Not done
- complete Min Áigi metadata
- Some done
- Some done
- Continue converting text from input format to our xml
- corpus access:
- meeting 9.6. (postponed to 15(?).6) t 9.30: discuss and decide upon the exact
- Done
- Done
- set upp the unix group structure for corpus users (also Saara, Trond)
- meeting 9.6. (postponed to 15(?).6) t 9.30: discuss and decide upon the exact
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Not done
- Not done
- test Gobby (with others)
- done
- done
- document use of Gobby within our project if above test is ok
- Not done
- Not done
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- Implement parallel corpus upload in web upload script
- not done
- not done
- Install Gobby
- not done
- not done
- set upp the unix group structure for corpus users (also Børre, Trond)
- Test the aligners once again
- not done
- not done
- refine the xml output of the xml-tagged analyses
- waiting Tomi's ccat implementation
- waiting Tomi's ccat implementation
- convert or adapt the received PHP for paradigm generation to our needs
- done some planning.
- done some planning.
- remove headers and footers from antiword documents, other improvements
- done, now improving handling of pdf-documents
- done, now improving handling of pdf-documents
- discuss parallel corpus markup
- done
- done
- extend the DTD with parallel markup
- do we want that? In the newsgroup, I proposed to keep the marking
- do we want that? In the newsgroup, I proposed to keep the marking
- Implement cleaning the corpus text in the file-specific xsl file
- not done
- not done
- fix bugs!
Sjur
- public tender:
- collect the last clarifying answers, and make a final proposal to the board
- still no answer from PL. If it doesn't arrive today, we'll make our proposal
- still no answer from PL. If it doesn't arrive today, we'll make our proposal
- collect the last clarifying answers, and make a final proposal to the board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- started some work on integrating M4 processing of the twol code - needed to
- started some work on integrating M4 processing of the twol code - needed to
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- investigation ready, and the concept phase is over, only implementation left
- investigation ready, and the concept phase is over, only implementation left
- implement editing functions
- change corpus summary processing to generate smaller pages
- nothing yet
- nothing yet
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- done
- done
- test Gobby (with others)
- not done
- not done
- discuss parallel corpus markup
- not really
- not really
- move the SEE 2.5 extension list to Bugzilla
- not done
- not done
-
fix bugs!
- other:
- monthly reports for April and May
- fixed a bug in the Forrest JSPWiki input plugin that produced invalid HTML
- monthly reports for April and May
Thomas
- investigate Actio compounding, first sme, later smj
- done North-sámi three-syll
- done North-sámi three-syll
- discuss findings with Maaren, and later all of us
- not done, Maaren is on sickleave
- not done, Maaren is on sickleave
- add proper numeral analysis/treatment to smj
- not done
- not done
- add loanwords (e.g. latin -ere verbs) to smj
- not done
- not done
- work on compounding
- soon finished
- soon finished
- lule sámi incoming words
- finished
- finished
- bug-fixing
- done some
- done some
- sme G3 issue
- not done
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- refine the xml output of the xml-tagged analyses
- done, but is it ready?
- done, but is it ready?
- convert or adapt the received PHP for paradigm generation to our needs
- not done
- not done
- test Gobby (with others)
- not done
- not done
- fix bugs!
Trond
- better smj NT text
- Not done, must finish Swedish work with Saara and Børre first.
- Not done, must finish Swedish work with Saara and Børre first.
- get fin, swe, nob and nno NT and OT in paratext format
- Still waiting for Oslo
- Still waiting for Oslo
- install aligner, test it and give feedback
- Not done
- Not done
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done
- Not done
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- Done
- Done
- set upp the unix group structure for corpus users (also Børre, Saara)
- Done
- Done
- test Gobby (with others)
- Done on a regular basis, also last week.
- Done on a regular basis, also last week.
- discuss parallel corpus markup
- Discussed with Lars and Saara, made good progress, and we will send texts
- Discussed with Lars and Saara, made good progress, and we will send texts
- fix bugs!.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write
- done
- done
- we will administer the corpus user accounts ourselves
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
- Admin documentation, telling how we set the permissions to the corpus files,
- User documentation probably in several languages. This covers how to apply
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
- none last 2 weeks
Olavi Korhonen's Lule Sámi dictionary.
TODO:
- set up user account/corpus access for Olavi (Børre)
- we need the infrastructure for corpus access in place first
- decided, but still not implemented
- decided, but still not implemented
- we need the infrastructure for corpus access in place first
- Contact Olavi Korhonen, to actually get the dictionary
KIO Grafisk and the Iđut books
TODO:
- send letter to Kurt Tore Andersen (Børre)
- done
- done
- send letters to the other authors (Børre)
- sent to Synnøve Persen
- Erling Persen has lost his files, only source is Iđut
- sent to Synnøve Persen
Bible texts
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- waiting for Oslo
- waiting for Oslo
- convert smj NT to paratext (Børre)
- waiting for an NT in paratext format (whatever language will do)
- waiting for an NT in paratext format (whatever language will do)
- convert fin, swe to paratext or directly to our XML (Børre)
Davvi Girji
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- not done
- not done
- call the authors (Børre)
- not more
Min Áigi
TODO:
- complete metainformation (Børre)
- continued, not finished
Kåfjord
TODO:
- contact Ája (Børre)
- no answer, will try again
Sámi Instituhtta
TODO:
- send renaming scripts to R. Valkeapää (Børre)
- done
Čálliid Lágádus
http://www.calliidlagadus.org/
TODO:
- send contracts (Børre)
- not done
Árran
TODO:
- call Bård Eriksen (Børre)
- Done
5. Corpus infrastructure
General
Errors in the Antiword conversions found when parsing the xml corpus.
There are also problems with the PDF conversion. Børre has found a
TODO:
- remove headers and footers from antiword documents, other improvements
- in the works, also improvements to the PDF conversion
User accounts and access
TODO:
- discuss and decide upon the exact access policy we want to give corpus users;
- meeting done, full memo available.
Conclusion: we have the following classes of users:
- Users that don't need a unix shell
- linguists doing research on singleton examples
- historians and other people interested in content, not in form
- linguists doing research on singleton examples
- Users that do need a unix shell
- linguists doing research on texts as a whole
- linguists with separate analysis tools
- language technology developers
- linguists doing research on texts as a whole
Shell access
External users will get their own user account, belonging to the groups
To let the bound group members be able to analyse, we need to do some minor
TODO:
- make a group bound for our external corpus users, which: ( Børre)
- gives access to read our bound texts
- gives access to execute/run the tools in /opt
- gives access to read our bound texts
- export to /opt (with cron) tools that the project team members find in
- ccat (and some perl scripts?)
- other tools?
- ccat (and some perl scripts?)
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our bound users, with pointers to the ordinary
- write documentation for how to apply for a user account (where's the form, to
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
Users of only the free corpus won't need anything but a browser.
Users of the bound corpus will need a username and password to the Oslo computer
TODO:
- discuss with Oslo
- delay other tasks until we are ready to go public?
- user management for access to bound texts
More texts to the graphical corpus interface:
TODO:
- refine xml-tagged output (Saara and Tomi)
- done, but still open if it is finished
- done, but still open if it is finished
- add text to the server (Lars)
Aligner
Trond and Saara will continue this issue.
We need markup of parallelism in the corpus DTD, at least an indication of which
TODO:
- discuss parallel corpus markup (Saara, Trond, Sjur, others)
- done
- done
- extend the DTD (Saara)
- done
Language recognition
Still waiting for more smj text to improve it.
Free and non-free texts
Anything? Final check with Børre and Saara - waiting for them to return.
Corpus summary
Forrest goes into an endless loop when processing these files. It happens when
TODO:
- trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- nothing done yet
- suggestion: lump together files with content less than X paragraphs (X < 5?)
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
TODO:
- convert or adapt the received PHP to our needs (Tomi or Saara)
- Saara has started to look at it
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Sjur has started some work with M4 integration
- Sjur has started some work with M4 integration
- Update the sme and sma hyphenator rule sets with the insights gained from smj
- Thomas has updated sme
- update (some of) sma (Trond during summer vacation)
- Thomas has updated sme
Automatic Bugzilla reminder for untouched bugs
TODO:
- set up Bugzilla to send automatic reminders for bugs not touched in a given
- in the meantime: Bugzilla as RSS feed
JSPWiki update
Sjur has corrected and improved the jspwiki parsing in Forrest, and
Mixed list example:
# something numbered ** some sub-thing with a bullet * something bulleted ## some sub-thing with a number
Sjur tried to grep, but multiline pattern matching is beyond him. Tomi:
egrep -C 3 -R "^\*.*[$]*{1,16}\#" *
TODO:
- grep for all occurences of ^* followed by a line ^## and vice versa (the
- done during the meeting
7. Linguistics
Name double-tagging
Conclusion, in a principled fashion:
- hardcoded sem-tags win
- There is a sem-tag conversion procedure: according to a hierarchy of sem-tags:
TODO:
- Make a section under gt/doc/lang/smi/, add a chapter
- write guidelines for annotators wrt. to name tagging and put them under
Systematic - when adding new names, only use one sem-tag unless there are known objects
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
TODO:
- investigate Actio compounding (Thomas)
- done North-sámi three-syll
- done North-sámi three-syll
- discuss findings with Maaren, and later all of us (Thomas)
- not done, Maaren is on sickleave
Actio compounding
It is definitely productive. Whether this is a problem for our speller(s), we
Whether Actio is used or not in compounding a verbal stem follows from the
TODO:
- investigate and identify under which conditions Actio compounding is possible
Lule Sámi
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- done except for abbr. and numerals
- add abbr to a new abbr lexicon file
- numerals are covered by the next task
- add abbr to a new abbr lexicon file
- done except for abbr. and numerals
- add proper numeral analysis/treatment (Thomas)
- not done
- not done
- add loanwords (e.g. latin -ere verbs) (Thomas)
- not done
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- investigation done, still not implemented
- investigation done, still not implemented
- develop the needed XQueries and interface (Sjur, Tomi)
- nothing this week
- nothing this week
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- not yet done
- discussion started on eXist-list, nothing useful came up. We need to
9. Public tender
Nothing received from PL yet. They have an extended deadline today. After that
TODO:
- final evaluation based on the answers to the clarification letters
- waiting for the PL answer
- waiting for the PL answer
- present the final report to the board, and bring their decision into action
10. Other
Summer vacation
From Bitte: "I følge fjorårets liste tok iallefall Børre ut 10 dager av
She would like to receive a final vacation plan soon.
Who | When |
---|---|
Børre | 24.7 - 20.8 |
Linda | ? |
Maaren | on sick leave |
Saara | July |
Sjur | at least 2 weeks in July, but still open |
Thomas | 3.7 - 7.8 |
Trond | 3.7 - 14.8 (last two weeks off at summer school) |
Tomi | 8.7 - 16.7, 2 more weeks in July and/or August |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
Gobby
TODO:
- install Gobby (Saara)
- test it (Tomi, Trond, Sjur, Børre, Lars?)
- done
- done
- if successful, document its use within our project (Børre)
SEE 2.5 extensions
Future extensions and whish modes:
- write a perl script to extract all TODO items and sort them according to
- twol mode
- xfst mode
- lexc mode
- vislcg mode
TODO:
- move the above list to Bugzilla (Sjur)
Task lists as iCal entries
With the latest corrections to the Wiki parsing, and with the tasks at the end
iCal entries look like the following:
BEGIN:VTODO DTSTAMP:20060619T090920Z ORGANIZER;CN=Børre Gaup:MAILTO:boerre@skolelinux.no CREATED:20050621T171425Z UID:libkcal-939001838.216 SEQUENCE:1 LAST-MODIFIED:20050622T050540Z SUMMARY:Ordne lenker CLASS:PUBLIC PRIORITY:5 DUE:20050824T073000Z COMPLETED:20050622T050540Z PERCENT-COMPLETE:100 END:VTODO
A reference can be found at Wikipedia
References should be of the type:
/doc/admin/weekly/2006/Tasks_2006-06-19_Sjur.ics
TODO:
- create iCal entries from our meeting memos (Sjur or Børre)
11. Next meeting, closing
26.06.2006 09: 30
Sjur might be away, will inform you later.
Closed at 11: 20.
Appendix - task lists for the next week
Boerre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again
- contact Ája (Kåfjord)
- call Bård Eriksen
- send contracts to Čálliid Lágádus
- Contact Olavi Korhonen, to actually get the dictionary
- send out contracts with accompanying letter
- corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- complete Min Áigi metadata
- Continue converting text from input format to our xml
- corpus access:
- make a group bound for our external corpus users
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation
- make a group bound for our external corpus users
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- document use of Gobby within our project
- create document & document entry for name double-tagging
- create iCal entries from our meeting memos (or Sjur)
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- remove headers and footers from antiword documents, other improvements
- fix bugs!
Sjur
- public tender:
- collect the last clarifying answers, and make a final proposal to the board
- collect the last clarifying answers, and make a final proposal to the board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus summary processing to generate smaller pages, see
- move the SEE 2.5 extension list to Bugzilla
- create iCal entries from our meeting memos (or Børre)
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
- fix bugs!
Thomas
- investigate productivity of Actio compounding in smj
- investigate and identify under which conditions Actio compounding is possible
- discuss findings with the rest of us
- add proper numeral analysis/treatment to smj
- add loanwords (e.g. latin -ere verbs) to smj
- work on compounding
- sme G3 issue
- review user documentation for corpus access
- create smj abbr file
- fix bugs!
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to /opt (with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
- write documentation for our bound users, with pointers to the ordinary
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- fix bugs!.