Meeting_2006-06-12
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Public tender
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 06.06.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 59.
Present: Sjur, Thomas, Trond, Tomi
Absent: Børre, Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Maaren has signed a contract, which is sent to Davvi Girji
- Maaren has signed a contract, which is sent to Davvi Girji
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not possible before we have another paratext NT.
- Not possible before we have another paratext NT.
- Send out letters to the rest of the Iđut authors
- Not done
- Not done
- send contract to Kurt Tore Andersen
- Not done
- Not done
- call Brita Kåven again
- Haven't got in touch with her
- Haven't got in touch with her
- call Harald Gaski
- Met with him. He has to send contracts to his writers union, to check
- Met with him. He has to send contracts to his writers union, to check
- contact Ája (Kåfjord)
- Not done
- Not done
- Send renaming scripts to R. Valkeapää
- Working on them
- Working on them
- complete Min Áigi metadata
- Halfway done
- Halfway done
- call Bård Eriksen
- Not done
- Not done
- send out contracts with accompanying letter
- corpus access:
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- Delayed once more
- Delayed once more
- set upp the unix group structure for corpus users (also Saara, Trond)
- Waiting for outcome of above meeting
- Waiting for outcome of above meeting
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Not done
- Not done
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header
- done
- done
- Implement parallel corpus upload in web upload script
- Install Gobby
- set upp the unix group structure for corpus users (also Børre, Trond)
- Test the aligners once again
- make an analyser that retains the xml structure for Oslo
- in progress (with Tomi). The exact structure of the output still open
- in progress (with Tomi). The exact structure of the output still open
- discuss parallel text markup in newsgroup
- done
- done
- fix bugs!
Sjur
- public tender:
- write e-mail to PL and ask for more clarification
- done
- done
- write e-mail to PL and ask for more clarification
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- not done
- not done
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- some work on researching new ways of doing all-collection searches, and at
- some work on researching new ways of doing all-collection searches, and at
- implement editing functions
- change corpus-summary processing to generate smaller pages
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- postponed
- postponed
- move to Bugzilla:
- proofed/unproofed parallel text issue
- done
- done
- xml output of paradigm generator
- done
- done
- proofed/unproofed parallel text issue
- fix bugs!
Thomas
- hyphenation-rule-set
- a few bugs, otherwise finished
- a few bugs, otherwise finished
- work on compounding and derivation
- finished with derivation
- finished with derivation
- lule sámi incoming words
- added a few
- added a few
- sme G3 issue
- nothing this week
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- add ccat option to analyse text while keeping the xml tags and structure
- done some
- done some
- fix bugs!
Trond
- better smj NT text
- Not done
- Not done
- get fin, swe, nob and nno NT and OT in paratext format
- Discussed with fin and swe societies, the Swedes are not quite satisfied with
- Discussed with fin and swe societies, the Swedes are not quite satisfied with
- install aligner, test it and give feedback
- Not done
- Not done
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done
- Not done
- Put Saara and Tero in contact with each other
- Per will send the PHP code. Details about Tero will be given at the meeting.
- Per will send the PHP code. Details about Tero will be given at the meeting.
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- Postponed.
- Postponed.
- set upp the unix group structure for corpus users (also Børre, Saara)
- Postponed.
- Postponed.
-
fix bugs!.
- Worked on some bugs.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write
- we will administer the corpus user accounts ourselves
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
- none last 2 weeks
Olavi Korhonen's Lule Sámi dictionary.
TODO:
- set up user account/corpus access for Olavi (Børre)
- we need the infrastructure for corpus access in place first
KIO Grafisk and the Iđut books
TODO:
- send letter to Kurt Tore Andersen (Børre)
- send letters to the other authors (Børre)
Bible texts
The Swedish Bible Society is reluctant to give us the paratext version, as it is
Børre and Trond suggest that we give the MS Word version a second try,
We have asked for any language version in paratext format from the Norwegian
We are still waiting for the Norwegian versions, but Finnish and Swedish are
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- convert smj NT to paratext (Børre)
- convert fin, swe to paratext or directly to our XML (Børre)
Davvi Girji
Talked to Harald Gaski, he will send the contracts to the writers' organisation,
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- call the authors (Børre)
- Harald Gaski
- done
- Harald Gaski
Min Áigi
TODO:
- complete metainformation (Børre)
Kåfjord
Promised to send us texts. Some texts have arrived, but nothing from Ája.
TODO:
- contact Ája (Børre)
- Not done
Sámi Instituhtta
TODO:
- send renaming scripts to R. Valkeapää (Børre)
Čálliid Lágádus
Børre talked to them, they are positive, and will give us print-ready pdf's.
Árran
The negotiations are underway, they discuss it in a meeting today, and we have
TODO:
- call Bård Eriksen (Børre)
5. Corpus infrastructure
General
Errors in the Antiword conversions found when parsing the xml corpus. Main
TODO (all of these in priority order, the third option is really a last resort):
- fine-tune the initial conversion in antiword (Børre or Saara)
- make file-specific fixes in the file-speciflc xsl file (by having our local
- Manually fix the resulting files before sending off to analysis (??, this
User accounts and access
TODO:
- discuss and decide upon the exact access policy we want to give corpus users;
- still not done, new date: Wednesday 14.6.2006, at 9.30
- still not done, new date: Wednesday 14.6.2006, at 9.30
- set upp the unix group structure to open for a new category of users:
More texts to the graphical corpus interface:
We need to get the infrastructure complete to be able to do this, then it
Saaara has made corpus-analyze.pl, a script to analyse text while keeping the
TODO:
- refine xml-tagged output (Saara and Tomi)
- add text to the server (Lars)
Aligner
Trond and Saara will continue this issue.
We need markup of parallelism in the corpus DTD, at least an indication of which
Language recognition
Still waiting for more smj text to improve it.
Free and non-free texts
Anything? Final check with Børre and Saara - waiting for them to return.
Corpus summary
TODO:
- trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
Proofed vs unproofed corpus files
TODO:
- count the number of parallel files of type unproofed/proofread (Saara)
- done, 35 such file pairs
- delayed for the time being, move to Bugzilla (Sjur)
- done
- done, 35 such file pairs
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
We have now received the original PHP code from Tero/Per. It seems quite easy to
TODO:
- convert or adapt the received PHP to our needs (Tomi or Saara)
- move xml-out option to Bugzilla (Sjur)
- done
Hyphenator
Thomas is finished with adding ^ tags to the sme noun file.
Trond and Thomas have been working on the smj rule component, and have
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done - postponed till next week (w 24)
- Still not done - postponed till next week (w 24)
- Update the sme and sma rule sets with the insights gained from smj updates.
Automatic Bugzilla reminder for untouched bugs
We need to get a summary by e-mail for all bugs not touched in more than 5(?)
TODO:
- set up Bugzilla to send automatic reminders for bugs not touched in a given
7. Linguistics
Name double-tagging
Conclusion, in a principled fashion:
- hardcoded sem-tags win
- There is a sem-tag conversion procedure: according to a hierarchy of sem-tags:
TODO:
- when adding new names, only use one sem-tag unless there are known objects
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
Topic: Actio+compound - how productive? How much does it destroy speller
Spelling error from typos.txt: 4 vuolggahansádji vuolggasadji vuolggahan vuolggahan vuolgga+N+Sg+Nom+Foc vuolggahan vuolggahit+V+TV+PrfPrc vuolggahan vuolggahit+V+TV+Ind+Prs+Sg1 vuolggahan vuolggahit+V+TV+Actio+Acc vuolggahan vuolggahit+V+TV+Actio+Gen vuolggahan vuolggahit+V+TV+Actio+Nom vuolggahansadji vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
TODO:
- investigate Actio compounding (Thomas)
- discuss findings with Maaren, and later all of us (Thomas)
Lule Sámi
There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. Our poor treatment of number words becomes more visible
- names => waiting for the new name lexicon
- compounds? Shortening here as well, but not in written language (some
- loanwords? We should consider importing the ^LOAN words from sme and
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- progressing fine
- progressing fine
- add proper numeral analysis/treatment (Thomas)
- add loanwords (e.g. latin -ere verbs) (Thomas)
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- progressing, investigating options
- progressing, investigating options
- develop the needed XQueries and interface (Sjur, Tomi)
- progressing, done some, commited
- progressing, done some, commited
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
9. Public tender
We have received answers from LS, and are now wating for the PL answers. Their
TODO:
- write clarification letters to PL (Sjur with help from
- done
- done
- evaluate finally the offers based on the answers to the clarification letters
10. Other
Summer vacation
Who | When |
---|---|
Børre | August |
Linda | ? |
Maaren | ? |
Saara | July |
Sjur | at least 2 weeks in July, but still open |
Thomas | 3.7 - 7.8 |
Trond | 3.7 - 14.8 (last two weeks off at summer school) |
Tomi | 8.7 - 16.7, 2 more weeks in July and/or August |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
After the corpus issues have been somewhat settled, we should do a bug
Gobby
Installed to most computers (only Saara missing), now we need to test it in
Trond has asked Lars Nygård and Tero Avellan to
TODO:
- install Gobby (Saara)
- test it (Tomi, Trond, Sjur, Børre, Lars?)
- if successful, document its use within our project (Børre)
SEE 2.5 extensions
TODO:
- give the jspwiki syntax colouring mode to all (Sjur)
- JSPWiki.mode checked in to cvs in gt/src/ - just double-click the mode!
- JSPWiki.mode checked in to cvs in gt/src/ - just double-click the mode!
- write a perl script to extract all TODO items and sort them according to
- other modes on the whish list:
- twol mode
- xfst mode
- lexc mode
- vislcg mode
- twol mode
11. Next meeting, closing
19.06.2006 09: 30
Sjur is away on Friday June 16.
Closed at 11: 21.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- send contract to Kurt Tore Andersen
- call Brita Kåven again
- contact Ája (Kåfjord)
- Send renaming scripts to R. Valkeapää
- call Bård Eriksen
- send out contracts with accompanying letter
- corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- complete Min Áigi metadata
- Continue converting text from input format to our xml
- corpus access:
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Saara, Trond)
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- test Gobby (with others)
- document use of Gobby within our project if above test is ok
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- set upp the unix group structure for corpus users (also Børre, Trond)
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- fix bugs!
Sjur
- public tender:
- collect the last clarifying answers, and make a final proposal to the board
- collect the last clarifying answers, and make a final proposal to the board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus-summary processing to generate smaller pages
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- test Gobby (with others)
- fix bugs!
Thomas
- investigate Actio compounding, first sme, later smj
- discuss findings with Maaren, and later all of us
- add proper numeral analysis/treatment to smj
- add loanwords (e.g. latin -ere verbs) to smj
- work on compounding
- lule sámi incoming words
- bug-fixing
- sme G3 issue
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- test Gobby (with others)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Børre, Saara)
- test Gobby (with others)
- fix bugs!.