Meeting_2006-06-06
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Public tender
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 06.06.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 37.
Present: Børre, Saara, Sjur, Thomas, Trond, Tomi
Absent: Maaren, Saara leaving 10.30.
Agenda accepted as is, some additions under Other.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Harald Gaski and Lill-Tove Fredriksen
- Harald Gaski and Lill-Tove Fredriksen
- Gather public texts, preferrably also parallel ones
- Done
- Done
- Continue converting text from input format to our xml
- Updating meta data in MinAigi, using Saaras script
- Updating meta data in MinAigi, using Saaras script
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the rest of the Iđut authors
- Not done
- Not done
- call Brita Kåven again towards the end of the week
- Not done
- Not done
- contact Ája (Kåfjord)
- Not done
- Not done
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Done
- Done
- wait for R. Valkeapää, call him this week
- He needed some renaming script, said I would send him some of ours.
- He needed some renaming script, said I would send him some of ours.
- send out contracts with accompanying letter
- public tender:
- corpus access:
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- Not done
- Not done
- set upp the unix group structure for corpus users (also Saara, Trond)
- Not done
- Not done
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- install Gobby for Thomas and Maaren, send /opt/local/ tarball to
- Done
- Done
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header.
- sent a note to the newsgroup.
- sent a note to the newsgroup.
- Implement parallel corpus upload in web upload script
- not done
- not done
- Install Gobby
- not done
- not done
- clean the free/ dir and test copying of only free texts to it
- done
- done
- Add some language recognition flags to write into the xsl file
- done
- done
- set up the unix group structure for corpus users (also Børre, Trond)
- count and examine pairs of proofread and nonproofed MinAigi files
- done
- done
- evaluate addition of a text-only corpus in the Oslo interface (with Trond)
- done
- done
- improve add-hyph-tags.pl to handle XML-structures better.
- done
- done
- Updated decode.pm to handle msword titles.
- almost ready
- almost ready
- convert nob and nno bible texts to be used as part of a parallel corpus
- We don't have parallel text markup in corpus.dtd.
- We don't have parallel text markup in corpus.dtd.
- fix bugs!
Sjur
- public tender:
- waiting for answers from the board, then act
- received answers from all, as well as a clarifying e-mail from Finnut; sent
- received answers from all, as well as a clarifying e-mail from Finnut; sent
- waiting for answers from the board, then act
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- progress - all look-up functions are now multi-collection aware
- progress - all look-up functions are now multi-collection aware
- implement editing functions
- change corpus-summary processing to generate smaller pages
- not done
- not done
- send bug report to Apple re filename matching and accented characters in
- done
- done
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- give copies of the SEE JSPWiki mode to others
- checked in to CVS, it can be installed from there with a double-click
- checked in to CVS, it can be installed from there with a double-click
- fix bugs!
Thomas
- correct hyphenation of exceptions (sme)
- done
- done
- work on compounding and derivation
- nothing done this week
- nothing done this week
- sme G3 issue
- nothing done this week
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby
- Set up the mechanism for the hash-mark transducer package
- add ccat option to analyse text while keeping the xml tags and structure
- fix bugs!
Trond
- better smj NT text
- Not done
- Not done
- get fin, swe, nob and nno NT and OT in paratext format
- Not done
- Not done
- install aligner, test it and give feedback
- Not done
- Not done
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done
- Not done
- get/upgrade keys for Børre's room for Tomi and Thomas
- No response from key office, the key person has been away
- No response from key office, the key person has been away
- Rethink the doubletagging procedure for names, consider grammatically
- The rethinking is done, implementation is needed.
- The rethinking is done, implementation is needed.
- Work on the graphical corpus tag list
- Done some, but awaiting Tomi's script to get this issue going.
- Done some, but awaiting Tomi's script to get this issue going.
- send Saara smj files for language recognition
- No new smj files.
- No new smj files.
- create a short smj word list to help the trigram heuristics
- This was done earlier.
- This was done earlier.
- Call Árran again, and then transfer smj corpus discussions to Børre
- Worked on this issue
- Worked on this issue
- Put Saara and Tero in contact with each other
- Awaiting the dust to be settled after his new employment.
- Awaiting the dust to be settled after his new employment.
- ask Lars Nygård and Tero Avellan to install Gobby 0.3
- Issue open.
- Issue open.
- continue linguistic discussions while Maaren is in Tromsø
- Had one session.
- Had one session.
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- Not done
- Not done
- set upp the unix group structure for corpus users (also Børre, Saara)
- Not done
- Not done
- evaluate addition of a text-only corpus in the Oslo interface (with Saara)
- We still go for the "everything first" strategy
- We still go for the "everything first" strategy
-
fix bugs!.
- Worked on bugs.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write
- we will administer the corpus user accounts ourselves
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
- none last 2 weeks
Olavi Korhonen's Lule Sámi dictionary.
TODO:
- set up user account/corpus access for Olavi (Børre)
- we need the infrastructure for corpus access in place first
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- talked to Kurt Tore Andersen, will get a letter from us (Børre)
Bible texts
We will get text from Finland, but still haven't received any. Swedish html has
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Davvi Girji
Talked to Harald Gaski, he will send the contracts to the writers' organisation,
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- call the authors (Børre)
- Harald Gaski
Min Áigi
TODO:
- send bug report to Apple (typing filenames in Terminal does not match, moving
- done
- done
- complete metainformation (Børre)
Kåfjord
Promised to send us texts. Some texts have arrived, but nothing from Ája.
TODO:
- contact Ája (Børre)
- Not done
Sámi Instituhtta
TODO:
- wait for R. Valkeapää, call him this week (Børre)
Čálliid Lágádus
Børre talked to them, they are positive, and will give us print-ready pdf's.
Árran
The negotiations are underway, they discuss it in a meeting today, and we have
TODO:
- call Bård Eriksen (Børre)
5. Corpus infrastructure
User accounts and access
TODO:
- discuss and decide upon the exact access policy we want to give corpus users;
- still not done, new date: Friday 9.6.2006, at 9.30
- still not done, new date: Friday 9.6.2006, at 9.30
- set upp the unix group structure to open for a new category of users:
- make a text-only corpus in the Oslo interface (dump the texts on omilia),
- Discussion on this issue after the previous meeting
- Conclusion: skip the text-only version, and go for the analyzed one right
- Conclusion: skip the text-only version, and go for the analyzed one right
- Discussion on this issue after the previous meeting
More texts to the graphical corpus interface:
We need to get the infrastructure complete to be able to do this, then it
Main obstacle for further progress: a tool to analyse text while keeping the xml
- parse xml
- send each <p> node to lookup etc
- analyse it
- add </s><s> after each <.>, <!>, <?> mark
- wrap up the analysis output in xml
- put it in a <p>node again
The first implementation in Perl (Saara), later possibly a C implementation
TODO:
- make an analyser that retains the xml structure (Saara)
- Finish the tag unification (korpustags.txt) (Trond)
- progressing, but open questions remain (awaiting text to see how things
- progressing, but open questions remain (awaiting text to see how things
- add text to the server (Lars)
Aligner
Trond and Saara will continue this issue.
We need markup of parallelism in the corpus DTD, at least an indication of which
Language recognition
TODO:
- create a short smj word list to help the trigram heuristics (Trond)
- done (but the smj.lm file suffer from the narrow corpus base. The smj.wm
- done (but the smj.lm file suffer from the narrow corpus base. The smj.wm
- Add some flag to write into the xsl file (Saara)
- done
Free and non-free texts
TODO:
- test the script(s) copying texts to the free/ dir which are explicitly
- done
Corpus summary
TODO:
- trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- nothing done
- suggestion: lump together files with content less than X paragraphs (X < 5?)
Proofed vs unproofed corpus files
TODO:
- count the number of parallel files of type unproofed/proofread (Saara)
- done, 35 such file pairs
- delayed for the time being, move to Bugzilla (Sjur)
- done, 35 such file pairs
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
TODO:
- Put Saara and Tero in contact with each other (Trond)
- still open, awaiting both Tero and the more crucial corpus tasks
- still open, awaiting both Tero and the more crucial corpus tasks
- The paradigm generator should also have an xml-out option (for use in e.g.
- move to Bugzilla (Sjur)
Hyphenator
Thomas is finished with adding ^ tags to the sme noun file.
Trond and Thomas have been working on the smj rule component, and have
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done - postponed till next week (w 24)
- Still not done - postponed till next week (w 24)
- Update the sme and sma rule sets with the insights gained from smj updates.
- add exception marks to the sme lexicons (Thomas)
- done
Automatic Bugzilla reminder for untouched bugs
We need to get a summary by e-mail for all bugs not touched in more than 5(?)
TODO:
- set up Bugzilla to send automatic reminders for bugs not touched in a given
7. Linguistics
North Sámi
Topic: Actio+compound - how productive? How much does it destroy speller
Spelling error from typos.txt: 4 vuolggahansádji vuolggasadji vuolggahan vuolggahan vuolgga+N+Sg+Nom+Foc vuolggahan vuolggahit+V+TV+PrfPrc vuolggahan vuolggahit+V+TV+Ind+Prs+Sg1 vuolggahan vuolggahit+V+TV+Actio+Acc vuolggahan vuolggahit+V+TV+Actio+Gen vuolggahan vuolggahit+V+TV+Actio+Nom vuolggahansadji vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
TODO:
- Maaren is in Tromsø this week, we should continue our linguistic discussions
- not done, and Maaren is now back in Guovdageaidnu (?)
Lule Sámi
There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. Our poor treatment of number words becomes more visible
- names => waiting for the new name lexicon
- compounds? Shortening here as well, but not in written language (some
- loanwords? We should consider importing the ^LOAN words from sme and
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- add proper numeral analysis/treatment
- add loanwords (e.g. latin -ere verbs)
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- progressing, still major work to be done
- progressing, still major work to be done
- develop the needed XQueries and interface (Sjur, Tomi)
- progressing, done some, commited
- progressing, done some, commited
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
- discussion started on eXist-list, nothing useful came up. We need to
Timeline:
- One-time conversion lexc2xml (sme to common)
- done (must be redone at D-day, but the *script* is done)
- done (must be redone at D-day, but the *script* is done)
- editing functions in risten.no - what about editing in emacs/see/other editors
- in the works
- in the works
- Automatic converson: xml2lexc (modulo language) (based on ccat)
- not done
9. Public tender
TODO:
- act, based on the answers (Sjur)
- done: we are asking for more clarification, and then pick the best
- done: we are asking for more clarification, and then pick the best
- write clarification letters to PL and LS (Sjur with help from
- LS - done
- PL - has started, not finished
- LS - done
10. Other
Summer vacation
Who | When |
---|---|
Børre | August |
Linda | ? |
Maaren | ? |
Saara | July |
Sjur | at least some in July, but still open |
Thomas | 3.7 - 7.8 |
Trond | 3.7 - 14.8 (last two weeks off at summer school) |
Tomi | 8.7 - 16.7, 2 more weeks in July and/or August |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
After the corpus issues have been somewhat settled, we should do a bug
Forrest:Unicode & JSPWiki
Sjur has identified a way of making forrest run read correctly the jspwiki
forrest run -Dforrest.jvmargs="-Dfile.encoding=utf-8"
It is added to our Forrest documentation.
RAM on our G5
We got some defect RAM. Our business contact has received a whole party of
Gobby
0.3 is working fine on Mac, Linux and Windows. Should be installed on all
- Børre - ok
- Maaren - ok
- Saara - todo
- Sjur - ok
- Thomas - ok
- Tomi - ok
- Trond - ok
Trond has asked Lars Nygård and Tero Avellan to
SEE 2.5 extensions
TODO:
- give the jspwiki syntax colouring mode to all (Sjur)
- JSPWiki.mode checked in to cvs in gt/src/ - just double-click the mode!
- JSPWiki.mode checked in to cvs in gt/src/ - just double-click the mode!
- write a perl script to extract all TODO items and sort them according to
- other modes on the whish list:
- twol mode
- xfst mode
- lexc mode
- vislcg mode
- twol mode
11. Next meeting, closing
12.06.2006 09: 30
Closed at 11: 08
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the rest of the Iđut authors
- send contract to Kurt Tore Andersen
- call Brita Kåven again
- call Harald Gaski
- contact Ája (Kåfjord)
- Send renaming scripts to R. Valkeapää
- complete Min Áigi metadata
- call Bård Eriksen
- send out contracts with accompanying letter
- corpus access:
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Saara, Trond)
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header
- Implement parallel corpus upload in web upload script
- Install Gobby
- set upp the unix group structure for corpus users (also Børre, Trond)
- Test the aligners once again
- make an analyser that retains the xml structure for Oslo
- discuss parallel text markup in newsgroup
- fix bugs!
Sjur
- public tender:
- write e-mail to PL and ask for more clarification
- write e-mail to PL and ask for more clarification
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus-summary processing to generate smaller pages
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- move to Bugzilla:
- proofed/unproofed parallel text issue
- xml output of paradigm generator
- proofed/unproofed parallel text issue
- fix bugs!
Thomas
- hyphenation-rule-set
- work on compounding and derivation
- lule sámi incoming words
- sme G3 issue
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- add ccat option to analyse text while keeping the xml tags and structure
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Put Saara and Tero in contact with each other
- meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Børre, Saara)
- fix bugs!.