Meeting_2006-05-29
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Public tender
- 11. Other
- 12. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 29.05.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 37.
Present: Børre, Saara, Sjur, Thomas, Trond, Tomi
Absent: Maaren
Main secretary: the whole concept dropped for now - working collaboratively
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Began cron job on http: //galdu.org/samegillii
- Began cron job on http: //galdu.org/samegillii
- Continue converting text from input format to our xml
- Done by convert2xml.pl, updated some .xsl files by hand
- Done by convert2xml.pl, updated some .xsl files by hand
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the rest of the Iđut authors
- Contact with Kurt Tore Andersen, writer of schoolbooks. Was positive.
- Contact with Kurt Tore Andersen, writer of schoolbooks. Was positive.
- call Brita Kåven again towards the end of the week
- Not done
- Not done
- contact Ája (Kåfjord)
- Not done
- Not done
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Not done
- Not done
- Check the status & license of the corpus texts
- Done
- Done
- wait for R. Valkeapää, call him next week
- send out contracts with accompanying letter
- public tender:
- assist with letter to the project board
- Read through the letter, participated in discussions
- Read through the letter, participated in discussions
- assist with letter to the project board
- corpus access:
- zip an xml-stripped version of our free texts for to Olavi
- Done!
- Done!
- meeting 23.5 at 9.30: discuss and decide upon the exact access policy we want
- Not done
- Not done
- set up user account/corpus access for Olavi
- Not needed?
- Not needed?
- set upp the unix group structure for corpus users (also Saara, Trond)
- To be done after the above-mentioned meeting
- To be done after the above-mentioned meeting
- zip an xml-stripped version of our free texts for to Olavi
- install latest SEE
- Done
- Done
- install Gobby for Thomas and Maaren
- Not done
- Not done
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- these are not done
- these are not done
- Implement links to parallel files in corpus header.
- Implement parallel corpus upload in web upload script
- not done
- not done
- Check the status & license of the corpus texts and
- done
- done
- rerun the corpus conversion
- done on daily basis
- done on daily basis
- Install Gobby
- can I have the tarball discussed in the last meeting?
- can I have the tarball discussed in the last meeting?
- update the corpus script(s) to only copy texts to the free/ dir which are
- done, some testing required (and cleaning the free-directory)
- done, some testing required (and cleaning the free-directory)
- Add some language recognition flags to write into the xsl file
- not yet ready
- not yet ready
- rename corpus dirs, and create symlinks
- done, there is an unused directory gt which should be removed.
- done, there is an unused directory gt which should be removed.
- set upp the unix group structure for corpus users (also Børre, Trond)
- not done
- not done
- fix bugs!
Sjur
- public tender:
- write letter to the project board
- done
- done
- write letter to the project board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- not yet done
- not yet done
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- nothing this week
- nothing this week
- implement editing functions
- change corpus-summary processing to generate smaller pages
- nothing yet - needs to be discussed
- nothing yet - needs to be discussed
- send bug report to Apple re filename matching and accented characters in
- done
- done
- meeting 23.5 at 9.30: discuss and decide upon the exact access policy we want
- delayed
- delayed
- fix bugs!
Thomas
- correct hyphenation of exceptions (sme)
- still working with this
- still working with this
- work on compounding and derivation
- planned continues work on work-shop with Trond and Maaren
- planned continues work on work-shop with Trond and Maaren
- sme G3 issue
- nothing done
- nothing done
- set up a linguistic workshop while Maaren is in Tromsø
- done
Tomi
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- install and test Gobby
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- add ccat option to analyse text while keeping the xml tags and structure
- not done
- not done
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- Not done.
- Not done.
- install aligner, test it and give feedback
- Discussed issue with Lars, will continue with Saara
- Discussed issue with Lars, will continue with Saara
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done
- Not done
- get/upgrade keys for Børre's room for Tomi and Thomas
- Rethink the doubletagging procedure for names, consider grammatically
- No news.
- No news.
- Check the status & license of the corpus texts
- Done.
- Done.
- Work on the graphical corpus tag list
- Made progress.
- Made progress.
- send Saara smj files for language recognition
- Talked to saara
- Talked to saara
- create a short smj word list to help the trigram heuristics
- Put Saara and Tero in contact with each other
- Not yet.
- Not yet.
- ask Lars Nygård and Tero Avellan to install Gobby 0.3
- set up a linguistic workshop while Maaren is in Tromsø
- done
- done
- meeting 23.5 at 9.30: discuss and decide upon the exact access policy we want
- delayed
- delayed
- set upp the unix group structure for corpus users (also Børre, Saara)
- delayed until the principles are settled
- delayed until the principles are settled
- fix bugs!.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write
- Not done. We didn't have the meeting.
- we will administer the corpus user accounts ourselves
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
- Working on this. Summing up the most important writers from Davvi Girji.
New contracts:
- none last 2 weeks
Olavi Korhonen's Lule Sámi dictionary.
Phoned Korhonen. He was willing to sign the contracts, and wanted some
TODO:
- set up user account/corpus access for Olavi (Børre)
- we need the infrastructure for corpus access in place first
- we need the infrastructure for corpus access in place first
- done: sent a Word document containing all free texts (2400 pages), as well as
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- talked to Kurt Tore Andersen, who has written a lot of smj school books. He
- talked to Kurt Tore Andersen, who has written a lot of smj school books. He
Bible texts
We will get text from Finland, but still haven't received any. Swedish html has
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Davvi Girji
Called her last week. She said Davvi Girji os would give us permission
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- call the authors (Børre)
Min Áigi
The Min Áigi format should be dealt with: \@ingress etc should be dealt with for
TODO:
- send bug report to Apple (typing filenames in Terminal does not match, moving
- not done yet
- not done yet
- will have to complete metainformation
Kåfjord
Promised to send us texts. Some texts have arrived, but nothing from Ája.
TODO:
- contact Ája (Børre)
- not done
Sámi Instituhtta
Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on
TODO:
- wait for R. Valkeapää, call him this week (Børre)
5. Corpus infrastructure
User accounts and access
We will probably have different kinds of users, some will only need access
TODO:
- before anything else is done: Zip an xml-stripped version of our free texts
- done
- done
- discuss and decide upon the exact access policy we want to give corpus users;
- not done, new date: Thursday 1.6.2006, at 9.30
- not done, new date: Thursday 1.6.2006, at 9.30
- set upp the unix group structure to open for a new category of users:
- make a text-only corpus in the Oslo interface (dump the texts on omilia),
- Discussion on this issue after this meeting
Name change again?
TODO:
- rename corpus dirs, and create symlinks (Saara)
- done, remove the gt dir? Yes.
Free and non-free texts
More info in a previous meeting memo.
TODO:
- Check the status of the texts, again. (Børre, Trond, Saara)
- Complete (only new files need to be checked from now on)
- Complete (only new files need to be checked from now on)
- Rerun the conversion afterwards (Saara is the one with the magic spell)
- done automatically every night
- done automatically every night
- update the script(s) to only copy texts to the free/ dir which are explicitly
- done, but buggy
More texts to the graphical corpus interface:
We need to get the infrastructure complete to be able to do this, then it
TODO:
- We would like to have more than the NT in
- We add the largest texts first.
- We add the largest texts first.
- We would like to have grammatical searchability, not only POS. (Saara,
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- The list of good candicates: The longest (admin) texts.
- We need a new option in ccat for analysing text while still keeping the
- xml texts number <p>, preprocess finds <.> <?> <!> and ccat numbers them as
- Then the aligner aligns...
- We need a new option in ccat for analysing text while still keeping the
Top-three priorities:
- Finish the tag unification (korpustags.txt) (Trond)
- change ccat to be able to create the right input for the corpus analysis
- add text to the server (Lars)
Language recognition
TODO:
- refine language recognition (Saara)
- in progress, continue discussion in
- in progress, continue discussion in
- create a short smj word list to help the trigram heuristics (Trond)
- send Saara smj files (Trond)
- Call Árran again (Trond, then handle it over to Sámediggi, that is, to
- Call Árran again (Trond, then handle it over to Sámediggi, that is, to
- Add some flag to write into the xsl file (Saara)
- unfinished
Some Lule Sami text is found on
Corpus summary
TODO:
- trim generated pages (now sme generates a table with 10 000 entries!)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
Proofed vs unproofed corpus files
The Min Áigi material contains partially parallel unproofed vs proofed
For now there are so few such files that it hardly pays off, but Saara will
TODO:
- count the number of parallel files of type unproofed/proofread (Saara)
Aligner
Trond and Saara will continue this issue.
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
TODO:
- Put Saara and Tero in contact with each other (Trond)
- still open
- still open
- The paradigm generator should also have an xml-out option (for use in e.g.
Hyphenator
Thomas is finished with adding ^ tags to the smj noun file, and has
Trond and Thomas have been working on the smj rule component, and have
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done.
- Still not done.
- Update the sme and sma rule sets with the insights gained from smj updates.
7. Linguistics
General issues
Rethink the doubletagging procedure for names, consider grammatically
Possible rule:
If Plc then Obj
North Sámi
TODO:
- set up a linguistic workshop while Maaren is in Tromsø (and remember
- done, more needed:
- done, more needed:
- Maaren is in Tromsø this week, we should continue our linguistic discussions.
Lule Sámi
There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. Our poor treatment of number words becomes more visible
- names => waiting for the new name lexicon
- compounds? Shortening here as well, but not in written language (some
- loanwords? We should consider importing the ^LOAN words from sme and
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- add proper numeral analysis/treatment
- add loanwords (e.g. latin -ere verbs)
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- nothing done last week
- nothing done last week
- develop the needed XQueries and interface (Sjur, Tomi)
- progressing, done some, haven't commited
- progressing, done some, haven't commited
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, we'll wait a couple of days to see what's
- discussion started on eXist-list, we'll wait a couple of days to see what's
- test and review when ready
Timeline:
- One-time conversion lexc2xml (sme to common)
- done (must be redone at D-day, but the *script* is done)
- done (must be redone at D-day, but the *script* is done)
- editing functions in risten.no - what about editing in emacs/see/other editors
- in the works
- in the works
- Automatic converson: xml2lexc (modulo language) (based on ccat)
- not done
9. Spellers
We will remove this speller section till we have something to report.
10. Public tender
Finnut called, and here's their evaluation: if we think that the offers are
Sjur will send an e-mail to the project board, outlining the different
- accept the offers as is, or enter negotiations?
- if accept, which one?
TODO:
- write e-mail to the project board, ask for their opinion regarding the offers
- done, waiting for answers from the board
11. Other
Summer vacation
Who | When |
---|---|
Børre | August |
Linda | ? |
Maaren | ? |
Saara | July |
Sjur | at least some in July, but still open |
Thomas | 3.7 - 7.8 |
Trond | 3.7 - 14.8 (last two weeks off at summer school) |
Tomi | 8.7 - 16.7, more? |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
After the corpus issues have been somewhat settled, we should do a bug
Gobby
0.3 is working fine on Mac, Linux and Windows. Should be installed on all
- Børre - ok by copying /opt/local/ from Trond
- Maaren - Børre to do it
- Saara - todo
- Sjur - ok
- Thomas - Børre to do it
- Tomi - not working
- Trond - ok
Easy way out when the standard Darwin Ports installation isn't working:
Trond should ask Lars Nygård and Tero Avellan to
SEE 2.5 extensions
- syntax coloring of meeting memos
- script to extract tasks?
TODO:
- give the jspwiki syntax colouring mode to all (Sjur)
12. Next meeting, closing
06.06.2006 09: 30
Closed at 11: 20
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again towards the end of the week
- contact Ája (Kåfjord)
- create weekly cron job to mirror Odin URL and detect new/updated pages
- wait for R. Valkeapää, call him this week
- send out contracts with accompanying letter
- public tender:
- corpus access:
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Saara, Trond)
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- install Gobby for Thomas and Maaren, send /opt/local/ tarball to
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header.
- Implement parallel corpus upload in web upload script
- Install Gobby
- clean the free/ dir and test copying of only free texts to it
- Add some language recognition flags to write into the xsl file
- set upp the unix group structure for corpus users (also Børre, Trond)
- count and examine pairs of proofread and nonproofed MinAigi files
- evaluate addition of a text-only corpus in the Oslo interface (with Trond)
- fix bugs!
Sjur
- public tender:
- waiting for answers from the board, then act
- waiting for answers from the board, then act
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus-summary processing to generate smaller pages
- send bug report to Apple re filename matching and accented characters in
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- fix bugs!
Thomas
- correct hyphenation of exceptions (sme)
- work on compounding and derivation
- sme G3 issue
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby
- Set up the mechanism for the hash-mark transducer package
- add ccat option to analyse text while keeping the xml tags and structure
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- get/upgrade keys for Børre's room for Tomi and Thomas
- Rethink the doubletagging procedure for names, consider grammatically
- Work on the graphical corpus tag list
- send Saara smj files for language recognition
- create a short smj word list to help the trigram heuristics
- Call Árran again, and then transfer smj corpus discussions to Børre
- Put Saara and Tero in contact with each other
- ask Lars Nygård and Tero Avellan to install Gobby 0.3
- continue linguistic discussions while Maaren is in Tromsø
- meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Børre, Saara)
- evaluate addition of a text-only corpus in the Oslo interface (with Saara)
- fix bugs!.