Meeting_2006-05-22
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Public tender
- 11. Other
- 12. Summary, task list
- 13. Next meeting, closing
Meeting setup
- Date: 22.05.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 56.
Present: Børre, Sjur, Thomas, Trond, Tomi
Absent: Maaren, Saara
Main secretary: Thomas (with help from others)
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- corpus work:
- send out contracts with accompanying letter
- Sent to Elle Márjá Vars
- Sent to Elle Márjá Vars
- Gather public texts, preferrably also parallel ones
- Gathered, some added to the corpus
- Gathered, some added to the corpus
- Continue converting text from input format to our xml
- Done
- Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the rest of the Iđut authors
- call Brita Kåven
- Done
- Done
- contact Kåfjord
- Will have to contact Ája
- Will have to contact Ája
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Not done
- Not done
- Check the status & license of the corpus texts
- Done some
- Done some
- contact Korhonen & Kuhmunen
- Done
- Done
- send out contracts with accompanying letter
- public tender:
- continue public tender offer evaluation
- meeting with Sjur, Tomi and Trond on friday afternoon
- meeting with Sjur, Tomi and Trond on friday afternoon
- meeting on Thursday 18.5. at 10.00 AM with Sjur, Tomi, Trond
- see above
- see above
- telephone meeting with Finnut
- Not done
- Not done
- continue public tender offer evaluation
- install latest SEE
- Not done
- Not done
- install Gobby using Darwin Ports (also for Thomas and Maaren)
- Done for me
- Done for me
- move to Bugzilla:
- write docu for how to apply for a corpus user account
- Not done, but talked to Roy Dragseth about this (more below).
- Not done, but talked to Roy Dragseth about this (more below).
- write docu for how to apply for a corpus user account
- fix bugs!
Maaren
- On sick leave
Saara
- been on sick leave
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header.
- not done
- not done
- Implement parallel corpus upload in web upload script
- not done
- not done
- Check the status & license of the corpus texts and
- not done
- not done
- Rerun corpus conversion
- done on daily basis
- done on daily basis
- Install Gobby
- got problems during installation
- got problems during installation
- make xsl conversion routine for the typographic tags
- done
- done
- update the corpus script(s) to only copy texts to the free/ dir which are
- not done
- not done
- Add some language recognition flags to write into the xsl file
- not done
- not done
- rename corpus dirs, and create symlinks
- not done
- not done
- fix bugs!
Sjur
- public tender:
- read & evaluate received offers
- done
- done
- meeting on Thursday 18.5. at 10.00 AM with Børre, Tomi, Trond
- we had a meeting later (I had technical problems with my Mac at the time)
- we had a meeting later (I had technical problems with my Mac at the time)
- telephone meeting Friday with Finnut
- short call, will have the meeting on Monday 22 instead
- short call, will have the meeting on Monday 22 instead
- read & evaluate received offers
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- nothing
- nothing
- name lexicon:
- implement editing functions
- updated my local cvs sandbox to use the new server (victorio), and checked
- updated my local cvs sandbox to use the new server (victorio), and checked
- finalise refactoring for multiple collections (regular search interface)
- done some, but not finished
- done some, but not finished
- implement editing functions
- update corpus-summary processing to adhere to the new structure
- done, but the generated page/file size for sme is way too large due to the
- done, but the generated page/file size for sme is way too large due to the
- send bug report to Apple re filename matching and accented characters in
- not done
- not done
- fix bugs!
Thomas
- correct hyphenation of exceptions
- finished lule sámi, now working with north sámi
- finished lule sámi, now working with north sámi
- correct hyphenation of smj -st-
- done
- done
- work on compounding and derivation
- worked a little
- worked a little
- smj G3 issue
- finished
- finished
- sme G3 issue
- not done
- not done
- set up a linguistic workshop while Maaren is in Tromsø
- not done
Tomi
-
move to Bugzilla:
- aspell UTF-8 suffix bug
- done
- done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- done
- done
- aspell UTF-8 suffix bug
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- some done
- some done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- install and test Gobby
- done, but not successfully
- done, but not successfully
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Trond
- done
- done
- add ccat option to analyse text while keeping the xml tags and structure
- not done
- not done
- fix bugs!
Trond
- better smj NT text
- Not done
- Not done
- get fin, swe, nob and nno NT and OT in paratext format
- Not done
- Not done
- install aligner, test it and give feedback
- Not done
- Not done
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done
- Not done
- get/upgrade keys for Børre's room for Tomi and Thomas
- Ordered upgrade.
- Ordered upgrade.
- Rethink the doubletagging procedure for names, consider grammatically
- Done some thinking
- Done some thinking
- Check the status & license of the corpus texts
- Not done
- Not done
- Work on the graphical corpus tag list
- Done some.
- Done some.
- send Saara smj files for language recognition
- Asked for text, discussing with Árran
- Asked for text, discussing with Árran
- Put Saara and Tero in contact with each other
- Not done
- Not done
- meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Tomi
- Done
- Done
- ask Lars Nygård, Per Langgård and Tero Avellan to install Gobby 0.3
- Asked Per, not the others.
- Asked Per, not the others.
- set up a linguistic workshop while Maaren is in Tromsø
- Not done
- Not done
- fix bugs!.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write
- we will administer the corpus user accounts ourselves
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
- none last week
Olavi Korhonen's Lule Sámi dictionary.
Phoned Korhonen. He was willing to sign the contracts, and wanted some
TODO:
- set up user account/corpus access for Olavi (Børre)
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
Bible texts
We will get text from Finland, but still haven't received any. Swedish html has
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Davvi Girji
Called her last week. She said Davvi Girji os would give us permission
TODO:
- call Brita Kåven again towards the end of the week (Børre)
Min Áigi
The Min Áigi format should be dealt with: \@ingress etc should be dealt with for
TODO:
- send bug report to Apple (typing filenames in Terminal does not match, moving
- not done yet
- not done yet
- make xsl conversion routine for the typographic tags (Saara)
- Done, some adjustment needed.
Kåfjord
Promised to send us texts. Some texts have arrived, but nothing from Ája.
TODO:
- contact Ája (Børre)
Sámi Instituhtta
Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on
TODO:
- wait for R. Valkeapää, call him next week (Børre)
5. Corpus infrastructure
User accounts and access
Talked to Roy Dragseth last week about this. Turns out that
This is from our contracts:
Contract 1 3.7 Mottakar kan gje personleg bruksrett til tekstsamlinga til personar som har skrive under på bruksrettskontrakten i Vedlegg 2. Mottakar skal ikkje gje bruksrett til tekstsamlinga til personar som ein har grunn til å tru vil bryte vilkåra i kontrakten. Mottakar forpliktar seg til å informere avgjevar med ein gong han/ho får kjennskap til mogleg brot på desse vilkåra. (Vedlegg 2 = Contract 3) Contract 3 4.1 Brukaren har berre rett til å bruke tekstane til forsking eller slike kommersielle språkteknologiske eller andre liknande formål, som ikkje bryt med Lov om opphavsrett til åndsverk. Brukaren kan bruke tekstane for å gjere seg nytte av dei språktrekka (t.d. statistisk informasjon, grammatiske reglar og semantiske skildringar) han/ho har funne gjennom forsking, og plukke ut kortare sitat frå tekstane. 5.3 Brukaren får ikkje ta større delar av løpande tekst i tekstsamlinga enn korte sitat bort frå den tenaren som tekstsamlinga er installert på. Det er lov til å lagre temporære kopiar på sjølve tenaren på det vilkåret at brukaren tek omsyn til datatryggleiken. Denne avgrensinga gjeld ikkje offentlege dokument (t.d. NOU, stortingsmeldingar o.l.). Det går klårt fram av kvart dokument kva for lisens som er knytta til det.
TODO:
- before anything else is done: Zip an xml-stripped version of our free texts
- discuss and decide upon the exact access policy we want to give corpus users;
- set upp the unix group structure to open for a new category of users:
- make a text-only corpus in the Oslo interface (dump the texts on omilia),
Name change again?
TODO:
- rename corpus dirs, and create symlinks (Saara)
Free and non-free texts
More info in a previous meeting memo.
TODO:
- Check the status of the texts, again. (Børre, Trond, Saara)
- some is done, not complete yet
- some is done, not complete yet
- Rerun the conversion afterwards (Saara is the one with the magic spell)
- in progress
- in progress
- update the script(s) to only copy texts to the free/ dir which are explicitly
- not done
- not done
- update the processing of the corpus summary files (Sjur)
- done
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in
- We add the largest texts first.
- We add the largest texts first.
- We would like to have grammatical searchability, not only POS. (Saara,
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- The list of good candicates: The longest (admin) texts.
- We need a new option in ccat for analysing text while still keeping the
- xml texts number <p>, preprocess finds <.> <?> <!> and ccat numbers them as
- Then the aligner aligns...
- We need a new option in ccat for analysing text while still keeping the
Top-three priorities:
- discuss more with Lars on tag unification, and unify them (Trond)
- change ccat to be able to create the right input for the corpus analysis
- add text to the server (Lars)
Language recognition
TODO:
- refine language recognition (Saara)
- in progress, continue discussion in
- in progress, continue discussion in
- create a short smj word list to help the trigram heuristics (Trond)
- send Saara smj files (Trond)
- Trond has tried to get files, see task report above
- Trond has tried to get files, see task report above
- Add some flag to write into the xsl file (Saara):
- method: do not run lg recognition
- method: Choose between these 2: nob, sme, etc.
- method: do not run lg recognition
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
TODO:
- Put Saara and Tero in contact with each other (Trond)
- still open
- still open
- The paradigm generator should also have an xml-out option (for use in e.g.
Aligner
Trond and Saara will continue this issue.
Hyphenator
Thomas is finished with adding ^ tags to the smj noun file, and has
Trond and Tomi have been working on the smj rule component, and have
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done.
- Still not done.
- Update the sme and sma rule sets with the insights gained from smj updates.
7. Linguistics
General - hyphenation
See discussion, open questions and decission in a previous meeting memo.
TODO:
- Set up the mechanism for the hash transducer package - fst gymnastics, see
- add exceptions marks to the smj lexicon (boks^távva)
- done
North Sámi
TODO:
- set up a linguistic workshop while Maaren is in Tromsø (and remember
Lule Sámi
There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. Our poor treatment of number words becomes more visible
- names => waiting for the new name lexicon
- compounds? Shortening here as well, but not in written language (some
- loanwords? We should consider importing the ^LOAN words from sme and
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- add proper numeral analysis/treatment
- add loanwords (e.g. latin -ere verbs)
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
- improving, not finished
- improving, not finished
- develop the needed XQueries and interface (Sjur, Tomi)
- progressing, done some, haven't commited
- progressing, done some, haven't commited
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, we'll wait a couple of days to see what's
- discussion started on eXist-list, we'll wait a couple of days to see what's
- test and review when ready
- Rethink the doubletagging procedure for names, consider grammatically
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
- aspell
- hunspell
10. Public tender
Finnut called, and here's their evaluation: if we think that the offers are
Sjur will send an e-mail to the project board, outlining the different
- accept the offers as is, or enter negotiations?
- if accept, which one?
TODO:
- meeting on Thursday 18.5. at 10.00 AM (Børre, Sjur, Tomi Trond)
- done (on Friday)
- done (on Friday)
- telephone meeting with Finnut (Børre, Sjur)
- short call on Friday, he called today (during the meeting, see above)
- short call on Friday, he called today (during the meeting, see above)
- write e-mail to the project board, ask for their opinion regarding the offers
11. Other
Summer vacation
Who | When |
---|---|
Børre | ? |
Linda | ? |
Maaren | ? |
Saara | July |
Sjur | ? |
Thomas | 3.7 - 7.8 |
Trond | July |
Tomi | 8.7 - 16.7, more? |
Bug fixing
45 open Divvun/Disamb bugs, and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug
279
After the corpus issues have been somewhat settled, we should do a bug
Gobby
0.3 is working fine on Mac, Linux and Windows. Should be installed on all
- Børre - ok by copying /opt/local/ from Trond
- Maaren - Børre to do it
- Saara - todo
- Sjur - ok
- Thomas - Børre to do it
- Tomi - todo
- Trond - ok
Easy way out when the standard Darwin Ports installation isn't working:
Trond should ask Lars Nygård and Tero Avellan to
SEE autosave AppleScript
Copy the following into a ScriptEditor window:
tell application "SubEthaEdit" repeat until false is true save documents delay 60 end repeat end tell
and click "run". All your SubEthaEdit documents will be automatically saved
12. Summary, task list
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again towards the end of the week
- contact Ája (Kåfjord)
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Check the status & license of the corpus texts
- wait for R. Valkeapää, call him next week
- send out contracts with accompanying letter
- public tender:
- assist with letter to the project board
- assist with letter to the project board
- corpus access:
- zip an xml-stripped version of our free texts for to Olavi
- meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
- set up user account/corpus access for Olavi
- set upp the unix group structure for corpus users (also Saara, Trond)
- zip an xml-stripped version of our free texts for to Olavi
- install latest SEE
- install Gobby for Thomas and Maaren
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header.
- Implement parallel corpus upload in web upload script
- Check the status & license of the corpus texts and
- rerun the corpus conversion
- Install Gobby
- update the corpus script(s) to only copy texts to the free/ dir which are
- Add some language recognition flags to write into the xsl file
- rename corpus dirs, and create symlinks
- set upp the unix group structure for corpus users (also Børre, Trond)
- fix bugs!
Sjur
- public tender:
- write letter to the project board
- write letter to the project board
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- change corpus-summary processing to generate smaller pages
- send bug report to Apple re filename matching and accented characters in
- meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
- fix bugs!
Thomas
- correct hyphenation of exceptions (sme)
- work on compounding and derivation
- sme G3 issue
- set up a linguistic workshop while Maaren is in Tromsø
Tomi
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby
- Set up the mechanism for the hash-mark transducer package
- add ccat option to analyse text while keeping the xml tags and structure
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- get/upgrade keys for Børre's room for Tomi and Thomas
- Rethink the doubletagging procedure for names, consider grammatically
- Check the status & license of the corpus texts
- Work on the graphical corpus tag list
- send Saara smj files for language recognition
- create a short smj word list to help the trigram heuristics
- Put Saara and Tero in contact with each other
- ask Lars Nygård and Tero Avellan to install Gobby 0.3
- set up a linguistic workshop while Maaren is in Tromsø
- meeting 23.5 t 9.30: discuss and decide upon the exact access policy we want
- set upp the unix group structure for corpus users (also Børre, Saara)
- fix bugs!.
13. Next meeting, closing
29.05.2006 09: 30
Closed at 11: 35