Meeting_2006-05-15
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Reviewing the task list from the last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Public tender
- 11. Other
- 12. Summary, task list
- 13. Next meeting, closing
Meeting setup
- Date: 15.05.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 54.
Present: Børre, Saara, Sjur, Thomas, Trond, Tomi
Absent: Maaren
Main secretary: Tomi/Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Some done
- Some done
- Gather public texts, preferrably also parallel ones
- Not done
- Not done
- Continue converting text from input format to our xml
- Done
- Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- Not done
- Not done
- review the paratext2xml converter
- Not done
- Not done
- convert smj NT to paratext
- Not done
- Not done
- Send out letters to the Iđut authors
- Some done
- Some done
- write docu for how to apply for a corpus user account (forms, recipients,
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- mirror Odin URL (create cron task to do it automatically?)
- Not done
- Not done
- read & evaluate received offers
- Done
- Done
- telephone meeting Wednesday with Finnut
- What happened with this? Sjur had a short meeting alone, as we are not ready
- What happened with this? Sjur had a short meeting alone, as we are not ready
- Check the status & license of the corpus texts
- Not done
- Not done
- fix bugs!
Maaren
- Working on missing list
- Done.
Saara
- Create a parallel corpora of the new testaments.
- grammatical searchability in the graphical corpus interface
- move to bugzilla:
- set up a weekly cron script to make free texts available
- add a possibility to upload whole documents for hyphenation (and also
- add a log of every word/text uploaded/hyphenated/analyzed etc.
- done
- done
- set up a weekly cron script to make free texts available
- Implement links to parallel files in corpus header.
- not done
- not done
- Implement parallel corpus upload in web upload script
- not done
- not done
- Implement turning off the language recognition in the xsl-file (and
- not done
- not done
- Refine language recognition
- not done
- not done
- Investigate the decomposed Unicode characters in file names -problem.
- solved
- solved
- Correct decomposed Unicode in Min Áigi file names
- done
- done
- Check the status & license of the corpus texts
- not finished yet, some discussion: bug 284
- not finished yet, some discussion: bug 284
- Rerun corpus conversion
- done extensively
- done extensively
- fix bugs!
Sjur
- read & evaluate received offers
- done, we need more info from the tenderers - questions sent, with a deadline
- done, we need more info from the tenderers - questions sent, with a deadline
- telephone meeting Wednesday with Finnut
- had a short meeting (without Børre), clarifying what's left regarding
- had a short meeting (without Børre), clarifying what's left regarding
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- nothing
- nothing
- name lexicon:
- implement editing functions
- started
- started
- write down the most common editing scenarios, to guide development
- done, quite helpful
- implement editing functions
Thomas
- correct hyphenation of exceptions
- not done
- not done
- correct hyphenation of smj -st-
- not done
- not done
- work on compounding and derivation
- worked
- worked
- smj G3 issue
- not done
- not done
- sme G3 issue
- not done
Tomi
- move to Bugzilla:
- aspell UTF-8 suffix bug
- not done
- not done
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- not done
- not done
- aspell UTF-8 suffix bug
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- not done
- not done
- XQuery refactoring and code development for our proper noun editor
- not done
- not done
- new version of xml2lexc (based on ccat), should handle complex names correct:
- not done
- not done
- write down the most common editing scenarios (to be used as guides for making
- done
- done
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- not done
- not done
- install and test Gobby, install new version of SEE
- not done
- not done
- Set up the mechanism for the hash-mark transducer package
- not done
- not done
- read & evaluate received offers
- done
- done
- fix bugs!
Trond
- better smj NT text, get fin NT texts
- Not done.
- Not done.
- get a key for Maaren in May
- Well, done, sort of.
- Well, done, sort of.
- install aligner, test it and give feedback
- Not done.
- Not done.
- correct hyphenation of word boundaries and exceptions with Thomas
- Not done (but Thomas has worked on this).
- Not done (but Thomas has worked on this).
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- Not done.
- Not done.
- get/upgrade keys for Børre's room for Tomi and Thomas
- Not done.
- Not done.
-
fix bugs!.
- Done some, but suggesting a collective effort wrt. the bugs.
- Done some, but suggesting a collective effort wrt. the bugs.
- Rethink the doubletagging procedure for names, consider grammatically
- Not done.
- Not done.
- write down the most common editing scenarios (to be used as guides for making
- Done, wasn't it?
- Done, wasn't it?
- read & evaluate received offers
- Work in progress.
- Work in progress.
- Check the status & license of the corpus texts
- Must check whether all is done.
3. Documentation
TODO:
- documentation on how to apply for a user account for the corpus repo
- The item will be moved to Bugzilla
4. Corpus gathering
Collecting
See a previous meeting memo for what's to be done.
TODO: Send out the rest of the letters (Børre)
New contacts:
- Elle Márjá Vars (sme, 38 titles in Bibsys, novels, plays, both iđut and Davvi
Odin
-
Børre will set up a weekly cron job to mirror the URL mentioned in Sæths
Olavi Korhonen's Lule Sámi dictionary.
- No news this week
TODO: Børre to contact Olavi Korhonen and Kuhmunen
KIO Grafisk and the Iđut books
TODO:
- send letters to the authors (Børre)
- one contacted, the rest this week
Bible texts
We will get text from Finland, but still haven't received any. We have got the
Swedish html has arrived, no paratext. Norsk bibelselskap has not sent
TODO:
- convert smj NT to paratext (Børre)
- get fin, swe, nob and nno NT and OT in paratext format. (Trond)
Davvi Girji
A talk with Brita Kåven, revealed that they would have a look at
Min Áigi
Agreement: they will send us updates each month. Standard license.
We have problems with Unicode characters in filenames. This was solved once
The Min Áigi format should be dealt with: \@ingress etc should be dealt with for
Done so far: Trond and Saara? Trond made a list of tags, and Saara will make
Saara is implementing the tag format.
TODO:
- reopen Bugzilla issue, and study the previous discussion and solution
- done
- done
- add filename extensions to files not having one
- done
- done
- Space in file names should be changed to underscore (and not to hyphen!).
- done
- done
- send bug report to Apple (typing filenames in Terminal does not match, moving
- make xsl conversion routine for the typographic tags (Saara)
Kåfjord
Promised to send us texts. Some texts have arrived, but nothing from Ája.
TODO: Børre will contact them.
Sámi Instituhtta
Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on his
5. Corpus infrastructure
https: //giellalt.uit.no/lang/corp/corpus-summary.html
TODO:
Changes and updates because of the Divvun public tender
User account admin and infra: see previous memo.
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see previous memo.
TODO:
- make free texts available
- done weekly by a cron script (but only if there are new files) (Saara)
- move to bugzilla: done
- move to bugzilla: done
- done weekly by a cron script (but only if there are new files) (Saara)
- Make a link, easily available, to these texts.
- done as a downloadable tar package.
Name change again? Trond has come up with some new suggestion:
gt -> gtbound/ gtbound -> bound/ gtfree -> free/
NB! keep symlinks from the old names to the new names for now.
Free and non-free texts
More info in a previous meeting memo.
TODO:
- Check the status of the texts, again. (Børre, Trond, Saara)
- Rerun the conversion afterwards (Saara is the one with the magic spell)
- check against bugs
- these are now fixed.
- these are now fixed.
- update the script(s) to only copy texts to the free/ dir which are explicitly
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in
- We add the largest texts first.
- We add the largest texts first.
- We would like to have grammatical searchability, not only POS. (Saara,
- For Lule Sámi: We would like to have a parallel corpus interface with NT
- Better Lule NT text still not made.
- Better Lule NT text still not made.
- The list of good candicates: The longest (admin) texts.
- We need a new option in ccat for analysing text while still keeping the
- xml texts number <p>, preprocess finds <.> <?> <!> and ccat numbers them as
- Then the aligner aligns...
- We need a new option in ccat for analysing text while still keeping the
Top-two priorities:
- Trond to discuss more with Lars on tag unification, and unify them.
- Tomi to change ccat to be able to create the right input for the corpus
- Lars to add text to the server.
Language recognition
TODO:
- refine language recognition (Saara)
- in progress, continue discussion in
- in progress, continue discussion in
- create a short word list to help the trigram heuristics
- Trond has made such lists for all lgs except smj.
- Trond has made such lists for all lgs except smj.
- send Saara smj files (Trond)
- Add some flag to write into the xsl file (Saara):
- method: do not run lg recognition
- method: Choose between these 2: nob, sme, etc.
- method: do not run lg recognition
6. Infrastructure
Paradigm generation
Greenland's language secretariat has a
Try e.g. illu "house", aput "snow" (or any word in st/kal/src/noun-kal-lex.txt)
TODO:
- Put Saara and Tero in contact with each other (Trond)
- The paradigm generator should also have an xml-out option (for use in e.g.
Aligner
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
-
Saara to install the aligner, everyone to read the documentation
- done, waiting for the test files from Bergen.
- done, waiting for the test files from Bergen.
- Trond and Saara will continue this issue.
Hyphenator
Trond and Thomas have been updating the propernoun file with ^ tags. We
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
- Still not done.
- Still not done.
- add a possibility to upload whole documents for hyphenation (and also
- moved to Bugzilla
- moved to Bugzilla
- we should log all and every word/text uploaded/hyphenated/analyzed etc
- we'll do it, but it does not have first priority (Saara)
- moved to Bugzilla
- we'll do it, but it does not have first priority (Saara)
7. Linguistics
General - hyphenation
See discussion, open questions and decission in the previous meeting memo.
We did a sme overkill: V^CV, when the rulebased hyphenator gave the right
TODO:
- Set up the mechanism for the hash transducer package - fst gymnastics, see
- add exceptions marks to the smj lexicon (boks^távva)
- Status? We discussed with Kåre Tjihkkom, and the principles are clear,
- Status? We discussed with Kåre Tjihkkom, and the principles are clear,
North Sámi
There are some heavy bugs (11 sme bugs all in all):
- he(a)ddjiid is one
- compounding cleanup - no shortening when normative, still shortening when
- vowel-shortening when compounding (we need the input from Pekka!)
We should have some linguistic workshops while Maaren is here.
TODO:
- set up a linguistic workshop while Maaren is in Tromsø (and remember
Lule Sámi
TODO:
- 50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks
- There are some open issues in the marginal area of the smj transducer:
- numerals, e.g. These become more visible as we get real texts.
- names
- compounds? Shortening here as well, but not in written language (some
- loanwords?
- numerals, e.g. These become more visible as we get real texts.
8. Name lexicon infrastructure
TODO:
- refactor and prepare risten.no for multiple collections:
- refactor the code into more and more specific components according to our
- done for editors, some open issues for regular searches
- done for editors, some open issues for regular searches
- refactor the code into more and more specific components according to our
- write down the most common editing scenarios (to be used as guides for making
- Done.
- Done.
- develop the needed XQueries and interface (Sjur, Tomi)
- developing
- developing
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing this week
- nothing this week
- test and review when ready
- Rethink the doubletagging procedure for names, consider grammatically
9. Spellers
Nothing until the new proper noun lexicon is in place. We don't have enough
- aspell
- hunspell
10. Public tender
TODO:
- read the offers (Børre, Sjur, Trond, Tomi)
- done
- done
- meet on Tuesday 13 to sum up the findings (Børre, Sjur, Tomi, Trond)
- done, need more info - questions sent
- done, need more info - questions sent
- telephone meeting next Wednesday with Finnut (Børre, Sjur)
- postponed till we have more info.
- postponed till we have more info.
- new meeting on Thursday 18.5. at 10.00 AM (Børre, Sjur, Tomi Trond)
11. Other
Summer vacation
Who | When |
---|---|
Børre | ? |
Linda | ? |
Maaren | ? |
Saara | July |
Sjur | ? |
Thomas | 3.7 - 7.8 |
Trond | July |
Tomi | 8.7 - 16.7, more? |
Bug fixing
45 open Divvun/Disamb bugs, and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with bug 279 (Perl locale). Not much help...
After the corpus issues have been somewhat settled, we should do a bug
Move to victorio
Trond's victorio problem:
sme: FirstTag...1, ProperNoun...10000... *** Warning: Ignoring info strings. 20000...30000...flex scanner jammed make: *** [sme/bin/sme.save] Error 2 smj: Reading from 'smj/src/adv-smj-lex.txt' adv...1, Adverb...1064 Reading from 'smj/src/noun-smj-lex.txt' NounRoot...flex scanner jammed make: *** [smj/bin/smj.save] Error 2
sma, kal compile without jamming.cd
Saara compiles sme without problems, has problems with smj. Conclusion: it is a
Gobby
0.3 is working fine on Mac, Linux and Windows. Should be installed on all
- Børre - todo
- Maaren - Børre to do it
- Saara - todo
- Sjur - ok
- Thomas - Børre to do it
- Tomi - todo
- Trond - ok
Trond should ask Lars Nygård, Per Langgård and Tero Avellan to install
12. Summary, task list
Børre
- corpus work:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Send out letters to the rest of the Iđut authors
- call Brita Kåven
- contact Kåfjord
- create weekly cron job to mirror Odin URL and detect new/updated pages
- Check the status & license of the corpus texts
- contact Korhonen & Kuhmunen
- send out contracts with accompanying letter
- public tender:
- continue public tender offer evaluation
- meeting on Thursday 18.5. at 10.00 AM with Sjur, Tomi, Trond
- telephone meeting with Finnut
- continue public tender offer evaluation
- install latest SEE
- install Gobby using Darwin Ports (also for Thomas and Maaren)
- move to Bugzilla:
- write docu for how to apply for a corpus user account
- write docu for how to apply for a corpus user account
- fix bugs!
Maaren
- On sick leave
Saara
- Create a parallel corpora of the new testaments.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement links to parallel files in corpus header.
- Implement parallel corpus upload in web upload script
- Check the status & license of the corpus texts and
- Rerun corpus conversion
- Install Gobby
- make xsl conversion routine for the typographic tags
- update the corpus script(s) to only copy texts to the free/ dir which are
- Add some language recognition flags to write into the xsl file
- fix bugs!
Sjur
- public tender:
- read & evaluate received offers
- meeting on Thursday 18.5. at 10.00 AM with Børre, Tomi, Trond
- telephone meeting Friday with Finnut
- read & evaluate received offers
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- name lexicon:
- implement editing functions
- write down the most common editing scenarios, to guide development
- finalise refactoring for multiple collections (regular search interface)
- implement editing functions
- update corpus-summary processing to adhere to the new structure
- send bug report to Apple re filename matching and accented characters in
- fix bugs!
Thomas
- correct hyphenation of exceptions
- correct hyphenation of smj -st-
- work on compounding and derivation
- smj G3 issue
- sme G3 issue
- set up a linguistic workshop while Maaren is in Tromsø
Tomi
-
move to Bugzilla:
- aspell UTF-8 suffix bug
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
- aspell UTF-8 suffix bug
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
- implement data synchronisation of proper nouns between risten.no and CVS
- read aligner docu, install, provide feedback
- install and test Gobby
- Set up the mechanism for the hash-mark transducer package
- meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Trond
- add ccat option to analyse text while keeping the xml tags and structure
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
- get/upgrade keys for Børre's room for Tomi and Thomas
- Rethink the doubletagging procedure for names, consider grammatically
- Check the status & license of the corpus texts
- Work on the graphical corpus tag list
- send Saara smj files for language recognition
- Put Saara and Tero in contact with each other
- meeting on Thursday 18.5. at 10.00 AM with Børre, Sjur, Tomi
- ask Lars Nygård, Per Langgård and Tero Avellan to install Gobby 0.3
- set up a linguistic workshop while Maaren is in Tromsø
- fix bugs!.
13. Next meeting, closing
22.05.2006 09: 30
Closed at 11: 30