Meeting_2008-03-31
Contents:
- Meeting setup
- Agenda
- Opening, agenda review, participants
- Updated task status since last meeting
- Pedagogical software online
- Documentation
- Corpus gathering
- Future plans, directions and ideas
- Infrastructure
- Linguistics
- Name lexicon infrastructure
- Proofing tools
- Other
- Next meeting, closing
- Appendix - task lists for the next five days
Meeting setup
- Date: 31.3.2008
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
Cf. one of the following, depending on context:
- the upper bar of the SEE window (provided you use the JSPWiki syntax mode)
- the TOC in Forrest-rendered output, like HTML and PDF
Opening, agenda review, participants
Opened at 10: 16.
Present: Børre, Per-Eric, Sjur, Thomas, Trond
Absent: Maaren, Lene, Tomi
Agenda accepted as is.
Updated task status since last meeting
Børre
- gather sma texts
- Hunspell lexicon conversion
- InDesign documentation
- update the Changes document
- prepare migration to svn (with Sjur, Trond)
- fix bugs!
Lene
- Ped project
- Add a flag !^P^ for forms to be excluded from ped. speller
Maaren
- Put the list of possible sma corpus sources into a document
Per-Eric
- keep the contact with Kurt Tores family about his texts.
- Not heard anything from them
- Not heard anything from them
- try to find other authors who have smj texts digitaly, send contracts to
them - Work with missing list from Tjaktjalasta, Lars-Matto Tuolja.
- Nothing done
- Nothing done
- Work with missing list from texts written by Sigga Tuolja Sandström.
- Nothing done
- Nothing done
- Keep the contact with Ulf-Stefan Winka who has many more smj texts to add.
- Keeping the contact
- Keeping the contact
-
fix bugs!
- Nothing done
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not) - discuss more parallel texts
Sjur
- gather sma texts
- nothing new
- nothing new
- name db/risten.no
- nothing new
- nothing new
- make an improved sma project plan
- nothing new
- nothing new
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- nothing new
- nothing new
- prepare migration to svn (Børre, Trond)
- nothing new
- nothing new
-
fix bugs!
- other things:
- cleaning up papers and reiserekningshelvetet
Thomas
- look at test cases still not behaving properly
- not any this week
- not any this week
- try to fix 636
- done with a little help from my friends
- done with a little help from my friends
- Add a flag !^P^ for forms to be excluded from ped. speller
- not done
- not done
-
fix bugs!
- some done
Tomi
- Hunspell lexicon conversion
- document how compounding is controlled in the PLX conversion
- fix double hyphen bugs
- compile new speller lexicons
- Make a pedagogical speller (after MA thesis is delivered)
- fix bugs!
Trond
- Work on sma analyser and visl integration
- Not done
- Not done
- try to fix 636
- Done, fixed.
- Done, fixed.
- Prepare svn migration (with Sjur, Børre)
- Have read about the topic, that's all.
- Have read about the topic, that's all.
-
fix bugs!.
- The assigned one, at least.
Pedagogical software online
TODO:
- get an easy-to-remember URL (UiT/IT)
- More thorough skin, layout, ... (External person within the Ped team,
Internal forrest expert) This we will postpone until later - Make a pedagogical speller (Tomi when finished with his MA thesis)
- Add a flag !^P^ for forms to be excluded (Thomas, Lene)
- Turn off peripheral compounds (numbers, acros, perhaps names)
- Increase editing distance by one for suggestions? Only possible with limited
compounding
- Add a flag !^P^ for forms to be excluded (Thomas, Lene)
Documentation
TODO:
- start to reorganise the documentation (Børre, Sjur, Trond)
Corpus gathering
TODO:
- follow-up on the smj texts from Kurt Tore ( Per-Eric)
- get texts from Sigga Tuolja Sandstrøm, possibly through
Ulf-Stefan Winka (contract is ok now) (Per-Eric) - other contacts: Nord-Salten avis, Børge Strandskog, Lena Davidsson daughter to
Lars-Matto Tuolja - gather sma texts (Børre, Sjur, Trond, Joseph)
- Put the list of possible corpus sources into a document
gt/doc/lang/sma/sma-corpus-plan.jspwiki ( Maaren) - give contract with blank fields to Per-Eric ( Børre)
Future plans, directions and ideas
See a separate document in plan/strat/5year.jspwiki.
Infrastructure
To accomodate future enhancements in different directions:
- migrate to svn
- merge gt, kt and st into one, probably after the svn move
- more modularised make / build infra (prepare for smn, sms, sjd, others)
- close certain parts of the code repository (requires svn)
- set up the Leopard Server features for collaborative support:
- permanent chat rooms
- stored (and indexed) chat transcripts of the chat rooms
- iCal server / group calendars
- wiki
- permanent chat rooms
- wiki? (is part of Leopard Server) or other web-based documentation
- improve Forrest stability and i18n support
- reorganise the documentation content:
- differ between target groups
- get better grouping
- decide what to write in forrest and what in wiki
(cf. Apertium for a similar split) - update/add missing parts
- differ between target groups
- migrate lexicons to XML, splitting the task
- Name lexica (the Name project)
- Dictionaries (already in XML, task is to integrate them)
- Open POSes (Komi as a test case)
- Name lexica (the Name project)
- change the look of the documentation web
- sfst? Both as replacement for xfst and for hunspell/open-source proofing tools
- investigate the NSIS installer, potentially replacing the InstallShield
package from Polderland - corpus content moved to Max Planck repositories?
TODO:
- add Jabber account in iChat
- check that all accounts are ready for iChat on the G5 (Børre)
- UiT: Lene, Kimme, Saara (Trond)
- SD: Svenne, Joseph (Sjur)
- check that all accounts are ready for iChat on the G5 (Børre)
- prepare migration to svn (Børre, Sjur, Trond)
Linguistics
North Sámi
(nothing new, see proofing bugs below)
Lule Sámi
(nothing new, see proofing bugs below)
TODO:
-
sme->smj lexicon conversion to build bilingual lexicon resources, and
increase smj coverage (Trond, Svenne). - Add the words when all words are ready.
South Sámi
Nothing new.
Name lexicon infrastructure
TODO:
- fix i18n bug in risten.no/G5 (so they will work without the proper locale
request) (Sjur)- it works ok locally, set-up / config needs to be checked on the G5; probably
easy to fix- it works the same both locally and on the G5, relates to i18n setup in
forrest
- it works the same both locally and on the G5, relates to i18n setup in
- it works ok locally, set-up / config needs to be checked on the G5; probably
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara) - convert propernoun-($lang)-lex.txt to a derived file from common xml files
( Sjur, Tomi, Saara) - implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) - start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists) - merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists) - publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
( linguists)
Proofing tools
Hunspell
OOo 2.4 is out, with support for the Sámi languages. And Skolelinux would like
To ease conversion to Hunspell and similar formats, it would be extremely useful
- decide which markers to use for derivations and inflections (pot. also
prefixes and suffixes) - Decide where to put the suffix boundaries, On parent lexicon or on child lexicon;
only on stem-suffix border, or also on suffix-suffix-border - Add the boundaries to lexc (already done for sme derivations and proper names)
- Do the boundaries in twol compulsatory when called for
Børre has made what could be called the first beta of the Hunspell speller
TODO:
- test the latest hunspell lexicon (Sjur)
- add smj to the soup, make sure it works roughly as good as sme
( Børre, Thomas, Per-Eric) - fix the remaining conversion bugs for sme ( Børre, Tomi)
- return to smj, and fix whatever is left to fix (Børre, Tomi)
- release a public beta at the end of April (Børre, Sjur)
- implement the derivations as separate "continuation lexicons"
( Børre, Tomi)- done for sme
Testing
Spelling Error Markup
TODO:
- Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) (Saara) - test new and nested error markup (Sjur)
Speller bugs
Open issues based on test results :
sme
- 425 - REGRESSION: other words from Divvun.no - two words rejected
- 426 - comp words from Divvun.no - guoktedássásaš accepted - still OPEN
- 435 - roman numbers - REGRESSION: inflection of single letter numbers
rejected (but is ok in smj)- we should pregenerate all numbers once and for all, and store them in a
separate lexicon file
- we should pregenerate all numbers once and for all, and store them in a
- 452 - REGRESSION: several lexical bugs - oažžuin + ožžuin
- 595 - prefix+name wihtout hyphen (ovdaLot instead of ovda-Lot) - still
OPEN - 600 - gen+hyph compound sámi-dáru - still OPEN
- 603 - suomabealdi accepted - still OPEN
- 606 - speller accepts VUOHTA compound - still OPEN
- 607 - acro + hyphen - still OPEN
-
NRKGA is acro + clitic accepted without colon - what is correct?
-
NRKGA is acro + clitic accepted without colon - what is correct?
- 611 - double hyphen sugg still accepted - still OPEN
- 613 - short gen. as second compound part - still OPEN
- 619 - numerals and pronouns to NAMÁK and SASJ fails - still OPEN
- 627 - prefix + hyhpen does not get accepted - still OPEN
- 629 - a taking part in compounding without hyphen - still OPEN
- 633 - double hyphens accepted in Word, not by cmdline speller - FIXED
- 634 - PropGen+hyph+PropGen - still OPEN
- 641 - numeral+noun compounds - still OPEN
- 642 - noun/adj/proper + hyphen + ain - still OPEN
- 644 - cased numeral+numeral compund - still OPEN
- 646 - adverb + hyphen + noun - still OPEN
- 647 - numerals+NOUN - still OPEN
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 649 - name + adj compound without hyphen - still OPEN
- 654 - speller does not recognize ordinals on -nuppelogát - still OPEN
- 655 - pron + nai - still OPEN
- 658 - Suggestion saame - still OPEN
- 660 - abbr. not recognised - FIXED
smj
- 435 - roman number - single letter numbers now recognised
- we should pregenerate all numbers once and for all, and store them in a
separate lexicon file - please note that inflection of single letter numerals is fine in
smj, as opposed to sme
- we should pregenerate all numbers once and for all, and store them in a
- 595 - prefix+name wihtout hyphen (tsåhkeLot instead of tsåhke-Lot) -
still OPEN - 600 - gen+hyph compound sáme-dáro - still OPEN
- 607 - acro + hyphen
-
NRKGA is acro + clitic accepted without colon - what is correct?
-
NRKGA is acro + clitic accepted without colon - what is correct?
- 616 - Bispadime-me-ráden - still OPEN, try to find an acro or abbr me
- 619 - numerals and pronouns to NAMÁK and SASJ fails - still OPEN
- 629 - a taking part in compound - still OPEN
- 634 - rop gen + hyphen + Prop gen - still OPEN
- 641 - numeral+noun compounds - still OPEN
- 644 - cased numeral+numeral compund - still OPEN
- 647 - numerals+NOUN - still OPEN
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 649 - name + adj compound without hyphen - still OPEN
- 650 - noun prefix+name compound without hyphen - still OPEN
- 658 - Suggestion saame - still OPEN
TODO:
- compile new speller lexicons (Tomi)
- document how compounding is controlled in the PLX conversion (Tomi)
Hyphenator bugs
Open issues based on test results :
sme
- 468 - REGRESSION: Márkomeanu
- 547 - REGRESSION: hyphen in front of vowel: Lotnolasealáhusas
- 548 - REGRESSION: mid syllable hyphenation: Háliidivččen
- 549 - REGRESSION: division without hyph: Váccedettiin
- 673 - adj-derivations: guovttenuppelotčoarvvagiin (the word is not rec.)
- 677 - NEW: Wrongly hyphenated ending -danidja
TODO:
- add remaining hyphenation bugs to Bugzilla (Thomas)
- done
smj
- 545 - REGRESSION: bad hyphenation in compounds: åhpadusorganisásjåvnån
(not recognised) - 546 - REGRESSION: obligatory hyph rules seem to work in facultative
manner: organisásjåvnån (not recognised) - 547 - REGRESSION: hyphen in front of vowel: Jienastimnjuolgadusá and
Orgánajs - 636 - hyphen before last char -> Divvun - FIXED
TODO:
- lexicalise europarádeministarjuogos ( Thomas)
- done
- done
- try to fix 636 (Thomas, Trond)
- done
InDesign tools
We're waiting for an update from Polderland.
Windows installer
This point is now moved to the section for future plans, and will be tackled as
Releases
TODO:
- update the Changes document (Sjur)
- documentation (Sjur)
- Norwegian translation received from Davvi Girji
Other
Corpus contracts + open source
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
( Sjur)
Next meeting, closing
The next meeting is 7.4.2008.
The meeting was closed at 11: 50.
Appendix - task lists for the next five days
Boerre
- Hunspell lexicon conversion
- InDesign documentation
- prepare migration to svn (with Sjur, Trond)
- release a public beta at the end of April (with Børre)
- fix bugs!
Lene
- Ped project
- Add a flag !^P^ for forms to be excluded from ped. speller
Maaren
- Put the list of possible sma corpus sources into a document
Per-Eric
- keep the contact with Kurt Tores family about his texts.
- try to find other authors who have smj texts digitaly, send them contracts
- Work with missing list from Tjaktjalasta, Lars-Matto Tuolja.
- Work with missing list from texts written by Sigga Tuolja Sandström.
- Keep the contact with Ulf-Stefan Winka who has many more smj texts to add.
- fix bugs!
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not) - discuss more parallel texts
Sjur
- gather sma texts
- name db/risten.no
- make an improved sma project plan
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- prepare migration to svn (Børre, Trond)
- add Jabber account in iChat for Svenne, Joseph
- test the latest hunspell lexicon
- release a public beta at the end of April (with Børre)
- update the Changes document
- fix bugs!
Thomas
- look at test cases still not behaving properly
- Add a flag !^P^ for forms to be excluded from ped. speller
- Hunspell: add smj to the soup, make sure it works roughly as good as sme
- fix bugs!
Tomi
- Hunspell lexicon conversion
- document how compounding is controlled in the PLX conversion
- fix double hyphen bugs
- compile new speller lexicons
- Make a pedagogical speller (after MA thesis is delivered)
- fix bugs!
Trond
- Work on sma analyser and visl integration
- Set up Jabber for Lene, Kimme, Saara
- Prepare svn migration (with Sjur, Børre)
- fix bugs!.

