Meeting_2007-06-04
Contents:
- Meeting setup
- Agenda
- 1. Opening, agenda review, participants
- 2. Updated task status since last meeting
- 3. Documentation
- 4. Corpus gathering
- 5. Corpus infrastructure
- 6. Infrastructure
- 7. Linguistics
- 8. Name lexicon infrastructure
- 9. Spellers
- 10. Other
- 11. Next meeting, closing
- Appendix - task lists for the next week
Meeting setup
- Date: 04.06.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09: 38.
Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond
Absent: Saara
Agenda accepted as is.
2. Updated task status since last meeting
Boerre
- add sma texts to the corpus repository
- not done
- not done
- run all known spelling errors in the prooftest corpus through the speller
- not done
- not done
- add extraction of all known spelling errors in the regular corpus (not the
prooftest corpus), and check that they are properly corrected- not done
- not done
- update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority- nothing new
- nothing new
- study the Hunspell formalism in detail - after beta release
- read some more docu
- read some more docu
- add info to front page (incl. download links)
- done
- done
- write separate page with detailed info (incl. download links)
- a separate page for the beta speller, with installation instructions, etc.
- done
- done
- a separate page for the beta speller, with installation instructions, etc.
- translate press release, web pages
- done
- done
- contact Davvi Girji / Mikal Aase
- not done
- not done
- test download and installation of beta release candidate on all platforms
- done
- done
- test deinstallation of beta release candidate on all platforms
- done
- done
- linguistic testing of speller beta release candidate
- done
- done
- update Win installer package with latest speller lexicons
- done
- done
- write and prepare README file
- done
- done
-
fix bugs!
- fixed #405
Maaren
- lexicalise actio compounds
- working
- working
- Manually mark speller test documents for typos
- done a little
Per-Eric
- expand the smj typos list
- contineud the work with typos list
- contineud the work with typos list
- add missing smj words
- working and continue work with smj words
- working and continue work with smj words
- translate beta release docs to smj
- done
- done
- fix bugs!
Saara
- improve cgi-bin scripts
- add new features to the paradigm generator
- done, to be installed this week
- done, to be installed this week
- paradigm generator for Lule Sámi
- done, same
- done, same
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- not done
- not done
- compilation of verb lists
- is this still a problem? No, at least if this relates to PLX conversion.
- is this still a problem? No, at least if this relates to PLX conversion.
- read the manual for graphical corpus interface and try to add files with Lars.
- reading done, waiting for Lars.
- reading done, waiting for Lars.
- fix bugs!
Sjur
- finish press release for the beta
- done
- done
- run all known spelling errors in the corpus through the speller
- started work on integrating ccat in testing, but not finished
- started work on integrating ccat in testing, but not finished
- document the AppleScript testing tool
- not done
- not done
- integrate regression self tests with the make file
- not done
- not done
- improve speller test bench
- not finished, see above
- not finished, see above
- integrate the ccat speller testing options in the make file
- not finished, see above
- not finished, see above
- fix internet setup for Per-Eric's satelite modem
- not finished, need to follow up the modem producer
- not finished, need to follow up the modem producer
- look over the Bugzilla status mails
- not done
- not done
- contact Davvi Girji / Mikal Aase
- not done
- not done
- ask Xerox for a commercial lisense for the xfst tools on the G5
- not done
- not done
- check with Sámi publishing houses whether support for CS2 is still needed
- Min Áigi is already using CS3, and Davvi Girji is moving to that one as well,
appearently. Áššu is still using QuarkXPress/MacSámi (8-bit)...
- Min Áigi is already using CS3, and Davvi Girji is moving to that one as well,
- send Beta RC Win and Mac installers to Polderland for testing
- done - both passed
- done - both passed
-
ń not considered word char in smj, not corrected to ŋ
- bug forwarded to Polderland - fix is easy, can be delivered any time
- bug forwarded to Polderland - fix is easy, can be delivered any time
- forward list of PR recipients to SD
- press release sent out, and we received a list of all recipients. We could
still check against it to see if there are more recipients we would like to target
- press release sent out, and we received a list of all recipients. We could
- update Mac installer package with latest speller lexicons
- done
- done
- write and prepare README file
- done
- done
-
fix bugs!
- Bugzilla is showing inconsistent behaviour for recently added bugs (407-409),
we need to investigate this. It also seems our webserver has been the target for a hacker attack, including fake bug reports in Bugzilla. This might explain why it has been so slow towards the end of the week.
- Bugzilla is showing inconsistent behaviour for recently added bugs (407-409),
Steinar
- test actio compounds before public beta release
- test download and installation on all platforms, following our instructions
- done some tests
- done some tests
- test deinstallation
- Beta testing: Align manually (shorter texts)
- not done
- not done
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue- added texts
- added texts
- Complete the semantic sets in sme-dis.rle
- no work this week
- no work this week
- missing lists
- no work this week
- no work this week
- fix bugs!
Thomas
- work with compounding
- worked and still working
- worked and still working
- Lack of lowering before hyphen: Twol rewrite.
- not done
- not done
- translate beta release docs to sme and smj
- done
- done
- implement –k ending in twol
- done
- done
- Test Actio compounds in speller beta release candidate
- done
- done
- linguistic testing of speller beta release candidate
- done
- done
-
smj: öä not accepted, only øæ (except for lexicalised names)
- not done
- not done
-
fix bugs!
- some work done
Tomi
- add compounding restrictions to the PLX convertion
- done, not finished
- done, not finished
- make PLX conversion test sample; add conversion testing to the make file
- not done
- not done
- improve prefix and middle-noun PLX conversion
- not done
- not done
- integrate the ccat speller testing options in the Makefile
- not done
- not done
- fix clitics in beta release sme speller
- done, not tested
- done, not tested
- first part of multiword expressions not accepted
- done, not finished
- done, not finished
-
fix bugs!
- fixed
Trond
- Work on the web corpus issues
- Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt - Go through the Num bugs
- update the smj proper noun lexicon, and refine the morphological
- fix bugs!.
3. Documentation
The open documentation issues fall into these three categories:
- Beta documentation for testers
- Documentation for the online corpora
- General documentation improvement after Steinar's test (for open-source
release)
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- delayed till after the beta release
- delayed till after the beta release
- document how to apply for access to closed corpus, and details on the corpus
and its use in general (Børre, Sjur, Trond)- delayed till after the beta release
- delayed till after the beta release
- correct and improve it based on feedback from Steinar ( Børre)
- low priority
- low priority
- beta documentation (see separate beta section below)
4. Corpus gathering
TODO:
-
sme texts: no new additions, fix corpus errors during this month
( Børre, Trond, Saara) - missing nob parallel texts should be added if such holes are found
( Børre, Trond) - Go through the list of missing or errouneous nob texts, based upon
Saara's perfect list (Børre, Trond) - add sma texts to the corpus repository (Børre)
- contact Davvi Girji / Mikal Aase ( Børre, Sjur)
- tried, no success yet
5. Corpus infrastructure
Nothing this week.
6. Infrastructure
TODO:
- update and fix our documentation and infrastructure as Steinar finds
problem areas (Børre) - fix internet setup for Per-Eric's satelite modem (Sjur, Børre)
- this influences iChat, SEE sharing, and ARD connetions
- this influences iChat, SEE sharing, and ARD connetions
- Translate the paradigm pages to Sámi, fix all the se links
( Børre, Trond)
7. Linguistics
North Sámi
Actio compounds: most of them are already accepted as freely made compounds,
BOAHTALADDAN *boahtalladanmuorra - not base form, but which form?=derived BOAHTIN boahtinmuorra - base form oahpahahttin for example causatives are allowed oahpaladdanvuohki is okei, and all these -laddat + subst
Actual speller behaviour on the examples above:
boahtalladanmuorra
boahtaladdanmuorra - not accepted, suggestions:
bohtaladdanmuorra - okei with -o- (Maaren disagrees?no she does not)
bohtaladdanmuorra
bohtaladdanmuorra bohtat+V+IV+Der1+Der/l+V+Der2+Der/adda+V+Der3+Der/n+N+Sg+Nom#muorra+N+Sg+Nom
boahtáladdanmuorra - okei with á
boahtinmuorra - ok
oahpahahttin - ok
oahpaladdanvuohki - not accepted (it should be?), suggestions:
oahppaladdanvuohki - this is correct according to Thomas
roahpaladdanvuohki
soahpaladdanvuohki
...
Maaren and Duomma disagrees about what is correct and not, needs to be
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
( Maaren)- vuolgin- and vuolgga- , both are okei vuolggasadji and vuolgindássi for eks
- possibly turn on free compounding as part of the PLX conversions (ie free
compounding in the spellers, but not in the analyzers/transducers)
- vuolgin- and vuolgga- , both are okei vuolggasadji and vuolgindássi for eks
- fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
- open up compounding for all actios (Tomi)
Lule Sámi
We received a pdf file containing the print ready dictionary. It has two
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
( Thomas, Trond) - write two-level rules to handle -ge/-k clitic alternation (Thomas)
- done
- done
-
ö/ä vs ø/æ in speller (Thomas, Sjur)
- lexicalise words from the Olavi missing list, but check against the pdf
original where in doubt (Inga) - investigate why actios of 3-syllable verbs are not accepted by the speller
( Thomas)- norm-lookup does not see these, ordinary look-up sees
- norm-lookup does not see these, ordinary look-up sees
- investigate why some adverbs of 3-syllable adjectives are not accepted by the
speller (Thomas)- norm-look-up sees some, but not all, ordinary look-up sees
norm-look-up:
tjiehkusit tjiehkusit tjiegos+A+Adv basstelit basstelit basstelit +? muttágit muttágit mutták+A+Adv gåbddågit gåbddågit gåbddåk+A+Adv allagit allagit allak+A+Adv gåbddågit gåbddågit gåbddåk+A+Adv galmasit galmasit galmasit +? galmmasit galmmasit galmas+A+Adv suohkadit suohkadit suohkat+A+Adv låssådit låssådit låssåt+A+Adv dibmásit dibmásit dimes+A+Adv bihtjasit bihtjasit bitjas+A+Adv bihtjasit bitjes+A+Adv oabmásit oabmásit oames+A+Adv oabmásit oabme+N+Sg+Ill+PxSg2 tjalmmisit tjalmmisit tjalmmis+A+Adv stuoragit stuoragit stuorak+A+Adv
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in this meeting memo.
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara) - convert propernoun-($lang)-lex.txt to a derived file from common xml files
( Sjur, Tomi, Saara) - implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way) - start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists) - merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists) - publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
( linguists)
9. Spellers
OOo speller(s)
The MS Office Beta is now delivered, thus the following items move up on the priority list.
TODO:
- add Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished) - study the Hunspell formalism in detail (Børre, Sjur, Tomi)
Testing
Spelling Error Markup
TODO:
- Manually mark test texts for typos (making them into gold standards)
( Steinar) - Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) (Saara)
Testing tools
TODO:
- document the AppleScript testing tool (Sjur)
- improve speller test bench (Sjur)
- integrate the ccat speller testing options in the Makefile (Sjur, Tomi)
Regression tests
TODO:
- add extraction of all known spelling errors in the corpus (not the
prooftest corpus), and check that they are properly corrected ( Børre, Sjur) - test the typos.txt list, and check that all entries are properly corrected
( Børre, Sjur) - consider how to do a regression self-test, ie, how to test the full
wordlist (Børre, Sjur)- extract all the base forms in the lexicon, and run them through the speller
- extract all SUB-marked entries, and run them through the lexicon
- integrate these in the make file (Sjur)
- extract all the base forms in the lexicon, and run them through the speller
Localisation
TODO:
- translate beta release docs to sme ( Thomas)
- done
- done
- translate beta release docs to smj ( Thomas)
- done
Lexicon conversion to the PLX format
TODO:
- install larger disks, new RAM on the G5 when they arrive (Børre)
- not received yet, needs to check
- not received yet, needs to check
- ask for mklex for Linux (victorio) from Polderland (Sjur)
- waiting for the offer (it should have been sent already)
- waiting for the offer (it should have been sent already)
- ask Xerox for a commercial lisense for the xfst tools on the G5 (Sjur)
- add compounding restrictions to the PLX convertion (Tomi)
- hopefully before the Drag meeting.
Compounding restrictions
How to include compounding restriction comment tags in the transducers:
giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp => (using a perl script or similar) +SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp
TODO:
- improve prefix conversion to PLX (Tomi)
- improve middle noun conversion to PLX (Tomi)
- improve noun + adjective PLX conversion: ( Tomi)
- compounding stems - how do we generate them? Using the java client?
+SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn't it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only. - compounding tags - we need to obey them when making the transducers.
Suggestion - see above.
- compounding stems - how do we generate them? Using the java client?
- make conversion test sample; add conversion testing to the make file
( Tomi)- to regression test / QA the PLX conversion.
- to regression test / QA the PLX conversion.
- improve number conversion (Børre, Tomi)
- done
- done
- ask for larger disk for the web server (Trond, Børre)
- done and installed
Errors in latest speller build
This list should be added to Bugzilla, and the list of fixes and known issues in
TODO after the final public beta:
- first part of multiword expressions not accepted (Tomi)
- compound restrictions/rules not yet integrated in speller fsts (but the tags
necessary to do this can now be included in the transducer) (Tomi) - suomasápmelaš vs suopmasápmelaš - now both are accepted (which is an
improvement - earlier only suopmasápmelaš was accepted). What is correct? Only the first one. The second one should be banned by the compound restrictions. -
smj: öä not accepted, only øæ (except for lexicalised names)
( Thomas) - ŋ and ñ and ń in smj ( Sjur, Polderland)
- move this list to Bugzilla (Børre, Sjur)
Public Beta release
RELEASED on Tuesday May 29.
Press coverage has been very good within the Sámi community - the coverage in Sweden and Finland, and in Norwegian-language media is not known.
TEST before public beta release:
- Actio compounds (Thomas, Steinar)
- done
- done
- download and installation on all platforms, following our instructions
( Steinar, Børre)- ok
- ok
- deinstallation (Steinar, Børre)
- MUST DO - /Applications/SamiProofingTools/deinstaller
- done
- done
- MUST DO - /Applications/SamiProofingTools/deinstaller
- linguistic testing (typos, at least one of the documents marked up by
Steinar) (Thomas, Børre)- MUST DO
- typos already done
- MUST DO
TODO:
- fix clitics (Tomi)
- done, but not in released version, has to be tested
- done, but not in released version, has to be tested
- add info to front page (incl. download links) (Børre)
- done
- download page made, only needs to add the speller beta when it is ready.
- done
- done
- write separate page with detailed info (incl. download links) (Børre)
- done
- a separate page for the beta speller, with installation instructions, etc.
- done
- done
- done
- translate press release, web pages (Børre, Thomas, whoever)
- done
- done
- forward list of PR recipients to Berit Karen Paulsen
( Sjur)- update: the persons to contact are: Pål Hivand or
John Markus Kuhmunen - done
- update: the persons to contact are: Pål Hivand or
- update installer packages with latest speller lexicons (Børre (Win),
Sjur (Mac))- done
- done
- README file: ( Børre, Sjur)
- list of known issues
- how to give us feedback
- what to give us feedback about (including: send in your user dictionary after
a month or so) - inform about the rights to the source, as well as the availability of the
speller - needs to be written in all languages
- make a PDF of the readme, to avoid encoding problems; text versions should be
included too- done, and translated
- swedish missing, Thomas translate, Per-Eric proofread
- done, and translated
- list of known issues
- file list Windows (Børre)
- not complete
- not complete
- proofreading using our own/Office spellers
- done
- done
- check links (Børre)
- done
- done
- test smj on typos (Børre)
- tried, but got an error, thus skipped. Needs to be checked now.
- tried, but got an error, thus skipped. Needs to be checked now.
- final check of list of known issues - smj (Thomas - based on typos-results)
- done
- done
- then update installers (Sjur)
- done
- done
- e-mail SD-people/Leif-Åge et co with final links (Sjur)
- done
- done
- send final PR text to SD
- done
- done
- celebrate
- NOT done - will do in Drag: )
- NOT done - will do in Drag: )
- resend the press release to some channels in Sweden, Finland and Norway
( Sjur)
10. Other
Summer vacation
When are we taking it? Please fill in the table below:
| Name | Starting | Ending |
|---|---|---|
| Børre | x | x |
| Maaren | 9.7. | 10.8. |
| Per-Eric | 9.7. | 20.7. |
| Saara | 2.7 | 3.8 |
| Sjur | x | x |
| Steinar | x | x |
| Thomas | 9.7. | 12.8. |
| Tomi | 9.7. | 5.8. |
| Trond | 25.6. | x |
Divvun people also need to send the dates to Julie Eira or
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
( Sjur)
Bug fixing
41 open Divvun/Disamb bugs, and 23 risten.no bugs
TODO:
- look over the Bugzilla status mails (Børre)
The meeting in Drag
The Sámi Parliament board has its meeting June 19-21. We should use Monday 18.
- Maaren (?)
- Sjur
- Tomi
11. Next meeting, closing
The next meeting is 11.6.2007, 10: 30 Norwegian time.
The meeting was closed at 11: 56.
Appendix - task lists for the next week
Boerre
- add sma texts to the corpus repository
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
prooftest corpus), and check that they are properly corrected - update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority - study the Hunspell formalism in detail
- contact Davvi Girji / Mikal Aase
- install larger disks, new RAM on the G5 when they arrive
- move list of known bugs to Bugzilla
- update/check installed file list and paths for Windows
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Per-Eric
- expand the smj typos list
- add missing smj words
Saara
- improve cgi-bin scripts
- add new features to the paradigm generator
- paradigm generator for Lule Sámi
- add new features to the paradigm generator
- add new XSL/XML headers for proofing test docs
- Try to add files with Lars to the corpus interface.
- fix bugs!
Sjur
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- improve speller test bench
- integrate the ccat speller testing options in the make file
- fix internet setup for Per-Eric's satelite modem
- look over the Bugzilla status mails
- contact Davvi Girji / Mikal Aase
- ask Xerox for a commercial lisense for the xfst tools on the G5
- check with Sámi publishing houses whether support for CS2 is still needed
- fix stuorra-oslolaš lower case o
-
ö/ä vs ø/æ in speller
- study the Hunspell formalism in detail
- move list of known bugs to Bugzilla
- resend the press release to some channels in Sweden, Finland and Norway
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue - Complete the semantic sets in sme-dis.rle
- missing lists
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
-
smj: öä not accepted, only øæ (except for lexicalised names)
- fix stuorra-oslolaš lower case o
- investigate why actios of 3-syllable verbs are not accepted by the speller
- investigate why some adverbs of 3-syllable adjectives are not accepted by the
speller - fix bugs!
Tomi
- add compounding restrictions to the PLX convertion
- make PLX conversion test sample; add conversion testing to the make file
- improve prefix and middle-noun PLX conversion
- integrate the ccat speller testing options in the Makefile
- first part of multiword expressions not accepted
- open up compounding for all actios
- fix bugs!
Trond
- Work on the web corpus issues
- update the smj proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt - Go through the Num bugs
- fix stuorra-oslolaš lower case o
- fix bugs!.

