Meeting_2006-11-20

Contents:

Meeting setup
Agenda
1. Opening, agenda review, participants
2. Updated task status since last meeting
- Børre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond
3. Documentation
4. Corpus gathering
5. Corpus infrastructure
6. Infrastructure
7. Linguistics
8. Name lexicon infrastructure
9. Spellers
10. Other
11. Next meeting, closing
Appendix - task lists for the next week
- Boerre
- Maaren
- Saara
- Sjur
- Thomas
- Tomi
- Trond

Meeting setup

Date: 20.11.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 45.

Present: Sjur, Thomas, Tomi, Trond

Absent: Børre, Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact writers who already have received contracts
send file rename script to NSI; get their corpus
- Sent script, made it work on NSI.
consider a script for automatic testing of the spell checker in Word
- Fetched documentation for AppleScript and Microsoft Word
consider more testing routines
update Maaren's Forrest installation to r430284
sma discussions with SD (with Sjur, Trond)
add a simple password protection to risten.no in the G5
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
check corpus contract issue
port all i18n work to the main branch (from the i18n branch)
- Done on a local machine, will wait with committing untill next week when I'm back.
update all forrest installations, including local patches
- Have uploaded a tarball available at http://divvun.no/static_files/forrest.som.funker-2006.11.07.tar.gz
revitalize the Aspell work
- Sent an email to skolelinux' sami maillist, with a short instruction on how to build an aspell package from our fullword list
- Have wrestled with our fullword list, to find out how to build an aspell package as efficiently as possible. Conclusion: we will have to work on the affix list.
check corpus contract issue in a meeting Wednesday 9.30
- Done together with Sjur and Trond, sent an answer to Harald Gaski
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
- done some
investigate unrecognised word forms in the hyphenator
- done some

Saara

finalize server of the Xerox tools.
- waiting for pardigm grammar?
help Trond with some shell commands
- not done
fix bugs!
- done some
other things:
- continue bible work: sme and smj nts, nno and nob available ots, swe nt and available ot and fin nt. bible documentation is almost ready as well.

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
  - tried to finalise the SD-terms code refactoring - found probblems in eXist re file URL for imported modules
  - problems with the old risten.no installation (the present site) - fixed some but there are still problems with editing; instead of trying to patch the old code -> will try to make the present codebase stable enough to be taken into use, and upload that one
hire linguist and programmer
make new list of unrecognised forms in the hyphenator
- tried, but found new problems with the Xerox tools
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
consider more testing routines
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
ask Julie Eira about SD employee seminar
- done - we should attend
check corpus contract issue
- done
meeting discuss names and multilinguality Tuesday 14.11 12.30
- done
redo the derived words work for smj
revitalize the Aspell work
check corpus contract issue in a meeting Wednesday 9.30
fix bugs!
other things:
- bug hunting in the new Xerox tools

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
redo the derived words work for smj
- not done
investigate unrecognised word forms in hyphenator
- not done
decide how to specify compounding behaviour info in the lexicon
- not done
check why some SUB-marked entries got included in the normative transducer
- not done
create / check the paradigm grammar
- not done
investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
- done
meeting discuss names and multilinguality Tuesday 14.11 12.30
- done
fix bugs!
- not done

Tomi

make first version of the PLX data generation in lexc2xspell
- first version is out
add hyphenation points to the generated output
- they are being added
revitalize the Aspell work
- helped Børre on this one
fix bugs!
- not done

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts, first the Bible / NT
- Finally got hold of sma Bible texts, they are forthcoming.
add corpus user accounts and access issues to Bugzilla
- Done.
fix the corpus tag list in the cwb/ directory
- Still awaiting disamb clarification
investigate unrecognised word forms in the hyphenator
- Done some work with Sjur
decide how to specify compounding behaviour info in the lexicon
- Don't remember this one
sma discussions with SD (with Børre, Sjur)
- Now forthcoming
check corpus contract issue
- Done
meeting discuss names and multilinguality Tuesday 14.11 12.30
- Meeting held, will continue
redo the derived words work for smj
- Did we do this? No, not yet
check corpus contract issue in a meeting Wednesday 9.30
- Done.
fix bugs!.

3. Documentation

TODO:

update all forrest installations, including local patches (Børre)
port all i18n work to the main branch (from the i18n branch) (Børre)

4. Corpus gathering

Trond talked with Bierna Bientie, got permission to use the texts, so far as bound texts. We shall contact SD to get the texts they have there, he will send whatever they don't have.

The texts at SD are available via Sig-Britt Persson, Sjur will ask her about them.

TODO:

get sma Bible / NT texts (Trond)
- partly done
Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
send the filename renaming script to NSI; get their corpus (Børre)
ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)

5. Corpus infrastructure

Nothing new on any of the sub-issues.

User accounts and access

TODO:

add the issue with subissues to Bugzilla (Trond)
- done

More texts to the graphical corpus interface:

No new texts available in the web interface yet. Saara and Trond have got access to the Oslo computer, and could in principle add texts themselves. Some instructions will be necessary, though.

TODO:

add text to the server (Lars)
Discuss with Lars (Trond)

Aligner

Øystein Reigem has commented Børre's work, and will perhaps look at it.

TODO:

gather more parallel texts (Trond)
try out NT alignment strategies (Saara)

Language recognition

TODO:

get more sma texts, first the Bible / NT (Trond)
- done, more forthcoming

6. Infrastructure

Xerox tools wrapped as servers

Tomi has modified the server a bit, causing it to NOT work with the perl client at the moment. It should not be a big deal to fix it (three lines?). Saara will fix it.

The paradigm grammar needs to be written for Saara to be able to finish the server. The format should follow the Noun sample below:

N+Number+Case+Possessive?+Clitic?
A
V
Adv

Can we do without clitics? We would like to just list them in a PLX format that allows correct behaviour:

Is
InC lex
MeC lex
FiC clit

That is, we want mánás + go to be accepted, without letting mánás make compounds freely. We don't know how to specify this in the PLX format. Sjur will discuss clitics with Polderland, to solve this issue.

TODO:

finalise the paradigm generator (requires paradigm grammar) (Saara)
fix the corpus tag list in the cwb/ directory (Trond)
create / check the paradigm grammar as exemplified above (Thomas)
update the generating transducer to only be normative, and without the subclass tags (Sjur)
discuss clitics in PLX format with Polderland (Sjur)
update the corpus/cwb tags (Trond)

Hyphenator

TODO:

make new list of unrecognised forms (Sjur)
- tried but failed because of problems with the Xerox tools
investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)

7. Linguistics

Names and multilinguality

In the terms-sme.xml file:
<entry id="Helsset">
  <appl excl="speller,hyph"/> -- just an example! --
  <infl lexc="BERN"/>
  <senses>
    <sense ref="Helsingfors"/>
  </senses>
</entry>

<entry id="Helsingfors" type="secondary">
  <appl incl="disamb, IR" /> 
  <infl lexc="BERN"/>
  <senses>
    <sense ref="Helsingfors"/>
  </senses>
</entry>

TODO:

separate meeting for discussing this issue, Tuesday 14.11 after lunch (12.30) (Trond, Thomas, Sjur)
- done; new tasks added below

TODO:

finish first version of the editing (Sjur, Tomi)
add @type=secondary and @excl=speller,hyph to all names marked with !SUB ( Saara)
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

Cohort policy

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

This issue is dependent on CG3, and is delayed untill we know more about the possibilities with it.

Derivation and spellers like Aspell

TODO:

redo the work for smj, including discussion regarding Actio ( Thomas, Sjur, Trond)

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun
ábuhuvvože
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne

TODO:

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio ( Maaren, Thomas)
check why some SUB-marked entries got included in the normative transducer ( Thomas, Sjur)

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
hire new linguist (Sjur)

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in the meeting memo.

TODO:

develop the needed XQueries and UI (Sjur, Tomi)
add a simple password protection to risten.no in the G5 (Børre)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Polderland data generation

Tomi has made the first version of the PLX data generation, it ran for 6,5 hours before it crashed, and generated all entries until line 13639 in the noun-sme-lex.txt file (njuvdin):

13639 njuvdin:njuvdim BOAHTIN ;

The generated PLX file contains 1 439 273 lines, roughly 32 Mb.

Tomi: The hyphenator produces multiple suggestions. Sjur: it needs to be used with the hyphenation filter.

TODO:

decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
make first version of the PLX data generation (Tomi)
- done
add hyphenation points to the generated output (Tomi)
- done
update the hyphenation filter to always return one - if there are multiple alternatives after all the other filtering, pick the one with the most hyphenation points (Saara)
make PLX lexicon for all open POSes (Tomi)
add derivations to the PLX generation (Tomi)
add compound stems to the PLX generation (Tomi)
add closed POS and clitics to PLX generation (Tomi)

Automatic testing of the Word spellchecker

It should be possible to write a script that runs texts through Word from the command line, using a combination of shell script and AppleScript. MS Word has the needed AppleScript commands to run the spell checker.

TODO:

consider a script for automatic testing (Sjur, Børre)
consider more testing routines (Sjur, Børre)
consider infra for testing feedback (Børre, Sjur)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD ( Børre, Sjur)

Aspell

Børre has worked on the Aspell code, mainly to be able to explain how it works and help out Petter Reinholdtsen, who would like to include the Aspell sme speller in Linux distributions. He already gave back some improvements to our code.

TODO:

revitalize the Aspell work (Børre, Sjur, Tomi)
- Børre and Tomi did some

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

10. Other

Corpus contracts

TODO:

check corpus contract issue in a meeting Wednesday 9.30 ( Børre, Sjur, Trond)
- done
publish corpus contracts and project infra on NoDaLi-sta (Sjur)

Bug fixing

62 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

update Maaren's Forrest installation to r430284 (Børre)

Employee seminar in Alta

SD has an employee seminar in Alta 7.-8. December - should we go there? Yes.

TODO:

ask Julie Eira about SD employee seminar (Sjur)
- done
send in hotel room list (Divvun)

11. Next meeting, closing

The next meeting is 27.11.2006, 09: 30 Norwegian time.

The meeting was closed at 11: 28.

Appendix - task lists for the next week

Boerre

contact writers who already have received contracts
consider a script for automatic testing of the spell checker in Word
consider more testing routines
update Maaren's Forrest installation to r430284
sma discussions with SD (with Sjur, Trond)
add a simple password protection to risten.no in the G5
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
check corpus contract issue
port all i18n work to the main branch (from the i18n branch)
update all forrest installations, including local patches
send in hotel room list for employee seminar
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator
send in hotel room list for employee seminar

Saara

finalize server of the Xerox tools.
help Trond with some shell commands
update namelex2xml
add nor output file to namelex2xml
update hyph-filter.pl
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
make new list of unrecognised forms in the hyphenator
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
consider more testing routines
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
redo the derived words work for smj
revitalize the Aspell work
check corpus contract issue in a meeting Wednesday 9.30
update the generating transducer to only be normative
discuss clitics in PLX format with Polderland
publish corpus contracts and project infra on NoDaLi-sta
send in hotel room list for employee seminar
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
redo the derived words work for smj
investigate unrecognised word forms in hyphenator
decide how to specify compounding behaviour info in the lexicon
check why some SUB-marked entries got included in the normative transducer
create / check the paradigm grammar
investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
send in hotel room list for employee seminar
fix bugs!

Tomi

make PLX lexicon for all open POSes
add derivations to the PLX generation
add compound stems to the PLX generation
add closed POS and clitics to PLX generation
revitalize the Aspell work
send in hotel room list for employee seminar
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts
update the corpus tag list in the cwb/ directory
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Sjur)
redo the derived words work for smj
update the corpus/cwb tags (Trond)
fix bugs!.