Meeting_2006-11-27

Meeting setup

Date: 27.11.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10: 07.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact writers who already have received contracts
- Not done
consider a script for automatic testing of the spell checker in Word
- Downloaded documentation on Microsoft Office and AppleScript
consider more testing routines
- Not done
update Maaren's Forrest installation to r430284
- Not done
sma discussions with SD (with Sjur, Trond)
- Not done
add a simple password protection to risten.no in the G5
- Not done
consider infra for testing feedback
- Not done
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- Had a short discussion with Sjur about it.
check corpus contract issue
- Done
port all i18n work to the main branch (from the i18n branch)
- Done, but haven't set up the web server to use it.
update all forrest installations, including local patches
- It is available as: http: //divvun.no/static_files/forrest_divvun.tar.gz
send in hotel room list for employee seminar
- Done
fix bugs!
- Not done
Other work:
- Aspell: Contacted mailing list, doing some research.

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator
- not done
send in hotel room list for employee seminar
- done

Saara

finalize server of the Xerox tools.
- did some testing with parallel processing.
help Trond with some shell commands
update namelex2xml
- done
add nor output file to namelex2xml
- done by Sjur.
update hyph-filter.pl
- done
fix bugs!
- done some
other
- work with analyzed corpus: bug fixing, more efficient analysis, some new features to the analyzed documents.

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
  - done bug fixing in the existing risten.no code, + some additional testing
hire linguist and programmer
make new list of unrecognised forms in the hyphenator
- done, Trond sent it to Maaren
investigate unrecognised word forms in the hyphenator
- not done
decide how to specify compounding behaviour info in the lexicon
- still not decided, we need a short meeting to settle this one
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
consider more testing routines
- topic added to this meeting schedule
consider infra for testing feedback
- topic added to this meeting schedule
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
redo the derived words work for smj
- mostly done by Thomas
revitalize the Aspell work
- done by Børre
update the generating transducer to only be normative
- done
discuss clitics in PLX format with Polderland
- done
publish corpus contracts and project infra on NoDaLi-sta
- not done
send in hotel room list for employee seminar
- everyone sends in him-/herself
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
redo the derived words work for smj
- finished soon
investigate unrecognised word forms in hyphenator
- not done
decide how to specify compounding behaviour info in the lexicon
- not done
check why some SUB-marked entries got included in the normative transducer
- not done
create / check the paradigm grammar
- done
investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
- done
send in hotel room list for employee seminar
- done
fix bugs!
- not this week

Tomi

make PLX lexicon for all open POSes
- Not done
add derivations to the PLX generation
- not done
add compound stems to the PLX generation
- not done
add closed POS and clitics to PLX generation
- not done
revitalize the Aspell work
send in hotel room list for employee seminar
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Not done
get more sma texts
- Will have to wait until people return from sick leave.
update the corpus tag list in the cwb/ directory
- Done, will have to work more on this.
investigate unrecognised word forms in the hyphenator
- Not done.
decide how to specify compounding behaviour info in the lexicon
- Not done.
sma discussions with SD (with Børre, Sjur)
- Not done.
redo the derived words work for smj
- Not done.
fix bugs!.
- At least some done.

3. Documentation

There's a problem with the reliability of the Jetty (forrest run) version of our site. Thus, still using the static version on our main/public site.

TODO:

update all forrest installations, including local patches (Børre)
- created a tailored forrest distribution to download and install
port all i18n work to the main branch (from the i18n branch) (Børre)
- done

4. Corpus gathering

We finally received the Áššu and Min Áigi files from 1995 to 2001. Many of the Áššu files are in Quark Express and WriteNow. Quark files should be readable by InDesign (which we have), and WriteNow files should be readable by many Mac applications. All files can be expected to be encoded in MacSámi.

No newer files received from Áššu, Børre should contact them again.

To be able to convert WriteNow files (WriteNow is a long dead MacOS Classic application), we need the commercial software package MacLinkPlus from DataViz . It should be available from MacOffice in Tromsø, or can be bought online from their home page (USD 79.99).

More South Sámi files are forthcoming soon.

TODO:

get sma Bible / NT texts (Trond)
- not done due to illness
Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
send the filename renaming script to NSI; get their corpus (Børre)
- done
ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)
- not done

5. Corpus infrastructure

More texts to the graphical corpus interface:

Lars suggested changes (<p> only policy) to the corpus DTD (in news). Good suggestions, but we will consider waiting till next year (and new project money) to see whether we'll follow through.

TODO:

Consider whether to implement the <p> only policy or not (Saara)
add text to the server (Lars)
Discuss with Lars (Trond)

Aligner

The latest cvs version of our copy of the Aligner can be run from the command line, thanks to work by Børre:

The exact instructions on how to compile and run the program is in the README.txt file. Basic usage as follows:

tca2 -a anchorfile samifile otherlangfile

TODO:

gather more parallel texts (Trond)
re-analyze parallel files using the command-line version (Saara)
make a runnable version available on both the g5 and victorio (Børre)
- Done

Language recognition

TODO:

get more sma texts, first the Bible / NT (Trond)

6. Infrastructure

Xerox tools wrapped as servers

The PLX format for clitics have been checked with Polderland, and we should be able to do fine without them. Thus, we generate all paradigms without clitics.

TODO:

finalise the paradigm generator (requires paradigm grammar) (Saara)
- done, except needs to remove the clitics
fix the corpus tag list in the cwb/ directory (Trond)
- The list now matches the actual tags.
create / check the paradigm grammar as exemplified above (Thomas)
- done
update the generating transducer to only be normative, and without the subclass tags (Sjur)
- done
discuss clitics in PLX format with Polderland (Sjur)
- done

Hyphenator

The list of unrecognised words is found in victorio in /tmp/hyph-errors.txt, whereas words to check can be found in the same location in the file maaren-liste.txt.gz.

Typical problem words/reports (in hyph-errors.txt) are:

No forms for šveicalažžanat
No forms for šveicalažžaneamet
No forms for šveicalažžaneame
No forms for šveicalažžanan
No forms for šveicalažžanis
No forms for šveicalažžaneaset
....

TODO:

make new list of unrecognised forms (Sjur)
- done
investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- not done

7. Linguistics

Names and multilinguality

TODO:

finish first version of the editing (Sjur)
1. working on it
add @type=secondary and @excl=speller,hyph to all names marked with !SUB ( Saara)
1. done
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

Derivation and spellers like Aspell

TODO:

redo the work for smj, including discussion regarding Actio ( Thomas, Sjur, Trond)
- Thomas is working on it.

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun
ábuhuvvože
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne

TODO:

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio ( Maaren, Thomas)
check why some SUB-marked entries got included in the normative transducer ( Thomas, Sjur)

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
hire new linguist (Sjur)
- new suggestion in the meeting

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in the meeting memo.

TODO:

develop the needed XQueries and UI (Sjur, Tomi)
add a simple password protection to risten.no in the G5 (Børre)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Polderland data generation

All nouns are now generated in PLX, not yet the other POSes. Also the closed POSes will be generated the same way, which means we need paradigm grammars for them as well. In the simplest case that is just one feature/tag combination.

Also, there might be possible to ask lookup to print out the whole paradigm. This should be investigate.

We need to close the compound specification discussion. We'll have a meeting for it later in the week with Thomas, Trond and Sjur.

TODO:

write paradigm grammar for the closed POSes (Thomas)
check whether lookup can be used to generate paradigms for closed POSes ( Sjur)
decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
- meeting Tuesday at 12.30
update the hyphenation filter to always return one - if there are multiple alternatives after all the other filtering, pick the one with the most hyphenation points (Saara)
- done
make PLX lexicon for all open POSes (Tomi)
add derivations to the PLX generation (Tomi)
add compound stems to the PLX generation (Tomi)
add closed POS and clitics to PLX generation (Tomi)

Aspell

In preparation for the Aspell work, Børre has sent an e-mail to the Aspell developer e-mail list. Here is the most interesting part of the thread:

> What features in hunspell would you specifically like to have in aspell?

Possibly:

- Max. 65535 affix classes and twofold affix stripping

- Handling conditional affixes, circumfixes, fogemorphemes, forbidden
   words, pseudoroots and homonyms.

- Support complex compoundings

I believe some of these will benefit you.

However I only want to implement them if these is a clear benefit to it.
For example based on what several people have told be complex compounding
rules are not worth it.

Aspell is far more complex then Myspell and each feature needs to
implemented carefully so that it will behave sensibly with the
suggestion code.  Also it is important that the addition of the
feature won't degrade performance, Especially when the feature isn't used.

The whole thread is available here .

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

When the PLX-based speller is ready: use the generated word list as test input: all should be accepted (coverage self-testing). Pick random 1% and randomly change them with edit distance 1, run through speller = testing false positives

We need a meeting to plan testing. We'll do it shortly this week, and perhaps a longer meeting in Alta.

TODO:

meeting Wednesday at 9.30 (Sjur, Børre)
consider a script (shell+AppleScript?) for automatic testing of MS Word ( Sjur, Børre)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD ( Børre, Sjur)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)

Bug fixing

57 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

update Maaren's Forrest installation to r430284 (Børre)

Employee seminar in Alta

TODO:

send in hotel room list (Divvun)
- done (all)

11. Next meeting, closing

The next meeting is 4.12.2006, 09: 30 Norwegian time.

The meeting was closed at 11: 27.

Appendix - task lists for the next week

Boerre

contact authors who have already received the corpus licensing contract
consider a script for automatic testing of the spell checker in Word
meeting Wednesday at 9.30 with Sjur to plan testing
sma discussions with SD (with Sjur, Trond)
add a simple password protection to risten.no in the G5
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
update all forrest installations, including local patches
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator

Saara

finalize server of the Xerox tools.
help Trond with some shell commands
re-analyze parallel files
consider implementing some new features to the corpus files
add closed POSes to the paradigm gen, if needed.
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
publish corpus contracts and project infra on NoDaLi-sta
ask SD/Sig-Britt Persson about some of the South Sámi bible texts
check whether lookup can be used to generate paradigms for closed POSes
meeting Wednesday at 9.30 with Børre to plan testing
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
redo the derived words work for smj
decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
check why some SUB-marked entries got included in the normative transducer
investigate unrecognised word forms in hyphenator
test editing of the xml files
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary)
write paradigm grammar for the closed POSes
fix bugs!

Tomi

make PLX lexicon for all open POSes
add derivations to the PLX generation
add compound stems to the PLX generation
add closed POS and clitics to PLX generation
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
sma discussions with SD (with Børre, Sjur)
redo the derived words work for smj fix bugs!.