Meeting_2007-01-08

Meeting setup

Date: 08.01.2007
Time: 09.00 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 9: 24.

Present: Børre, Saara, Sjur, Steinar, Thomas, Trond

Absent: Maaren, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact authors who have already received the corpus licensing contract
- still not done
continue work on script for automatic testing of the spell checker in Word
- nothing done
recreate our forrest tarball
- done
update setup and installation instructions for new users/computers
- done
create new forrest tarball
- done
cvs up of the public server should be done for xtdoc/sd/documentation/
- done
fix sme texts in corpus this month
- nothing done
find missing nob parallel texts in corpus
- nothing done
aligner status quo
- works
translate Windows installer to sme and smj
- translated a few strings
fix bugs!
- not done
other
- had to setup (www.)divvun.no to serve static content directly via apache. jetty runs out of memory if it has to serve huge files. Only for big/huge files, the rest of the site is served by Jetty as before.

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

help Trond with some shell commands
- this is related to bug 355.
fix sme texts in corpus this month
- done some
aligner status quo
- aligner is working
send aligned, xml nob texts to Lars
- not done
add correction markup to the xml files (string-to-correction markup)
- faced some problems
first new version of xml2lexc in Perl
- started, test version will be available today.
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
  - continued refactoring and redesign of the risten.no code: risten.no now has full i18n support, is using the latest Forrest to overcome several shortcomings of the previous version, and is soon ready for a more flexible editor
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix forrest installations for Maaren, Disamb
send Windows installer translation text to Børre and Thomas
- done
fix bugs!

Steinar

conversion error screening
- not done
missing lists
- done some work
report conversion errors to Saara
- not done
fix bugs!
- not done

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not done
decide how to specify compounding behaviour info in the lexicon
- nothing this week
translate Windows installer to sme and smj
- not done
fix bugs!
- worked hard with bugs

Tomi

On sick leave.

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- No smj this week.
get more sma texts
- Also no news.
aligner status quo
- done - working
decide how to specify compounding behaviour info in the lexicon
Set up work on missing and conversion screening with Steinar.
- Done (missing only).
fix sme texts in corpus this month
find missing nob parallel texts in corpus
- Not yet.
report conversion errors to Saara
- Done. Progress on conversion last week.
fix bugs!.
- Fixed some, actually, and found new ones

3. Documentation

Børre has done all(!) open issues - thanks!

TODO:

either fix forrest installations (Sjur), or create a new tarball ( Børre)
- done (new tarball)
cvs up of the public server should be done for xtdoc/sd/documentation/ - notice the directory! (Børre)
- done
update setup and installation instructions (Børre)
- done

4. Corpus gathering

TODO:

sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
missing nob parallel texts should be added if such wholes are found ( Børre, Trond)
Go through the list of missing or errouneous nob texts, based upon Saaras perfect list (Børre, Trond)

5. Corpus infrastructure

Aligner

TODO:

make an overview of status quo (Trond, Børre, Saara).
- done
go through other directories, fix parallelity information for other documents ( Børre)
- the sme/admin/ files are now double checked.
gather more parallel texts (Trond, Børre)
- not any more? (depends on what we are missing and how hard they are to get, but we will not put much time into it)
re-analyze parallel files using the command-line version (Saara)
- progressing
when aligned, send aligned, xml nob texts to Lars ( Saara)

Conversion issues

Initial nbsp? Cf these:

ccat -l sme -r zcorp/bound/sme/admin/depts/

preprocess --abbr=bin/abbr.txt --corr=src/typos.txt

lookup -flags mbTT -utf8 -f bin/missing

grep '\?'

cut -f1

sort

uniq -c

sort -nr

 19  –Danne (EM SPACE, U+2003)
 84  –Lea
 62 Bueng

Use UnicodeChecker to check spurious characters that look inocent (nbsp, em space, etc).

admin/depts still have many errors stemming from the problematic PDF conversion:

     34 kapih
     33 ttalat

TODO:

add correction markup to the xml files (string-to-correction markup) ( Saara)
report conversion errors to Saara ( Trond, Steinar)

6. Infrastructure

Nothing

7. Linguistics

North Sámi

Compounds with Actio as the first part. Earlier, we accepted such compounds, but they gave rise to so much disamb errors that we excluded them, and opted for lexicalisation. Now they turn up, but the solution is probably to lexicalise them.

bajásšaddaneavttuid
olgunastinlága
ávkkástallanvugiide

Another problem:

we get heajus-Oslo instead of heajos-Oslo
we get lojis-Oslo instead of lojes-Oslo
we get álkis-Oslo instead of álkes-Oslo

The fix here is the same as for smj below: Hyphen must be included in the right context triggering lowering.

Third issue: stuorra-oslolaš does not work (that is, initial lower case oslolaš after compound with hyphen - there is always hyphen in front of name-based words).

TODO:

Actio compounds: The disamb crew is satisfied. Now it is up to the divvun folks to see whether it is too hard to lexicalise ( Thomas, Maaren, Steinar)
Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)

Numbers:

Needs to be included in the non-rec transducers, and their inflection needs to be tested. Test specification:

inflection (case, number, ordinality)
generation of higher numerals (not inflected)

TODO:

include numbers in the non-recursive transducers (i.e. split the recursive and the non-recursive part of the numerals) (Trond, Thomas)
Set up test bed for numerals, test and revise (who?)
Make a test bed make num-paradigm ( Trond)
Go through the Num bugs (Trond, Thomas, Steinar)

Hyphenation problem

Vowel combinations that are not defined as diphthongs behave like diphthongs anyway, ie. they don't get hyphenated. In North sámi we have this problem when there is a or u in second position:

laam
laam    laam (expected: la-am)
liam
liam    liam
luam
luam    luam
lyam
lyam    lyam
lium
lium    lium
luum
luum    luum
lyum
lyum    lyum

For Lule sámi ALL Vowel combos behave like diphtongs.

South Sámi needs supervising, too, there are reports indicating it has onset maximising rather than coda maximising.

TODO:

write diphthong hyphenation pseudocode (Thomas)
rewrite hyphenation code (Trond)
ask Ove Lorentz to report on our sma hyphenator (Trond)

Lule Sámi

We had a twol problem -- now fixed by Thomas. We still have a problem with actor nouns of type juhttsat:juhttse: when hyphened we get juhttsa-muorra instead of juhttse-muorra. There is a similar problem in sme with adjectives (see above).

The fix is to include the hyphen in the right context triggering vowel fronting.

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
include numbers in the non-recursive transducers
Set up a test bed for numerals, test and revise (who?)

8. Name lexicon infrastructure

Sjur continued refactoring and redesign of the risten.no code: risten.no now has full i18n support, is using the latest Forrest to overcome several shortcomings of the previous version, and is soon ready for a more flexible editor.

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in the meeting memo.

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

TODO:

try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
1. soon ready: -)
finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

Polderland data generation

TODO:

decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
add closed POS and clitics to PLX generation (Børre, Tomi)
add compound stems to the PLX generation (Børre, Tomi)
add derivations to the PLX generation (Børre, Tomi)
Include numerals in the speller (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for testing Windows spellers (Børre, Sjur)

Localisation

TODO:

send translation text to Børre and Thomas ( Sjur)
- done
translate Windows installer text to sme and smj ( Børre, Thomas)
- Børre has started, Thomas (and perhaps Steinar and Maaren) will continue

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 15.1.2007, 09: 00 Norwegian time.

The meeting was closed at 10: 50.

Appendix - task lists for the next week

Boerre

contact authors who have already received the corpus licensing contract
continue work on script for automatic testing of the spell checker in Word
fix sme texts in corpus this month
find missing nob parallel texts in corpus
translate Windows installer to sme
fix bugs!
work on the Polderland data generation (PLX format conversion)
go through other directories, fix parallelity information for other documents

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

fix sme texts in corpus this month
send aligned, xml nob texts to Lars
add correction markup to the xml files (string-to-correction markup)
first new version of xml2lexc in Perl
fix bugs!

Sjur

name lexicon:
- rewrite the integration with forrest, to get a more flexible integration with proper i18n, solving some problems with the previous solution, and make a foundation for better search and editing interfaces.
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix stuorra-oslolaš lower case o
fix bugs!

Steinar

conversion error screening
missing lists
report conversion errors to Saara
Go through the Num bugs
Look at the actio compound issue when adding from missing lists
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
decide how to specify compounding behaviour info in the lexicon
translate Windows installer to sme and smj
Actio compounds: The disamb crew is satisfied. Now it is up to the divvun folks to see whether it is too hard to lexicalise
Lack of lowering before hyphen: Twol rewrite.
include numbers in the non-recursive transducers
Go through the Num bugs
Write diphthong hyphenation pseudocode
fix stuorra-oslolaš lower case o
fix bugs!

Tomi

add closed POS and clitics to PLX generation
add derivations to the PLX generation
add compound stems to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
decide how to specify compounding behaviour info in the lexicon
Set up work on missing and conversion screening with Steinar and Ilona.
fix sme texts in corpus this month
find missing nob parallel texts in corpus, go through Saara's list
report conversion errors to Saara
Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
Go through the Num bugs
Make numeral testbed
Rewrite hyphenation-code (pseudocode from Thomas) sme, smj
Get input on sma hyphenations
fix stuorra-oslolaš lower case o
include numbers in the non-recursive transducers for sme, smj
fix bugs!.