Meeting_2007-01-03

Meeting setup

Date: 03.01.2007
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10: 42.

Present: Børre, Saara, Sjur, Steinar, Thomas, Trond

Absent: Maaren, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact authors who have already received the corpus licensing contract
- not done
continue work on script for automatic testing of the spell checker in Word
- not done
update setup and installation instructions for new users/computers
- not done
create new forrest tarball
- not done
cvs up of the public server should be done for xtdoc/sd/documentation/
- not done
fix bugs!
- not done
other:
- fixed link errors in gtuit/doc

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

help Trond with some shell commands
re-analyze parallel files
- done, aligner now works.
Move to Bugzilla: vislcg server-friendly as feature request to the vislcg devs
- done
fix bugs!
- done some

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
  - started work on a refactor of some basic code, to allow greater flexibility, easier coding, and hopefully a better interface
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
- combinatorial behaviour decided; specification of form still not done
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix forrest installations for Maaren, Disamb
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- we've begun
decide how to specify compounding behaviour info in the lexicon
- on our way
go through the class of GAHPIR words, and try to generalise the compounding behaviour
- done
change whatever is needed based on the above generalisation
- done
fix bugs!
- worked with bugs

Tomi

add closed POS and clitics to PLX generation
add derivations to the PLX generation
add compound stems to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- We have done some work on this.
Go through the GAHPIR lexicon (with Thomas)
- Done
get more sma texts
- No news.
decide how to specify compounding behaviour info in the lexicon
- At least we worked on this, what was the final outcome again?
fix bugs!.
- Done some work on bugs, closed a couple.

3. Documentation

TODO:

either fix forrest installations (Sjur), or create a new tarball ( Børre)
cvs up of the public server should be done for xtdoc/sd/documentation/ - notice the directory! (Børre)
update setup and installation instructions (Børre)

4. Corpus gathering

We should probably concentrate upon mending what we have, up to Febrary 8th.

Børre has been fixing the meta information for many files, adding info on parallel language versions.

TODO:

sme texts: Fix what we have for this month (Børre, Trond, Saara)
missing nob parallel texts could be an issue (Børre, Trond)

5. Corpus infrastructure

Aligner

The command line aligner is in place, and most(?)/all(?) potential texts are aligned.

TODO:

make an overview of status quo (Trond, Børre, Saara). Børre has added parallellity status to all admin/sd/ texts
gather more parallel texts (Trond, Børre)
- not any more? (depends on what we are missing and how hard they are to get, but we will not put much time into it)
re-analyze parallel files using the command-line version (Saara)
- progressing
when aligned, send aligned, xml nob texts to Lars ( Saara)

Conversion issues

Analysis statistics:

	words	missing	% recognised
adm	1493746	28967	98,06
bib	241233	2651	98,90
fac	382353	11371	97,03
fic	70601	7847	88,89
law	284700	6269	97,80
new	4032848	17134	99,58

There are still a lot of conversion errors, and fixing them should be a priority this month. Error types:

pdf files (there will always be some errors here)
string-string replacement during the xsl conversion
- Conversion errors (but we want to fix the conversion)
- typos (should be marked up using our error markup tags)
- grammatical errors (as above)

So far, we only have had character replacement (even context-less character replacement). We can use XSL string replacement to correct spelling errors to constructs like, cf the corpus.dtd and a previous meeting memo :

<error correct="corrected text">erroneous text</error>

Either we use the function Trond found, or we install and use Saxon 8 and XSL 2, which have good string manipulation tools.

Flow:

orig
xsl
- corr-of-textstring (not implemented yet) (we is > we are) (we keep the orig text, and the correction is added as additional info in an attribute, see above)
xml
preprocess
corr-of-token=typos.txt eror>error
analysis

Do we have good tools for localising conversion errors? Problem: I spot an error in the analysis. Which file is it in? ccat removes file source info, and we are thus in the blind. Conclusion: we don't have a good tool for this, only grep.

TODO:

add correction markup to the xml files (string-to-correction markup) ( Saara)
report conversion errors to Saara ( Trond, Steinar)

6. Infrastructure

Xerox tools wrapped as servers

TODO:

find a way of integrating vislcg as a server, or send a feature request to the vislcg developers (Saara)
- move this to Bugzilla
  - done

7. Linguistics

Names and multilinguality

TODO:

finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

North Sámi

TODO:

go through the class of GAHPIR words, and try to generalise the compounding behaviour (Thomas)
- done
change whatever is needed based on the above generalisation ( Thomas, Trond)
- done

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
- progressing - good!
update proper noun lexicon with copy of sme lexicon (Trond)
- done

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in the meeting memo.

TODO:

develop the needed XQueries and UI (Sjur, Tomi)
try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Polderland data generation

TODO:

decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
- partially done - compound stem specification still open
add closed POS and clitics to PLX generation (Tomi)
add derivations to the PLX generation (Tomi)
add compound stems to the PLX generation (Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for testing Windows spellers (Børre, Sjur)

Localisation

The Windows installer should be localised. We have received the string file from Polderland, now they should be translated.

The Mac installer they use is the standard MacOS X installer, and as sme isn't generally turned on as a UI language in MacOS X, there is no big point in translating it. It could be done, though: -)

TODO:

send translation text to Børre and Thomas ( Sjur)
translate Windows installer to sme and smj ( Børre, Thomas)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- still not done

Bug fixing

55 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

11. Next meeting, closing

The next meeting is 8.1.2007, 09: 00 Norwegian time.

The meeting was closed at 12: 12.

Appendix - task lists for the next week

Boerre

contact authors who have already received the corpus licensing contract
continue work on script for automatic testing of the spell checker in Word
recreate our forrest tarball
update setup and installation instructions for new users/computers
create new forrest tarball
cvs up of the public server should be done for xtdoc/sd/documentation/
fix sme texts in corpus this month
find missing nob parallel texts in corpus
aligner status quo
translate Windows installer to sme and smj
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

help Trond with some shell commands
fix sme texts in corpus this month
aligner status quo
send aligned, xml nob texts to Lars
add correction markup to the xml files (string-to-correction markup)
first new version of xml2lexc in Perl
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix forrest installations for Maaren, Disamb
send Windows installer translation text to Børre and Thomas
fix bugs!

Steinar

conversion error screening
missing lists
report conversion errors to Saara
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
decide how to specify compounding behaviour info in the lexicon
translate Windows installer to sme and smj
fix bugs!

Tomi

add closed POS and clitics to PLX generation
add derivations to the PLX generation
add compound stems to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
get more sma texts
aligner status quo
decide how to specify compounding behaviour info in the lexicon
Set up work on missing and converision screening with Steinar.
fix sme texts in corpus this month
find missing nob parallel texts in corpus
report conversion errors to Saara
fix bugs!.