Meeting_2007-02-12

Meeting setup

Date: 12.02.2007
Time: 09.00 Norw. time
Place: Internet
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09: 49.

Present: Børre, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

Maaren is working Tuesday and Friday this week.

2. Updated task status since last meeting

Børre

write form to request corpus user account
- not done
document how to apply for access to closed corpus, and details on the corpus and its use in general
- not done
update and fix our documentation and infrastructure as Steinar finds problem areas
- some done, received feedback from Steinar
continue work on script for automatic testing of the spell checker in Word
- not done
fix sme texts in corpus this month
- not done
find missing nob parallel texts in corpus
- not done
work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
  - done by Tomi
go through other directories, fix parallellity information for other documents
- not done
add sma texts to the corpus repository
- not done
move the G5 to the basement (Børre)
- didn't work out last week, because we needed the machine
fix bugs!

Maaren

lexicalise actio compounds

Saara

fix sme texts in corpus this month
- mostly done
continue aligning the rest of the parallel files
- continued
fix problems with xml2lexc if needed
fix bugs!

Sjur

name lexicon:
- refactor the rest of the SD-terms editor code
  - refocused to the propernoun things
- implement missing propnouns editing functions
  - started
- implement improvements decided upon in Tromsø
  - some done
hire linguist and programmer
- not done
publish corpus contracts and project infra as open-source on NoDaLi-sta
- not done
fix stuorra-oslolaš lower case o
- not done
write form to request corpus user account
- not done
document how to apply for access to closed corpus, and details on the corpus and its use in general
- not done
get an Intel Mac for Tomi
- not done
fix bugs!
- looked at some

Steinar

test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page.
- continued working, reported some problems
Complete the semantic sets in sme-dis.rle
- no work this week
missing lists
- no work this week
report conversion errors to Saara
- not done
Look at the actio compound issue when adding from missing lists
lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
Go through the Num bugs
- not done
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
work with compounding
- nothing this week
Lack of lowering before hyphen: Twol rewrite.
- nothing this week
Go through the sme Num bugs
- done
fix stuorra-oslolaš lower case o
- nothing this week
implement discontinous case inflection for sme numbers
- soon finished
produce correct number base forms in the sme analyzer
- soon finished
fix bugs!
- worked

Tomi

improve numerals in the speller
- not finished
add prefixes to the PLX
add derivations to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- Not done.
fix sme texts in corpus this month
- Worked on this, final conference is over, now starts the long-term maintenance.
find missing nob parallel texts in corpus, go through Saara's list
- Will work more on this, now that we see how useful the tool is.
report conversion errors to Saara
- Done.
Go through the Num bugs
- Not done.
implement discontinous case inflection for numbers
- Not looked into, Thomas has done this
produce correct number base forms in the analyzer
- Looked at some.
Write project presentation
- Done.
fix bugs!.

3. Documentation

Børre has fixed some errors in the docu, otherwise nothing new.

TODO:

write form to request corpus user account (Børre, Sjur, Trond)
document how to apply for access to closed corpus, and details on the corpus and its use in general (Børre, Sjur, Trond)
correct and imrove it based on feedback from Steinar ( Børre)
- started

4. Corpus gathering

TODO:

sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
missing nob parallel texts should be added if such holes are found ( Børre, Trond)
Go through the list of missing or errouneous nob texts, based upon Saara's perfect list (Børre, Trond)
add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Alignment

Main news: We have a working parallel corpus online.

Notes about the interface (or lack of documentation): the first search field in the form needs to be filled; to get the parallell texts in the search result, make sure to click add phrase and specify the language to be the other one.

TODO:

go through other directories (nob dicrectories, sd directories), fix parallellity information for other documents (2 hours) ( Børre)

Conversion issues

TODO:

report conversion errors to Saara ( Trond, Steinar)

6. Infrastructure

Børre and Steinar have both started on the task of testing and correcting the documentation.

TODO:

test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page. (Steinar)
update and fix our documentation and infrastructure as Steinar finds problem areas (Børre)

7. Linguistics

Numbers:

Thomas is almost finished with correcting the number part of the sme analyzer.

TODO:

discontinous case inflection in sme (but only for maximally three-part compound numerals) (viđain/goalmmát/logiin and guvttiin/logiin/viđain) ( Thomas)
- soon finished
produce correct number base forms in the sme analyzer (Thomas)
- soon finished
Go through the sme Num bugs (Thomas)

North Sámi

TODO:

lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Maaren)
fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- In Bugzilla?yes, Which #? 350

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in the meeting memo.

Postponed:

TODO:

finish first version of the editing (Sjur)
1. working on it
test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
1. wrote a prototype xml2lexc converter in XQuery, just to test the performeance and the complexity of the task - the result is quite promising, and might be a viable alternative to Perl/XML::Twig
implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
1. Sjur has a concrete suggestion for how to do this to ensure consistency between the different editors and servers, including emacs; basically it includes the following points, to be executed as a preflight on each commit:
  1. sort on entry ID using Unicode default (= sorting on character code)
  2. validate against DTD
  3. reformat using xmllint - that will ensure consistent whitespace
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

Polderland data generation

TODO:

send smj PLX data to Polderland (Børre, Tomi)
1. done
decide how to specify nouns requiring genitive first parts ( Sjur, Thomas, Trond)
1. done
improve number conversion (Børre, Tomi)
1. working on it
add prefixes to the PLX (Børre, Tomi)
1. not yet
add derivations to the PLX generation (Børre, Tomi)
1. next after numbers are fixed
  1. not yet

OOo speller(s)

TODO after the MS Office Beta is delivered:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for Tomi (Sjur)
- not yet

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

translate beta release docs to sme ( Thomas)
translate beta release docs to smj ( Thomas)

Beta release

Tentative beta release: Thursday 15.2. - but it might be delayed till later in February, since we still have no beta from Polderland.

In the beta, sme is now Catalan, whereas smj is Basque.

DONE:

delivered PLX data of sme and smj including compounding
translated Windows installer to sme and smj

TODO:

write press release (Sjur)
add info to front page (incl. download links) (Børre)
write separate page with detailed info (incl. download links) (Børre)
test the beta release from Polderland thoroughly before it is released ( all):
- download and installation
- documentation
- technical performeance
- linguistic performeance:
  - true positives
  - false positives
  - false negatives
  - suggestions
- all tests on both Mac and Win - Windows only (Børre, Sjur, Thomas)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

Moving G5

TODO:

move the G5 to the basement (Børre)

The KUNSTI conference

Thomas and Trond was there. The first presentation (for politicians) got some response. The second was more for insiders, ie the language technologists, but got only one response about financing.

There was more inofficial feedback, both from Telenor and NTNU on text-to-speech, and from the text-based people on machine translation.

11. Next meeting, closing

The next meeting is 19.2.2007, 09: 30 Norwegian time.

Sjur will be away next week on winter holidays, Trond or Børre will head the next meeting.

The meeting was closed at 11: 12.

Appendix - task lists for the next week

Boerre

write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
update and fix our documentation and infrastructure as Steinar finds problem areas
continue work on script for automatic testing of the spell checker in Word
fix sme texts in corpus this month
find missing nob parallel texts in corpus
work on the Polderland data generation (PLX format conversion)
go through other directories, fix parallellity information for other documents
add sma texts to the corpus repository
move the G5 to the basement
add info to front page (incl. download links)
write separate page with detailed info (incl. download links)
fix bugs!

Maaren

lexicalise actio compounds

Saara

fix sme texts in corpus this month
continue aligning the rest of the parallel files
fix problems with xml2lexc if needed
have some holiday first
start improving the corpus interface for Sámi in Oslo.
fix bugs!

Sjur

name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
publish corpus contracts and project infra as open-source on NoDaLi-sta
fix stuorra-oslolaš lower case o
write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
get an Intel Mac for Tomi
write press release for the beta
fix bugs!

Steinar

test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page.
Complete the semantic sets in sme-dis.rle
missing lists
report conversion errors to Saara
Look at the actio compound issue when adding from missing lists
lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
Go through the Num bugs
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
work with compounding
Lack of lowering before hyphen: Twol rewrite.
fix stuorra-oslolaš lower case o
implement discontinous case inflection for sme numbers
produce correct number base forms in the sme analyzer
translate beta release docs to sme and smj
fix bugs!

Tomi

improve numerals in the speller
add prefixes to the PLX
add derivations to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
fix sme texts in corpus this month
find missing nob parallel texts in corpus, go through Saara's list
Go through the Num bugs
fix bugs!.