Meeting_2007-02-26

Meeting setup

  • Date: 26.02.2007
  • Time: 09.00 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10: 10, continued at 13.02.

Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • write form to request corpus user account
    • Not done
  • document how to apply for access to closed corpus, and details on the corpus and its use in general
    • Not done
  • update and fix our documentation and infrastructure as Steinar finds problem areas
    • Begun working
  • continue work on script for automatic testing of the spell checker in Word
    • Not done
  • find missing nob parallel texts in corpus
    • Not done
  • work on the Polderland data generation (PLX format conversion)
    • Not done
  • go through other directories, fix parallellity information for other documents
    • Not done
  • add sma texts to the corpus repository
    • Not done
  • add info to front page (incl. download links)
    • Not done
  • write separate page with detailed info (incl. download links)
    • Not done
  • fix bugs!

Maaren

  • lexicalise actio compounds
    • done some

Saara

  • fix sme texts in corpus this month
  • continue aligning the rest of the parallel files
  • fix problems with xml2lexc if needed
  • have some holiday first
    • done
  • start improving the corpus interface for Sámi in Oslo.
  • fix bugs!
    • done some
  • other
    • started writing article(s)

Sjur

  • name lexicon:
    • refactor the rest of the SD-terms editor code
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
    • synchronisation between cvs and running db-s
  • hire linguist and programmer
  • publish corpus contracts and project infra as open-source on NoDaLi-sta
  • fix stuorra-oslolaš lower case o
  • write form to request corpus user account
  • document how to apply for access to closed corpus, and details on the corpus and its use in general
  • get an Intel Mac for Tomi
  • write press release for the beta
  • fix bugs!
  • other:
    • was on Winter Holiday last week
    • did some work on the beta release

Steinar

  • test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page.
    • tested most, reported some major problems, info about access to corpus is still missing
  • Complete the semantic sets in sme-dis.rle
    • no work this week
  • missing lists
    • no work this week
  • report conversion errors to Saara
    • not done
  • Look at the actio compound issue when adding from missing lists
    • not done
  • lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
  • Go through the Num bugs
  • not done
  • fix bugs!

Thomas

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
    • not done
  • work with compounding
    • begun adding tags
  • Lack of lowering before hyphen: Twol rewrite.
    • not done
  • fix stuorra-oslolaš lower case o
    • not done
  • implement discontinous case inflection for sme numbers
    • finished
  • produce correct number base forms in the sme analyzer
    • finished
  • translate beta release docs to sme and smj
    • not done
  • fix bugs!
    • fixed some

Tomi

  • improve numerals in the speller
  • add prefixes to the PLX
  • add derivations to the PLX generation
  • fix bugs!

Trond

  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
    • Not done
  • fix sme texts in corpus this month
    • Worked on getting an overview
  • find missing nob parallel texts in corpus, go through Saara's list
    • Worked on getting an overview
  • Go through the Num bugs
    • No bugs resolved
  • fix bugs!.
    • No bugs resolved
  • Last week was speller week.

3. Documentation

The open documentation issues fall into these three categories:

  • Beta documentation for testers
  • Documentation for the online corpora
  • General documentation improvement after Steinar's test (for open-source release)

TODO:

  • write form to request corpus user account (Børre, Sjur, Trond)
  • document how to apply for access to closed corpus, and details on the corpus and its use in general (Børre, Sjur, Trond)
  • correct and imrove it based on feedback from Steinar ( Børre)

4. Corpus gathering

The disamb project would want parallell texts relevant for terminological work ie bilingual domain-specific texts. Using our parallell corpus (interface), we can give back a very requested feature, filling some of the gaps in risten.no, as well as providing data for terminology workers.

TODO:

  • sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
  • missing nob parallel texts should be added if such holes are found ( Børre, Trond)
  • Go through the list of missing or errouneous nob texts, based upon Saara's perfect list (Børre, Trond)
  • add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Alignment

The aligner output has too many errors. Possible tasks to improve it added to the TODO list below.

TODO

  • go through other directories (nob dicrectories, sd directories), fix parallellity information for other documents (2 hours) ( Børre)
  • Improve the automatic process:
    • Improve the anchor list and realign (Trond, Børre)
    • Only adding words does not improve alignment, you have to consider the format as well. If you cut with star e.g guo* too early, wrong word can be selected.
    • The documents have still some formatting issues which cause trouble in alignment. (In some documents tables are included to the text, some not, etc.)
    • Test and improve settings in the aligner
  • Align manually (Trond, Steinar) (especially shorter terminological texts) good idea. Saara will look for troublesome texts and prepare them for manual alignment.

Conversion issues

TODO:

  • report conversion errors to Saara ( Trond, Steinar)
    • should be done anyway, all the time, by all of us, for all tools: -)

6. Infrastructure

Steinar has gone through all of the documentation except access to the corpus, written a report and sent to Børre. Next step is to go through the report and fix whatever should be fixed.

TODO:

  • test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page. (Steinar)
    • done
  • add report to gt/doc/infra/, probably as infrareport.jspwiki ( Steinar)
  • update and fix our documentation and infrastructure as Steinar finds problem areas (Børre)

7. Linguistics

Numbers:

TODO:

  • discontinous case inflection in sme (but only for maximally three-part compound numerals) (viđain/goalmmát/logiin and guvttiin/logiin/viđain) ( Thomas)
    • done
  • produce correct number base forms in the sme analyzer (Thomas)
    • done
  • Go through the sme Num bugs (Thomas)
    • done

North Sámi

TODO:

  • lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Maaren)
    • continuous work
  • fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
    • postponed till after the public beta

Lule Sámi

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
    • Nothing new.

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in this meeting memo.

lexc2xml conversion bug - how do we convert complex entries with multiple inflections or multiple stems, or both?

<entry id="Budejju">
  <infl lexc="ACCRA">
    <stem>Budej3ju</stem>
  </infl>
  <infl lexc="ACCRASUB" type="secondary">
    <stem>Budeju</stem>
  </infl>
  <senses>
    <sense ref="Budejju" sem="plc"/>
  </senses>
  <log/>
</entry>

Where do we draw the line between "words" and variants belonging to the same entry? Cf:

Trondheim  ?
Trondhjem  ?
Nidaros    (separate entry)

An example where stem variation is just an expression of paradigm variation:

Buckingham^shire:Buckingham^shire3 ACCRA-plc ; !SUB
Buckingham^shire:Buckingham^shire ACCRA-plc ;
! Note: the final vowel e3, used for e-stems that NEVER may have illatives
Buckinghamshire+N+Sg+Ill
Buckinghamshire+N+Sg+Ill        Buckinghamshirii
Buckinghamshire+N+Sg+Ill        Buckinghamshirej <== sic!

In SD-terms, Budejju and Budeju would be stored as separate entries, but linked to each other, and with one marked as a sub of the other. Thus:

<entry id="Budejju">
  <infl lexc="ACCRA">
    <stem >Budej3ju</stem>
  </infl>
  <variant ref=""/>
  <senses>
    <sense ref="Budejju" sem="plc"/>
  </senses>
  <log>
    <comment date="date-of-conversion" who="xxx">
      Comment text
    </comment>
  </log>
</entry>
<entry id="Budejju" type="secondary">
  <infl lexc="ACCRASUB">
    <stem>Budeju</stem>
  </infl>
  <variant ref=""/>
  <senses>
    <sense ref="Budejju" sem="plc"/>
  </senses>
  <log/>
</entry>

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

MS Office speller

There are no language codes available for Office 2004, and there won't be any, if I interpret the e-mail I just recieved correctly. It might be added to Office 2008 (at least they will try). This is the reply we got from Microsoft:

Apologies about the delay. This is indeed a shortcoming of the application.
The product team is currently looking at how it can be addressed moving
forward.

For Office 2004, the only global language setting I'm aware of is the one in
the Custom tab in the document properties window.  There you can set the
Name field to "Language" then set the Type to text and enter the value
"Saami."  Obviously this doesn't provide object-model support via language
settings.

I hope this helps a little...

The official Microsoft language codes for Sámi languages are:

hex    dec    language-country combo
----   ----   ----------------------
043b   1083   Northern Sami - Norway
083b   2107   Northern Sami - Sweden
0c3b   3131   Northern Sami - Finland
103b   4155   Lule Sami - Norway
143b   5179   Lule Sami - Sweden
183b   6203   Southern Sami - Norway
1c3b   7227   Southern Sami - Sweden
203b   8251   Skolt Sami - Finland
243b   9275   Inari Sami - Finland

OOo speller(s)

TODO after the MS Office Beta is delivered:

  • add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
  • study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

Precision and recall testing

A testbed has been set up (Trond), and some texts are marked for errors and corrections (Steinar). Versions alpha, beta 0.1 and beta 0.2 have been tested.

Types of tests:

  1. Technical testing
  2. Testing for linguistic functionality
  3. Testing for lexical coverage
  4. Testing for normativity
  5. Testing the suggestions

TODO:

  • get an Intel Mac for Tomi (Sjur)
    • not yet
  • Include a testbed and results in the cvs (gt/doc/proof/spelling/testing) ( Trond, Børre)
    • textid - nu_wds - tp - fp - tn - fn - prec - rec - acc - spellid - ref_to_txt
  • Store the tested texts, for reference (Trond, Børre)

The tester should identify these 4 values:

  • wds - number of words in the text
  • tp - correctly identified errors
  • fp - correctly written but marked as errors
  • fn - errors not marked as such

The spreadsheet will then calculate precision, recall and accuracy. Steinar has marked some texts like this: Errors are makred§marked with paragraph number followed by the correct form. Way of finding precision: Take out the § entries and evalate them for tp and fp. Way of finding recall: Remove the § entries and count the fn among the remaining words. Then fill in and collect results.

Testing of suggestion should follow the same lines:

  • errs - number of errors in the text
  • tp - the intended word is among the suggestions
  • fp - the intended word is not among the suggestions
  • fn - no suggestions
  • tn - (not relevant??)

Ordering of suggestions:

  • place in the list of the intended correction
    • ordered first
    • ordered top-five
    • ordered below top-five

"Perceived Quality", ie for all recognised errors/tp:

  • number of correct suggestions at top
  • number of correct suggestions among top-five
  • number of correct suggestions below top-five

Testing on unseen texts

We need to use unknown texts in order to measure the performance of the speller.

Regression tests

We need to ensure that we do not take steps backwars, ie all known spelling errors in the corpus should be correctly identified, with a proper suggestion among the top five. For this purpose we can use the regular corpus with correction markup.

Storing test texts

Test texts should be stored in the corpus catalogue, separated from the ordinary corpus files. They should be marked as to whether their unknown words have been added to the lexicon or not (in the former case, they cannot be used for testing of performance any more, only for regression testing). When the words have been added, the whole text can be transferred to the regular corpus repository.

TODO:

  • get speller test tool from Polderland ( Sjur)
  • Set up (sub)directories (Børre, Saara)
  • Add potential test texts (Børre, Thomas, Trond, anyone, really)
  • Manually mark them for typos (making them into gold standards) ( Steinar, Maaren)
  • Format the added texts in appropriate ways - use our existing xml format, with correct markup as decided earlier (the only thing that separate these documents from regular corpus documents is the directory (tree) in which they reside), thus regular corpus conversion tools, plus error markup (Børre, Saara, Steinar?)
  • Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Sjur, Børre)
  • Set up test record page in gt/doc/proof/spelling/testing/ (Børre)
  • Conduct tests on new beta versions on the basis of the unspoiled gold standard documents (whoever has time), and fill in data from testing (the testers: who?)
  • alternatively: make test scripts that will run the tests automatically, collect the numbers, and transform them into test results (who?) dependent upon the functionality of the Polderland tools.

include the ones already tested in the testing/ catalogue test 0.3 on the same texts test each version before beta release

The b0.3 / 2007.02.26 version

Subjective impression: With the b0.3 version we have surpassed the alpha version as far as useful tools go. Trond will now (for the first time) migrate from Dutch (Alpha) to Catalan (latest internnal beta candidate) in his daily work of producing lectures in Sámi.

Writing biillaviessu, ránskkabiila (gen+nom) we get error, and suggestion biilaviessu (nom+nom). I write biilaviessu, ránskabiila (nom+nom), and get ok. Given this, it seems compounding behaves as it should, ie according to the default compounding scheme. No entries have so far received compound override tags yet, thus the default behaviour is expected everywhere.

SgNom does not have any L tag - but SgNomCmp does. Thus, biilaviessu is NOT a fp, it is correctly recognised and corrected, as shown above. The PLX entries look as follows:

sát^ne^gir^ji NIR       +N+Sg+Nom
sát^ne^gir^ji GaIALR    +N+Sg+Gen
sát^ne^gir^ji GaAL      +SgGenCmp


vies^su NIR
bii^la  NIR     +N+Sg+Nom
biil^la GaIALR
bii^la  NAL     +SgNomCmp
biil^la GaAL

The two bii^la entries above in effect create one entry bii^la NIALR with the correct compounding info. Thus, basic compounding is now working as expected, ie according to the defaults.

Known errors:

  • clitics do not work with W class words (uninflected words)
    • two options:
      • generate these with clitics (adds words from 6700 -> 100 000) ( Tomi, Saara)
      • ask Polderland to look at it - Sjur will do that

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

  • translate beta release docs to sme ( Thomas)
  • translate beta release docs to smj ( Thomas)

Lexicon conversion to the PLX format

We need to test that the conversion is correct and gives expected results in all cases, especially regarding compounding and derivation. For that we need a small set of test entries in lexc format, and the corresponding expected output in PLX format. By comparing the actual output with the expected output we get a measure of the quality of the conversion.

TODO:

  1. add derivations to the PLX generation (Tomi)
  2. add prefixes to the PLX (Børre)
  3. middle nouns (Børre)
  4. make conversion test sample; add conversion testing to the make file ( Tomi)
  5. improve number conversion (Børre, Tomi)

Public Beta release

Tentative public beta release: after the initial linguistic bugs and poor coverage, it is now moved to Thursday 15.3. - this time with derivations and numbers included: -)

Internal deadlines:

  • A date for when lexical updates should be checked in, in order to make it to the beta.
  • A plan for how many pre-betas we should compile, and when(?)
    • alpha = Dutch (sme) + French (smj)
    • beta 0.1 = the first Catalan (sme) + Basque (smj)
    • beta 0.2 = the second Catalan
    • beta 0.3 = 26. or 27.: compound beta
    • beta 0.4 = 2.3.: first derivation beta, also including numbers, prefixes.
    • beta 0.5 = 7.3.: final derivation beta, also including middle nouns

Linguistic issues still open:

  • derivations (Tomi)
  • numbers 1-20 (Børre)
  • prefixes (eahpe, ii-) (Børre)
  • middle nouns (LEXICON: lexc: Rmiddle, plx: L) (Børre)

The middle nouns are: beai, beal, geaš, oahpaheai, oai, vuol. They are also marginally used initially (not found in the corpus):

 beai+ShCmp:beai  Rreal ; (not used init in our corpus)
 beal+ShCmp:beal  Rreal ; (init with Num -goalmmat, -guđát, -nuppi, lexicalized)
 geaš+ShCmp:geaš  Rreal ; (not used init in our corpus)
 oahpaheai+ShCmp:oahpaheai  Rreal ;  init, but then actually 2-part 
 oai+ShCmp:oai 	  Rreal ;  (not used in corpus init oaivuolli (SUB? yes!)
 vuol+ShCmp:vuol  Rreal ;  (not used in our corpus)

The PLX format does not allow encoding a stem as middle-only. For the public beta we will encode them as Left-only (which is really non-right), and evaluate their effect on the quality of the speller as we progress.

DONE:

  • delivered PLX data of sme and smj including compounding
  • translated Windows installer to sme and smj
  • installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for sme and one for smj)
  • added resources needed for compiling PLX lexicons to our cvs repo
  • tested the beta drop from Polderland - good we did, it is absolutely unacceptable (our responsibiliby - only linguistic errors (poor coverage) found so far)

TODO:

  • write press release (Sjur)
    • done first draft, see xtdoc/sd/.../xdocs/pr/
  • add info to front page (incl. download links) (Børre)
  • write separate page with detailed info (incl. download links) (Børre)
    • Sjur wrote a start
  • compile new speller lexicons using the mklex* tools on the G5, following the instructions given in the documentation from Polderland (in mklex beta drop) ( Tomi, Børre)
    • done regularly now
  • add compilation of MS Office spellers part of the Makefile (Tomi)
  • install Windows and MS Office; test tools on Windows (Børre, Thomas)
  • collect a list of PR recipients, forward to Berit Karen Paulsen
  • questions for Polderland (Børre):
    • version info in the speller?
    • remaking/updating the installer packages with linguistic updates - who?

Version identification of speller lexicons

See the Norwegian spellers for an example, with the trigger string tfosgniL.

Suggestion:

nuvviD -> Divvun
nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?)
nuvviD -> 12.2.2007  (automatically generated/added)
nuvviD -> Sjur_Nørstebø_Moshagen
nuvviD -> Børre_Gaup
nuvviD -> Thomas_Omma
nuvviD -> Maaren_Palismaa
nuvviD -> Tomi_Pieski
nuvviD -> Trond_Trosterud
nuvviD -> Saara_Huhmarniemi
nuvviD -> Steinar_Nilsen
nuvviD -> Lene_Antonsen
nuvviD -> Linda_Wiechetek

These correction rules (and their corresponding PLX entries) should be added automatically to the PLX file and the phonetic file as part of the compilation process, to include build date and version number.

Conversion from LexC to PLX

Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs
Nouns compile at 3 sec/noun,            i.e. (23600*3) / 3600 = 19 hrs

This is so far acceptable for nouns, but on the edge of being unacceptable for adjectives. These times will multiply many times when we add derivation, meaning we will need more than a week to convert the major POSes from LexC to PLX then.

We need to investigate why adjectives are so slow, and try to improve on the conversion speed.

Testing

Different ways of testing:

  1. Impressionistic, functionality: try the program, try all the functions
  2. Impressionistic, coverage: try the program on different texts, look for false positives
  3. Systematic (in order of importance):
    1. Make a corpus of texts, from different genres (can be done before 0.2 release)
      1. For each text, detect precision
      2. For each text, detect recall
      3. For each text, detect accuracy

Before beta release: precision is important, but have a look at recall as well.

Recall and precision

  • precision = tp / ( tp + fp ) = true redlines / all redlines
    • can we trust that the redlines are actually errors?
    • Task: check all hits
    • (test p, are they tp or fp?)
  • recall = tp / ( tp + fn) = true redlines / all errors in doc
    • can we trust that all errors are actually found?
    • Task: check every single word
    • (test p, are they tp or fp, test n, are they tn or fn?)
  • accuracy = tp + tn / tp + fp + fn + tn = overall performance

Definitions:

  • true positives (correctly recognised misspellings)
  • false positives (correct words errouneously marked as misspellings)
  • false negatives (misspellings not recognised by the speller)

10. Other

Corpus contracts

TODO:

  • publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 5.3.2007, 09: 30 Norwegian time.

The meeting was closed at 11: 05 first time, then at 14.55.

Appendix - task lists for the next week

Boerre

  • write form to request corpus user account
  • document how to apply for access to closed corpus, and details on the corpus and its use in general
  • update and fix our documentation and infrastructure as Steinar finds problem areas
  • continue work on script for automatic testing of the spell checker in Word
  • fix sme texts in corpus this month
  • find missing nob parallel texts in corpus
  • add prefixes to the PLX conversion
  • add middle nouns to the PLX conversion
  • improve number PLX conversion
  • go through other directories, fix parallellity information for other documents
  • add sma texts to the corpus repository
  • add info to front page (incl. download links)
  • write separate page with detailed info (incl. download links)
  • Improve automatic alignment process
  • Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
  • Store the tested texts, for reference
  • Set up (sub)directories for speller test documents
  • Add potential speller test texts
  • Mark-up the added speller test texts, using our existing xml format
  • Set up ways of adding meta-information to speller test docs
  • Set up test record page in gt/doc/proof/spelling/testing/
  • fix bugs!

Maaren

  • lexicalise actio compounds
  • Manually mark speller test documents for typos

Saara

  • continue aligning the rest of the parallel files
  • prepare files for manual alignment
  • add ABBR, ACR, clitics to closed classes + ADV to paradigm generator
  • add correction markup to test documents
  • update lexc2xml with comment field
  • start improving the corpus interface for Sámi in Oslo.
  • Set up (sub)directories for speller test documents
  • Mark-up the added speller test texts, using our existing xml format
  • fix bugs!

Sjur

  • name lexicon:
    • refactor the rest of the SD-terms editor code
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
  • hire linguist and programmer
  • publish corpus contracts and project infra as open-source on NoDaLi-sta
  • fix stuorra-oslolaš lower case o
  • write form to request corpus user account
  • document how to apply for access to closed corpus, and details on the corpus and its use in general
  • get an Intel Mac for Tomi
  • write press release for the beta
  • get speller test tool from Polderland
  • Set up ways of adding meta-information to speller test docs
  • fix bugs!

Steinar

  • Beta testing: Align manually (shorter texts)
  • Manually mark speller test texts for typos (making them into gold standards)
  • Infrastructure test: add report to gt/doc/infra/, probably as infrareport.jspwiki
  • Complete the semantic sets in sme-dis.rle
  • missing lists
  • Look at the actio compound issue when adding from missing lists
  • Align corpus manually
  • fix bugs!

Thomas

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  • work with compounding
  • Lack of lowering before hyphen: Twol rewrite.
  • fix stuorra-oslolaš lower case o
  • translate beta release docs to sme and smj
  • Add potential speller test texts
  • fix bugs!

Tomi

  • add derivations to the PLX generation
  • make PLX conversion test sample; add conversion testing to the make file
  • improve number PLX conversion
  • fix bugs!

Trond

  • Participate in the beta testing setup
  • Test the beta versions
  • Work on the parallel corpus issues
    • Discuss with Anders
    • Work on the aligner with (Børre)
    • fix sme texts in corpus this month
    • find missing nob parallel texts in corpus, go through Saara's list
  • Postpone these tasks to after the beta:
    • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
    • Go through the Num bugs
  • Improve automatic alignment process
  • Align corpus manually
  • Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
  • Store the tested texts, for reference
  • Add potential speller test texts
  • fix bugs!.