Meeting_2011-06-06

Meeting setup

  • Date: 6.6.2011
  • Time: 10.45 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat

Saksliste

  • Divvun 2.2 & Polderland
  • Korpus
  • sma-normering + fst

Opening, agenda review, participants

  • Opened at 11.00.
  • Present: Børre, Sjur, Tomi, Thomas (Divvun only)
  • Absent: none
  • Secretary: Sjur; Børre

Divvun 2.2 & Polderland

Numerals taking a long time to compile, break the compilation. The solution would be to compile them once and add the result to svn.

sme - the following do not work:

iige		#382 - verb + clitic
Máretgo		#415 - nominals with clitic not accepted
gulloban		#622 - Words with clitic -ban not accepted

Oslobiila	Oslo-biila	#397 - name compounding

Finncomm		#400 - multiword names: Finncomm Airlines
Seskaröcd	Seskarö-cd	#805 - Nouns+acronyms
Brønnøysund-registarii		#930 - proper+nouns

ovda--Lott-

ovdda	ovdda-	#666 - guovtte- and njealje-

njealjilogát	njealljelogát	#804 - guovttilogát, njealjilogát

goalmmát	#962 - some ordinals not recognized

Lule sámi
NRK-Finnmarku	NRK-Finnmárko	#805 - Nouns+acronyms

Suggestions:
ENØK-Finnmárko  name+name
NRK-finnmárkok  acro+noun

where:
ENØK is a name
finnmárkok is a noun
finnmárkok	Finnmárkko+N+Prop+Plc+Sg+Gen+Der1+Der/k+N+Sg+Nom

SMJ is looking very good, only problem is name and acro compounding (see above). name+acro is fine

Polderland:

  • drop that fixes Windows+Office issues
  • ask about hyphenated suggestions

TODO:

  1. change build process to exclude closed POS's and closed sets
  2. look into ACRO compounding bug
  3. compile new spellers
  4. test

Korpus

Børre has:

Freecorpus:
9428 files processed, 81 errors among them
The errors were distributed like this:
checkxml_after_faulty 0 0% of errors
error_markup 0 0% of errors
convert2xml 1 1% of errors
xsl 8 10% of errors
character_encoding 0 0% of errors
intermediate 16 20% of errors
faulty_chars 44 54% of errors
text_categorization 0 0% of errors
checkxml_after_checklang 12 15% of errors
checklang 0 0% of errors
To find which files caused the errors, do the command
grep "Conversion failed" tmp/*.log

Boundcorpus:
51760 files processed, 1407 errors among them
The errors were distributed like this:
checkxml_after_faulty 0 0% of errors
error_markup 0 0% of errors
convert2xml 13 1% of errors
xsl 2 0% of errors
character_encoding 0 0% of errors
intermediate 11 1% of errors
faulty_chars 1379 98% of errors
text_categorization 0 0% of errors
checkxml_after_checklang 2 0% of errors
checklang 0 0% of errors
To find which files caused the errors, do the command
grep "Conversion failed" tmp/*.log

Sjur has:

Freecorpus:
Processing finished
9430 files processed, 442 errors among them
The errors were distributed like this:
checkxml_after_faulty 1 0% of errors
error_markup 0 0% of errors
convert2xml 28 6% of errors
xsl 8 2% of errors
character_encoding 0 0% of errors
intermediate 338 76% of errors
faulty_chars 55 12% of errors
text_categorization 0 0% of errors
checkxml_after_checklang 12 3% of errors
checklang 0 0% of errors
To find which files caused the errors, do the command
grep "Conversion failed" tmp/*.log

The main problem here is not so much the quality of the conversions as such, but that the conversion results varies so much between runs and computers This is also what we have seen earlier with Trond. This problem will be less intruding with the stable corpus repository (see below), but needs to be followed up.

Stable corpus conversion

  • goal: a repository of converted files that are guaranteed to meet certain quality measures
  • each file is added manually after testing the conversion result

Conversion & test procedure:

  • convert2xml without errors
  • ccat > analyser = less than 3 % unknown words
  • ccat > grep all paragraphs not ending in a sentence delimiter = less than 1 %
  • parallel files must have valid parallel pointers

Most important: the stable repository must not contain useless data!

TODO:

  • create stable/converted/ & stable/goldstandard/converted branch of the corpus repositories (Børre)
  • add tested and verified converted files to the stable branch (Tomi)
    • one sme-nob pair from $GTFREE/orig/sme/admin/ today
  • fill up the stable branch of GTFREE with as much parallel data as required by Francis for the Forv.ordbok project (Tomi; Børre)

Sma-normering + fst

Noun file updated with the SGM decisions that are clear to us, verbs and adjectives left. Not too much work.

TODO:

  • missing: 1-2 days
  • sma-normering: 1-2 days
  • work for sma-oahpa fst's