Meeting_2011-11-11

Corpus meeting 11.11.11. :-)

Agenda

  • Status quo,
  • Evaluation
  • Next steps
  • More intelligent suggestions, anyone?

Status quo

  • First conversion ready wednesday.
  • Second conversion underway.

Improvements in sme and nob abbr files, and in anchor file. Made a style sheet to convert tmx files to html. Many errors in the tmx file are caused by bad abbr. handling in the nob files On time use: half the time goes to add sentences to the file, the other half to parallelise them Throughput on xserve yesterday: 25k bytes per minute Possible test: Check if the size of the anchor file affects the speed of tca2

cd prestable/tmx/smenob/ for file in *.tmx do xsltproc $GTHOME/gt/script/corpus/tmx2html.xsl $file > $file.html done

Here it seems the texts are different:

prestable/tmx/smenob/vuollasa-asahusat.html_id=115192.tmx.html

Filenames and directory structure

Root tmx dir:

prestable/tmx/SOURCELANG2TARGETLANG/

Below this point we follow the directory structure found elsewhere, ie GENRE/subdirs/file.tmx. This should give us:

prestable/tmx/nob2sme/admin/depts/regjeringen.no/xxx.tmx
  • Include bullet as sentence border???

TODO

  • Check a corpus against different-size anchor files, and measure time use. (Børre)
  • Add more words to anchor.txt, and possibly modularise / remove some files (Trond, Børre)
    • Common anchor file: anchor.txt
    • Genre-specific files: anchor_admin.txt, anchor_bible.txt, …
  • Harmonise the nob and sme abbr.txt files.
    • Testcase: STM55,
  • Improve encoding: Search for black question marks in the whole corpus (Børre)
    • 13 files contain the black marks (see below).

Black question mark files:

$ grep -lr '�' * | grep -v '\.svn'
nob/admin/depts/other_files/OTP200620070025000SE_12.html.xml
nob/admin/depts/regjeringen.no/7-narmere-om-planbestemmelsene.html_id=571096.xml
sme/admin/depts/other_files/STM_TS007SA.pdf.xml
sme/admin/depts/regjeringen.no/10.html_id=458508.xml
sme/admin/depts/regjeringen.no/2011--rievdadeami-aigi-afghanistanas.html_id=604390.xml
sme/admin/depts/regjeringen.no/7.html_id=458471.xml
sme/admin/depts/regjeringen.no/aigeguovdil.html_id=1150.xml
sme/admin/depts/regjeringen.no/bismagodderait.html_id=449030.xml
sme/admin/depts/regjeringen.no/historihkka.html_id=861.xml
sme/admin/depts/regjeringen.no/horingsbrev.html_id=499754.xml
sme/admin/depts/regjeringen.no/raehus-rahkadahtta-samegiela-doaibmaplan.html_id=514922.xml
sme/admin/depts/regjeringen.no/sami.html_id=615757.xml
smj/admin/depts/other_files/HP_2009_samisk_sprak_lulesam.pdf.xml

Evaluation

Conversion

ŋ not converted: samediggi-article-3002.html.tmx.html:

Son gii biddjo virgái ferte hálddašit davviriikkalaš giela , sámegiela ja e ? gelasgiela .

Personen som blir ansatt må beherske skandinavisk språk , samisk og engelsk .

mnd. is not sentence final:

Forøvrig tilsettes arbeidstakere etter gjeldende lover , reglement og overenskomster , herunder lønn og pensjon , samt 6 mnd .
	prøvetid .

Capital letter in names divides sentence: (boerre: I think the sentence division comes from the .)

Mun doaivvun strategiija maid mii plánet váikkuha ahte bargu ollislaš ja dássásaš bálvalusain sámi álbmoga váste šaddá álkit ja beaktileappot , dadjá várrepresideanta Ragnhild L .	
Nystad .

Jeg håper strategien vi legger opp til vil bidra til at arbeidet med å oppnå helhetlige og likeverdige tjenester til det samiske folket vil bli lettere og mer effektivt , sier visepresident Ragnhild L .
	Nystad .

The 1-0 issue

There are two types of 1-0 cases:

  1. The sentence is missing in the other language
  2. The 1-0 status reveals an alignment error (the match is in the neighbour pair)

Trond's impressionistic feeling: (1) is the overwhelmingly most common one.

Next meeting

Middle of next week.