Meeting_2011-11-11
Corpus meeting 11.11.11. :-)
Agenda
- Status quo,
- Evaluation
- Next steps
- More intelligent suggestions, anyone?
Status quo
- First conversion ready wednesday.
- Second conversion underway.
Improvements in sme and nob abbr files, and in anchor file.
cd prestable/tmx/smenob/
Here it seems the texts are different:
prestable/tmx/smenob/vuollasa-asahusat.html_id=115192.tmx.html
Filenames and directory structure
Root tmx dir:
prestable/tmx/SOURCELANG2TARGETLANG/
Below this point we follow the directory structure found elsewhere, ie GENRE/subdirs/file.tmx. This should give us:
prestable/tmx/nob2sme/admin/depts/regjeringen.no/xxx.tmx
- Include bullet as sentence border???
TODO
- Check a corpus against different-size anchor files, and measure
- Add more words to anchor.txt, and possibly modularise / remove
- Common anchor file: anchor.txt
- Genre-specific files: anchor_admin.txt, anchor_bible.txt, …
- Common anchor file: anchor.txt
- Harmonise the nob and sme abbr.txt files.
- Testcase: STM55,
- Testcase: STM55,
- Improve encoding: Search for black question marks in the whole corpus (Børre)
- 13 files contain the black marks (see below).
Black question mark files:
$ grep -lr '�' * | grep -v '\.svn' nob/admin/depts/other_files/OTP200620070025000SE_12.html.xml nob/admin/depts/regjeringen.no/7-narmere-om-planbestemmelsene.html_id=571096.xml sme/admin/depts/other_files/STM_TS007SA.pdf.xml sme/admin/depts/regjeringen.no/10.html_id=458508.xml sme/admin/depts/regjeringen.no/2011--rievdadeami-aigi-afghanistanas.html_id=604390.xml sme/admin/depts/regjeringen.no/7.html_id=458471.xml sme/admin/depts/regjeringen.no/aigeguovdil.html_id=1150.xml sme/admin/depts/regjeringen.no/bismagodderait.html_id=449030.xml sme/admin/depts/regjeringen.no/historihkka.html_id=861.xml sme/admin/depts/regjeringen.no/horingsbrev.html_id=499754.xml sme/admin/depts/regjeringen.no/raehus-rahkadahtta-samegiela-doaibmaplan.html_id=514922.xml sme/admin/depts/regjeringen.no/sami.html_id=615757.xml smj/admin/depts/other_files/HP_2009_samisk_sprak_lulesam.pdf.xml
Evaluation
Conversion
ŋ not converted: samediggi-article-3002.html.tmx.html:
Son gii biddjo virgái ferte hálddašit davviriikkalaš giela , sámegiela ja e ?
Personen som blir ansatt må beherske skandinavisk språk , samisk og engelsk .
mnd. is not sentence final:
Forøvrig tilsettes arbeidstakere etter gjeldende lover , reglement og overenskomster , herunder lønn og pensjon , samt 6 mnd . prøvetid .
Capital letter in names divides sentence: (boerre: I think the sentence division comes from the .)
Mun doaivvun strategiija maid mii plánet váikkuha ahte bargu ollislaš ja dássásaš bálvalusain sámi álbmoga váste šaddá álkit ja beaktileappot , dadjá várrepresideanta Ragnhild L . Nystad . Jeg håper strategien vi legger opp til vil bidra til at arbeidet med å oppnå helhetlige og likeverdige tjenester til det samiske folket vil bli lettere og mer effektivt , sier visepresident Ragnhild L . Nystad .
The 1-0 issue
There are two types of 1-0 cases:
- The sentence is missing in the other language
- The 1-0 status reveals an alignment error (the match is in the neighbour pair)
Trond's impressionistic feeling: (1) is the overwhelmingly most common one.
Next meeting
Middle of next week.