Meeting_2011-11-07
Corpus meeting 7.11.2011
- Conversion status
- what is (not) included in prestable
- what does it look like
- what is (not) included in prestable
- Next tasks:
- sentence alignment
- sentence alignment
- targets:
- forvaltningsordbok
- Autshumato/translation memory
- forvaltningsordbok
Conversion status
What is (not) included
In prestable:
- included in sme: admin/ facta/ laws/
- at least 85 % analysable by the main language analyser - this percentage is
- parallel documents with at least 30 words, and where the word count diff
- all parallel pointers are ok (in all directions)
Here we count sme words.
Prestable total is 1 612 856.
Catalogue | prestable | conv. locally@Trond | conv. apache@vic |
---|---|---|---|
admin/sd/samediggi.no/ | 197 676 | xxx | 233 667 |
admin/sd/other_files/ | 335 543 | 1 925 954 | 1 935 348 |
admin/depts/other_files | 571 250 | 1 573 658 | 1 311 716 |
admin/depts/regjeringen.no | 218 730 | 1 592 557 | 1 613 487 |
prestable/converted/sme/admin/others/laws | 13 521 | 631 361 | 649 426 |
Total | 1 336 720 | xxx | 5 743 644 |
This is enough to start doing sentence alignment, but it also shows that there
What does it look like
Document structure
Law bug still open.
Text
For sme, ligatures represent 400 errors. Others?
Of 1.6 mill words, appr. 1/3 is pdf.
All words divided in pdf are still lost.
Sjur and Trond to look at all-caps xfst script.
Do not forget this conversion error (missing đ):
~/freecorpus$ccat prestable/converted/sme/laws/other_files/finnmarkkulahka_lov_web.pdf.xml \ |preprocess|usme|grep '?'|l Bántideapmi Bántideapmi +? aktivan aktivan +? váfistit váfistit +? finnmárkkulága finnmárkkulága +? mear mear +? <=================================== hyph error rida rida +? fidnet fidnet +?
Next tasks
Sentence alignment
Børre has initiated the parallelisation of the entire prestable parallelised corpus.
Targets
- forvaltningsordbok
- Autshumato/translation memory
cd GTLANG/st/nob/src make abbr and that's it. cat nobtext | preprocess --abbr=st/nob/bin/abbr.txt
Todo
- sentence alignment (already started) (Børre)
- The files are in /home/boerre/freecorpus/tmp
- The files are in /home/boerre/freecorpus/tmp
- testing the alignment output (Berit Merete, Børre)
- nob morphological analysis, look at ob.fst vs. nob.fst (Trond, Sjur)
- Look through list of TCA2 improvements
- Improve anchor file.
- Parallellised corpus to tmx (Ciprian, Børre)
- giza / word aligment incl. documentation of the process (Trond, Francis)
- pdf conversion improvements, incl. hyphenation (Børre)
- improving conversion of dirty documents (Børre)
Milestones
-
8.11. First sentence parallellisation done
-
11.11. Testing first sentence parallellisation
-
11.11. Improve document conversion (?) and sentence parallellisation
- Issues: Ligature, Hyphens, others…
- Issues: Ligature, Hyphens, others…
- Improve nob fst
-
11.11. Status quo and some attempts to improvements
- Choose between ob.fst and nob.fst
- Look at missing, add words
- Look at compounding
- Choose between ob.fst and nob.fst
-
Before Giza: Deliver a for-now optimal fst for nob.
-
11.11. Status quo and some attempts to improvements
-
Before Giza: Convert xml parallel to tmx.
- Set up giza
- First word parallellisation
- Testing
Next meeting
Friday at 10.