Parallel Corpus Check Fix
Check and fix parallel corpus
Do this if you find files in freecorpus that aren't parallel anyway (or how to
To find files with wrong sentence alignment:
1. Open a terminal. Run tmx2html.sh. This converts all .tmx files to html files.
2. Go to:
3. If you want to check in which files the word is, grep word in sentence that
nob2sme $ grep -rl bransjep . | grep -v '.svn' | less ./admin/depts/regjeringen.no/aktuelt.html_id=166.tmx ./admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.tmx ./admin/depts/regjeringen.no/nyheter.html_id=174.tmx ./admin/depts/regjeringen.no/pressemeldinger.html_id=184.tmx
Check all the listed files
4. Check if correctly aligned:
NOW YOU CAN START TO FIX THE FREECORPUS
1. Find all the files with the same id number in orig:
~ $ cd freecorpus $ find orig -name '*id=210*' | grep -v ".svn" orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210 orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210.xsl orig/nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210 orig/nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210.xsl orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210 orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.xsl orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210 orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210.xsl
2. Check if nob and sme are parallel files:
In SubEthaEdit press cmd+r to open the files in a webbrowser.
3. If the files are not parallel files, change the sme xsl-file and delete
4. Convert the file xml to check if there are any errors in xsl-file:
5. If converted succesfully, check in xsl-file:
6. Find the rest of the files with the same id number:
$ find orig -name '*id=210*' | grep -v ".svn" orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210 orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210.xsl orig/nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210 orig/nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210.xsl orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210 orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.xsl
7. svn rm the other languages except sme (6 files; 2 eng, 2 nno, 2 nob):
8. check in the changes:
10. Find the rest of the files with same id number in prestable and delete
freecorpus $ find prestable -name '*id=210*' | grep -v '.svn'
11. Delete the files (5 files: 3 in converted, 1 in tmx and 1 in toktmx):
12. Check in the changes: