Meeting_2011-04-11
Corpus meeting 11.4.2011
Present: Berit Merete, Børre, Ciprian, Tomi, Trond
Agenda
- Algorithm for dealing with scanning errors
- Setningsparallellisering
- Analyserte korpora på xserve
Goal:Functioning corpus
Algorithm for dealing with scanning errors
The process has not ben run, and we thus do not have
Run the same routine for nob.
Missing in nob:
- vŽre, and all æøå: converted/nob/admin/guovda/1.doc.xml
- Note: The document is marked xml: lang="kal"
/home/apache_corpus/freecorpus/converted/sme/admin/depts/other_files 8.9000 26196 2334 STM200420050011000SE_PDFS.pdf.xml Rá Rá +? ehusa ehusa +? jahkedie jahkedie +? áhusáššiid áhusáššiid +? 8.4300 30893 2605 STM200420050044000SE_PDFS.pdf.xml jahkedie jahkedie +? áhus áhus +? Rá Rá +? ádallamat ádallamat +? 8.3300 7320 610 Reindrift_Omraadeprotokoll_til_konvensjon_mellom_Norge_Sverige_Nordsamisk.pdf.xml 7.1500 14438 1033 273777-raportti_saami.pdf.xml 6.0100 57637 3464 OTP200620070025000SE_PDFS.pdf.xml 5.6600 1535 87 faktablad_nordsamiska_wordversion.doc.xml 4.5900 8931 410 260965-h-2179s_2.pdf.xml 4.4800 3325 149 sami_rapporter_bruk_samisk_flagg_SA.pdf.xml 4.4700 18766 840 203210-q-1066_samisk_lav.pdf.xml 4.4600 3874 173 sami_rapport_sametinget_vedlegg4_SA.pdf.xml /home/apache_corpus/freecorpus/converted/sme/admin/depts/regjeringen.no 30.4900 341 104 130-000-ruvnnu-kvena-proeavttaide.html_id=573764.xml Rejeerinki Rejeerinki +? anttaa anttaa +? rahhaa rahhaa +? Porsangin Porsangin +? kolmekieliselle kolmekieliselle +? laulukirjale laulukirjale +? 26.6600 30 8 plakater-til-valgdagen.html_id=575739.xml 26.6600 15 4 neahttakarta-.html_id=313865.xml 25.0000 12 3 neahttakarta.html_id=223274.xml 24.2400 33 8 nytt-og-nytting.html_id=544857.xml 23.5200 17 4 neahttakarta-.html_id=313868.xml 23.2500 43 10 neahttakarta-.html_id=313744.xml 22.8500 35 8 gulaskuddannotahtta.html_id=588787.xml 22.8500 35 8 adreassalistu.html_id=588788.xml 22.2200 18 4 ohcanveahkki-.html_id=446705.xml 21.8700 32 7 forskrifter.html_id=623.xml 21.4200 42 9 julebesok-til-oslo-fengsel.html_id=629537.xml /home/apache_corpus/freecorpus/converted/sme/admin/guovda 50.0000 10 5 GUOVDAGEAINNU_NUORAIDSKUVLLA_OAHPAHEDDJIID_PLÁKÁHTTA.doc.xml 33.3300 12 4 GUOVDAGEAINNU_NUORAIDSKUVLLA_OHPPIID_PLÁKÁHTTA.doc.xml 29.9500 227 68 KS_áššelistu_24.06.2004.doc.xml 13.8400 65 9 Gártnetluohkka_ÁRVVOŠTALLANSKOVVI_22.04.03.doc.xml 12.3100 138 17 Bajasdoallansiehtadus_FKB-data_Guovdageainnu_suohkanis_05.05.05.doc.xml 10.3800 10409 1081 1_2.doc.xml 8.3500 431 36 vinterskole.doc.xml 8.3100 493 41 Sakspapirer_på_samisk_31.10.03.doc.xml 8.1300 209 17 MEAHCCESKUVLA.doc.xml 7.9600 427 34 Mearraskuvla.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/others 15.6800 1326 208 uito-ohpenplana.txt.xml 15.1500 66 10 Reglement_Djupvik_havn.doc.xml 13.0800 107 14 VÁLGADIKKI.doc.xml 10.6700 637 68 skuterløyer_2006.doc.xml 9.4300 53 5 valgalistut_almmuhus.doc.xml 8.9200 112 10 SKJEMA___AMBULLERENDE.doc.xml 8.8800 45 4 Oversetting,_følgebrev.doc.xml 7.3600 95 7 UTBETALINGSANMODNING.doc.xml 7.1700 237 17 RETN.LINJER___KULTUR.doc.xml 7.0500 85 6 Reguleringsplan.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/sd/other_files 38.1800 6270 2394 dc1990-4.pdf.xml 26.0400 14338 3734 satnelistu.doc.xml 25.3600 138 35 stedsnavn4.doc.xml 20.7600 6592 1369 dc1991-2.pdf.xml 15.0600 9294 1400 dč1994-2.pdf.xml 14.7100 9357 1377 dc1990-3.pdf.xml 14.2600 1311 187 dc1993-3.pdf.xml 13.9800 12240 1712 dc1990-1.pdf.xml 13.3600 11341 1516 dč1994-1.pdf.xml 13.1200 160 21 64547_1_P.doc.xml /home/apache_corpus/freecorpus/converted/sme/admin/sd/samediggi.no 40.0000 5 2 samediggi-article-788.html.xml 40.0000 5 2 samediggi-article-315.html.xml 40.0000 5 2 samediggi-article-227.html.xml 40.0000 5 2 samediggi-article-225.html.xml 27.5500 196 54 samediggi-article-2933.html.xml 25.0000 8 2 samediggi-article-3179.html.xml 21.7300 23 5 samediggi-article-3114.html.xml 20.0000 35 7 samediggi-article-3217.html.xml 18.5100 27 5 samediggi-article-3451.html.xml 17.5700 165 29 samediggi-article-3683.html.xml 17.0800 158 27 samediggi-article-2738.html.xml 16.3900 61 10 samediggi-article-505.html.xml 16.0000 25 4 samediggi-article-2485.html.xml
Tomi to look into this, and discuss with Børre on unclear points.
TODO
- Fix large parts of this problem. (Tomi)
- Challenge: How to fix.
- Challenge: How to fix.
- Write a report late this week (Tomi)
Ultimate goal:
- fix the file conversion, or
- move the file to e.g. the gold corpus for scanning errors or whatever, or
- remove the file altogether
[dstroke] [dstr juoga oke]
Error reports
Scanning errors
I found this error yesterday: <tuv lang="sme-NO"> <seg>Sámediggi gávnnaha 1unddo1ažžan ahte fy1kagielda váldá oasi giellanjuolggadusaid ovttastahttimii ja di1álašvuodaid 1 áhčimiidda gielddain mat gu11et doaibmaguv1ui Finnmárkku fylkkas .</seg> And this: <tuv lang="sme-NO"> <seg>Dan lassin lea bálkkašumi vuoiti vuođđudan alccesis duodjefitnodaga , ja lea máhtolašvuođainis ja hutkái ¬vuođainis ožžon alla árvvu duodjeealáhusas .</seg> </tuv> And this - đ is missing: <tuv lang="sme-NO"> <seg>daid ektui , ja ahte gielddat ieža oidnet dárbbu doallat aktiivvalaš oktavuo a Sámedikkiin go galgá bargat kulturhistorjjá sihkkarastimiin , duo aštemiin , dutkamiin ja gaskkustemiin .</seg> </tuv> Same error - đ is missing: <tuv lang="sme-NO"> <seg>Orru ahte dán gealdagasas dat lea sámi kultuvra vuoittahallan ja ahte eiseválddiid dáiddaáŋgiruššamat vuo uduvvojit minoritehtakultuvrra siskilkeahtesvuhtii .</seg>
Son !! boahtán (!! pro ii)
Corpus conversion
Status quo:
- Works on Linux
- Mac:
- Has problems with perl version xyz
WARNING - NO MATCH
This message shows up when converting orig, and the issue is still open.
Sentence alignment
New program
Trond has talked to Knut Hofland. We will get a new TCA2 version.
TODO
- Put the new version of TCA2 in svn (?, make it accessible)
- Update our general TCA2 documentation if the old is obsolete (all)
Anchor list
Trond had made an anchor.fst, which unfortunately was flawed. A new one
TODO
- Make a nob-based new anchor list. (Trond)
- Thereafter, translate to sme (Biret Merete)
- Divide the anchor list in two: a. general, b. topic-specific. (Trond, Berit Merete)
Analysed corpara on xserve
Has anyone checked the output? No.
The cronjob did this
TODO
Make sure we have a fresh version on thursday. (Børre)
Error report, have a look:
tmp/STM200720080028000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200720080028000DDDPDFS.pdf to intermediate xml format tmp/STM200820090039000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090039000DDDPDFS.pdf to intermediate xml format tmp/STM200820090043000DDDPDFS.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090043000DDDPDFS.pdf tmp/Samiske_tall_forteller_3_NO.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf tmp/Samiske_tall_forteller_II_Norsk.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_II_Norsk.pdf tmp/retningslinjerforverneplanarbeid_sametinget.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/retningslinjerforverneplanarbeid_sametinget.pdf drwxr-xr-x 4 cipriangerstenberger staff 136 7 apr 22:54 orig drwxr-xr-x 201 cipriangerstenberger staff 6834 11 apr 13:29 tmp