Meeting_2011-04-11

Corpus meeting 11.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond

Agenda

  • Algorithm for dealing with scanning errors
  • Setningsparallellisering
  • Analyserte korpora på xserve

Goal:Functioning corpus

Algorithm for dealing with scanning errors

The process has not ben run, and we thus do not have new results.

Run the same routine for nob.

Missing in nob:

  • vŽre, and all æøå: converted/nob/admin/guovda/1.doc.xml
  • Note: The document is marked xml: lang="kal"
/home/apache_corpus/freecorpus/converted/sme/admin/depts/other_files
8.9000 26196 2334 STM200420050011000SE_PDFS.pdf.xml
Rá      Rá      +?
ehusa   ehusa   +?
jahkedie        jahkedie        +?
áhusáššiid      áhusáššiid      +?

8.4300 30893 2605 STM200420050044000SE_PDFS.pdf.xml
jahkedie        jahkedie        +?
áhus    áhus    +?
Rá      Rá      +?
ádallamat       ádallamat       +?

8.3300 7320 610 Reindrift_Omraadeprotokoll_til_konvensjon_mellom_Norge_Sverige_Nordsamisk.pdf.xml
7.1500 14438 1033 273777-raportti_saami.pdf.xml
6.0100 57637 3464 OTP200620070025000SE_PDFS.pdf.xml
5.6600 1535 87 faktablad_nordsamiska_wordversion.doc.xml
4.5900 8931 410 260965-h-2179s_2.pdf.xml
4.4800 3325 149 sami_rapporter_bruk_samisk_flagg_SA.pdf.xml
4.4700 18766 840 203210-q-1066_samisk_lav.pdf.xml
4.4600 3874 173 sami_rapport_sametinget_vedlegg4_SA.pdf.xml

/home/apache_corpus/freecorpus/converted/sme/admin/depts/regjeringen.no
30.4900 341 104 130-000-ruvnnu-kvena-proeavttaide.html_id=573764.xml
Rejeerinki      Rejeerinki      +?
anttaa  anttaa  +?
rahhaa  rahhaa  +?
Porsangin       Porsangin       +?
kolmekieliselle kolmekieliselle +?
laulukirjale    laulukirjale    +?

26.6600 30 8 plakater-til-valgdagen.html_id=575739.xml
26.6600 15 4 neahttakarta-.html_id=313865.xml
25.0000 12 3 neahttakarta.html_id=223274.xml
24.2400 33 8 nytt-og-nytting.html_id=544857.xml
23.5200 17 4 neahttakarta-.html_id=313868.xml
23.2500 43 10 neahttakarta-.html_id=313744.xml
22.8500 35 8 gulaskuddannotahtta.html_id=588787.xml
22.8500 35 8 adreassalistu.html_id=588788.xml
22.2200 18 4 ohcanveahkki-.html_id=446705.xml
21.8700 32 7 forskrifter.html_id=623.xml
21.4200 42 9 julebesok-til-oslo-fengsel.html_id=629537.xml

/home/apache_corpus/freecorpus/converted/sme/admin/guovda
50.0000 10 5 GUOVDAGEAINNU_NUORAIDSKUVLLA_OAHPAHEDDJIID_PLÁKÁHTTA.doc.xml
33.3300 12 4 GUOVDAGEAINNU_NUORAIDSKUVLLA_OHPPIID_PLÁKÁHTTA.doc.xml
29.9500 227 68 KS_áššelistu_24.06.2004.doc.xml
13.8400 65 9 Gártnetluohkka_ÁRVVOŠTALLANSKOVVI_22.04.03.doc.xml
12.3100 138 17 Bajasdoallansiehtadus_FKB-data_Guovdageainnu_suohkanis_05.05.05.doc.xml
10.3800 10409 1081 1_2.doc.xml
8.3500 431 36 vinterskole.doc.xml
8.3100 493 41 Sakspapirer_på_samisk_31.10.03.doc.xml
8.1300 209 17 MEAHCCESKUVLA.doc.xml
7.9600 427 34 Mearraskuvla.doc.xml

/home/apache_corpus/freecorpus/converted/sme/admin/others
15.6800 1326 208 uito-ohpenplana.txt.xml
15.1500 66 10 Reglement_Djupvik_havn.doc.xml
13.0800 107 14 VÁLGADIKKI.doc.xml
10.6700 637 68 skuterløyer_2006.doc.xml
9.4300 53 5 valgalistut_almmuhus.doc.xml
8.9200 112 10 SKJEMA___AMBULLERENDE.doc.xml
8.8800 45 4 Oversetting,_følgebrev.doc.xml
7.3600 95 7 UTBETALINGSANMODNING.doc.xml
7.1700 237 17 RETN.LINJER___KULTUR.doc.xml
7.0500 85 6 Reguleringsplan.doc.xml

/home/apache_corpus/freecorpus/converted/sme/admin/sd/other_files
38.1800 6270 2394 dc1990-4.pdf.xml
26.0400 14338 3734 satnelistu.doc.xml
25.3600 138 35 stedsnavn4.doc.xml
20.7600 6592 1369 dc1991-2.pdf.xml
15.0600 9294 1400 dč1994-2.pdf.xml
14.7100 9357 1377 dc1990-3.pdf.xml
14.2600 1311 187 dc1993-3.pdf.xml
13.9800 12240 1712 dc1990-1.pdf.xml
13.3600 11341 1516 dč1994-1.pdf.xml
13.1200 160 21 64547_1_P.doc.xml

/home/apache_corpus/freecorpus/converted/sme/admin/sd/samediggi.no
40.0000 5 2 samediggi-article-788.html.xml
40.0000 5 2 samediggi-article-315.html.xml
40.0000 5 2 samediggi-article-227.html.xml
40.0000 5 2 samediggi-article-225.html.xml
27.5500 196 54 samediggi-article-2933.html.xml
25.0000 8 2 samediggi-article-3179.html.xml
21.7300 23 5 samediggi-article-3114.html.xml
20.0000 35 7 samediggi-article-3217.html.xml
18.5100 27 5 samediggi-article-3451.html.xml
17.5700 165 29 samediggi-article-3683.html.xml
17.0800 158 27 samediggi-article-2738.html.xml
16.3900 61 10 samediggi-article-505.html.xml
16.0000 25 4 samediggi-article-2485.html.xml

Tomi to look into this, and discuss with Børre on unclear points.

TODO

  • Fix large parts of this problem. (Tomi)
    • Challenge: How to fix.
  • Write a report late this week (Tomi)

Ultimate goal:

  • fix the file conversion, or
  • move the file to e.g. the gold corpus for scanning errors or whatever, or
  • remove the file altogether
[dstroke]
[dstr juoga
oke]

Error reports

Scanning errors

I found this error yesterday:
<tuv lang="sme-NO">
            <seg>Sámediggi gávnnaha 1unddo1ažžan ahte fy1kagielda váldá oasi giellanjuolggadusaid ovttastahttimii ja di1álašvuodaid 1 áhčimiidda gielddain mat gu11et doaibmaguv1ui Finnmárkku fylkkas .</seg>

And this:          
 <tuv lang="sme-NO">
            <seg>Dan lassin lea bálkkašumi vuoiti vuođđudan alccesis duodjefitnodaga , ja lea máhtolašvuođainis ja hutkái ¬vuođainis ožžon alla árvvu duodjeealáhusas .</seg>
         </tuv>
           
And this - đ is missing:
<tuv lang="sme-NO">
            <seg>daid ektui , ja ahte gielddat ieža oidnet dárbbu doallat aktiivvalaš oktavuo a Sámedikkiin go galgá bargat kulturhistorjjá sihkkarastimiin , duo aštemiin , dutkamiin ja gaskkustemiin .</seg>
         </tuv>

Same error - đ is missing: 
<tuv lang="sme-NO">
            <seg>Orru ahte dán gealdagasas dat lea sámi kultuvra vuoittahallan ja ahte eiseválddiid dáiddaáŋgiruššamat vuo uduvvojit minoritehtakultuvrra siskilkeahtesvuhtii .</seg>

Son !! boahtán (!! pro ii)

Corpus conversion

Status quo:

  • Works on Linux
  • Mac:
    • Has problems with perl version xyz

WARNING - NO MATCH

This message shows up when converting orig, and the issue is still open.

Sentence alignment

New program

Trond has talked to Knut Hofland. We will get a new TCA2 version.

TODO

  • Put the new version of TCA2 in svn (?, make it accessible) and document the installation (Børre + friends)
  • Update our general TCA2 documentation if the old is obsolete (all)

Anchor list

Trond had made an anchor.fst, which unfortunately was flawed. A new one is finished and ok, but not tested or checked in. The question now is whether to take nob or sme as a starting point.

TODO

  • Make a nob-based new anchor list. (Trond)
  • Thereafter, translate to sme (Biret Merete)
  • Divide the anchor list in two: a. general, b. topic-specific. (Trond, Berit Merete)

Analysed corpara on xserve

Has anyone checked the output? No.

The cronjob did this

TODO

Make sure we have a fresh version on thursday. (Børre)

Error report, have a look:

tmp/STM200720080028000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200720080028000DDDPDFS.pdf to intermediate xml format
tmp/STM200820090039000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090039000DDDPDFS.pdf to intermediate xml format
tmp/STM200820090043000DDDPDFS.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090043000DDDPDFS.pdf
tmp/Samiske_tall_forteller_3_NO.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf
tmp/Samiske_tall_forteller_II_Norsk.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_II_Norsk.pdf
tmp/retningslinjerforverneplanarbeid_sametinget.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/retningslinjerforverneplanarbeid_sametinget.pdf


drwxr-xr-x    4 cipriangerstenberger  staff   136  7 apr 22:54 orig
drwxr-xr-x  201 cipriangerstenberger  staff  6834 11 apr 13:29 tmp