Fetching parallel files from samediggi.no
This is an example on how to:
- fetch parallel documents
- add metadata that make the parallel documents refer to each other
- add the parallel documents and their metadata to the corpus repository
- convert them to giellateknos xml format
- move the converted documents to prestable/converted
Fetch parallel documents
- Open Safari, go to this address:
http://www.samediggi.no/Dokumeantta/Oza-dokumeanttaid
Click on "Lávdegottit", "Dievasčoahkkimat" and "Møter"
Choose some file and click on it. The document will be opened in another tab.
Copy the document link.
- Open another tab in the same Safari window (Press cmd+t) and go to:
http://www.sametinget.no/Dokumenter/Soek-etter-dokumenter
Click on "Utvalg", "Plenumsmøter" and "Møter"
Choose the norwegian version of the saami document and double click it. The document will open in a new tab.
(You now have the saami pages in the left tab and the norwegian pages in the right tab)
- Open a new window in SubEthaEdit (cmd+n)
- paste (cmd+v) the link of the saami document - press cmd+f and replace & with \& (replace all) - copy (cmd+c) the link
- Open a new Terminal window and go to:
freecorpus/orig/sme/admin/sd/other_files/
Fetch the saami document with this command: wget (cmd+v, to paste the link we just copied)
Change the name of the document: mv wf (press tab, the whole name will appear) dc2012-2.pdf
Make the xsl file: convert2xml dc2012-2.pdf
- Go back to Safari and copy the link to the norwegian document.
- Go to the SubEthaEdit window
- paste (cmd+v) the link of the norwegian document - press cmd+f and replace & with \& (replace all) - copy (cmd+c) the link
- Go the already opened Terminal. Press cmd+t to open a new tab and go to:
freecorpus/orig/nob/admin/sd/other_files/
Fetch the norwegian document using this command: wget (press cmd+v, then paste the link using cmd+c)
Change the name of the document: mv wf (press tab, the whole name will appear) sp2012-2.pdf
Make the .xsl file: convert2xml sp2012-2.pdf
(Now you have the saami pages in the left tab of Terminal and the norwegian pages in the right tab)
Add metadata
- Open the saami xsl file:
see dc2012-2.pdf.xsl
- Open the norwegian xsl file:
see sp2012-2.pdf.xsl
- Place the saami file on the left hand side and the norwegian file on the right hand side. Then fill in the metadata in the xsl files.
- These entries have be filled in:
This entry has to be entered in the saami xsl file (don't fill in "translated from" in the norwegian xsl file):
<xsl:variable name="filename" select="'http://innsyn.e-kommune.no/innsyn_sametinget_samisk/wfdocument.aspx?journalpostid=2012006934&dokid=368243&versjon=1&variant=P&ct=RA-PDF'"/>
-->NB!! This is the full link, note that you have to replace some characters in the link. (Paste the link into a clean SubEthaEdit document, use the search and replace function and replace & with (&). Don't include the paranthesis). Copy the link and paste it into the Terminal. IMPORTANT: replace & with &
<xsl:variable name="title" select="'Sámedikki dievasčoahkkin Čoahkkingirji 001/12'"/> <xsl:variable name="publisher" select="'Norgga Sámediggi'"/> <xsl:variable name="year" select="'2012'"/> <xsl:variable name="genre" select="'admin'"/> <xsl:variable name="translated_from" select="'nob'"/>
--->NB! Only use translated_from if it is a translated document!
<xsl:variable name="license_type" select="'free'"/> <xsl:variable name="sub_name" select="'Berit Eskonsipo'"/> <xsl:variable name="sub_email" select="'berit.nystad.eskonsipo@uit.no'"/> <xsl:variable name="mainlang" select="'sme'"/> <xsl:variable name="monolingual" select="'1'"/> <!--lg rec is off!--> <xsl:variable name="parallel_texts" select="'1'"/> <xsl:variable name="para_nob" select="'sp2012-1.pdf'"/>
Save!
Add the parallel documents and their metadata to the corpus repository
- Rerun convert2xml in the SME Terminál window (note that you can press up in the Terminal untill the right command appears.):
convert2xml dc2012-2.pdf
- Write svn stat and the result is:
? dc2012-2.pdf ? dc2012-2.pdf.xsl
(copy (cmd+v) dc2012-2.pdf)
- Write:
svn add (cmd+v) (cmd+v+.+tab)
- Write:
svn ci -m "your svn message"
- Rerun convert2xml in the NOB Terminal window (note that you can press up in the Terminal untill the right command appears.):
convert2xml sp2012-2.pdf
- Write svn stat and the result is:
? sp2012-2.pdf ? sp2012-2.pdf.xsl
(copy (cmd+v) sp2012-2.pdf)
- Write:
svn add (cmd+v) (cmd+v+.+tab)
- Write:
svn ci -m "your svn message"
Move the converted documents to prestable/converted
- In the SME Terminal, write:
cd
cd freecorpus/converted/sme/admin/sd/other_files - Write this:
pick (press tab, pick-parallel-docs.pl appears) dc2012-2.pdf.xml (pick-parallel-docs.pl only works on sme docs)
- Write this:
cd
cd freecorpus - Write svn stat and the result is:
? prestable/converted/nob/admin/sd/other_files/sp2012-2.pdf.xml ? prestable/converted/sme/admin/sd/other_files/dc2012-2.pdf.xml
- Write:
svn add (copy and paste both paths, remember to add a space between them) prestable/converted/nob/admin/sd/other_files/sp2012-2.pdf.xml prestable/converted/sme/admin/sd/other_files/dc2012-2.pdf.xml
- Write:
svn ci -m "added approved parallel files to prestable"
YOU NOW HAVE ADDED NEW PARALLEL DOCUMENTS TO PRESTABLE
by Berit Merete Nystad Eskonsipo, Børre Gaup