UsingXMLSH

A collection of examples

This is a short collection of examples serving as a starting point for how to use XMLSH. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! : D

Count the number of sme words in parallel files

First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in ~/lib/saxon9.jar):

$ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted

Then start xmlsh and extract some statistics from the xml files produced above:

$ xmlsh
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml
2307
xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml
2288

Then off to some slightly more advanced XQuery: get all elements for which we have found a parallel file (as per above), extract the path to that file, and print it (we do this with both the created report files, and sort -u later):

xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \
        < corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \
        < corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt
xmlsh$ exit

Finally some traditional processing to extract the words and count them. The most conservative (and probably most reliable) method is to just count the words using wc:

$ sort -u sme-files.txt > sme-files.sorted.txt
$ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w
  849855
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l
  964529
$ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w
  977348