UsingXMLSH
A collection of examples
This is a short collection of examples serving as a starting point for how to use XMLSH. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! : D
Count the number of sme words in parallel files
First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in ~/lib/saxon9.jar):
$ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted
Then start xmlsh and extract some statistics from the xml files produced above:
$ xmlsh xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml 2307 xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml 2288
Then off to some slightly more advanced XQuery: get all elements for which we have
xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \ < corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \ < corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt xmlsh$ exit
Finally some traditional processing to extract the words and count them. The most
$ sort -u sme-files.txt > sme-files.sorted.txt $ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w 849855 $ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l 964529 $ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w 977348