Corpus analysis on stallo
Pre analysis
Building tokenisers
Tokenisers are built once a week on gtlab. The comments in $GTHOME/tools/stallo-analysis/build-tokenisers.sh details what is being done.
The tokeniser build jobs are kicked off by this cronjob 45 19 * * sat source $HOME/.bash_profile && svn up $GTHOME/tools/stallo-analysis/build-tokenisers.sh && $GTHOME/tools/stallo-analysis/build-tokenisers.sh
The hfst tools on this machine are updated at least once a week from the nightly apertium repo.
Fetch converted corpus and kick off analysis
The analysis is kicked off by this cronjob on gtlab: 00 03 * * sun ssh stallo.uit.no ". $HOME/.bash_profile && ls -lR $HOME/.local/share/giella && svn up $HOME/svnrepos/langtech/tools && $HOME/svnrepos/langtech/tools/stallo-analysis/pre-analysis.sh"
The script $GTHOME/tools/stallo-analysis/pre-analysis.sh fetches the converted corpus files and dispatches separate analysis jobs for each corpus, language and type (type being xfst and hfst).
Analysis
The analysis is done on the boerre account on stallo. The comments in $GTHOME/tools/stallo-analysis/analyse.sh details what is being done. The grunt work is done by the program analyse_corpus .
Post analysis
Analysed files are sent to gtweb by this cron job. 00 08 * * sun ssh stallo.uit.no ". $HOME/.bash_profile && $HOME/svnrepos/langtech/tools/stallo-analysis/post-analysis.sh"
The script $GTHOME/tools/stallo-analysis/post-analysis.sh details what is being done.
Files and compilers
These are the repositeries found on stallo:
- The langtech repo: ~boerre/svnrepos/main/
- The freecorpus repo: ~boerre/svnrepos/freecorpus/
- The boundcorpus repo: ~boerre/svnrepos/boundcorpus/
- The rusbound repo: ~boerre/svnrepos/rusbound/
The xfst, hfst, vislcg3 and CorpusTools tools are installed in ~boerre/bin on stallo.
hfst: Build commands
cd ~/hfst/ git pull module load autoconf/2.69 module load automake/1.13.1 module load gcc/4.9.1 ./autogen.sh ./configure --enable-all-tools --with-unicode-handler=glib --prefix=/home/boerre --with-readline make make install
vislcg3: Build commands
module load CMake/3.6.2-foss-2016b module load Boost/1.61.0-foss-2016b module load ICU cd ~/svnrepos/vislcg3/ svn up cmake \ -DCMAKE_INCLUDE_PATH=/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/include/ \ -DCMAKE_LIBRARY_PATH=/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/lib/ \ -DCMAKE_EXE_LINKER_FLAGS=-L/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/lib \ . make make install
CorpusTools: Install commands
cd ~/svnrepos/main/tools/CorpusTools/ svn up python setup.py develop --user
Environment variables on stallo
These environment variables are set on stallo to make hfst and vislcg3 work as expected:
export PATH=$HOME/bin:$PATH export LD_LIBRARY_PATH=$HOME/lib:$LD_LIBRARY_PATH