wikipedia_as_corpus
Wikipedia as a corpus
For some of our languages we use Wikipedia as a corpus. This may be
Procedure
We use Moksha Mordvin (ISO code mdf) as an example, replace mdf
Download the wikipedia from http://dumps.wikimedia.org/mdfGwiki/latest/,
Unzip the .bz2 file, and save the resulting .xml file somewhere.
The text will be split in smaller files, so make a directory, say
Extract the xml with the WikiExtractor.py script:
cat mdfwiki-latest-pages-articles.xml | WikiExtractor.py -o mdfwiki
The result is stored in directories AA, AB, etc. with 100 files in
cat mdfwiki/*/wiki*|sed 's/<[^>]*>//g;' > mdfwikicorp.txt
Wikipedia as a source for investigating corrections
Other versions of Wikipedia on the download site
A paper describing this in some detail is the following: