Wikipedia As Corpus
This page explains how to fetch whole Wikipedias as raw text
Do the following:
- Find the language code for the language you want: It is the two-letter ISO code (se, etc.). If the language has no two-letter code, use the 3-letter code.
- Go to the download page. The URL is http://dumps.wikimedia.org/sewiki/ will give you North Sámi, exchange the se in sewiki with the language code you want.
- In the list that follows, choose the last one before latest/. The
- Download the .bz2 file found under the header
- When downloaded, open the .bz2 file.
- Extract it with the script WikiExtractor.py (which is in your
- The output is xml. If you want clean text, you may strip the tags with some command, e.g. this one:
... | sed 's/<[^>]*>//g;' | ...
For convenience, we often store the last version in biggies, e.g. biggies/langs/vep/corp/vepwiki.txt. For larger wikipedias, please store only a part of it (e.g. only the files with names with an initial A (see output)).