corpus_parallel_maintenance
Many documents are parallel with the parallel content in the same file.
- For each file, count the number of words
- For each file, count the number of words marked with the language of
- Estimate the ratio
- Pick the files with a bad ratio, and investigate them. Split and reallocate.