corpus project website
Omnia Russica (lat. all Russian) is an open source corpus project, containing 33 billion words.
Omnia Russica is combining major Russian corpus sources within one pipeline
Format | Morphology | Syntax | Size | |
---|---|---|---|---|
Wikipedia | vertical | TreeTagger | None | 0.5 G |
Taiga | CoNLL-U | UDpipe | UDpipe | 4.5 G |
Araneum Russicum | vertical | TreeTagger | None | 25 G |
Common Crawl | Plain text | None | None | 3 G |
Download plain text data (97 Gb)
We have noSketch Engine search interface here
omnia.russica@yandex.ru
T.Shavrina and V.Benko (2019) Omnia Russica: Even larger Russian corpus. In Proc.Corpora.
Check out our documentation on github
For merging the data, the following principles are applied:
At the time of writing this paper (May 2019), all respective data has been collected and preprocessed, except Common Crawl part, which is still being scaled to the limit. Now, the Wikipedia and Araneum Russicum corpora need to be retagged by UDPipe, and Taiga by TreeTagger to get a uniform format suitable for subsequent merging and optional deduplication; Common Crawl part should be also deduplicated and processed.