omnia-russica.github.io

Logo

corpus project website

View My GitHub Profile

Omnia Russica

About

Omnia Russica (lat. all Russian) is an open source corpus project, containing 33 billion words.

Omnia Russica is combining major Russian corpus sources within one pipeline

  Format Morphology Syntax Size
Wikipedia vertical TreeTagger None 0.5 G
Taiga CoNLL-U UDpipe UDpipe 4.5 G
Araneum Russicum vertical TreeTagger None 25 G
Common Crawl Plain text None None 3 G

Download

Download plain text data (97 Gb)

Web Interface

We have noSketch Engine search interface here

Contact and cite us

Check out our documentation on github

Pipeline

For merging the data, the following principles are applied:

At the time of writing this paper (May 2019), all respective data has been collected and preprocessed, except Common Crawl part, which is still being scaled to the limit. Now, the Wikipedia and Araneum Russicum corpora need to be retagged by UDPipe, and Taiga by TreeTagger to get a uniform format suitable for subsequent merging and optional deduplication; Common Crawl part should be also deduplicated and processed.