Multilingual Wikipedia Corpus

This is a website for multilingual wikipedia corpus used in Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling (ACL 2017).

Overview

Languages differ in what word formation processes they have. For character-level modeling it is therefore interesting to compare a model's performance across languages. Since there is at present no standard multilingual language modeling dataset, we created a new dataset, the Multilingual Wikipedia Corpus (MWC), a corpus of the same Wikipedia articles in 7 languages which manifest a range of morphological typologies. The MWC contains English, French, Spanish, German, Russian, Czech, and Finnish.

To attempt to control for topic divergences across languages, every language's data consists of the same articles. Although these are only comparable (rather than true translations), this ensures that the corpus has a stable topic profile across languages.

Download the dataset from the link below and build awesome character level language models!

If you unzip the file, you will find wiki_{language} folders and each folders contain

Our paper reports results on regular dataset (ptb_format).

We know it's redundant to put raw articles for each document but we expect it will be interesting to work on document level modeling rather than dealing with documents as a single text (as we usually do with neural language modeling).

Download

Construction

We constructed the MWC similarly to the WikiText-2 corpus. Articles were selected from Wikipedia in the 7 target languages. To keep the topic distribution to be approximately the same across the corpora, we extracted articles about entities which explained in all the languages. We used wiki-data entity annotations to find associations between articles in different languages.

Articles which exist in all language and each consist of more than 1,000 words were extracted. In total, the dataset is a collection of 797 articles. These cross-lingual articles are, of course, not usually translations, but they tend to be comparable. This filtering ensures that the topic profile in each language is similar. Each language corpus is approximately the same size as the WikiText-2 corpus.

Wikipedia markup was removed with WikiExtractor to obtain plain text. We used the same thresholds to remove rare characters in the WikiText-2 corpus. No tokenization or other normalization (e.g., lowercasing) was done.

Statistics

Counts

Char. Types Word Types OOV rate Tokens Characters
Train Valid Test Train Valid Test Valid Test Train Valid Test Train Valid Test
EN 307 160 157 193808 38826 35093 6.60% 5.46% 2.5M 0.2M 0.2M 15.6M 1.5M 1.3M
FR 272 141 155 166354 34991 38323 6.70% 6.96% 2.0M 0.2M 0.2M 12.4M 1.3M 1.6M
DE 298 162 183 238703 40848 41962 7.07% 7.01% 1.9M 0.2M 0.2M 13.6M 1.2M 1.3M
ES 307 164 176 160574 31358 34999 6.61% 7.35% 1.8M 0.2M 0.2M 11.0M 1.0M 1.3M
CS 238 128 144 167886 23959 29638 5.06% 6.44% 0.9M 0.1M 0.1M 6.1M 0.4M 0.5M
FI 246 123 135 190595 32899 31109 8.33% 7.39% 0.7M 0.1M 0.1M 6.4M 0.7M 0.6M
RU 273 184 196 236834 46663 44772 7.76% 7.20% 1.3M 0.1M 0.1M 9.3M 1.0M 0.9M

Example

Each file in tr/va/te is named with Wikidata ID.

Wikidata ID: Q1
Universe

The Universe is all of time and space and its contents. It includes planets, moons, minor planets, stars, galaxies, the contents of intergalactic space, and all matter and energy. The size of the entire Universe is unknown, but there are many hypotheses about the composition and evolution of the Universe.

The earliest scientific models of the Universe were developed by ancient Greek and Indian philosophers and were geocentric, placing the Earth at the center of the Universe. Over the centuries, more precise astronomical observations led Nicolaus Copernicus (1473–1543) to develop the heliocentric model with the Sun at the center of the Solar System. In developing the law of universal gravitation, Sir Isaac Newton (NS: 1643–1727) built upon Copernicus's work as well as observations by Tycho Brahe (1546–1601) and Johannes Kepler's (1571–1630) laws of planetary motion. Further observational improvements led to the realization that our Solar System is located in the Milky Way galaxy and is one of many solar systems and galaxies. It is assumed that galaxies are distributed uniformly and the same in all directions, meaning that the Universe has neither an edge nor a center. Discoveries in the early 20th century have suggested that the Universe had a beginning and that it is expanding at an increasing rate. The majority of mass in the Universe appears to exist in an unknown form called dark matter.

...

Reference

Please refer to the following paper, if you need.

@inproceedings{kawakami2017learning,
title={Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling},
author={Kawakami, Kazuya and Dyer, Chris and Blunsom, Phil},
booktitle={Proc. ACL},
year=2017
}