[su_spoiler title=”Monolingual text data” open=”no” style=”fancy” ]
[su_table]
- From WMT evaluation campaigns:
Corpus CS DE EN FI RO RU TR All languages
combinedNotes Europarl v7/v8 32MB 107MB 99MB 95MB News Commentary 13MB 17MB 20MB 17MB 65MB Common Crawl 10.5GB 102GB 103GB 5.3GB 11.3GB 42GB 18GB SHA512 checksums. News Discussions 1.7GB V1 from 2014/15 News Crawl: 2007 3.7MB 92MB 198MB 302MB News Crawl
Extracted article text from various online news publications.
News Crawl: 2008 191MB 313MB 672MB 2.3MB 1.5GB News Crawl: 2009 194MB 296MB 757MB 5.1MB 1.6GB News Crawl: 2010 107MB 135MB 345MB 2.5MB 727MB News Crawl: 2011 389MB 746MB 784MB 564MB 3.1GB News Crawl: 2012 337MB 946MB 751MB 568MB 3.1GB News Crawl: 2013 395MB 1.6GB 1.1GB 730MB 4.3GB News Crawl: 2014 380MB 2.1GB 1.4GB 52MB 801MB 5.3GB News Crawl: 2015 360MB 2.2GB 1.3GB 203MB 125MB 608MB 4.8G
[/su_table]
[/su_spoiler]
[su_spoiler title=”Multilingual text data” open=”no” style=”fancy”]
[su_table]
- From WMT evaluation campaigns:
File Size CS-EN DE-EN FI-EN RO-EN RU-EN TR-EN Notes Europarl v7 628MB ✓ ✓ corpus home page Europarl v8 215MB ✓ ✓ corpus home page Common Crawl corpus 876MB ✓ ✓ ✓ News Commentary v11 72MB ✓ ✓ ✓ CzEng 1.6pre 3.1GB ✓ Yandex Corpus 121MB ✓ corpus home page Wiki Headlines 9.1MB ✓ ✓ Provided by CMU. SETIMES2 ?? MB ✓ ✓ Distributed by OPUS
[/su_table]
[/su_spoiler]
[su_spoiler title=”Multimodal text+image” open=”no” style=”fancy” ]
[su_table]
File | Size | Languages | Notes |
---|---|---|---|
Flickr8k | ?? MB | En | corpus home page |
Flickr30k | ?? MB | En | corpus home page |
MS COCO | ?? MB | En | corpus home page |
IAPRTC12 | ?? MB | En | corpus home page |
[/su_table]
[/su_spoiler]
[su_spoiler title=”Multimodal text+speech” open=”no” style=”fancy” ]
[su_table]
File | Size | Languages | Notes |
---|---|---|---|
WIT3 | ?? MB | [cs, de, fr, th, vi, zh] en | corpus home page |
[/su_table]
[/su_spoiler]