If this article can help you, I hope you can pay attention and forward to share! (thanks)

Building artificial intelligence or machine learning systems is easier than ever. The ubiquity of cutting-edge open source tools such as TensorFlow, Torch and Spark, coupled with the massive computing power of AWS and Cloud computing from Google Cloud or other providers, means you can train cutting-edge machine learning models on your laptop in your afternoon off.

The importance of data sets for deep learning models is self-evident. However, according to different nature, type and domain, data sets are often scattered in different resource platforms, which is in urgent need of sorting.

Without data, our machine learning and deep learning models can’t do anything. Let’s just say that the people who create the datasets that allow us to train our models are our heroes, even if they often don’t get enough thanks. Thankfully, the most valuable data sets have since become “academic benchmarks” — widely cited by researchers, especially for comparisons of algorithmic changes; Names like MNIST, CIFAR 10 and Imagenet have become familiar inside and outside the industry.

If you use these data sets in your research, we hope you remember to cite the original paper (we have provided a link to the citation in the form); If you are using them as part of a business or educational project, consider adding a thank-you note and a link to the original dataset.

We often use these data sets in our teaching because they are excellent examples of the types of data that students are likely to encounter, and they can make progress by comparing their own work with academic work that uses these data sets. We will also use Kaggle Competitions datasets, Kaggle Public Leaderboards that allow students to test their models against the best datasets in the world, but the Kaggle datasets are not included in this list.

Image classification domain

1) MNIST

The classic small (28×28 pixels) grayscale handwritten digital dataset, developed in the 1990s to test the most complex models of the time; To this day, MNIST datasets are viewed more as foundational materials for deep learning. The fast. Ai version of the data set has abandoned the original special binary format in favor of the standard PNG format, which is used as a normal workflow in most current code bases. If you want to use the same single-input channel as the original, simply select a single slice in the channel axis.

Citation: http://yann.lecun.com/exdb/publis/index.html#lecun-98

Download address: https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz

2) CIFAR10

10 categories, up to 60,000 32×32 pixel color images (50,000 training images and 10,000 test images), with an average of 6,000 images for each category. It is widely used to test the performance of new algorithms. The fast. Ai version of the dataset has abandoned the original special binary format in favor of

Standard PNG format for use as a normal workflow in most code bases today.

Citation: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz

3) CIFAR100

Similar to CIFAR-10, the difference is that CIFAR-100 has 100 categories, each containing 600 images (500 training images and 100 test images), which are then divided into 20 superclasses. Thus, each image in the dataset comes with a “fine” label (the class it belongs to) and a “rough” label (the superclass it belongs to).

Citation: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz

4) Caltech – UCSD Birds – 200-2011

Image data sets containing photographs of 200 species of birds (mainly North American birds) for image recognition work. Classification quantity: 200; Number of images: 11,788; The average number of annotations per image: 15 local positions, 312 binary attributes, and 1 border box.

Citation: http://vis-www.cs.umass.edu/bcnn/

Download address: https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz

5) Caltech in 101

Image datasets for 101 item categories have an average of 40-800 images per category, with a large number of categories having around 50 images. Each image is about 300 x 200 pixels in size. This dataset can also be used for target detection and location.

Citation: http://www.vision.caltech.edu/feifeili/Fei-Fei_GMBV04.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/caltech_101.tar.gz

6) Oxford – IIIT Pet

Contains an image data set for 37 pet categories, with approximately 200 images per category. The images have rich variations in proportion, posture, and lighting. This dataset can also be used for target detection and location.

Citation: http://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz

7) Oxford 102 Flowers

Image datasets containing 102 flower species (mainly some common flower species in the UK), each containing 40 — 258 images. The images have rich variations in comparison, posture, and lighting.

Citation: http://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/oxford-102-flowers.tgz

8) Food – 101

The dataset contains 101 food categories with a total of 101,000 images, with an average of 250 test images and 750 training images for each category. The training image has not undergone data cleaning. All images have been resized to a maximum edge length of 512 pixels.

Citation: https://pdfs.semanticscholar.org/8e3f/12804882b60ad5f59aad92755c5edb34860e.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/food-101.tgz

9) Stanford cars

There are 16,185 images in the image dataset for 196 vehicle categories, 8,144 training images and 8,041 test images, with a roughly 50-50 ratio of image types for each category. The categories of this data set are mainly divided based on the brand, model and year of the vehicle.

Citation: https://ai.stanford.edu/~jkrause/papers/3drr13.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/stanford-cars.tgz

Natural language processing

1) IMDb Large Movie Review Dataset

A data set for sentiment duality classification, which contains 25,000 film reviews for training and 25,000 film reviews for testing, is particularly polarised. The dataset also contains untagged data for use.

Citation: http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf

Download address: https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz

2) Wikitext – 103

More than 100 million lines of data, all extracted from Wikipedia’s Good and Featured articles. It is widely used in language modeling, including the FaSTAi library and the pretraining model frequently used in ULMFiT algorithms.

Citation: https://arxiv.org/abs/1609.07843

Download address: https://s3.amazonaws.com/fast-ai-nlp/wikitext-103.tgz

3) Wikitext – 2

Wikitext-103 is a subset of wikitext-103, mainly used to test the effect of language model training for small data sets.

Citation: https://arxiv.org/abs/1609.07843

Download address: https://s3.amazonaws.com/fast-ai-nlp/wikitext-2.tgz

4) WMT 2015 French/English Parallel texts

French/English parallel texts for training translation models, with over 20 million French and English sentences. This dataset was created by Chris CallisonBurch, who took millions of web pages and converted them from French to English using a set of simple heuristics, with the default translation of each document.

Citation: https://www.cis.upenn.edu/~ccb/publications/findings-of-the-wmt09-shared-tasks.pdf

Download address: https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz

5) AG News

496,835 news articles from more than 2000 news sources in the four major categories of THE AG News Corpus, with only title and description fields invoked in the dataset. There were 30,000 training samples and 1,900 test samples in each category.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz

6) Amazon Reviews-Full

34,686,770 reviews from 6,643,669 Amazon users on 2,441,053 products in a data set primarily sourced from the Stanford Web Analytics Project (SNAP). Each category of the dataset contained 600,000 training samples and 130,000 test samples, respectively.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_full_csv.tgz

7) Amazon Reviews – Polarity

34,686,770 reviews from 6,643,669 Amazon users on 2,441,053 products in a data set primarily sourced from the Stanford Web Analytics Project (SNAP). Each emotional polarity dataset for this subset contained 1,800,000 training samples and 200,000 test samples, respectively.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz

8) DBPedia ontology

40,000 training samples and 5,000 test samples from DBpedia 2014 for 14 non-overlapping categories.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/dbpedia_csv.tgz

9) Sogou news

2,909,551 news articles from 5 categories of SogouCA and SogouCS news corpus. Each category contains 90,000 training samples and 12,000 test samples. These Chinese characters have been converted into pinyin.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/sogou_news_csv.tgz

10) Yahoo! Answers

From Yahoo! The 10 main categories of data from the Answers Comprehensive Questions and Answers1.0 dataset. Each category contained 140,000 training samples and 5,000 test samples, respectively.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz

11) Yelp Reviews – Full

1,569,264 samples from the 2015 Yelp Dataset Challenge Dataset. Each rating consisted of 130,000 training samples and 10,000 test samples.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz

12) Yelp Reviews – Polarity

1,569,264 samples from the 2015 Yelp Dataset Challenge Dataset. The different polarities in this subset included 280,000 training samples and 19,000 test samples, respectively.

Citation: https://arxiv.org/abs/1509.01626

Download address: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz

Target detection and location

1) Camvid: Motion-based Segmentation and Recognition Dataset

700 image segmentation datasets containing pixel-level semantic segmentation, each image is checked and validated by a second person to ensure the accuracy of the data.

Citation: https://pdfs.semanticscholar.org/08f6/24f7ee5c3b05b1b604357fb1532241e208db.pdf

Download address: https://s3.amazonaws.com/fast-ai-imagelocal/camvid.tgz

PASCAL Visual Object Classes (VOC)

Standard image datasets for class identification – both 2007 and 2012 versions are available here. The 2012 version has 20 categories. The 11,530 images of the training data contained 27,450 ROI annotation objects and 6,929 target segmentation data.

Citation: http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf

Download address: https://s3.amazonaws.com/fast-ai-imagelocal/pascal-voc.tgz

COCO data set

At present, the most Common data set used for image detection and location should be COCO data set (full name Common Objects in Context). This article provides all files for the 2017 COCO dataset, along with a subset dataset created by fast.ai. Details of each COCO dataset can be obtained from the COCO Dataset download page (http://cocodataset.org/#download). The subset dataset created by fast.ai contains all images in five selected categories: chairs, sofas, TV remote controls, books, and vases.

Fast. Ai create a subset of the data set: https://s3.amazonaws.com/fast-ai-coco/coco_sample.tgz

Training image data sets: https://s3.amazonaws.com/fast-ai-coco/train2017.zip

Verify the image data sets: https://s3.amazonaws.com/fast-ai-coco/val2017.zip

Test image data sets: https://s3.amazonaws.com/fast-ai-coco/test2017.zip

Without annotation of image data sets: https://s3.amazonaws.com/fast-ai-coco/unlabeled2017.zip

Test image data set details: https://s3.amazonaws.com/fast-ai-coco/image_info_test2017.zip

Without annotation image data set details: https://s3.amazonaws.com/fast-ai-coco/image_info_unlabeled2017.zip

Training/verification comments set: https://s3.amazonaws.com/fast-ai-coco/annotations_trainval2017.zip

Subject training/verification comments set: https://s3.amazonaws.com/fast-ai-coco/stuff_annotations_trainval2017.zip

Panoramic training/verification comments set: https://s3.amazonaws.com/fast-ai-coco/panoptic_annotations_trainval2017.zip

Dataset Collector: Huang Shanqing (for learning use only)