If this article can help you, I hope you can pay attention and forward to share! (thanks)

Building artificial intelligence or machine learning systems is easier than ever. The ubiquity of cutting-edge open source tools such as TensorFlow, Torch and Spark, coupled with the massive computing power of AWS and Cloud computing from Google Cloud or other providers, means you can train cutting-edge machine learning models on your laptop in your afternoon off.

The importance of data sets for deep learning models is self-evident. However, according to different nature, type and domain, data sets are often scattered in different resource platforms, which is in urgent need of sorting.

Without data, our machine learning and deep learning models can’t do anything. Let’s just say that the people who create the datasets that allow us to train our models are our heroes, even if they often don’t get enough thanks. Thankfully, the most valuable data sets have since become “academic benchmarks” — widely cited by researchers, especially for comparisons of algorithmic changes; Names like MNIST, CIFAR 10 and Imagenet have become familiar inside and outside the industry.

If you use these data sets in your research, we hope you remember to cite the original paper (we have provided a link to the citation in the form); If you are using them as part of a business or educational project, consider adding a thank-you note and a link to the original dataset.

We often use these data sets in our teaching because they are excellent examples of the types of data that students are likely to encounter, and they can make progress by comparing their own work with academic work that uses these data sets. We will also use Kaggle Competitions datasets, Kaggle Public Leaderboards that allow students to test their models against the best datasets in the world, but the Kaggle datasets are not included in this list.

Image classification domain

1) MNIST

The classic small (28×28 pixels) grayscale handwritten digital dataset, developed in the 1990s to test the most complex models of the time; To this day, MNIST datasets are viewed more as foundational materials for deep learning. The fast. Ai version of the data set has abandoned the original special binary format in favor of the standard PNG format, which is used as a normal workflow in most current code bases. If you want to use the same single-input channel as the original, simply select a single slice in the channel axis.

Citation: http://yann.lecun.com/exdb/publis/index.html#lecun-98

Download address: https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz

2) CIFAR10

10 categories, up to 60,000 32×32 pixel color images (50,000 training images and 10,000 test images), with an average of 6,000 images for each category. It is widely used to test the performance of new algorithms. The fast. Ai version of the dataset has abandoned the original special binary format in favor of

Standard PNG format for use as a normal workflow in most code bases today.

Citation: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz

3) CIFAR100

Similar to CIFAR-10, the difference is that CIFAR-100 has 100 categories, each containing 600 images (500 training images and 100 test images), which are then divided into 20 superclasses. Thus, each image in the dataset comes with a “fine” label (the class it belongs to) and a “rough” label (the superclass it belongs to).

Citation: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz

4) Caltech – UCSD Birds – 200-2011

Image data sets containing photographs of 200 species of birds (mainly North American birds) for image recognition work. Classification quantity: 200; Number of images: 11,788; The average number of annotations per image: 15 local positions, 312 binary attributes, and 1 border box.

Citation: http://vis-www.cs.umass.edu/bcnn/

Download address: https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz

5) Caltech in 101

Image datasets for 101 item categories have an average of 40-800 images per category, with a large number of categories having around 50 images. Each image is about 300 x 200 pixels in size. This dataset can also be used for target detection and location.

Citation: http://www.vision.caltech.edu/feifeili/Fei-Fei_GMBV04.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/caltech_101.tar.gz

6) Oxford – IIIT Pet

Contains an image data set for 37 pet categories, with approximately 200 images per category. The images have rich variations in proportion, posture, and lighting. This dataset can also be used for target detection and location.

Citation: http://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/oxford-iiit-pet.tgz

7) Oxford 102 Flowers

Image datasets containing 102 flower species (mainly some common flower species in the UK), each containing 40 — 258 images. The images have rich variations in comparison, posture, and lighting.

Citation: http://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/oxford-102-flowers.tgz

8) Food – 101

The dataset contains 101 food categories with a total of 101,000 images, with an average of 250 test images and 750 training images for each category. The training image has not undergone data cleaning. All images have been resized to a maximum edge length of 512 pixels.

Citation: https://pdfs.semanticscholar.org/8e3f/12804882b60ad5f59aad92755c5edb34860e.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/food-101.tgz

9) Stanford cars

There are 16,185 images in the image dataset for 196 vehicle categories, 8,144 training images and 8,041 test images, with a roughly 50-50 ratio of image types for each category. The categories of this data set are mainly divided based on the brand, model and year of the vehicle.

Citation: https://ai.stanford.edu/~jkrause/papers/3drr13.pdf

Download address: https://s3.amazonaws.com/fast-ai-imageclas/stanford-cars.tgz

Natural language processing

1) IMDb Large Movie Review Dataset

A data set for sentiment duality classification, which contains 25,000 film reviews for training and 25,000 film reviews for testing, is particularly polarised. The dataset also contains untagged data for use.

Citation: http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf

Download address: https://s3.amazonaws.com/fast-ai-nlp/imdb.tgz

2) Wikitext – 103

More than 100 million lines of data, all extracted from Wikipedia’s Good and Featured articles. It is widely used in language modeling, including the FaSTAi library and the pretraining model frequently used in ULMFiT algorithms.

Citation: https://arxiv.org/abs/1609.07843

Download address: https://s3.amazonaws.com/fast-ai-nlp/wikitext-103.tgz

3) Wikitext – 2

Wikitext-103 is a subset of wikitext-103, mainly used to test the effect of language model training for small data sets.

Citation: https://arxiv.org/abs/1609.07843

Download address: https://s3.amazonaws.com/fast-ai-nlp/wikitext-2.tgz

4) WMT 2015 French/English Parallel texts

French/English parallel texts for training translation models, with over 20 million French and English sentences. This dataset was created by Chris CallisonBurch, who took millions of web pages and converted them from French to English using a set of simple heuristics, with the default translation of each document.

Citation: https://www.cis.upenn.edu/~ccb/publications/findings-of-the-wmt09-shared-tasks.pdf

Download address: https://s3.amazonaws.com/fast-ai-nlp/giga-fren.tgz

5) AG News

496,835 news articles from more than 2000 news sources in the four major categories of THE AG News Corpus, with only title and description fields invoked in the dataset. There were 30,000 training samples and 1,900 test samples in each category.