Where can I find machine learning exercise data? Two lines of code!

While beginners to machine learning often can’t find data to practice on, SciKit-Learn actually has a lot of data built in for machine learning that can be used in just two lines of code. \

I. Built-in data set

The built-in small dataset is sklearn.dataset. load_<name>\

load_boston	Boston House Prices	Return to the	506 * 13
fetch_california_housing	California’s housing	Return to the	20640 * 9
load_diabetes	diabetes	Return to the	442 * 10
load_digits	Hand writing	classification	1797 * 64
load_breast_cancer	Breast cancer	Classification and clustering	* 30 (357 + 212)
load_iris	irises	Classification and clustering	(50 * 3) * 4
load_wine	wine	classification	* 13 (59 + 71 + 48)
load_linnerud	Physical training	Many classification	20

How to use: \

Data set information keywords:

DESCR:

Description of the data set
Data:

Internal data (i.e. : X)
Feature_names:

Data field name
Target:

Data label (i.e., Y)
Target_names:

Label field name (not available in regression dataset)

Usage method (load_iris as an example)

Data introduction:

It is generally used for classification testing
There are 150 data sets, divided into 3 categories with 50 samples per category. Each sample has four characteristics.
Each record had four features: four features (Sepal.Length, Sepal.Width, Petal Length, Petal Width), all positive floating point, in centimeters, Petal.
These four characteristics can be used to predict which iris species (Iris-setosa, Iris-versicolour, iris-virginica) belongs to.

Step 1: \

Import data

from sklearn.datasets import load_iris
iris = load_iris()
Copy the code

X, y = iris.data, iris.target X, y = iris.target150.4), (150,)) view the feature name: iris.feature_names'sepal length (cm)'.'sepal width (cm)'.'petal length (cm)'.'petal width (cm)'] To view the label name:Copy the code

iris.target_names
Copy the code

The output is:

array(['setosa'.'versicolor'.'virginica'], dtype='<U10')
Copy the code

Divide training set and test set :\

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Copy the code

In this way, the training set and test set are divided by 3 to 1, and then the machine learning algorithm can be used for training and testing. \

Tip: Convert data to Dataframe format (either method works) :

importDataFrame(iris.data, columns=iris.feature_names) #"target"Y df=pd.concat([df_X,df2],axis=1Df.head ()Copy the code

Or:

import numpy as np
import pandas as pd
col_names = iris['feature_names'] + ['target']
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns=col_names)
df.head()
Copy the code

The output is consistent: \

Ii. Online Downloadable data sets (download required)

Fetch_ <name> sklearn.datasets.

This kind of data needs to be downloaded online, which is a bit slow ** **

fetch_20newsgroups	One of the international standard data sets used for text classification, text mining and information retrieval research. The dataset collected about 20,000 newsgroup documents evenly divided into 20 newsgroup collections on different topics. Returns a text feature extractor that can be extracted
fetch_20newsgroups_vectorized	This is the vectorized data of the above text data, which returns a text sequence of extracted features, i.e. no feature extractor is required
fetch_california_housing	Fetch_california_housing ()[‘DESCR’]; fetch_california_housing()[‘DESCR’];
fetch_covtype	Forest vegetation type, a total of 581012 samples, each sample is represented by 54 dimensions (12 attributes, two of which are oneHOT4 and OneHOT40 dimensions respectively), and target represents vegetation type 1-7, all attribute values are number. Call fetch_covType ()[‘DESCR’] to see what each attribute means
fetch_kddcup99	The KDD competition was held in 1999 when the data set was adopted, the KDD99 data set is still the fact Benckmark in the field of network intrusion detection, laying the foundation for network intrusion detection research based on computational intelligence, containing 41 features
fetch_lfw_pairs	The task is called face verification: given a pair of two images, a binary classifier must predict whether the two images are from the same person.
fetch_lfw_people	A dataset of labeled faces
fetch_mldata	Download the dataset from mldata.org
fetch_olivetti_faces	Olivetti face image dataset
fetch_rcv1	Reuters News chat data set
fetch_species_distributions	Species distribution data set

Use the same method as the built-in data set, but more download process (example: fetch_20newsgroups) \

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all') # Download the data from sklearn.model_selectionimportX_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)
Copy the code

Third, generate data sets* * * *

Can be used for classification tasks, can be used for regression tasks, can be used for clustering tasks, can be used for manifold learning tasks, can be used for factorization tasks, can be used for classification tasks and clustering tasks: these functions produce matrix of sample eigenvectors and corresponding set of category labels

Make_blobs: Multi-class single-label data sets that assign one or more normally distributed point sets to each class
Make_classification: Multi-class single-label data sets. Assigning one or more normally distributed point sets to each class provides a way to add noise to the data, including dimensional dependencies, invalid features, and redundant features
Make_gaussian-quantiles: Divide a point set with a single Gaussian distribution into two equal points sets as two classes
Make_hastie-10-2: Produces a similar binary classification dataset with 10 dimensions
Make_circle and make_moons: Generate 2D binary classification datasets to test the performance of some algorithms, add noise to the datasets and generate some spherical decision interface data for binary classifiers

For example:

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)
plt.title('make_moons function example')
plt.scatter(X[:,0],X[:,1],marker='o',c=y)
plt.show()
Copy the code

Other data sets

Kaggle: \

www.kaggle.com

Tianchi: \

tianchi.aliyun.com/dataset

Sogou Laboratory: \

www.sogou.com/labs/resour…

DC Contest: \

www.pkbigdata.com/common/cmpt…

DF contest: \

www.datafountain.cn/datasets * * *…

**总结

**

This article provides a way for machine learning beginners to use the built-in sciKit-learn data, which can be used in just two lines of code for most machine learning experiments.

Reference \

Scikit-learn.org/stable/data…

On this site

The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 110 in github (32,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

All those years of academic philanthropy. – You’re not alone
Suitable for beginners to enter the artificial intelligence route and information download
Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \
Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)
Statistical Learning Methods of Python code implementation (Github 7200+) \
Carefully organized and translated mathematical materials related to machine learning
Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Where can I find machine learning exercise data? Two lines of code!

I. Built-in data set

Usage method (load_iris as an example)

Ii. Online Downloadable data sets (download required)

Third, generate data sets* * * *

Other data sets

Carefully organized and translated mathematical materials related to machine learning

Note: If you join our wechat group or QQ group, please reply”Add group“

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet“

Where can I find machine learning exercise data? Two lines of code!

I. Built-in data set

Usage method (load_iris as an example)

Ii. Online Downloadable data sets (download required)

Third, generate data sets* * * *

Other data sets

Carefully organized and translated mathematical materials related to machine learning

Note: If you join our wechat group or QQ group, please reply”Add group“

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet“

Related Posts

Machine | Goland set annotation template

Small technology | let Go SQL behavior be clear at a glance (on)

Optimization practice of ten-million-level MongoDB data index