While beginners to machine learning often can’t find data to practice on, SciKit-Learn actually has a lot of data built in for machine learning that can be used in just two lines of code. \

I. Built-in data set

The built-in small dataset is sklearn.dataset. load_<name>\

load_boston Boston House Prices Return to the 506 * 13
fetch_california_housing California’s housing Return to the 20640 * 9
load_diabetes diabetes Return to the 442 * 10
load_digits Hand writing classification 1797 * 64
load_breast_cancer Breast cancer Classification and clustering * 30 (357 + 212)
load_iris irises Classification and clustering (50 * 3) * 4
load_wine wine classification * 13 (59 + 71 + 48)
load_linnerud Physical training Many classification 20

How to use: \

Data set information keywords:

  • DESCR:

    Description of the data set

  • Data:

    Internal data (i.e. : X)

  • Feature_names:

    Data field name

  • Target:

    Data label (i.e., Y)

  • Target_names:

    Label field name (not available in regression dataset)

Usage method (load_iris as an example)

Data introduction:

  • It is generally used for classification testing
  • There are 150 data sets, divided into 3 categories with 50 samples per category. Each sample has four characteristics.
  • Each record had four features: four features (Sepal.Length, Sepal.Width, Petal Length, Petal Width), all positive floating point, in centimeters, Petal.
  • These four characteristics can be used to predict which iris species (Iris-setosa, Iris-versicolour, iris-virginica) belongs to.

Step 1: \

Import data

from sklearn.datasets import load_iris
iris = load_iris()
Copy the code
X, y = iris.data, iris.target X, y = iris.target150.4), (150,)) view the feature name: iris.feature_names'sepal length (cm)'.'sepal width (cm)'.'petal length (cm)'.'petal width (cm)'] To view the label name:Copy the code
iris.target_names
Copy the code

The output is:

array(['setosa'.'versicolor'.'virginica'], dtype='<U10')
Copy the code

Divide training set and test set :\

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Copy the code

In this way, the training set and test set are divided by 3 to 1, and then the machine learning algorithm can be used for training and testing. \

Tip: Convert data to Dataframe format (either method works) :

importDataFrame(iris.data, columns=iris.feature_names) #"target"Y df=pd.concat([df_X,df2],axis=1Df.head ()Copy the code

Or:

import numpy as np
import pandas as pd
col_names = iris['feature_names'] + ['target']
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns=col_names)
df.head()
Copy the code

The output is consistent: \

Ii. Online Downloadable data sets (download required)

Fetch_ <name> sklearn.datasets.

This kind of data needs to be downloaded online, which is a bit slow ** **

fetch_20newsgroups One of the international standard data sets used for text classification, text mining and information retrieval research. The dataset collected about 20,000 newsgroup documents evenly divided into 20 newsgroup collections on different topics. Returns a text feature extractor that can be extracted
fetch_20newsgroups_vectorized This is the vectorized data of the above text data, which returns a text sequence of extracted features, i.e. no feature extractor is required
fetch_california_housing Fetch_california_housing ()[‘DESCR’]; fetch_california_housing()[‘DESCR’];
fetch_covtype Forest vegetation type, a total of 581012 samples, each sample is represented by 54 dimensions (12 attributes, two of which are oneHOT4 and OneHOT40 dimensions respectively), and target represents vegetation type 1-7, all attribute values are number. Call fetch_covType ()[‘DESCR’] to see what each attribute means
fetch_kddcup99 The KDD competition was held in 1999 when the data set was adopted, the KDD99 data set is still the fact Benckmark in the field of network intrusion detection, laying the foundation for network intrusion detection research based on computational intelligence, containing 41 features
fetch_lfw_pairs The task is called face verification: given a pair of two images, a binary classifier must predict whether the two images are from the same person.
fetch_lfw_people A dataset of labeled faces
fetch_mldata Download the dataset from mldata.org
fetch_olivetti_faces Olivetti face image dataset
fetch_rcv1 Reuters News chat data set
fetch_species_distributions Species distribution data set

Use the same method as the built-in data set, but more download process (example: fetch_20newsgroups) \

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all') # Download the data from sklearn.model_selectionimportX_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25, random_state=33)
Copy the code

Third, generate data sets* * * *

Can be used for classification tasks, can be used for regression tasks, can be used for clustering tasks, can be used for manifold learning tasks, can be used for factorization tasks, can be used for classification tasks and clustering tasks: these functions produce matrix of sample eigenvectors and corresponding set of category labels

  • Make_blobs: Multi-class single-label data sets that assign one or more normally distributed point sets to each class
  • Make_classification: Multi-class single-label data sets. Assigning one or more normally distributed point sets to each class provides a way to add noise to the data, including dimensional dependencies, invalid features, and redundant features
  • Make_gaussian-quantiles: Divide a point set with a single Gaussian distribution into two equal points sets as two classes
  • Make_hastie-10-2: Produces a similar binary classification dataset with 10 dimensions
  • Make_circle and make_moons: Generate 2D binary classification datasets to test the performance of some algorithms, add noise to the datasets and generate some spherical decision interface data for binary classifiers

For example:

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)
plt.title('make_moons function example')
plt.scatter(X[:,0],X[:,1],marker='o',c=y)
plt.show()
Copy the code


Other data sets

Kaggle: \

www.kaggle.com

Tianchi: \

tianchi.aliyun.com/dataset

Sogou Laboratory: \

www.sogou.com/labs/resour…

DC Contest: \

www.pkbigdata.com/common/cmpt…

DF contest: \

www.datafountain.cn/datasets * * *…

**总结

**

This article provides a way for machine learning beginners to use the built-in sciKit-learn data, which can be used in just two lines of code for most machine learning experiments.

Reference \

Scikit-learn.org/stable/data…

On this site

The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 110 in github (32,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

  • All those years of academic philanthropy. – You’re not alone

  • Suitable for beginners to enter the artificial intelligence route and information download

  • Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \

  • Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)

  • Statistical Learning Methods of Python code implementation (Github 7200+) \

  • Carefully organized and translated mathematical materials related to machine learning

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

Note: If you join our wechat group or QQ group, please reply”Add group

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet