What is a sklean

Sklean is the most important library for machine learning. Sklean can not only provide a large number of available data sets, but also realize the vast majority of famous machine learning algorithms. Sklean is characterized by good installation, a few lines of code can be used, perfect documentation, easy to use, and very rich API

Get the data set available for the learning phase

Machine learning is developed from statistics and can never be separated from data. Moreover, machine learning and deep learning are always based on data and serve for data

But data is hard to come by. Company-level data is not available to us. It costs a lot of money, or it is collected secretly by ourselves. So where do we get our data during the learning phase? There are:

  • sklearn– Sklearn website can provide a small number of data sets, address:Scikit-learn.org/stable/data…
  • kaggle– Kaggle is data mining, and its website also provides some data sets. By the way, Kaggle is already owned by Google.www.kaggle.com/datasets
  • UCI– UCI is the University of California, is a professional academic institution, it provides a number of professional data sets, address:Archive.ics.uci.edu/ml/index.ph…

Here’s how they compare:

Sklean is a great library that provides data sets and data processing algorithms. The industry needs libraries like Sklean. Python is characterized by multiple libraries and Nice encapsulation lines


Install sklean

It is easy to install with PIP, where a separate Python environment has been created

pip3 install Scikit-learn
Copy the code

After all, PIP is a foreign address, all some friends there will be very slow, here we can use Tsinghua PIP image

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple Scikit-learn
Copy the code

Sklearn install not installed, sklearn need scipy, numpy support, generally will be automatically installed, everybody also go to check whether these two have, not also use PIP install good

pip3 list
Copy the code

Load the Sklean dataset

Sklean’s datasets provide a large number of datasets available in two types:

  • Local data set– Data set is not large, directly integrated indatasetsIn the bag, use:datasets.load_*()API to load
  • Remote data set– The data set is very large, many of the data sets are on G, which requires the remote server to link to download, use:datasets.fetch_*(data_home=None)API loading, parameter is the data set downloaded to the local address, after downloading, the future use of this data set does not need to download, there is a disk cache bar. The download address may not be specified, but is in the root directory by default/scikit_learn_data/

* indicates the name of the dataset. For example, if we want to load the dataset iris, API is: datasets.load_iris().


Data set object type

The data set we load from Sklean is of type Bunch, which is derived from dictionaries and looks similar to Map to me, and provides two types of value operations:

  • dist[key] = value
  • dist.key = value– Also supports point attributes to get corresponding values

As we said before, a group of data in machine learning is collectively called a data set, and the data is composed of: characteristic value + target value, so Bunch, the data set object we get in the code, should also be composed of corresponding attributes, and There are five attributes in Bunch:

  • dataCharacteristic value –
  • targe– the target
  • DESCR– description
  • feature_names– The eigenvalue name
  • targe_names– Name of the target value

That’s from a code point of view, but we still don’t see what the data set looks like, right


A classic iris data set

The data set is a MAP + TWO-DIMENSIONAL array with key/value structure in the outermost layer and a two-dimensional array inside the value. Each row is a data, also known as a unit, and each column is a feature or target

In order to better understand the data set, here we use iris, the most classic introductory example on SkLearn:

Scholars collected four characteristics of irises and then divided them into three categories. Just to give you a sense of it, this is how things are digitized, what are the obvious characteristics of things like you, abstracted and collected, and these are characteristics, and then things like you can be divided into representative types, and this is the goal, right

Guide package

from sklearn.datasets import load_iris
Copy the code

Loading code:

from sklearn.datasets importData = load_iris() print(type(data)) print(data)Copy the code

Dataset ontology: – The dataset itself follows JSON format

{ 'data': Array ([[5.1, 3.5, 1.4, 0.2], [4.9, 3., 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [6.7, 3.3, 5.7, 2.5], [6.7, 3., 5.2, 2.3]. [6.3, 2.5, 5, 1.9], [6.5, 3., 5.2, 2], [6.2, 3.4, 5.4, 2.3], [5.9, 3., 5.1, 1.8]]), 'target' : array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': '.. , 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': '/ Library/Frameworks/Python framework Versions / 3.7 / lib/python3.7 / site - packages/sklearn/datasets/data/iris. The CSV'}Copy the code

DESCR description inside actually has a lot of information, such as what characteristics are collected, the target value is divided into several categories, the list inside also statistics of the maximum value of each feature, minimum value, average value, standard deviation, correlation coefficient, and finally References inside there are relevant References, looking at a bit like a short paper…

. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 2.43-0.4194 Petal Length: 1.0 6.9 3.76 1.76 0.9490 (high!) Petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (Marshall %[email protected]) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. -  Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...Copy the code

Shape We can also look at data, such as data consisting of several rows and columns

print(data.data.shape)

(150.4)
Copy the code

Data partitioning

Data partitioning – The data we get from Sklearn is for the exercise, which includes training models and tests, so we have to partition the data so that we can test the model after training

Generally, we divide the data ratio of training and test according to this ratio:

  • Training set: 70% 75% 80%
  • Test set: 30% 25% 20%

The API is under the Sklearn.model_selection package

/ / packagefrom sklearn.model_selection import train_test_split

train_test_split(x,y,test_size,random_state)
Copy the code

Train_test_split method has 4 arguments, xy is mandatory, others can not write

  • xCharacteristic value –
  • y– the target
  • test_size– Size of the test set. Default is 0.25
  • random_state– Random seed: when testing the effect of the same data set under multiple algorithms, the same random number seed should be used to ensure fairness

The train_test_split method returns four arguments, which you would normally name like this:

  • x_train– Characteristic value of training set
  • x_test– Characteristic value of the test set
  • y_train– Training set target value
  • y_test– Test set target value

Training set and test set Generally we use these two words:

  • trainThe training set
  • testThe test set

Then let’s run:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

data = load_iris()
x_train,x_test,y_train,y_test  = train_test_split(data.data,data.target,test_size = 0.2)

print("Eigenvalue of training set: \n",x_train,x_train.shape)
Copy the code

Take some data, not too much, and you can see that the training set is 120

Characteristic values of training set: [[5. 3. 1.6 0.2] [5.2 3.5 1.5 0.2] [5.1 3.8 1.6 0.2] [7.1 3. 5.9 2.1] [5.9 3. 4.2 1.5] [6.8 2.8 4.8 1.4] [6.7 3.3 5.7 5.6 2.5 3.9 1.1 2.1] [] [] 6. 5. 2.2 1.5 [6.3 3.4 5.6 2.4] [4.8 3. 1.4 0.1]] (120, 4)Copy the code

The last

Sklearn is the most important library for machine learning. It not only provides us with data for learning, but also provides a large number of MACHINE learning apis, including most well-known algorithms. This article is just to give you a popular science skLearn this library, write very wordy, behind the feature engineering there will use skLearn inside the algorithm, we wait and see ~