@[toc]

The environment

Python 3.7

sklearn

numpy

scipy

pip3 install numpy
pip3 install scipy
pip3 install sklearn
Copy the code

(This article is for notes)

The data set

Sklearn comes with some commonly used data sets to help with testing.

sklearn.datasets

Load_ * for small data set fetch_ * for large data setsCopy the code
Sklearn Small dataset sklearn.datasets.load_iris() sklearn large dataset sklearn.datasets.fetch_20newsgroups(data_HOME =None, subset = "train")Copy the code

The thing to notice here is that getting a big data set is actually downloading from the corresponding website, data_home is the corresponding download hold directory (with default) subset is what you want to download

“Test” means to test data “all” means to test and train.

Example:

from sklearn.datasets import load_iris

Iris = load_iris()
print("View data description :",Iris["DESCR"])
print("View eigenvalue name:",Iris.feature_name)
print("View eigenvalues:",Iris.data)
print("View data set:",Iris)
Copy the code

This data set is a data set about what that iris looks like

Feature extraction

Purpose: To “flatten” other data types (such as text) according to a specific algorithm, that is, to transform a specific pattern of “relational” matrices

Dictionary feature extraction

In fact, the function of this is to convert dictionary type data into matrix, so as to facilitate related operations, so dictionary feature extraction is a similar algorithm.

Here’s an example:

data = [{"city":"Beijing"."template":25},
        {"city":"Shanghai"."template":55},
        {"city":"Nanchang"."template":45}]Copy the code

Look directly at the processed values:

Feature name: ['Shanghai city ='.'Beijing city ='.'city = nanchang'.'template'] Characteristic value: [[0.  1.  0. 25.]
 		[ 1.  0.  0. 55.]
 		[ 0.  0.  1. 45.]]Copy the code

Of course this is a representation without compression, if it is compressed, it is a compression matrix

Feature name: ['Shanghai city ='.'Beijing city ='.'city = nanchang'.'template'] Eigenvalues: (0.1)	1.0
  		 (0.3)	25.0
  		 (1.0)	1.0
  		 (1.3)	55.0
  		 (2.2)	1.0
  		 (2.3)	45.0
Copy the code

Ok so let’s go straight to the code:

from sklearn.feature_extraction import DictVectorizer

def DictVectorizer_Test() :
    data = [{"city": "Beijing"."template": 25},
            {"city": "Shanghai"."template": 55},
            {"city": "Nanchang"."template": 45}
            ]

    transfer = DictVectorizer()

    new_data = transfer.fit_transform(data)
    print("Feature name :",transfer.get_feature_names())
    print((2) Characteristic value:,new_data)
Copy the code

The dicttorizer is the most important thing we use here

Text feature extraction

Now that we have a general idea of what feature extraction is, let’s look at feature extraction for text.

So there are two main things that skLearn is famous for when it comes to text extraction, and when it comes to text extraction, it’s divided into English and Chinese.

One at a time.

One – Hot coding

The truth does not say, directly say the effect, that is this can be a word in the frequency statistics out. As for what it does, it does a lot of things.

To give you a direct example:

data = ["life is short i like python"."python is great but i prefer golan
Copy the code

Results after processing:

Feature name: ['but'.'golang'.'great'.'is'.'life'.'like'.'prefer'.'python'.'short'] Characteristic value: [[0 0 0 1 1 1 0 1 1]
 		 [1 1 1 1 0 0 1 1 0]]Copy the code

Again, this one is not compressed, and the compressed effect is similar to the above.

code

from sklearn.feature_extraction.text import CountVectorizer

def CountVertorizer_Test() :
    data = ["life is short i like python"."python is great but i prefer golang"]
    tranfer = CountVectorizer()
    new_data = tranfer.fit_transform(data)
    print("Feature name:",tranfer.get_feature_names())
    print((2) Characteristic value:,new_data.toarray())

Copy the code

Chinese text processing

So let’s do an example here, one-hot coding.

Here we need to use a third-party library jieba

This toy is mainly to do Chinese segmentation.

Directly on the code:

def To_Chinese_World(worlds) :

    Param Worlds: : Return: """
    index = 0
    for word in worlds:
        worlds[index] = "".join(list(jieba.cut(word)))
        index+=1
Copy the code
def CountVectorizer_Chinese(data) :

    To_Chinese_World(data)

    tranfer = CountVectorizer()
    new_data = tranfer.fit_transform(data)

    print("Feature name:",tranfer.get_feature_names())
    print((2) Characteristic value:,new_data.toarray())
Copy the code
if __name__ == '__main__':
    data = ["I love Greater China, I am Chinese."]
    CountVectorizer_Chinese(data)
Copy the code

TF – IDF processing

This is actually used for feature extraction of text. But the main purpose here is to extract keywords.

Definition of keywords: high frequency in the current text, but low frequency in other text. It has very obvious classification attributes

Then this TF-IDF is used to deal with this matter.

algorithm

The operation rules of this algorithm are relatively simple. Here’s an example:

    data = [
        "water water water x"."apple apple apple x"."pear pear pear x",]Copy the code

Assuming that each string is a text, it is obvious that the keyword once is:

water apple pear

So the first thing to do is to do this

TF is the same frequency, so for example apple in “Apple apple apple x” is 3/4

IDF is the inverse document frequency so for example apple how many of the three strings have apple in them? Here’s 1 so IDF is equal to lg (3/1)

TF-IDF = TF * IDF

code

from sklearn.feature_extraction.text import TfidfVectorizer
def TFIDFVectorizerTest() :
    data = [
        "water water water x"."apple apple apple x"."pear pear pear x",

    ]

    transfer = TfidfVectorizer()
    new_data = transfer.fit_transform(data)
    print("Feature name :",transfer.get_feature_names())
    print((2) Characteristic value:,new_data.toarray())
Copy the code

Results:

Feature name: ['apple'.'pear'.'water'] Characteristic value: [[0. 0. 1.]
 		[1. 0. 0.]
 		[0. 1. 0.]]Copy the code

The first key word is water and so on

At this point you can implement a simple article classifier yourself.

Data preprocessing

The normalized

This is familiar, the role is to unify the measurement, that is, dimensionless, this is still very important in the numerical prediction of neural network, if there was no normalized data processing in the original mathematical modeling, BP neural network directly died that.

The corresponding normalization algorithm is also many, see you, the simplest is:

This is also typical, where all the data is in the range of (0,1).

Corresponding to the API

sklearn.preprocrssing.MinMaxScaler
Copy the code

To give you a direct example:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd
def MinMaxScalerTest() :
    """ normalization :return: """
    # 1, get data only for the first three columns, \t is for normal output
    data = pd.read_csv("dating.txt",sep="\t")
    data = data.iloc[:, :3]
    print("data:\n", data)

    Instantiate a converter class
    #transfer = MinMaxScaler(feature_range=[2, 3])
    transfer = MinMaxScaler(feature_range=[2.3])
    # the default [0, 1]

    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
Copy the code

Data set:

Link: pan.baidu.com/s/1Q-4zKBx3… Extraction code: 6666

data:
      milage     Liters  Consumtime
0     40920   8.326976    0.953952
1     14488   7.153469    1.673904
2     26052   1.441871    0.805124
3     75136  13.147394    0.428964
4     38344   1.669788    0.134296. . . .995   11145   3.410627    0.631838
996   68846   9.974715    0.669787
997   26575  10.650102    0.866627
998   48111   9.134528    0.728045
999   43757   7.882601    1.3324



data_new:
 [[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 [0.28542943 0.06892523 0.47449629]... [0.29115949 0.50910294 0.51079493]
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]
Copy the code

standardized

Because normalization is easily disturbed by large data and has poor stability (in scenes with large data differences), we have another one called normalization.

X * = (x – μ) / σ

μ is the mean value of all sample data, and σ is the standard deviation of all sample data.

In addition

The standard deviation of standardized z-Score data is 1

It doesn’t matter if you don’t know the API.

sklearn.preprocrssing.StandarScaler

Example:

def StandardScalerTest() :

    data = pd.read_csv("dating.txt",sep="\t")
    data = data.iloc[:, :3]
    print("data:\n", data)

    Instantiate a converter class
    transfer = StandardScaler()

    # 3. Call fit_transform
    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
Copy the code

In fact, you’ll notice that the rules are the same, but the operator is different (in Flink’s words)

Dimension reduction

Reduce the dimensionality of data. It’s not the same as the matrix storage in the data structure.

Because we’ve been dealing with two-dimensional arrays, for higher dimensions we’re going to have to go into two dimensions. It’s kind of like slapping. In other words, we lose some data (reduce unnecessary data interference, reduce data redundancy) in order to achieve “beat flat”.

Feature selection

Plain English: Identify the main features from the original features.

Filter Filter mode

Data set:

Link: pan.baidu.com/s/1uj6cuR0P… Extraction code: 6666

Low variance filtering

Calculate the variance and filter out the low variance

from sklearn.feature_selection import VarianceThreshold


def VarianceThresholdTest() :


    data = pd.read_csv("factor_returns.csv")
    data = data.iloc[:, 1: -2] # Some data do not need, some columns, such as the first column
    print("data:\n", data)

    transfer = VarianceThreshold(threshold=10)

    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new, data_new.shape)


Copy the code

png)]

The correlation coefficient

We want to avoid redundancy here, so the correlation coefficient is too close and we just process it and pass it

How do you find the correlation coefficient? You take the characteristics of the pair as x and y and you find the correlation coefficient. How to calculate this? Check Baidu or ask a high school math teacher.

Here’s an example:

from scipy.stats import pearsonr
# Calculate the correlation coefficient between two variables
r1 = pearsonr(data["pe_ratio"], data["pb_ratio"])
print("Correlation coefficient: \n", r1)
r2 = pearsonr(data['revenue'], data['total_expense'])
print("Correlation between revenue and total_expense: \n", r2)
Copy the code

So what do you do after that? You go straight to the for loop and filter out the features, and then filter that out of the data.

Principal component analysis (PCA)

In plain English: Higher dimensions change to lower latitudes, but in the process, old data may be discarded and new data may be created, and information may be lost. The role is to compress data, application, cluster analysis, regression analysis.

For example

Now compress this into a straight line and try to fit the original features as closely as possible.

def PCATest() :
    """ PCA dimension reduction :return: ""
    data = [[2.8.4.5], [6.3.0.8], [5.4.9.1]]

    Instantiate a converter class
    transfer = PCA(n_components=0.95)
    # n_components=0.95 Keep 95% of the features if you pass in integers

    data_new = transfer.fit_transform(data)
    print("data_new:\n", data_new)
Copy the code

summary

Good good study, day day up, their choice of road to die have to go on!