What is a sklearn

Sklearn is a third-party library in Python's pypi. There are a lot of model algorithms that you can call directly. The development of machine learning is basically divided into 6 steps: 1) data acquisition 2) data processing 3) feature engineering 4) model training, model saving and loading 5) model evaluation 6) model applicationCopy the code

1/ Get data

Sklearn provides datasets for beginners in the skLearn datasets directory. Sklearn uses sklearn as the package to get datasets. After datasets, load_* and fetch_* can be used to get data from sklearn for beginners. Load_ * access is small data sets, fetch_ * access is large-scale data set.Copy the code
   # Get small-scale data
   from sklearn.datasets import load_iris 
   iris_object = load_iris() # Get small data

   After obtaining the data, you can view some properties of the data, such as:
   iris_object.data  [],[],[],[],[],[]
   iris_object.data.shape # rows and columns of feature data

   iris_object.target [x,x,x,......]

   iris_object.DESCR # Data description
   iris_object.feature_names # Feature name
   iris_object.target_names  The name of the tag
   
   # Get large-scale data
   from sklearn.datasets import fetch_20newsgroups
   news = fetch_20newsgroups() # Get large-scale data
Copy the code

2/ Data processing

The acquired data cannot be used directly, so it needs to be divided into training data sets and test data sets. Sklearn.model_selection. Train_test_split (x,y,test_size,random_seed) has 4 arguments: <1>x is the characteristic value of the data set. <2>y is the target value of the data set. <3> Test_size is the size of the test set, which is of type float. Different seeds of random numbers will produce different random sampling results. If you want to reproduce, you will return the same random_state each time in the order of: training set feature value, test set feature value, training set target value, test set target valueCopy the code
   # e.g.
   from sklearn.model_selection import train_test_split
   x_train,x_test,y_train,y_test = train_test_split(iris_object.data,
                                                    iris_object.target,
                                                    test_size-0.2,
                                                    random_state=22)
Copy the code

3/ Feature engineering

Sklearn: provides a very powerful interface for manipulating featuresCopy the code

1/ Feature extraction

① Dictionary feature extraction:

Sklearn. Feature_extraction. DictVectorizer DictVectorizer. Fit_transform () input values for the iterator or dictionary include dictionary return values for sparse matrix, You can use False to change the type of the return value to a two-dimensional array dictvectorizer.inverse_transform () input value to an array or sparse matrix return value to the data format before conversion Dictvectorizer.get_feature_names () Returns a value for the category name application scenario: 1. DictVectorizer is used to perform the transformation. 2. You get dictionary type dataCopy the code

② Text feature extraction

Sklearn. Feature_extraction. Text. CountVectorizer (stop_words []) stop_words: stop words, Refers to the specified word is not as text feature extraction processing object CountVectorizer. Transfer. Fit_transform () input values as text or dictionary include text string of iterators The return value for sparse matrix, The SPARSE matrix can be converted directly toa two-dimensional array using the toarray method. Countvectorizer.inverse_transform () Countvectorizer. get_feature_names Returns a category name for Chinese word segmentation. Jieba library can be used to convert a character string. TFIDF text feature extraction, the use of words in an article frequency is very different from other articles, to achieve feature extraction. Methods of TFIDF text feature extraction: Sklearn.feature_extraction.text.TfidfVectorizer(stop_words[]) TfidfVectorizer.transfer.fit_transform() A text dictionary or iterator containing text strings returns a sparse matrix. The SPARSE matrix can be converted directly toa two-dimensional array using the toarray method. Tfidfvectorizer.inverse_transform () for an array or returns a value in the previous data format Tfidfvectorizer.get_feature_names returns the name of the categoryCopy the code

2/ Feature preprocessing

(1) the normalized

Sklearn. Preprocessing. MinMaxScaler (feature_range = (0, 1))... Feature_range =(0,1) data processing range minmaxscaler.fit_traensform () input values are numpy array format data [n_samples,n_features] [number of samples, Disadvantages: This approach is heavily influenced by outliers.Copy the code

② Standardization:

Sklearn. Preprocessing. StandardScaler need not specified range can directly to the data processing to (0, 1), within the scope of the mean is zero, The standard deviation is 1 standardScaler.fit_traensform () input values are numpy array format data [N_samples,n_features] [Sample number, The standardized method of array whose return value of feature number] is the same shape is more suitable for big data processing and more stable when there are enough samples.Copy the code

3/ Feature dimension reduction:

It refers to reducing the number of features and removing irrelevant features.Copy the code

① Variance filtering dimension reduction:

Sklearn. Feature_selection. VarianceThreshold (threshold = 0.0) VarianceThreshold. Fit_transform () input values for numpy Data in array format [N_samples,n_features] [number of samples, number of features] returns an array with features with low variance features removedCopy the code

(2) Correlation coefficient filtering dimension reduction:

The correlation coefficient is calculated using scipy.stats. Pearsonr (x,y). The input value is the name of the feature of the dataCopy the code

③ Principal Component Analysis (PCA)

Sklearn. Decomposition. PCA (n_components = None) to process the data, implement the data dimension reduction. N_components: decimal: What percentage of information is retained integer: Pca.fit_transform () input values are data in numpy array format [n_samples,n_features] [number of samples, number of features] and return values are converted arrays of the specified dimensionsCopy the code

4/ Model training (design model)

The basic use of algorithm training in SkLearn is as follows: 1. Instantiate an Estimator class; 2. Estimator calls fit() to train the input x_train and y_train data. Model evaluation: Y_predict =estimator. (x_test) Y_predict == Y_test or: accuracy=estimator. Score (x_test,y_testCopy the code

<1> Classification algorithm:

1 the KNN algorithm

Sklearn. Neighbors. KNeighborsClassifier (n_neighbors = 5, the algorithm = 'auto') n_neighbors for K value, the algorithm by default auto, general need not set, Advantages: simple and easy to understand, easy to implement disadvantages: lazy algorithm, large calculation, large memory overhead, the selection of K value is not necessarily, need to find the most suitable K value to achieve good results.Copy the code

② Grid search and cross validation

Sklearn.model_selection.GridSearchCV(Estimator,param_grid=None, CV =None) returns the estiamtor object estimator: Param_grid: paramer parameters {"n_neighbors":[1,3,5,7,9]} CV: fold for cross-authentication. You can use the following methods:.fit() enters training data for training. Score () outputs the best accuracy of training. Best_param_ Best result: BEST_score_ Best estimator: BEST_ESTIMator_ Cross validation result: cv_results_Copy the code

③ Naive Bayes algorithm

MultinomialNB(alpha=1.0) is also important.MultinomialNB(alpha=1.0) is also important. Classification efficiency is stable, it is not sensitive to missing data, and the algorithm is relatively simple. Therefore, it is often used for text classification. Disadvantages: Due to the assumption of mutual independence between features, if there is correlation between features in the data set used, inappropriate results will be generatedCopy the code

④ Decision tree:

By prioritizing features, prioritize features that have more impact, You can use the information gain as the basis of determining sklearn. Tree. DecisionTreeClassifier (criterion = 'gini', max_depth = None, random_state = None) criterion: Default to 'gini', use the CART decision tree with the Gini coefficient as the basis for feature selection, or the 'entropy' for information gain, which is the ID3 decision tree. Max_depth: tree depth (can be changed by changing the depth size, reduce the decision tree overfitting) random_state: random number seed decision tree visualization: Sklearn.tree. export_graphviz(estimator,out_file="tree.dot", Feature_names) Feature_names are displayed only after input. Advantages: Simple and easy to understand, can achieve visualization disadvantages: no set depth, easy to produce overfittingCopy the code

⑤ Random forest:

Training set random: random with put back sampling; Feature random: Extract M features from M features, M>>m sklearn.ensemble.RandomForestClassifier(n_estimator=10, criterion='gini', max_depth=None, bootstrap=True, Random_state =None, min_sample_lit=2) max_features="auto" : Default is "auto", the maximum number of features in each decision tree, that is, m worthy method selection. If "auto", take the square root; If "SQRT", take the square root; If "log2", find log2 (); If None, use the M value {"n_estimators":[120,200,300,500,800,1200],"max_depth":[5,8,10,15,30]} advantage: good accuracy, good for processing high-dimensional samplesCopy the code

<2> Regression algorithm:

① Linear regression

The target value and the eigenvalue are regarded as linear relations to achieve fitting, and the regression algorithm is obtained. Linear model is not equal to linear relation, the nonlinear relation with consistent parameters can also be called linear model. Normal equation: Linear_model.linearregression (FIT_Intercept =True) FIT_Intercept: Whether to calculate bias LinearRegression. Coef_ : regression coefficient LinearRegression. Intercept_ : offset gradient descent: Sklearn. Linear_model. SGDRegressor (loss = "squared_loss", Fit_intercept =True, learning_rate='invscaling', ETA0 =0.01) loss: Squared_loss Least square loss function type max_iter: number of iterations fit_Intercept: whether to calculate bias learning_rate: String, 'optional ':eta= 1.0/(alpha*(t+t0))[defult] 'invta0 ':eta=eta0/ POw (t,power_t) Sgdregressor.coef_ : regression coefficient SGdregressor.intercept_ : bias model evaluation method (mean square error evaluation) : Sklearning.metrics. Mean_squared_error (y_true,y_pre) Y_true: true value y_pre: predicted value Return: floating point resultCopy the code

(2) the ridge regression

During regularization, the effect of some eigenvalues is weakened, which results in the problem of over-fitting and under-fitting. Sklinear_model. Ridge(alpha=1.0, fit_Intercept =True, solver"auto", normolize=False) alpha: Solver: Automatically selects an optimization method based on the data set. Normalize: Is the data normalized? If it's True, it doesn't have to be normalized, The implementation is the same. Ridge.intercept_: bias The Ridge method is equivalent to SGDRegressor(penalty='l2',loss="squared_loss"), but the latter lacks SAGCopy the code

③ Logistic regression

Sklear.linear_model.LogisticRegression(Solver ="liblinear", C=1.0Copy the code

5/ Model evaluation (ROC curve and AUC indicators) :

Roc_auc_score (y_true,y_score) y_true= true category of each sample, which must be 0(negative example), 1(positive example) y_score= predicted score, which can be the estimated probability of positive example, confidence value, The return value of the classifier method, AUC, can only be used to evaluate binary classification problems and is very suitable for evaluating classifier performance in sample disequilibriumCopy the code

6/ Model save and load: sklearn.externals.joblib

Joblib. dump(estimator,"my_ridge.pkl")Copy the code

7) Model application