This is the 14th day of my participation in the August More Text Challenge

@TOC

This section mainly understands the feature extraction needs of machine learning and some commonly used feature extraction methods. Here we will focus on Sklearn and take a look at machine learning feature extraction in conjunction with spark Java version ml package. And what the most typical implementations are in Python and Java.

I. Using data sets:

Sklearn uses sklearn.datasets to load popular datasets

API:

Dataset.load_ *(): Gets small datasets in skLearn's native package. Such as load_iris () datasets. Fetch_ * (data_home =None,subset='all'): Get large data sets from the network datA_HOME: download path Subset: train > test > Test subset: sckit_learn_data/ subset: train > testall> < span style = "max-width: 100%;Copy the code

Load and FETCH return data of type datasets.bae.bunch. Bunch inherits from the dictionary type and can be accessed using [ATTR] like a dictionary. Alternatively, attribute names can be accessed using.attr.

The returned Bunch contains the following attributes: data: feature data Target: tag array DESCR: data description Feature_NAMES: feature name target_names: tag nameCopy the code

Second, divide training set and test set

In order to verify the effect of the model, it is usually necessary to divide the data set into training set and test set

The model_selection package in SkLearn is mainly responsible for model feature extraction

sklearn API

Sklearn.model_selection. Train_test_split (arrays,*options) test_size specifies the size of the test set test_sizefloat. The default is0.25Random_state Random number seed, different seed will result in different random sampling results.returnX_train of the training set, y_train of the test set, x_test of the training set, y_test of the test setCopy the code

spark

In Spark, the general processing method is to collate the data into RDD< LabelPoint> structure, and then use randomSplit method to split the training set and data set.

3. Data feature extraction

Factors influencing the results of the model: data, algorithms (commonly used open algorithms) feature engineering

Data and features determine the upper limit of machine learning, and models and algorithms only approximate this upper limit.

Machine learning is a statistical method in nature and can only process digital data. Therefore, text, image, video and other types of data need to be converted to digital type before they can be processed by machine learning.

A common feature extraction tool in Python is sklearn Feature Engineering. Pandas: cleans and processes data

The usual method: feature extraction feature pretreatment feature dimension reduction

Iv. Dictionary type feature extraction: one-hot

For dictionary data characteristics such as gender, city, color, etc., dictionary data is usually identified in the database with a number encoding, such as 0: male,1: female. However, if such numerical coding is used in machine learning, it creates misunderstandings in the learning process. Because different dictionary values are supposed to be completely “equal,” numbers like 0, 1, and 2 can confuse machine learning into thinking that dictionary values have a size relationship. Therefore, a common approach in machine learning is to process dictionary values into one-hot encoding.

Gender = male Gender = female One – hot coding
male 1 0 (1,0)
female 0 1 0,1.

sklearn API:

​		sklearn.feature_extraction.DictVectorizer(sparse=True,...). Sparse: sparse matrix. A sparse matrix only indicates that the matrix is not0The coordinates of a number, not a sparse matrix is a complete matrix. Sparse matrix can save storage space when dictionary values are very large. Dicvectorizer.fit_transform (X) X: dictionary or iterator containing dictionary. Return SPARSE matrix dicvectorizer. inverse_transform(X) X:array or SPARSE matrix Returned value: Prior data format. Dicvectorizer.get_feature_names () returns the category nameCopy the code

spark API:

See the spark official Demo (in spark installation directory) org. Apache. Spark. Examples. Ml. JavaOneHotEncoderExample

5. Text type feature extraction: CountVectorizer

For text-type data, such as an article. In machine learning, the most basic processing is characterized by the number of words in the text. Processing into [(word1, count1), (word2, count2)… That’s the format. This is the most classic starting calculation method for MapReduce and Spark.

sklearn API:

Sklearn. Feature_extraction. Text. CountVectorizer (stop_words = []) returns the word frequency matrix stop_words: stop words. You can choose which words not to count. Such as toisFit_transform (X) X: Text or iterable containing text strings returns the Sparse matrix countvectorizer.inverse_transform (X) X: Array data or sparse matrix. Return the matrix countvectorizer.get_feature_names () before the transformation: Returns: list of wordsCopy the code

The implementation in Sklearn only separates words by Spaces and counts the number of occurrences of words in the sample. At the same time, punctuation marks and single-letter words will be removed from the sample during the statistics

spark API:

CountVectorizer, see sample org. Apache. Spark. Examples. Ml. JavaCountVectorizerExample

Advantages: Fast and easy.

Disadvantages: In machine learning scenarios such as article classification, important features of articles cannot be reflected. For example, some words that appear most frequently, such as, here, there, we, they, etc., do not reflect the content characteristics of the article.

Supplement:

In the feature processing of text, word segmentation is an inescapable obstacle. English word segmentation is relatively simple, just press the space. Chinese is much more difficult. I have tried HanLP in Java and Jieba in Python. Tried to use these two bags respectively to split the Deer and the Ding Ji, there are a lot of special treatment for the name, martial arts, etc., such as Theotto III, White Yi, etc., the result of the split is more simple and efficient to jieba. For more specialized articles, these open source Chinese word segmentation tools do not perform very well.

Six, text feature extraction: TfidfVectorizer

Tf-idf can be used to evaluate the importance of a word to a document set or one of the documents in a corpus. , for example, when classifying a pile of paper, computer, software, cloud in the number of these words, Java is more articles may be classified as more technology (in the other class is less, this word is important), and a bank, credit, credit card, this kind of word occurrences more articles may be classified as more financial. Words like “we”, “you”, “here” and “there”, which appear more frequently in all articles, are less useful for classification.

· TF-IDF consists of TF and IDF.

TF: Term frequency. The rate at which a given word appears in an article

IDF: Inverse document frequency. Is a measure of the universal importance of a word. Is the total number of files divided by the number of files containing the word, and then take the base log of 10.

The final TF – IDF = TF * IDF

Example:

Key words: economy; Corpus: 1000 articles; 10 articles appear “economy”.

TF(economy) = 10/1000 = 0.01; The IDF (economic) = lg (1000/10) = 2

Final TF-IDF(economy) = TF(economy)*IDF(economy) = 0.02

sklearn API:

​	sklearn.feature_extraction.text.TfidfVectorizer(stop_words=None,...). Return the weight matrix of the word tfidfVectorizer.fit_transform (X) X: text or iterable containing text strings; Return SPARSE matrix - Full matrix can be returned using the toarray method. Inverse_transform (X) Inverts tfidfVecotrizer.get_feature_names (); Return word listCopy the code

spark API:

Spark divides Tfidf into HashingTF and IDF. See org. Apache. Spark. Examples. Ml. JavaTfIdfExample

7. Dimensionless data: normalization

Dimensionless: Transform feature data of different dimensions (different ranges, different orders of magnitude) into feature data more suitable for the algorithm model through some transformation functions.

Purpose: When the units or sizes of features differ greatly, or some features are several orders of magnitude larger than other features, these large features are more likely to affect the dominant target results, so some algorithms cannot learn the overall characteristics of the data. The idea is to make all the different units of data equally important.

Formula: X’ = (x-min)/(max-min); X” = X’*(mx-mi)

Min: minimum value of feature in the sample set; Max: maximum value of features in the sample set; Mx mi are specified ranges. The default value is 0,1

skelarn API:

​	sklearn.preprocessing.MinMaxScaler(feature_range=(0.1),...). MinMaxScalar. Fit_transform (X) X: Numpy ndarray format data [N_samples,n_features] Return value: After conversion, array of the same shapeCopy the code

The spark API:

See MinMaxScalerModel org. Apache. Spark. Examples. Ml. JavaMinMaxScalerExample

Disadvantages:

The occurrence of outliers in a dataset far from the rest of the sample will generally affect the maximum or minimum values (much larger or smaller than the overall data), which will affect the overall results. Therefore, normalization is only suitable for traditional precise small-data scenarios. In big data scenarios, the following standardized processing is mostly used.

Dimensionless data: standardization

Standardization transforms metadata to a range of 0 mean and 1 standard deviation.

Formula: X’ = (x-mean)/ STD

Mean is the sample mean and STD is the sample standard deviation

StandardScaler. The API is almost like normalization.

Advantages: A small number of outliers will not have a great influence on the mean and variance of the sample as a whole, and thus will not have a great influence on the results of the sample as a whole. It is suitable for big data scenarios with enough samples.

9. Data dimension reduction: variance selection method

Concept of dimension reduction: reduce the number of features only for the sample eigenvalues (two-dimensional array).

Feature selection: Sklearn. Feature_selection module

The Filter Filter type

1) Variance selection method: filter low-variance features (overly concentrated data)

2) Correlation coefficient method: correlation degree between features. For example, weather humidity and rainfall are generally considered highly correlated characteristics.

Embeded embedded

1) Decision tree

2) Regularization

3) Deep learning

Variance selection method: if the variance of feature is too small, it means that the data of this feature is too close, and the significance of learning classification is relatively small. The variance selection method is to delete this part of the variance is too small features.

### sklear API:

​	sklearn.feature_selection.VarianceThreshold(thredhold=0.0Fit_transform (X) X: Numpay array data [n_samples,n_feature] Returned value: Features with differences lower than threshold in training sets will be deleted. The default is to keep all non-zero variance characteristics. That is, all features with the same value in the sample are deleted.Copy the code

SPARK API: Not found

X. Correlation coefficient:

The correlation coefficient measures the correlation between features. There are many ways. Pearson Correlation Coefficient, spearman Correlation Coefficient.

The formula is too esoteric, so let’s leave it at that.

For example, Pearson correlation coefficient can be calculated using API: scipy. Pearsonr (x,y)

If the correlation between two features is strong, there are generally the following treatment methods:

1) Pick one

2) Weighted summation becomes a new feature

3) Principal component analysis

11. Data dimension reduction: PCA principal component analysis.

Definition: A method for converting high-dimensional data into low-dimensional data. The eigenvalues that have great influence on the result of the target value are retained, and the eigenvalues that have relatively little influence are removed.

Function: Data dimension compression, reduce the dimension of the original data as much as possible (number of features), loss of a small amount of information.

Function: Regression analysis or cluster analysis

Understanding: The process of projecting high-latitude data into low-latitude space. For example, a (x,y) two-dimensional coordinate point can be projected onto a one-dimensional line y=f(x), or like a shadow play, a three-dimensional object can be projected onto a two-dimensional plane. The most simple understanding is that there are too many features. When it is difficult to analyze, PCA can be used to reduce the number of features.

sklearn API:

​		sklearn.decoposition.PCA(n_components=None) decompose the data into low-dimensional Spaces. Integer: number of features reduced pca. fit_transform(X): X: numpay Data in array format Returned value: array of the specified dimension after conversionCopy the code

​ sparkAPI:

See PCAModel org. Apache. Spark. Examples. Ml. JavaPCAExample

Transformer and Estimator of SkLearn

Transformer and Estimator are two basic concepts in eigenvalue processing. Many operations like the ones above can be understood as specific converter and estimator processes. In Java, it is understood as two parent classes.

General process of Transformer:

1. Instantiate a transformer class

2. Call FIT_transform for data transformation.

This can actually be done in two steps: the FIT () method performs the calculation to determine the converter parameters. The transform() method performs the final transformation.

General process of estimator:

1. Instantiate an Estimator

Estimator. fit(x_train,y_train) calculates the training model, and the model will be generated after the call.

3. Model evaluation

1) Directly compare the real value with the predicted value

Y_predict = estimator. Predict (x_test) Generates the predicted value

Y_test == Y_predict Compares predicted values and results

2) Calculation accuracy

​ accuracy = estimator.score(x_test,y_test)

This is a summary of the API in SkLearn. Spark also has examples of these two concepts. See org. Apache. Spark. Examples. Ml. JavaEstimatorTransformerParamExample.

Added: About Spark and sklearn

Sklearn is python’s classic machine learning package, while Spark is the classic one-stop computing platform for big data. However, in machine learning, SkLearn is based on local NUMpy as the data base, while Spark is based on distributed RDD and DataFrame for computation. While Spark has its own Python support, the package feels less sophisticated than Sklearn. It seems that the combination of the two is not very good, but to combine the advantages of the two for machine learning, recently on Github there is a project spark skLearn to do just such a thing, you can take a look.

Pypi address: pypi.org/project/spa…

Github address: github.com/databricks/…