preface

Data mining is a process of clearing and processing a large amount of data to find information and applying it to classification, recommendation system, prediction and so on.

First, data mining process

1. Data selection

After analyzing the business requirements, select data related to the required business: original business data, open data sets, and structured website data collected by crawler. It is a prerequisite for data mining to know the business requirements and select the targeted data.

2. Data preprocessing

It is common to choose good data with noise, incompleteness, cleaning, missing items, integration, transformation, and generalization: Python string processing (quite handy), regular matching, PANDAS, BeautifulSoup, Html tag processing, and so on.

3. Feature engineering/data conversion

According to the selected algorithm, the features of the pre-processed data are extracted and transformed into the analysis model of the specific data mining algorithm.

4. Data mining

The selected data mining algorithm is used to process the data and obtain the information.

5. Explanation and evaluation

The information after data mining is analyzed and interpreted, and applied to the actual work field.

Introduction to common data mining algorithms

1. Association analysis algorithm

Association rules are about finding associations between data in different domains with minimum support thresholds and minimum confidence thresholds. In the research of association rules analysis algorithm, the efficiency of algorithm is the core problem. The classical algorithms are: Apriori algorithm, AprioriTid algorithm, FP-growth algorithm;

2. Classification algorithm

Decision tree algorithm: to represent the classification or decision set in tree structure, generate rules or find rules. There are ID3 algorithm, C4.5 algorithm, SLIQ algorithm, SPRINT algorithm and RainForest algorithm.

Naive Bayes classification algorithm: Using the probability statistics method of Bayes theorem, select the category with large probability for classification;

Classification Based on Association (CBA) algorithm: Classification algorithm Based on Association rules;

MIND(Mining in Database) algorithm: Using user-definedfunction (UDF) in the Database to realize the classification algorithm;

Neural network classification algorithm: use the training set to train multiple neural networks, and use the trained model to classify the samples;

Rough set theory: The characteristic of rough set theory is that it does not need to give the quantitative description of some characteristics or attributes in advance, but directly starts from the given problem and determines the approximate domain of the problem through the indiscernable relation and indiscernable class, so as to find out the internal law of the problem.

Genetic algorithm (GA) : It is a technology that simulates the process of biological evolution and optimizes the solution by copying (selection), crossing (recombination) and mutation (mutation).

3. Clustering algorithm

Unlike classification, cluster analysis deals with data objects whose classes are unknown. Cluster analysis is the process of grouping a collection of objects into clusters composed of similar objects. Divided into three types of methods:

Given a database of N objects or tuples, the Ipartitioning method constructs K partitions of data, each of which represents one cluster, and K is less than N. The classical algorithm is K-mean (K MEAN);

Hierarchical method is used to decompose a given set of data objects. The classical algorithm is BIRTH algorithm.

This method uses a multi-resolution grid data structure. The space is quantified into a finite number of cells that form a grid structure on which all clustering analysis is performed. Common algorithms are STING, SkWAVECLUSTER and CLIQUE;

summary

With the increasing accumulation of data and the diversification of database types, various data mining methods have limited scope and limitations. Therefore, it is difficult to obtain all kinds of knowledge needed for decision making by using a single method. However, their organic combination is complementary and multi-method fusion will become the development trend of data mining algorithm.

3. Realization of data mining algorithm

1. Relevant knowledge

(1) Distance measurement: In data mining, it is necessary to clarify the similarity of sample data, and usually the distance between samples can be calculated. The common distance measurement is described as follows.

Sample data:

Manhattan distance: also known as Manhattan block distance, such as the distance from one intersection point in a block to another intersection point, two-dimensional space (multi-dimensional space expanded by the same formula) is expressed as

Euclidean distance: indicates the distance from point to point. The two-dimensional space (the multidimensional space expands by the same way) is expressed as

Minkowski distance: is a generalization of a set of distance methods, when p=1 is the Manhattan distance, when p=2 is the Euclidean distance. When P is larger, the difference of a single dimension has greater influence on the whole.

Advantages and disadvantages of minkowski distance (including Euclidean distance, Manhattan distance) :

Advantages: Widely used.

Disadvantages: Unable to consider the units of each component and the differences in the distribution of each component (variance, expectation). (Unit differences in one of the components can be eliminated by standardizing the data, as described below.)

Cosine correlation coefficient: The sample data is regarded as a vector, and the correlation is confirmed by the cosine value of the Angle between the two vectors. The numerical range is [-1, 1]. Minus 1 means negative correlation, 0 means independent, and 1 means positive correlation.

Advantages: Cosine similarity is independent of the amplitude of the vector, but only related to the direction of the vector. It can be found in the calculation of document similarity (TF-IDF) and image similarity (HISTOGRAM). And it can still be used when the sample values are sparse.

Disadvantages: The cosine similarity is affected by the translation of the vector. If the above equation is shifted by x to x+1, the cosine will change. (It can be understood as influenced by the starting standard of the sample, which can be eliminated by the Pearson correlation coefficient introduced next)

Pearson correlation coefficient: the correlation between the sample vectors is calculated, and the numerical range is [-1, 1].

Considering the number of iterations calculated, there is an alternative formula to approximate the Pearson correlation coefficient:

Pearson correlation coefficient advantages: can eliminate each component standard different (fraction expansion), with translation invariance and scale invariance.

(2) Data standardization:Refer to the article

The unit scale of each component differs greatly from that of each component when calculating distance. Data standardization can be used to eliminate the influence of unit scale between different components and accelerate the efficiency of model convergence. There are three commonly used methods:

Min-max normalization: scales the value range to (0,1) without changing the data distribution. Max is the maximum value of the sample and min is the minimum value of the sample.

Z – score standard

Fixed standard Z-score

(3) Effect evaluation of the algorithm:

Ten fold cross validation: The data set was randomly divided into ten equal parts, 9 pieces of data were used for training set and 1 piece of data was used for test set, and so on for 10 times. The key to cross-validation is to divide the ten pieces more evenly.

N-folded cross-validation is also known as the “leave one” method: train with almost all the data, then leave one data for testing, and iterate each data test. The advantage of the one method is: certainty.

2. Collaborative filtering recommendation algorithm

Code implementation, data sets and reference papersMovie recommendation — a collaborative filtering algorithm based on users and items

. Shameshameel ()print("Items Base collaboration to recommend Slope One")
Item Base collaborative recommendation algorithm Slope One
r.slope_one_recommendation('lyy')

print("Items Base collaboration to recommend COS")
# Items Base collaborative recommendation algorithm to correct cosine similarity
r.cos_recommendation('lyy')

print("Users Base collaborative recommendation")
# Userbase collaborative recommendation algorithm
r.user_base_recommendation("lyy")

Copy the code

(1) User-based collaborative recommendation algorithm

This method uses the likes of similar users to recommend a band: if you want to recommend a band to you, it will find a user who is similar to you and recommend the band to you.

The key to the algorithm is to find similar users and iteratively calculate the distance between you and each user’s rating of the same band to determine who you are most similar to. The distance can be calculated using The Manhattan distance, Pearce correlation coefficient and so on.

Scalability: As the number of users increases, so does the amount of computing. The algorithm works fine with only a few thousand users, but hits a bottleneck when it reaches a million. Sparsity: In most recommendation systems, the number of items is much larger than the number of users, so users only evaluate a small part of items, which results in data sparsity. For example, amazon has millions of books, but only a few of them are reviewed by users, so it’s hard to find two similar users.

(2) Object-based collaborative recommendation algorithm

User-based collaborative filtering is to find out the most similar users by calculating the distance between users (all evaluation data need to be read and processed in memory for recommendation), and recommend the items evaluated by similar users to target users. And item-based collaborative filtering is to find out the most similar items (by building a similarity model of items to make recommendations), and then combined with the user’s evaluation to give the recommendation results.

Object-based collaborative recommendation algorithms are commonly used in the following two ways:

Modified cosine similarity algorithm:

The score of the item is taken as the attribute value of the item, and the correlation s(I,j) is calculated by comparing the relative scores of the workers of item I and item J. In the same way as Pearson’s correlation coefficient, for every score R(u,j) for an item by a total user, R(u, I) needs to be subtracts R(‘ u) by the average value of that user’s score to eliminate score inflation.

To correct the shortcomings of cosine similarity: sparsity, which needs to be based on user’s rating data;

Slope One recommendation algorithm:

The first step is to calculate the average difference:

Dev (I,j) is the average difference of the score of the shared user U that traverses all shared items I,j.

Card (Sj, I (X)) represents the number of users who have evaluated item J and item I simultaneously.

In the second step, weighted Slope One algorithm is used:

PWS1(u)j indicates that we will predict user U’s rating of item J.

Find the collection I belonging to S(u)-j, all items I (except J) contained in user U.

Dev (I,j) is the average difference of the score of the shared user U that traverses all shared items I,j.

C(ji), which is card(Sj, I (X)), is the number of users who have evaluated both item J and item I.

Slope One algorithm has the advantages of simple algorithm; It scales well so that you only need to update user reviews for common attributes without reloading the entire dataset.

Disadvantages of Slope One algorithm: Sparse, requiring user-based scoring data;

3. Classification algorithm

(1) KNN classification algorithm based on item eigenvalues

Code implementationKNN classification algorithm of iris

.# KNN algorithm
    def knn(self, oj_list):
        weight_dict = {"Iris-setosa": 0.0."Iris-versicolor": 0.0."Iris-virginica": 0.0}for atuple inWeight_dict [atuple[1]] += (1.0 / atuple[0]) rel_class = [(key, value)for key, value in weight_dict.items()]
        #print(sorted(rel_class, key=lambda x:x[1], reverse=True))
        rel_class = sorted(rel_class, key=lambda x:x[1], reverse=True)[0][0]
        return rel_class
        
...
Copy the code

We discuss in front of the need to all kinds of data generated in a user collaborative recommendation algorithm carries on the analysis above, so also known as social filtering algorithm, this algorithm often have data sparse, algorithm extensibility and depends on the user’s data faults, and based on the characteristic value classification algorithm can improve these problems. The algorithm is divided into two steps:

The first step is to select eigenvalues

The key of the algorithm is to pick the distinguishing features and points. In the case of Iris, the characteristic values of calyx length, calyx width, petal length and petal width were selected.

The second step is to calculate the distance

For example, the Manhattan distance between the eigenvalues of the test set and the training set is calculated, and the k nearest neighbors are obtained and the classification is predicted by the weighted results.

The disadvantages of KNN classification algorithm are as follows: it cannot quantify the confidence of classification results; Is a passive learning algorithm, each test needs to traverse all the training set before classification.

(2) Bayesian classification algorithm

Code implementationA naive Bayesian classification algorithm to distinguish news categories

. def train_data(self):# Conditional probability of training group
        for word in self.vocabulary:
            for category,value in self.prob.items():
                if word not in self.prob[category]:
                    count = 0
                else :
                    count = self.prob[category][word]
                # Optimize conditional probability formula
                self.prob[category][word] = (count + 1) / (self.total[category] + len(self.vocabulary)) 
                
...

Copy the code

Bayesian classification algorithm is a probabilistic classification algorithm. Compared with KNN classification algorithm, it is an active learning algorithm, it will build a model according to the training set, and use this model to classify new samples, and the speed is much faster. The basic theory of bayesian classification algorithm is based on the conditional probability formula (used in the real P (X | Y&Z) is not intuitive, and P (Y | X) * P (Z | X) more intuitive to draw), and assuming an existing child events (Y, Z… In practical application, there will be multiple) are independent of each other (therefore, it is also called naive Bayes). When the events y and Z are assumed to be independent, there will be:

Step 1: Calculate prior probability and conditional probability

Prior probability: the probability of a single event occurring, such as P(buying green tea), P(organic food)

Conditional probability (a posteriori probability) : the probability of x occurring after observing the y data set after the event Y has occurred. Such as P (buy organic food | green tea), by the following formula (nc next occurrence frequency of x, y data set n to y) by the total number of data sets:

Modified conditional probability :(formula from Tom Mitchell, machine learning. M is a constant, the equivalent sample size. There are many ways to determine the constant m, and here we can use the category of the predicted results as M, for example, there are two categories of votes for and against, so m is equal to 2. P is the corresponding prior probability. For example, if the yes probability is 0.5, then P (yes) is 0.5. :

The second step is to make predictions based on Bayesian formula

Calculate and compare the probability difference of different X events under the occurrence of y&Z events by the formula. For example, calculate the probability of P (x= like) and P (x= dislike), and predict events with relatively high probability. Since P(y)* P(z) are all the same in the above equation, the formula can be simplified to calculate the maximum probability term and predict classification:

The advantages of Bayesian algorithm are as follows: it can give the confidence of classification results; It’s an active learning algorithm, and it’s faster.

Disadvantages of Bayesian algorithm: it needs a specific format; Numerical data needs to be converted into categories to calculate the probability or gaussian distribution to calculate the probability;

(2) Logistic regression classification algorithm

Code implementationDistinguish between pictures of cats

Note: The logistic regression classification algorithm will be added to the network layer later and updated to the neural network classification algorithm.

.# Cost function, calculate the gradient
def propagate(w, b, X, Y):
    m = X.shape[1]      
    A = sigmoid(np.dot(w.T, X) + b)            
    cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))        
    dw = 1 / m * np.dot(X, (A - Y).T)  
    db = 1 / m * np.sum(A - Y) 
...    
Copy the code

The logistic regression classification algorithm implements the classification of X with input feature vector X and output Y (range 0~1) predicting X.

The first step is to get the linear regression function with respect to X

WX + b can be obtained by linear regression, where W is the weight and b is the deviation. However, this expression cannot be used to express the predicted value, because the value of output Y needs to be in the range of (0~1);

The second step is to convert by activating functions

The characteristic of activation function is that it can transform linear function into nonlinear function, and it has the characteristics of finite output value, differentiable and monotonic. This example uses sigmoid to make the output the predicted value Y=sigmoid (WX+b);

The third step is to construct the Cost function

To train W and B to better predict the real category, the Cost Cost function should be constructed. Y ^ is the predicted classification value of Sigmoid (WX+b), and y is the actual classification value (0 or 1) :

Where L(y^,y) is called the loss function

The fourth step is to obtain the minimum value of the Cost function by gradient descent

4. Clustering algorithm

(1) Hierarchical clustering

Code implementationHierarchical clustering of dog species

Hierarchical clustering treats each piece of data as a classification, and the closest two classifications are merged in each iteration until one classification is left.

(2) K – means++ clustering

Code implementationKmean++ clustering

Note: The difference between Kmean algorithm and Kmean++ lies in that the initial center point is directly randomly selected k points.

.#kmean initializes random k centers
        #random.seed(1)
        #center = [[self.data[i][r] for i in range(1, len((self.data)))]  
                  #for r in random.sample(range(len(self.data)), k)]
            
        K centers are randomly selected based on the distance weight
        # 1. Select a point at random
        center = []
        center.append(random.choice(range(len(self.data[0]))))
        # 2. Select other centers according to the probability of distance
        for i in range(self.k - 1):
            weights = [self.distance_closest(self.data[0][x], center) 
                     for x in range(len(self.data[0])) if x not in center]
            dp = [x for x in range(len(self.data[0])) if x not in center]
            total = sum(weights)
            # Weighting based on distance
            weights = [weight/total for weight in weights]
            num = random.random()
            x = -1
            i = 0
            while i < num :
                x += 1
                i += weights[x]
            center.append(dp[x])
        ... 

Copy the code

The k-Means ++ algorithm can be summarized as:

(1) Based on the distance component from each point to the center point, k elements were randomly selected as the center point successively: first, a point was randomly selected. Repeat the following steps until k points are selected.

Calculate the distance (D) between each data point DP (n) and each center point, and select the minimum value D(dp);

(2) According to the distance from each point to the center point;

(3) Calculate the new center point of each classification. Repeat (2, 3) until the conditions are met.


Welcome to the “Algorithm Advanced” public account, here regularly push Python, machine learning, deep learning and other technology good articles. Welcome to study, exchange and progress together!