The background,

1) problem

In the practical application of machine learning, there may be a large number of features, among which there may be irrelevant features or correlations between features, which may easily lead to the following consequences:

1. The more features there are, the longer it will take to analyze features and train the model, and the more complex the model will be.

2. The more features there are, the more likely it is to cause “dimensional disaster” and its promotion ability will decline.

3. The more the number of features, the more likely it is to lead to the problem of sparse features which often occurs in machine learning, leading to the degradation of model effect.

4. For the model, it may lead to unsteadiness, that is, the solved parameters will fluctuate greatly due to the small change of samples.

Feature selection can eliminate features that are irrelevant, redundant and have no difference characterization ability, so as to reduce the number of features, reduce the training or running time and improve the model accuracy.

2) How to make feature selection

Feature selection means to select a feature subset from all features to make the constructed model have better effect and stronger generalization ability. How do you do feature selection? If you want to select an optimal subset of all the features to make it perform best in the current training and test data under certain evaluation criteria.

From this perspective, feature selection can be regarded as three problems:

1. Select a fixed number of features from the original feature set to minimize the error rate of the classifier, which is an unconstrained combinatorial optimization problem;

2. For a given allowable error rate, it is a constrained optimization problem to find the feature subset with the smallest dimension;

3. Compromise between error rate and dimension of feature subset.

The above three problems are ALL NP hard problems. When the feature dimension is small, it is feasible to implement, but when the dimension is large, the complexity of implementation is great, so it is difficult to be practical in practical application. The above three kinds of feature selection are ten NP difficult problems. Because the calculation of optimal solution is too much, it is necessary to find an algorithm that can get better suboptimal solution under a certain time limit. The following describes the process of solving the suboptimal solution.

Second, the general process of feature selection

The general process of feature selection can be represented in Figure 1. First, a feature subset is generated from the full set of features, and then the evaluation function is used to evaluate the feature subset. The evaluation result is compared with the stop criterion. If the stop criterion is satisfied, the evaluation stops, otherwise, the next feature subset is generated and the feature selection continues. The selected subset of features is generally validated.

To sum up, the feature selection process generally includes four parts: feature subset generation process, evaluation function, stop criterion and verification process.

Generation Procedure of feature subset

A subset selection method is adopted to provide feature subset for evaluation function. According to the different methods of searching process, feature selection can be divided into exhaustive, heuristic and random methods.

The above methods do not change the original attributes of features, while some methods remove the correlation through spatial transformation of features. Such as PCA, Fourier transform, wavelet transform and so on.

EvaluationFunction

Evaluation function is a criterion to evaluate the quality of a feature subset. The design of evaluation function will be different in different application scenarios. For example, some evaluation functions will be judged according to whether their distribution is even or the effect of the final model. Each evaluation function has its own advantages and disadvantages, so it needs to choose according to the actual situation. According to different evaluation criteria, it can be divided into filter model, wrapper model and hybrid model. The filter model takes feature selection as a preprocessing process and evaluates the selected feature subset by using the inherent characteristics of the data, independent of the learning algorithm. The wrapper model takes the results of the subsequent learning algorithm as part of the feature evaluation criteria, which can be divided into independent criteria and relevance measures according to the different evaluation function (whether it is associated with the adopted classification method).

A filter measures the quality of a feature subset by analyzing its internal characteristics. Filters are generally used for preprocessing and have nothing to do with classifier selection. The principle of filters is shown in Figure 1:

Figure 1. Filter principle (RicardoGutierrez-Osuna 2008)

The wrapper is essentially a classifier, which classifies the sample set with the selected feature subset, and the accuracy of the classification is the standard to measure the quality of the feature subset. The principle of the wrapper is shown in Figure 2.

Figure 2. Wrapper principle (RicardoGutierrez-Osuna 2008)

StoppingCriterion

The stop criterion is related to the evaluation function, when the evaluation function value reaches a certain threshold, the search can be stopped. For example, for the independence criterion, the maximum average spacing between samples can be selected. For relevance measurement, the highest recall accuracy of classifier can be selected as the criterion.

Validation Procedure

Measure the validity of selected feature subsets validated on the test data set. It is best to use measures that are not related to the previous selection method to reduce coupling.

FIG. 3 Process of feature selection (M.Dash and H. Liu 1997)

The different approaches to these processes can be viewed as one component, combined separately. For example, heuristic feature screening and correlation measurement can be used as evaluation function.

Third, the generation process of feature subset

The generation process is the process of searching feature subspace. Search algorithms are divided into Complete search, Heuristic search and Random search, as shown in FIG. 4.

FIG. 4 Classification of search process

Of course, none of these methods are mutually exclusive, and you can combine them to learn from each other. The following is a brief introduction to common search algorithms.

1) Complete search

The full search is divided into two Exhaustive and non-exhaustive categories. In the complete search part, the correlation between features is considered so that the optimal set can be found better.

Our lines First Search can be customized according to our requirements.

Algorithm description: breadth-first traversal feature subspace.

Step1: first put the root node into the queue.

Step2: remove the first node from the queue and verify that it is the target.

Substep: If the target is found, the search ends and the result is returned.

Substep: otherwise, queue all its immediate children that have not yet been checked.

Step3: if the queue is empty, it means that all features have been checked. End search and return “target not found”.

Step4: repeat step2.

Algorithm evaluation: enumerating all feature combinations, it belongs to exhaustive search, time complexity is O(2n), practicability is not high.

B. Branch and Bound search

Algorithm description: Add branch bound on the basis of exhaustive search. For example, if it is determined that certain branches cannot be searched for better than the best solution currently found, it is possible to cut those branches.

C. Beam Search

Algorithm description: Firstly, N features with the highest score are selected as feature subset and added to a priority queue with a maximum length limit. The subset with the highest score is extracted from the queue every time. Then, all feature sets generated after adding 1 feature to this subset are exhaustively selected and added to the queue.

D. Best First Search

Algorithm description: Similar to directional search, the only difference is that the length of the priority queue is not limited.

2) Heuristic searches

Heuristic search is more greedy, some algorithms do not consider the correlation between features, but simply consider the impact of a single feature on the final result, but in reality, there may be various correlations between features. Some algorithms are also improved from these aspects, such as adding L to R selection algorithm, sequence floating selection.

A. Sequential Forward Selection (SFS)

Algorithm description: Feature subset X starts from the empty set, and selects one feature X to add feature subset X each time to make the feature function J(X) optimal. Simply put, it is a simple greedy algorithm that selects a feature to make the value of the evaluation function better every time.

Algorithm evaluation: the disadvantage is that features can only be added but not removed. For example, feature A is completely dependent on feature B and C, and it can be considered that if feature B and C are added, A is redundant. If the sequential forward selection algorithm first adds A to the feature set, and then B and C, then the feature subset contains redundant feature A.

B. Sequential Backward Selection (SBS)

Algorithm description: Starting from the complete set of features O, each time a feature X is removed from the feature set O, so that the evaluation function value after the elimination of feature X reaches the optimal value.

Algorithm evaluation: sequence backward selection is just the opposite to sequence forward selection, its disadvantage is that features can only be removed but not added.

In addition, both SFS and SBS are greedy algorithms, which are easy to fall into local optimal values.

C. Bidirectional Search

Algorithm description: Sequence forward selection (SFS) is used to start the search from the empty set, while sequence backward selection (SBS) is used to start the search from the whole set, and stop the search when they reach an identical feature subset C.

The starting point for two-way search is. As shown in the figure below, point O represents the starting point of the search and point A represents the target. The gray circle represents the possible search scope of a one-way search, and the green two circles represent the search scope of a two-way search. It is easy to prove that the green area must be smaller than the gray one.

Figure 5. Two-way search

D. LRS (Plus-L minus-R Selection)

The algorithm has two forms:

<1> The algorithm starts from the empty set and adds L features in each round, and then removes R features to make the evaluation function value optimal. ( L> R )

<2> The algorithm starts from the complete set and removes R features and then adds L features in each round to make the evaluation function value optimal. ( L< R )

Algorithm evaluation: the l-plus and r-minus selection algorithm combines the idea of sequence forward selection and sequence backward selection, and the selection of L and R is the key of the algorithm.

E. Sequential Floating Selection

Algorithm description: sequence floating selection by the increase L to R selection algorithm developed, the algorithm and the increase L to R selection algorithm difference lies in: sequence floating selection L and R is not fixed, but “floating”, that is, will change.

The sequence float selection comes in two variants, depending on the direction of the search.

<1> Sequential Floating Forward Selection

Algorithm description: Starting from the empty set, select a subset X from the unselected features in each round to optimize the evaluation function after adding the subset X, and then select the subset Z from the selected features to optimize the evaluation function after eliminating the subset Z.

<2> Sequential Floating Backward Selection (SFBS)

Algorithm description: Similar to SFFS, the difference is that SFBS starts from the complete set, and features are first removed and then added in each round.

Algorithm evaluation: sequence floating selection combines the characteristics of sequence forward selection, sequence backward selection and L – to – R selection, and makes up for their shortcomings.

A. Decision Tree Method (DTM)

Algorithm description: Run C4.5 or other decision tree generation algorithms on the training sample set, and run pruning algorithm on the tree after the decision tree is fully grown. Then the features at each branch of the final decision tree are selected feature subsets. Decision tree methods generally use information gain as evaluation function.

3) Random Algorithm

A. Random Generation Plus Sequential Selection algorithm (RGSS)

Algorithm description: Generate a random subset of features, and then execute SFS and SBS algorithms on this subset.

Algorithm evaluation: it can be used as a supplement of SFS and SBS to jump out of local optimal value.

B. Simulated Annealing algorithm (SA)

Algorithm evaluation: Simulated annealing overcomes the disadvantage that sequence search algorithm is easy to fall into local optimal value to a certain extent, but if the optimal solution area is too small (such as the so-called “golf hole” terrain), simulated annealing is difficult to solve.

C. Genetic Algorithms (GA)

Algorithm description: Firstly, a batch of feature subsets are randomly generated, and these feature subsets are graded by evaluation function. Then, the next generation of feature subsets are propagated through crossover, mutation and other operations, and the higher the score of feature subsets is, the higher the probability of being selected for reproduction. In this way, after N generations of reproduction and survival of the fittest, a subset of features with the highest evaluation function value may be generated in the population.

Common disadvantages of stochastic algorithms: dependent on random factors, it is difficult to reproduce experimental results.

4) Feature transformation method

A. PCA

PCA(Principal ComponentAnalysis) is a coordinate transformation method that can remove redundant features.

In the process of specific feature transformation, small eigenvalues are removed, so as to achieve the purpose of noise removal, correlation removal and feature reduction.

B. Wavelet transform

Wavelet is also a method of feature space transformation. Compared with Fourier transform, wavelet transform can better adapt to violent transformation.

4. Evaluation function

The function of evaluation is to evaluate the quality of the subset of features provided by the generation process.

1) Independent criteria

Independent criteria are usually applied in feature selection algorithms of filter models, which attempt to evaluate selected feature subsets by intrinsic properties of training data, independent of specific learning algorithms. Usually include: distance, information measure, relevance measure and consistency measure.

This approach is recommended for more general feature selection methods because it is independent of specific machine learning algorithms and applies to most subsequent machine learning methods.

2) Relevance measurement

Association criterion is usually used in feature selection algorithm of wrapper model. Firstly, a learning algorithm is determined and the performance of machine learning algorithm is taken as evaluation criterion. For a particular learning algorithm, it is usually possible to find a better subset of features than the filter model, but the learning algorithm needs to be called many times, which generally costs a lot of time and may not be suitable for other learning algorithms.

When we do the pattern classification algorithm, we can adopt the correlation measurement method according to our actual situation, which can be better combined with our classification method and usually find a better subset.

To sum up, the advantages, disadvantages and application of the two evaluation functions are summarized as follows:


methods	Independent accuracy	Correlation measure
advantages	Generic, independent of a particular algorithm	For the association classification algorithm may be optimal
disadvantages	Results the general	Not applicable to other algorithms
applicable

Table 1 Advantages and disadvantages of the two evaluation functions

3) Common evaluation functions

A. Chi-square test

The basic idea of Chi-square test is to determine the correctness of the theory by observing the deviation between the actual value and the theoretical value. First on specific do often assumes that the two variables is independent (” null hypothesis “), and then observe the actual value (observations) and the theoretical value (this refers to the theoretical value “if it really is both independent” should some value) under the condition of the degree of deviation, if the deviation is small enough, we think the error is natural samples, is not precise enough measurement methods lead to or send by accident Well, they are really independent, so we accept the null hypothesis; If the deviation is large enough to make it unlikely that the error was due to chance or imprecise measurement, we assume that the two are actually related, rejecting the original hypothesis and accepting the alternative hypothesis.

The theoretical value is E, and the actual value is X. The calculation formula of the deviation degree is as follows:

This is the difference measure that we use for the square root test. When observations are provided for several samples x1,x2… Xi,… After xn, it can be substituted into the equation to obtain the square root value, which can be compared with the preset threshold value. If the value is greater than the threshold value (i.e. a large deviation), the null hypothesis is considered to be invalid, otherwise, the null hypothesis is considered to be valid.[Please refer to my other popular article on chi-square test]

B. Correlation

The use of correlation to measure the quality of feature subsets is based on the assumption that a good feature subset should contain features that are highly correlated with the classification (high correlation) and those that are less correlated with each other (low redundancy).

The correlationcoefficient can be used to measure the linear correlation between vectors.

C. Distance Metrics

Feature selection using distance measurement is based on the assumption that a good feature subset should make the distance between samples belonging to the same class as small as possible and the distance between samples belonging to different classes as far as possible. Fisher discriminant classification is also based on this idea.

Commonly used distance measures (similarity measures) include Euclidean distance, standardized Euclidean distance, Mahalanobis distance and so on.

D. Information Gain

Suppose there is a discrete variable Y, and the values in Y include {y1, y2,…. Ym}, yi occurs with probability Pi. Then the information entropy of Y is defined as:

Information entropy is a description of uncertainty. It has the following characteristics: if the elements of set Y are unevenly distributed, its information entropy is smaller; If the Y distribution is more average, the information entropy is higher. In extreme cases, if Y can only take one value, that is, P1=1, then H(Y) takes the minimum value 0. Otherwise, if the probabilities of all values are equal, that is, all values are 1/m, then H(Y) takes the maximum value log2m.

For a feature T, the amount of information the system has with it and without it, the difference between the two is the amount of information the feature brings to the system. It is information entropy without it is conditional entropy.

Conditional entropy: Calculate the amount of information in the system when a feature T cannot change.

For a feature X, there are n possible values (x1,x2… ,xn), calculate the conditional entropy of each value, and then take the average value.

In text classification, the value of the feature word T is only T (representing the occurrence of T) and (representing the non-occurrence of T). then

Finally, information gain

But the problem of maximum information gain [this problem exists for multiple classifications, For the binary classification, there is no] is that it also can examine characteristics on the contribution of the whole system, but not specific to a particular category, which makes it suitable for only do the so-called “global” feature selection (refers to all the classes all use the same feature set), and can’t do the “local” feature selection (each category has its own characteristic collection, because some words, for this category Discriminative, insignificant for another category).

At the same time, the information entropy tends to be more distributed features, so the improved method is to try the information gain rate.

E. Classifier Error Rate (Classifier error Rate)

A specific classifier is used to classify the sample set with a given feature subset, and the accuracy of the classification is used to measure the quality of the feature subset.

Among the above four measurement methods, chi-square test, correlation, distance, information gain, belong to the filter, while classifier error rate belongs to the wrapper.

Because the filter is independent of the specific classification algorithm, it has a strong generalization ability among different classification algorithms and has a small amount of computation. However, because of the application of specific classification algorithm in the evaluation process, the effect of encapsulation to other classification algorithms may be poor, and the calculation is also large.

5. Application examples

Here is an example of a practical application, the basic method is heuristic search (sequential addition)+ relevance criteria (Chi-square test, maximum entropy)+ quasi call stop criteria. The following describes the procedure in detail.

Step1: calculate the chi-square value of each feature.

Step2: take the characteristic value of topN.

Step3: Bring model training, calculate accurately and recall on test set.

Step4: if it reaches the standard, stop; otherwise, gotostep2.

Data mining gives people the ability to realize the real value of data, that is, the information and knowledge contained in data. Data mining refers to the extraction of interesting knowledge from large databases or data warehouses, which are hidden and previously unknown potentially useful information. Data mining is one of the most cutting-edge research directions in the field of database and information decision-making in the world at present. So let me share with you a little piece of research that I did a long time ago. It is also a simple example of data mining processing.

1. Overview of data mining and cluster analysis

Data mining generally consists of the following steps:

(l) Analysis issues: The source data database must be evaluated for compliance with data mining standards. In order to determine the expected result, the optimal algorithm for this work is chosen.

(2) Extracting, cleaning and verifying data: The extracted data is placed in a database that is structurally compatible with the data model. Clean inconsistent and incompatible data in a consistent format. Once the data has been extracted and cleaned, browse through the created model to ensure that all the data exists and is complete.

(3) Model creation and debugging: a structure is generated after the algorithm is applied to the model. It is important to browse the data in the resulting structure to verify that it is an accurate representation of the “facts” in the source data. While it may not be possible to do this for every detail, by looking at the generated model, it is possible to discover important features.

(4) Query the data of the data mining model: once the model is established, the data can be used for decision support.

(5) Maintain the data mining model: after the data model is established, the characteristics of the initial data, such as validity, may change. Some information changes can have a big impact on accuracy because it changes the nature of the original model on which it is based. Therefore, it is very important to maintain the data mining model.

Cluster analysis is the core technology of data mining and has become a very active research topic in this field. Cluster analysis is based on the simple idea of “birds of a feather flock together”, clustering or classifying things according to their characteristics. As an important research direction of data mining, cluster analysis has attracted more and more attention. The input of clustering is a group of data without category annotation, which can be known in advance or not know how many clusters these data are clustered into. Through the analysis of these data, according to certain clustering criteria, the records set can be reasonably divided, so that the similar records can be divided into the same cluster, and the dissimilar data can be divided into different clusters.

2. Feature selection and clustering analysis algorithm

Relief is a series of algorithms, including Relief, which was first proposed, and ReliefF and RReliefF, which were expanded later. RReliefF algorithm is proposed for the regression problem where the target attribute is a continuous value. The Relief and ReliefF algorithms for classification problems are only introduced below.

2.1 Relief algorithm

Relief algorithm was first proposed by Kira and was initially limited to the classification of two types of data. Relief algorithm is a Feature Weighting algorithm. Features are assigned different weights according to the correlation of each Feature and category. Features with weights less than a certain threshold will be removed. The correlation between features and categories in the Relief algorithm is based on features’ ability to distinguish near samples. The algorithm randomly selects a sample R from the training set D, and then searches for the nearest neighbor sample H, called Near Hit, from the samples of the same class as R, and searches for the nearest neighbor sample M, called NearMiss, from the samples of different classes from R, and then updates the weight of each feature according to the following rules: If the distance between R and Near Hit on a feature is smaller than that between R and Near Miss, it indicates that this feature is beneficial to distinguish the nearest neighbor of the same kind from that of the different kind, and the weight of this feature is increased. On the contrary, if the distance between R and Near Hit in a feature is greater than that between R and Near Miss, it indicates that this feature has a negative effect on distinguishing the nearest neighbor of the same kind from that of the different kind, and the weight of this feature is reduced. The above process is repeated m times, and the average weight of each feature is finally obtained. The larger the weight of the feature is, the stronger the classification ability of the feature is; otherwise, the weaker the classification ability of the feature is. The running time of Relief algorithm increases linearly with the increase of sample sampling times M and original feature number N, so the running efficiency is very high. The specific algorithm is as follows:

2.2 ReliefF algorithm

Relief algorithm is widely used because of its simplicity, high efficiency and satisfactory results. However, its limitation is that it can only deal with two types of data. Therefore, Kononeill extended ReliefF algorithm in 1994, which can deal with multiple types of problems. This algorithm is used to deal with the regression problem with continuous object attribute. ReliefF algorithm takes a sample R randomly from the training sample set every time, and then finds K near Hits of R from the sample set of similar types with R, and finds k near Misses from the sample set of different types of R, and then updates the weight of each feature. As shown in the following formula:

Relief series of algorithms have high operation efficiency and no restriction on data types. They belong to a feature weight algorithm, which gives high weight to all features with high correlation with categories. Therefore, the limitation of the algorithm is that it cannot effectively remove redundant features.

2.3 K-means clustering algorithm

As the clustering algorithm is a natural similarity partitioning method for data, it is required that the obtained clustering should be as similar as possible to the internal data of each cluster while the differences between the clusters should be as big as possible. So it’s very important to define a scale to measure similarity. In general, there are two ways to define similarity. The first way is to define the distance between the data, which describes the differences between the data. The second approach is to define the similarity between the data directly. Here are some common ways to define distance:

1.Euclidean distance, which is a traditional distance concept, is suitable for 2 and 3 dimensional space.

2.Minkowski distance is an extension of Euclidean distance, which can be understood as the distance in n-dimensional space.

There are many kinds of clustering algorithms. When necessary, appropriate clustering algorithms can be selected according to the data types involved, the purpose of clustering and the application requirements. The k-means clustering algorithm is introduced below:

K-means algorithm is a common clustering algorithm based on partition. The K-means algorithm takes K as the parameter and divides N objects into K clusters, so that the similarity within the cluster is higher and the similarity between clusters is lower. The processing process of K-means is as follows: firstly, K objects are randomly selected as the centroid of the initial K clusters; Then the remainder object is allocated to the nearest cluster according to its distance to the centroid of each cluster. Finally, the center of mass of each cluster is recalculated. This process is repeated until the objective function is minimal. The center of mass of the cluster is obtained by the following formula:

In the concrete implementation, a maximum number of iterations is often defined to prevent infinite loops due to untenable conditions in Step 2. K-means tries to find K partitions that minimize the squared error function value. When the data distribution is more uniform and the difference between clusters is obvious, its effect is better. In the face of large data sets, the algorithm is relatively scalable and has high efficiency. Where, n is the number of objects in the data set, k is the expected number of clusters, and t is the number of iterations. In general, the algorithm terminates at a locally optimal solution. But, for example, involving data with non-numeric attributes. Secondly, this algorithm requires that the number k of clusters to be generated be given in advance, which obviously puts too high a requirement on users. Moreover, since the initial clustering center of the algorithm is randomly selected, different initial centers have a great influence on the clustering result. In addition, k-means algorithm is not suitable for finding clusters with non-convex shapes or clusters with large size differences, and it is sensitive to noise and outlier data.

3. An example of medical data analysis

3.1 Data Description

The experimental data in this paper are from the famous UCI machine learning database, which has a large amount of artificial intelligence data mining data. The website is archive.ics.uci.edu/ml/. The database is constantly…

The Data selected in this paper are type: Breast Cancer Wisconsin (Original) Data Set, Chinese name: Wisconsin Breast Cancer Data Set. These data are derived from clinical case reports from the Hospital of the University of Wisconsin, USA, and each data piece has 11 attributes. The downloaded data file format is “.data “, which is converted into Matlab default data set by Using Excel and Matlab tools to save, convenient for the program to call.

The following table is the name and description of 11 attributes of the dataset:

After the transformation of the above data and the data description, it can be seen that there are 9 indicators that can be used for feature extraction, and the sample number and classification are only used to determine the classification. The idea of data processing in this paper is to first use ReliefF feature extraction algorithm to calculate the weight of each attribute, eliminate the attribute with the least correlation, and then use K-means clustering algorithm to perform clustering analysis on the remaining attributes.

3.2 Data preprocessing and procedures

In this paper, pre-processing is carried out first after data conversion. As the data range in this paper is 1-10, normalization is not required. However, there are some incompleteness in the data sample, which will affect the actual program operation. These incomplete data are not registered or lost due to some actual reasons. The form represents.

This paper uses Matlab software for programming calculation. According to the ReliefF algorithm process mentioned in chapter 3, the ReliefF function program is first written to calculate the characteristic attributes, and then the main program is written. In the main program, the function is called to calculate, and the results are analyzed and drawn, and the useful conclusions are obtained.

The program is posted at the end.

3.3 Feature extraction of breast cancer data set

ReliefF algorithm in Section 3.1 is adopted in this paper to calculate the weight of each feature. Features whose weight is less than a certain threshold value will be removed. According to the actual situation in this paper, 2-3 features with the lowest weight will be removed. Since the algorithm will select random sample R in the process of operation, the difference of random numbers will lead to a certain difference in the weight of the results, so this paper adopts the average method, runs the main program 20 times, and then summarizes the results to find the average value of each weight. As shown below, column attribute number, behavior calculation results for each time:

The following is the trend chart of feature weight calculated by the feature extraction algorithm. The results of 20 times of calculation have the same trend:

Whether the above results are calculated by running the main program does not seem intuitive. The following plots are made in order to intuitively display the size distribution of weights of each attribute, as shown in the figure below:

According to the order from small to large, it can be seen that the weight relation of each attribute is as follows:

Attribute 9< attribute 5< attribute 7< attribute 4< attribute 2< attribute 3< attribute 8< attribute 1< attribute 6

We selected the weight threshold as 0.02, then attribute 9, attribute 4 and attribute 5 were eliminated.

Weight can be seen from the above characteristics, properties of 6 naked nuclear size is one of the main influence factors, shows the symptoms of breast cancer patients showed naked first nuclear size, will directly cause the naked nuclear size changes, the second is properties 1 and 8, etc., after several attribute weights are close to size, but from the point of calculation rules many times, still can show different important degree, The following focuses on the analysis of several important attributes. The following is the weight change of bare core size (attribute 6) over the 20 tests:

It can be seen from the figure above that the weight of this attribute is mostly around 0.22-0.26, which is an attribute with the largest weight. Let’s look at the weight distribution for attribute 1:

The feature weight of the block thickness attribute fluctuates around 0.19-25, which is also a very high weight, indicating that this feature attribute is a very important judgment basis in the detection indicators of breast cancer patients. Further analysis shows that the success rate of attribute 6 and attribute 1 can reach 91.8% by clustering analysis alone. This article will be introduced in detail in the Kmeans algorithm in the next section.

3.4 Cluster analysis of breast cancer data

In the previous section, ReliefF algorithm was used to analyze the data set to obtain the importance of attribute weights, which can provide some reference value for clinical diagnosis, can be used to analyze actual cases, and can avoid wrong diagnosis as far as possible, and improve the speed and accuracy of diagnosis. The following data will be analyzed by k-MenAS clustering analysis algorithm. This section will be divided into several steps to compare and determine the results of the clustering analysis algorithm and the results combined with ReliefF algorithm, etc.

1.K-means algorithm analyzes data sets separately

Kmeans algorithm will be used to analyze the data set separately below. Matlab has included some conventional data mining algorithms, such as the K-means algorithm used in this paper. This function, called Kmeans, performs cluster analysis on data sets. Firstly, all attribute columns (excluding identity information and classification columns) of the breast cancer data set are directly classified in this paper. Since there are only two types of data set results, the test is divided into two categories first, and the results are as follows: Overall, 683 data were divided into two categories, and the overall accuracy rate was 94.44%. Among them, the accuracy rate of the first category was 93.56%, and the accuracy rate of the second category was 96.31%. The following is the distribution map of attribute values drawn according to different attributes after classification:

Limited by space, only the above three feature attributes are selected for image rendering. From the results, the situation after k-means algorithm classification can be intuitively observed, and the classification boundary between category 1 and category 1 is relatively clear. But it’s not easy to see what’s right and what’s wrong. The following table is the clustering center of each attribute in the classification result:

From the effect of k-means algorithm, it can classify data sets accurately. On the one hand, this data set may be due to the obvious characteristics of the case, and on the other hand, k-MENAS algorithm plays a large role in such 2 categories.

2.K-means analyzes data sets with ReliefF

In terms of classification accuracy and results, K-Mens algorithm can make very accurate judgment on breast cancer data set. However, considering the influence of ReliefF algorithm on attribute weight, this section will combine ReliefF algorithm and K-means algorithm to analyze the data set. On the one hand, some simple conclusions on the treatment of this problem can be obtained, and on the other hand, some research methods on medical data processing can be obtained.

First of all, this section preprocesses k-MenAS classification data according to the weights of different attributes according to some conclusions in Section 3.2, so as to obtain more accurate conclusions and deeper feature rules for the data.

According to section 3.2, attribute 9< attribute 5< attribute 7< attribute 4< attribute 2< attribute 3< attribute 8< attribute 1< attribute 6. According to the principle of ReliefF algorithm, this paper believes that the important characteristic attributes of attribute 6 and attribute 1 should play a more important role in classification. Therefore, the data of each attribute will be tested separately below, and the detailed results are shown in the following table:

In the overall classification accuracy, attribute 9 is the lowest and attribute 6 is the highest, which is roughly similar to the test result of ReliefF algorithm. However, due to the close weight of the middle part of ReliefFar algorithm, the distinction is not obvious. It shows that the judgment of the weight of feature attributes has an impact on classification. In the above separate classification, only the column data to be classified can be taken out and input into the K-means algorithm. Due to the change of input data, k-means classification results are definitely different, so it is not reliable to judge the type of an attribute alone. The highest and lowest conditions of a single classification are selected below to draw the distribution map of their classification attribute values, as shown in the figure below:

The following will select the corresponding data for clustering analysis according to the order of feature weights from large to small, and the conclusions are as follows:

1. All 9 attributes were directly selected, and the classification success rate was 94.44%;

2. Select attribute 6 and Attribute 1, and the classification success rate is 91.36%;

3. Select attributes 6,1,8,3, and the classification success rate is 93.85%;

4. Select attributes 6,1,8,3,2,4, and the classification success rate is 94.48%;

5. Select attributes 6,1,8,3,2,4,5,7, and the classification success rate is 95.02%;

Can be seen from the above test, select the feature weighting of the largest six properties, its accuracy is to select all attribute, so we can think of feature weighting least several properties in breast cancer diagnosis process of the actual may be small, could cause the actual reaction, is this a few attribute value is not necessarily associated with breast cancer. This can be a diagnostic reference, or it can be noted for further research to confirm.

3. K-means can be divided into three categories

Although most of the results and conclusions of this data set can be obtained from the experiments in the above 2 sections. However, in order to distinguish the same type of data more accurately, the following attempts will be divided into three categories. On the one hand, the significant characteristic attributes of benign and malignant breast cancer can be analyzed. On the other hand, a more reasonable solution can be found according to this result.

The kmeans function in Matlab was used to change the classification number to 3. As the data types increased after classification into 3, judgment was more complicated, so the data was analyzed manually and all characteristic attributes were added. The running results are as follows: there are 683 test data in total, including 444 benign ones and 239 malignant ones:

1. Among the records classified into the first category, 96.88% were benign;

2. Malignancy accounted for 100% of the records classified into the second category;

3. Malignancy accounted for 92% of the records classified as category iii;

According to the above results, it can be considered that the first classification is benign, the second is malignant and the third is mixed. For mixed category, it indicates that the data in it is closer to the typical data deviating from the case than other data, so further analysis is made on the accuracy of classification in the first category and the second category:

1. The first category was benign, with a total of 448 data, and the classification accuracy rate was 96.88%.

2. The second category was malignant, with 99 data, and the classification accuracy was 100%;

3. The third category is mixed category, with a total of 136 data items

So from the perspective of the accuracy of classification after separately, the effect is improved, the description of the typical cases of data classification more accurate, but for the third type of data, but unable to distinguish, so in this case, its significance lies not in the overall accuracy of classification, but in some special circumstances, can according to the features of some important attribute values can be diagnosed patients, This improves efficiency and accuracy and reduces the chance of misdiagnosis.

Above, all attributes are transformed by K-means. In the following, ReliefF algorithm will be combined to remove some feature attributes with low feature weight before k-means processing. According to the conclusion in Section 4.2, the following six attributes with the largest weight are extracted for testing: attribute 6, attribute 1, attribute 8, attribute 3, attribute 2, and attribute 4.

1. The first category was benign, with 281 data items, and the classification accuracy rate was 97.51%;

2. The second category was malignant, with 211 data items, and the classification accuracy rate was 97.16%.

3. The third category is mixed category, with 191 data items

Therefore, it can be seen by comparison that although the benign accuracy rate has increased, the detected data has decreased. The number of the third type of mixture also increased, indicating that the proposed attributes with smaller special attributes can make it easier to distinguish between extreme case data and detect extreme data more accurately.

4. Main Matlab source code

1.ReliefF

1 % main function 2 clear; clc; 3 the load (' matlab. Mat) 4 D = data (:, 2: size (data, 2)); % 5 m =80; % sampling times 6 k = 8; 7 N = 20; For I =1:N 9 W(I,:) = ReliefF (D,m,k); Plot (1:size(W,2),W(I,:)); 10 end 11 for I = 1:N % Hold on; 14 end 15 for I = 1:size(W,2) % 16 result(1, I) = sum(W(:, I))/size(W,1); 17 end 18 xlabel(' attribute number '); 19 ylabel(' feature weight '); 20 Title ('ReliefF algorithm calculates feature weights for breast cancer data '); 21 Axis ([1 100 0.3]) 22% ------- Plot the change trend of each attribute 23 xlabel(' calculated times '); 24 ylabel(' feature weight '); Name = 25 char (' thickness ', 'uniformity of cell size, cell morphological uniformity, adhesion' edge ', 'single epithelial cell size,' naked nucleus, chromatin 'Bland', 'normal nucleolus',' nuclear fission); Name = 26 cellstr (name); Figure 30 Plot (1:size(W,1),W(:, I)); figure 30 Plot (1:size(W,1),W(:, I)); 31 xlabel(' count times '); 32 ylabel(' feature weight '); 33 the title ([char (name) (I) '(attributes' num2Str (I)') the characteristics of the weight change ']); 34 the endCopy the code

ReliefF function program

The 1% Relief function implements the training set with 2% D as the input, and the identity information item is removed from the input set; Function W = ReliefF (D,m,k) 4 Rows = size(D,1); Cols = size(D,2); % characteristic skilled, not including classification column 6 type2 = sum ((D (:, Cols) = = 2))/Rows; Type4 = sum((D(:,Cols)==4))/Rows; The data set is divided into two classes first, which can speed up the calculation. % D2 = zeros(0,Cols); 1 for I = 1:Rows 1 if D(I,Cols)==2 D1(size(D1,1)+1,:) = D(I,:); 14 elseif D (I, Cols) = = 4 15 D2 (size (D2, 1) + 1, :) = D (I, :); W =zeros(1, cols-1); % initialize the feature weight and set 0 19 for I = 1: m % for m cycle selection operation 20% randomly select a sample R 21 [R,Dh,Dm] = GetRandSamples(D,D1,D2,k); 23 22% update feature weights for j = 1: length (W) every feature has a %, cycle 24 W (1, j) = W (1, j) - the sum (Dh (:, j))/m (k *) + sum (Dm (:, j))/m (k *); % Update weight according to formula 25 end 26 endCopy the code

ReliefF auxiliary function to find the most recent sample number K

1 % get random R and find adjacent sample 2 %D: training set; D1: Class 1 data set; D2: Class 2 dataset; 3 %Dh: sample distance adjacent to R class; Dm: 4 function [R,Dh,Dm] = GetRandSamples(D,D1,D2,k) 5% Generate a random number first and determine the selected sample R 6 R = ceil(1 + (size(D,1)-1)*rand); 7 R=D(r,:); % select line r and assign r 8 d1 = zeros(1,0); 9 d2 = zeros(1,0); 11 for I =1:size(D1,1) % calculate the Distance between R and D1 12 D1(1, I) = Distance(R,D1(I,:)); D (1,j) = Distance(R,D2(j,:)); d (1,j) = Distance(R,D2(j,:)); 16 end 17 [v1,L1] = sort(d1) ; %d1 sort, 18 [v2,L2] = sort(d2); % d2 sorting 19 if R (1, the size (R, 2)) = = if sample R = 2, 2% are benign 20 H = D1 (L1 (1, 2: k + 1), :); %L1 is the number of the nearest distance from R and is assigned to H. 21 M = D2(L2(1,1:k),:); % v2 (1, 1: k); 21 else 23 H = D1(L1(1,1:k),:); M = D2(L2(1,2:k+1),:); Calculate the feature distance between the features of each 2 samples circularly: Features 1 - (2)/(Max - min) 27 for I = 1: size (H, 1) 28 for j = 1: size (H, 2) 29 Dh (I, j) = abs (H (I, j) - R (1, j)) / 9. % The data range of this paper is 1-10, so max-min=9 is fixed 30 Dm(I,j) = ABS (M(I,j) -r (1,j))/9; 31 end 32 endCopy the code

3. Main program of k-means algorithm

1 CLC; clear; 2 load('matlab.mat')% 3 N0 =1; N1 = size(data,1); Data =data(N0:N1,:); Data1 =data(:,[2,3,4,5,6,7,8,9]); ,4,7,9 [2] % 2: the size 7 opts (data, 2) - 1 = statset (' Display ', 'final'); % control option 8 [idx, CTRS,result,D] = kmeans(data1,2,... %data1 indicates the data to be classified,2 indicates the number of categories, this article only has 2 categories 9 'Distance','city',... % Select distance calculation method 10 'Options',opts); T = % control options, refer to the matlab help 11 [data (:, size (data, 2)), independence idx (:, 1)); D2 = data(idx==1,11); d2 = data(idx==1,11); A = sum(d2==2) a = sum(d2==2); 14 b = a/length (d2); 15 totalSum = 0; % total accuracy 16 rate1 = 0; % The correct rate of the first category. 17 rate2 = 0; 19 totalSum = totalSum + a; 18 if(b>0.5) % 20 rate1 = a/length(d2); 22 totalSum = totalSum + sum(data(idx==2,11)==4); 23 rate2 = sum (data (independence idx = = 2, 11) = = 4)/length (data (independence idx = = 2, 11)); 25 totalSum = totalSum + sum(data(idx==1,11)==4); 26 totalSum = totalSum + sum(idx==2,11)==2); 27 rate1 = sum (data (independence idx = = 2, 11) = = 2)/length (data (independence idx = = 2, 11)); 28 rate2 = sum (data (independence idx = = 1, 11) = = 4)/length (data (independence idx = = 1, 11)); End 30 x1 =1; % x1 attribute 31 x2 =1; % the x2 properties of 32 plot (1: sum (independence idx = = 1), data1 (independence idx = = 1, the x1), 'r.', 'MarkerSize', 12); 33 hold on; 34 plot (sum (independence idx = = 1) + 1: sum (independence idx = = 1) + sum (independence idx = = 2), data1 (independence idx = = 2, the x1), 'b.', 'MarkerSize', 12); 35 xlabel(' number of records '); 36 ylabel(' attribute value '); 37 title(' Value distribution of attribute 9 '); 38 legend(' 1 ',' 2 '); 39 Axis ([0 640 0 10]) 40 rate = totalSum/size(t,1) %Copy the code

A brief discussion on the implementation of feature selection algorithm and Relief