Original link:tecdat.cn/?p=18726 

 

Self-organizing mapping neural network (SOM) is an unsupervised data visualization technique that can be used to visualize high-dimensional data sets in a low-dimensional (usually 2-dimensional) representation. In this article, we examined how to use R to create a SOM for customer segmentation.

SOM was first described in 1982 by Teuvo Kohonen in Finland and Kohonen’s work in the field has made him the most cited Finnish scientist in the world. In general, SOM’s visualization is a colorful 2D diagram of hexagonal nodes.

 

SOM

The SOM visualization consists of multiple “nodes”. Each node vector has:

  • Location on the SOM grid
  • The same weight vector as the input space dimension. (For example, if your input data is representative, it may have variables “age,” “gender,” “height,” and “weight,” and each node on the grid will also have values for these variables.)
  • Correlation sample in input data. Each sample in the input space is “mapped” or “linked” to a node on the grid. A node can represent multiple input samples.

The key feature of SOM is that the topological features of the original input data remain on the graph. This means placing similar input samples (where similarity is defined by input variables (age, sex, height, weight) together on a SOM grid. For example, all 55-year-old women approximately 1.6m tall will be mapped to nodes in the same area of the grid. Taking into account all the variables, short people will be mapped to other places. Tall men are physically closer to tall women than small, fat men because they are so much more ‘alike’.

SOM heat map

A typical SOM visualization is a “heat map.” The heat map shows the distribution of variables in SOM. Ideally, people of similar ages should congregate in the same area.

The chart below uses two heat maps to illustrate the relationship between average education levels and unemployment.

 

SOM algorithm

The algorithm for generating SOM from the sample data set can be summarized as follows:

  1. Select the map size and type. Shapes can be hexagonal or square, depending on the shape of the desired nodes. In general, it is best to use a hexagon grid, since each node has six neighbors.
  2. Randomly initialize all node weight vectors.
  3. Select a random data point from the training data and present it to SOM.
  4. Find the “best matching Unit” (BMU) – the most similar node on the map. Euclidean distance formula was used to calculate the similarity.
  5. Determine the BMU neighbor node. – The size of the neighborhood decreases with each iteration.
  6. The selected data point adjusts the weights of nodes in the BMU neighborhood. – The learning rate decreases with each iteration. – The adjustment range is proportional to the proximity between the BMU and the node.
  7. Repeat steps 2-5 for N iterations/convergence.

 

R in SOM

training

R can create SOM and visualization.

# create a self-organizing map in R # Create a training dataset (rows are samples, columns are variables) # here, I select the subset of variables available in "data" data_train < -data [, c(3,4,5,8)] # change the data box with training data to a matrix # standardize all variables at the same time # SOM training process. Data_train_matrix < -as. matrix(scale(data_train)) Topo ="hexagonal" topo="hexagonal")Copy the code

visualization

Visualization allows you to examine the quality of the generated SOM and explore the relationships between variables in the data set.

  1. Training process: With the progress of SOM training iteration, the distance from the weight of each node to the sample represented by this node will decrease. Ideally, this distance should be minimal. This graph option shows progress over time. If the curve is decreasing, more iterations are required.

    #SOM training progress plot(model, type="changes")Copy the code

     

  2. Node count We can visualize the number of samples mapped to each node on the map. This measure can be used as a measure of plotting quality – ideally, the sample distribution is relatively uniform. When selecting the graph size, there should be at least 5-10 samples per node.

    Plot (model, type="count")Copy the code

     

  3. The neighbor distance is often referred to as the “U matrix,” which visually represents the distance between each node and its neighbors. Generally, gray scale is used to view nodes. The area with low neighbor distance represents similar node groups. Regions with larger distances indicate that nodes are much different. The U matrix can be used to identify categories within a SOM map.

    # U-matrix visualizationCopy the code

     

  4. Code/weight vector The node weight vector consists of the original variable values used to generate the SOM. The weight vector for each node represents/is similar to the sample mapped to that node. By visualizing the weight vector across the entire map, we can see the model in the distribution of samples and variables. The default visualization of a weight vector is a “fan chart” that shows a fan representation of the magnitude of each variable in the weight vector for each node.

    # Weight vector viewCopy the code

     

  5. Heat maps Heat maps are perhaps the most important possible visualization of self-organizing graphs. Typically, the SOM process creates multiple heat maps and then compares them to identify areas of interest on the map. In this case, we visualized SOM’s average education level.

    # Heat map creationCopy the code

     

    It should be noted that this default visualization draws a standardized version of the variable of interest.

    Numeric (data_train, by=list(som_model$unit.classi FUN=mean)Copy the code

    It’s worth noting that the heat map above shows an inverse relationship between unemployment and education levels. Other heat maps displayed side by side can be used to build up a picture of different areas and their features.

     

    Heat maps with empty nodes in the SOM grid In some cases, your SOM training may cause nodes in the SOM diagram to be empty. With a few lines, we can find the missing nodes in som_model $unit.Classif and replace them with NA values — this step will prevent empty nodes from distorting your heat map.

    # Draw unnormalized variables var_unscaled <- aggregate(as.numeric(data_train_RAW), by=list(som_model$unit.classif), FUN=mean) # add NA value to unallocated nodes missingNodes <- which(! (seq(1,nrow(som_model$codes) %in% varunscaled$Node)) # Add them to unnormalized data boxes var_unscaled < -rbind (var_unscaled, Data.frame (Node=missingNodes, Value=NA)) # Result Data box var_unscaled # Now create heat maps using only the correct "values". plot(som_model, type =d)Copy the code

     

Clustering and segmentation of self-organizing graphs

Clustering can be performed on SOM nodes to discover sample groups with similar metrics. An appropriate clustering number estimate can be determined using the Kmeans algorithm and by checking the elbow points in the “in-class sum of squares” graph.

WSS [I] < -sum (kmeans(mydata, Centers = I)$withinss)} ## Cluster cutree(HCLust (dist(som_model$codes)), 6) #  plot(som_model, t"mappinol =ty_palCopy the code

Ideally, the categories discovered are contiguous on the diagram surface. To obtain continuous clustering, a hierarchical clustering algorithm can be used that only groups similar AND nodes together on the SOM grid.

Map the cluster back to the original sample

When the clustering algorithm is applied as in the code example above, the clustering is assigned to each node on the SOM map instead of the original sample in the dataset.

Som_clust [som_modl$ununit.clasf] som_clust[som_modl$unit.clasf] som_clust[som_modl$ununit.clasf] som_clust[som_modl$ununit.clasf] som_clust[som_modl$ununit.clasfCopy the code

Using statistics and distributions of training variables in each cluster to build a meaningful picture of cluster features – which is both art and science, the clustering and visualization process is usually an iterative process.

conclusion

Self-organizing mapping (SOM) is a powerful tool in data science. Advantages include:

  • An intuitive way to discover customer segmentation information.
  • Relatively simple algorithm, easy to interpret results to non-data scientists
  • New data points can be mapped to trained models for prediction.

Disadvantages include:

  • Because the training dataset is iterative, there is a lack of parallelization for very large datasets
  • It’s hard to represent a lot of variables in two dimensions
  • SOM training requires clean, numerical data, which is hard to come by.

Most welcome insight

1.R language K-Shape algorithm stock price time series clustering

2. Comparison of different types of clustering methods in R language

3. K-medoids clustering modeling and GAM regression are performed for time series data of electricity load using R language

4. Hierarchical clustering of IRIS data set of R. language

5.Python Monte Carlo K-means clustering

6. Use R to conduct website comment text mining clustering

7. Python for NLP: Multi-label text LSTM neural network using Keras

8.R language for MNIST data set analysis and exploration of handwritten digital classification data

9.R language deep learning image classification based on Keras small data sets