Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Many readers have asked me: are there any good cases of data analysis and data mining? The answer is yes, it’s all on Kaggle.

It’s just that you take the time to learn, even to play. Peter himself has no experience in participating in the competition, but he often visits Kaggle to learn the ideas and methods of solving problems of the leaders in the competition.

In order to document the good methods of the bigwigs, and to improve himself, Peter decided to start a column called Kaggle Case Sharing.

In the future, case analysis will be updated irregularly. The ideas are all from the Internet leaders, especially the Top1 sharing. Peter is mainly responsible for sorting out ideas and learning technologies.

Today I decided to share a case about clustering, using: supermarket user segmentation data set, the official website address please go to: supermarket

In order to facilitate everyone to practice, the public account back to the supermarket, you can get this data set ~

The Notebook is no. 1

Import libraries

# Data processing
import numpy as np
import pandas as pd
# KMeans clustering
from sklearn.cluster import KMeans

# drawing library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
Copy the code

Data on EDA

Import data

First we import the data set:

We found that there were five attribute fields in the data, namely customer ID, gender, age, average income and consumption level

Data exploration

1. Data shape

df.shape

# the results
(200.5)
Copy the code

It’s 200 rows, 5 columns of data

2. Missing values

df.isnull().sum(a)# the results
CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64
Copy the code

As you can see, all fields are complete with no missing values

3. Data types

df.dtypes

# the results
CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object
Copy the code

The field types are int64 values except Gender, which is a string

4. Describe statistics

Description statistics is used to view the values of statistical parameters related to numerical data, such as number, median, variance, maximum value, and quartile

In order to facilitate subsequent data processing and presentation, two points are processed:

# 1. Set the drawing style
plt.style.use("fivethirtyeight")

# 2. Take out the three fields for key analysis
cols = df.columns[2:].tolist()
cols
# the results
['Age'.'Annual Income (k$)'.'Spending Score (1-100)']
Copy the code

Three property histograms

Check the histogram of ‘Age’, ‘Annual Income (k$)’ and ‘Spending Score (1-100)’ to observe the overall distribution:

# drawing
plt.figure(1,figsize=(15.6))  # Canvas size
n = 0

for col in cols:
    n += 1 # subgraph position
    plt.subplot(1.3,n)  # subgraph
    plt.subplots_adjust(hspace=0.5,wspace=0.5)  Adjust width and height
    sns.distplot(df[col],bins=20)  # Draw the histogram
    plt.title(f'Distplot of {col}')  # titles
plt.show()  # Display graphics
Copy the code

Gender factors

Sex statistics

See how many men and women there are in this data set. Whether gender has an impact on the overall analysis will be considered later.

Data distribution by gender

sns.pairplot(df.drop(["CustomerID"],axis=1),
             hue="Gender".# group field
             aspect=1.5)
plt.show()
Copy the code

Through the above bivariate distribution chart, we can observe that gender has little influence on the other three fields

The relationship between age and average income by gender

plt.figure(1,figsize=(15.6))  # Drawing size

for gender in ["Male"."Female"]:
    plt.scatter(x="Age", y="Annual Income (k$)".# Specify two parsed fields
                data=df[df["Gender"] == gender],  # Data to be analyzed, under a gender
                s=200,alpha=0.5,label=gender  # Scatter size, transparency, label classification
               )
   
# Horizontal and vertical axis, title Settings
plt.xlabel("Age")  
plt.ylabel("Annual Income (k$)")
plt.title("Age vs Annual Income w.r.t Gender")
# Display graphics
plt.show()
Copy the code

The relationship between average income and consumption score by gender

plt.figure(1,figsize=(15.6))

for gender in ["Male"."Female"] :# Explanation refer to above
    plt.scatter(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',
                data=df[df["Gender"] == gender],
                s=200,alpha=0.5,label=gender)
    
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)') 
plt.title("Annual Income vs Spending Score w.r.t Gender")
plt.show()
Copy the code

Data distribution by gender

Observe the data distribution by violin diagram and cluster scatter diagram:

# The Swarmplots
# Violinplot

plt.figure(1,figsize=(15.7))
n = 0

for col in cols:
    n += 1  # Subgraph order
    plt.subplot(1.3,n)  # NTH subgraph
    plt.subplots_adjust(hspace=0.5,wspace=0.5)  Adjust width and height
    Draw two graphs under a col, grouped by Gender
    sns.violinplot(x=col,y="Gender",data=df,palette = "vlag") 
    sns.swarmplot(x=col, y="Gender",data=df)
    # Axis and title Settings
    plt.ylabel("Gender" if n == 1 else ' ')
    plt.title("Violinplots & Swarmplots" if n == 2 else ' ')
    
plt.show()
Copy the code

The results are as follows:

  • View the distribution of different fields for different genders
  • Observe for outliers, outliers, etc

Attribute correlation analysis

Mainly observe the regression between two pairs of attributes:

cols = ['Age'.'Annual Income (k$)'.'Spending Score (1-100)']  # Correlation analysis of the three attributes
Copy the code
plt.figure(1,figsize=(15.6))
n = 0

for x in cols:
    for y in cols:
        n += 1  N increases with each loop and the subgraph moves once
        plt.subplot(3.3,n)  # 3 by 3 matrix, the NTH figure
        plt.subplots_adjust(hspace=0.5, wspace=0.5)  # Width and height parameters between subgraphs
        sns.regplot(x=x,y=y,data=df,color="#AE213D")  # Data and colors for drawing
        plt.ylabel(y.split()[0] + "" + y.split()[1] if len(y.split()) > 1 else y)
        
plt.show()
Copy the code

The specific graph is:

The figure above shows two things:

  • The main diagonal is the relationship between itself and itself, directly proportional
  • Other graphs are inter-attribute, with scatter of data, as well as simulated trends

Clustering between two attributes

Here do not specifically explain the principle and process of clustering algorithm, the default has the basis

K value selection

We determined the k value by drawing a ELBOW diagram of the data. Information broadcast:

1. Parameter explanation from the official website: scikit-learn.org/stable/modu…

2, Chinese explanation reference: blog.csdn.net/qq_34104548…

df1 = df[['Age' , 'Spending Score (1-100)']].iloc[:,:].values  # Data to be fitted
inertia = []   Empty list, used to store the sum of distances to the center of mass

for k in range(1.11) :The # k value is between 1 and 10 by default, and the experience value is 5 or 10
    algorithm = (KMeans(n_clusters=k,  # k value
                       init="k-means++".# Initial algorithm selection
                       n_init=10.# Random run times
                       max_iter=300.# Maximum number of iterations
                       tol=0.0001.# Tolerate minimum error
                       random_state=111.# Random seed
                       algorithm="full"))  # Select auto, Full, elkan
    algorithm.fit(df1)  # Fit data
    inertia.append(algorithm.inertia_)  The sum of the centers of mass
Copy the code

Draw the relationship between the change of K value and the sum of the centroid distance:

plt.figure(1,figsize=(15.6))
plt.plot(np.arange(1.11), inertia, 'o')  # Data is drawn twice with different marks
plt.plot(np.arange(1.11), inertia, The '-', alpha=0.5)

plt.xlabel("Choose of K")
plt.ylabel("Interia")
plt.show()
Copy the code

Finally, we find that k=4 is appropriate. Therefore, k=4 is used to carry out the real data fitting process

Clustering modeling

algorithm = (KMeans(n_clusters=4.# k=4
                       init="k-means++",
                       n_init=10,
                       max_iter=300,
                       tol=0.0001,
                       random_state=111,
                       algorithm="elkan"))
algorithm.fit(df1)  # Simulation data
Copy the code

After performing the fit operation on the data, we get the label label and four centroids:

labels1 = algorithm.labels_  # Results of classification (4 categories)
centroids1 = algorithm.cluster_centers_  The position of the final center of mass

print("labels1:", labels1)
print("centroids1:", centroids1)
Copy the code

In order to show the classification effect of raw data, the case on the official website is the following operation, which I personally think is a little tedious:

Perform data merge:

Show the classification effect:

plt.figure(1,figsize=(14.5))
plt.clf()

Z = Z.reshape(xx.shape)

plt.imshow(Z,interpolation="nearest",
           extent=(xx.min(),xx.max(),yy.min(),yy.max()),
           cmap = plt.cm.Pastel2, 
           aspect = 'auto', 
           origin='lower')

plt.scatter(x="Age",
            y='Spending Score (1-100)', 
            data = df , 
            c = labels1 , 
            s = 200)

plt.scatter(x = centroids1[:,0], 
            y =  centroids1[:,1], 
            s = 300 , 
            c = 'red', 
            alpha = 0.5)

plt.xlabel("Age")
plt.ylabel("Spending Score(1-100)")

plt.show()
Copy the code

If it were me, what would I do? Pandas+Plolty

Take a look at the results of the classification visualization:

px.scatter(df3,x="Age",y="Spending Score(1-100)",color="Labels",color_continuous_scale="rainbow")
Copy the code

The above process is clustered according to Age and Spending Score(1-100). On the official website, clustering of Annual Income (K $) and Spending Score (1-100) fields is also carried out based on the same method.

The effects are divided into five categories:

Clustering of 3 attributes

Cluster according to Age, Annual Income and Spending Score, and finally draw a 3D graph.

K value selection

The method is the same, but 3 fields are selected.

X3 = df[['Age' , 'Annual Income (k$)' ,'Spending Score (1-100)']].iloc[: , :].values  Select 3 fields of data
inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n,
                        init='k-means++', 
                        n_init = 10 ,
                        max_iter=300, 
                        tol=0.0001,  
                        random_state= 111  , 
                        algorithm='elkan') )
    algorithm.fit(X3)   # Fit data
    inertia.append(algorithm.inertia_)
Copy the code

Draw elbow diagram to determine K:

plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , The '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
Copy the code

We finally choose K =6 to cluster

Construction simulation

algorithm = (KMeans(n_clusters=6.# determine the value of k
                    init="k-means++",
                    n_init=10,
                    max_iter=300,
                    tol=0.0001,
                    random_state=111,
                    algorithm="elkan"))
algorithm.fit(df2)

labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_

print(labels2)
print(centroids2)
Copy the code

Get the label and center of mass:

labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_
Copy the code

drawing

We finally chose Plotly to show the 3d clustering:

df["labels2"] = labels2

trace = go.Scatter3d(
    x=df["Age"],
    y= df['Spending Score (1-100)'],
    z= df['Annual Income (k$)'],
    mode='markers',
    
    marker = dict(
        color=df["labels2"],
        size=20,
        line=dict(color=df["labels2"],width=12),
        opacity=0.8
    )
)

data = [trace]
layout = go.Layout(
    margin=dict(l=0,r=0,b=0,t=0),
    title="six Clusters",
    scene=dict(
        xaxis=dict(title="Age"),
        yaxis = dict(title  = 'Spending Score'),
        zaxis = dict(title  = 'Annual Income')
    )
)

fig = go.Figure(data=data,layout=layout)

fig.show()
Copy the code

The following is the final clustering effect: