Data source: User purchase details of CDNow website. Total user ID, purchase date, purchase quantity, purchase amount of four fields.

1. Data preprocessing

import pandas as pd
import numpy as np
Load the data visualization package
import matplotlib.pyplot as plt
# Visual display on the page
%matplotlib inline 
# Change the design style
plt.style.use('ggplot')

columns = ['user_id'.'order_dt'.'order_products'.'order_amount']
df = pd.read_table('CDNOW.txt',names=columns,sep='\s+')
Copy the code
  • withread_tableOpen, the raw data does not contain the table header, so you need to name the field.

If the separator in the text has both Spaces and tabs (‘ /t ‘), the sep argument takes’ /s+ ‘and matches any space.

- user_id: user ID - order_dt: purchase date - order_products: purchase quantity - order_amount: purchase amountCopy the code

The consumer industry or e-commerce industry is generally analyzed by the four fields of order number, order amount, purchase date and user ID.

df.head()
Copy the code

Observe the data and determine whether the data is properly recognized. It’s worth noting that a single user may have made multiple purchases in a single day, and user ID 2 made two purchases on January 12. Don’t omit this detail.

df.info()
Copy the code

  • Check whether the data has a null value: The original data does not have a null value.

If there is a null value, see Data Analysis – Missing Value Handling Method Summary

  • Order_dt is int and needs to be converted to datetime.

When processing data, we often meet the problem of data type. When we get the data, we should first make sure that we get the correct data type. If the data type is incorrect, we need to transform the data type before processing the data.

Data type conversion
df['order_dt'] = pd.to_datetime(df.order_dt,format="%Y%m%d")
df['month'] = df.order_dt.values.astype('datetime64[M]')
Copy the code
  • Y represents a 4-digit year, such as 1990; Y represents a 2-digit number, such as 90.
df.describe()
Copy the code

  • Most orders consume only a small number of goods (average 2.4), with some extreme interference.
  • The consumption amount of users is relatively stable, the average consumption is 35.89 yuan, the median is 25.98, there is a certain extreme interference.
  • Generally speaking, the distribution of consumer data is in the form of a long tail. Most users are small, while a small number of users contribute most of the revenue, commonly known as “Two or eight.”

2. Analyze the consumption trend of users (monthly)

2.1 Total amount of monthly consumption

grouped_month = df.groupby(by='month')
order_month_amount = grouped_month.order_amount.sum()
order_month_amount.head()
Copy the code

order_month_amount.plot()
Copy the code

As can be seen from the figure above, the consumption amount reached the peak in the first three months, and the subsequent consumption was relatively stable with a slight downward trend.

2.2 Monthly consumption order number

comsume_month_count = grouped_month.user_id.count().plot()
Copy the code

The consumption order of the first three months is about 10,000, and the average consumption order of the subsequent months is about 2,500.

2.3 Monthly purchase volume of products

grouped_month.order_products.sum().plot()
Copy the code

The monthly purchase volume of products also showed a trend of high purchase volume in the early stage and steady decline in the later stage. Why is this reason presented? We assume that something is wrong with the users, that there are outliers among the users in the early period, and that the second hypothesis is all kinds of promotional marketing, but we only have consumption data, so we can’t tell.

2.4 Monthly number of consumers

User_month = df.groupby('month').user_id.apply(lambda x:len(x. drop_duplp_duplicates ()) user_month.plot() Method 2: groupby_user = df.groupby([x.drop_duplicates()).'month'.'user_id']).count().reset_index()[['month'.'user_id']]
groupby_user.groupby('month').count().plot()
Copy the code

The number of monthly consumers is lower than the number of monthly consumers, but the difference is not significant. The number of monthly consumers in the first three months is between 8000-10,000, and the average number of consumers in the subsequent months is less than 2,000

2.5 An easier way: Data perspective

df.pivot_table(index = 'month',
               values = ['order_products'.'order_amount'.'user_id'],
               aggfunc = {'order_products':'sum'.'order_amount':'sum'.'user_id':'count'}).head()
Copy the code

Pivottables are simpler and faster. In the actual operation, which is convenient and simple.

2.6 Trend of average monthly consumption amount of users

avg_month_amount = order_month_amount/user_month
avg_month_amount.plot()
Copy the code

2.7 Trend of average consumption times of monthly users

avg_month_count = comsume_month_count/user_month
avg_month_count.plot()
Copy the code

3. Analysis of individual consumption of users

3.1 Descriptive statistics of user consumption amount and consumption frequency

grouped_user = df.groupby(by='user_id')
grouped_user.sum().describe()
Copy the code

  • From a user perspective, each user bought an average of seven CDS, with a maximum purchase of 1,033. The average consumption amount (customer unit price) of users is 106 yuan, and the standard deviation is 240. In combination with the quantile and maximum value, the average is close to the 75 quantile, so there must be a small number of high consumption users.
  • The average consumption of users is 106 yuan, and the median is only 43 yuan. Judge as above, there is extreme interference

3.2 Scatter diagram of user consumption amount and consumption frequency

grouped_user.sum().plot.scatter(x='order_amount',y='order_products')
Copy the code

Most of the points on the lower part of the scatter plot pile up under the influence of a few extreme values. And we can also see that most points are less than 4000, so we can do a filter.

grouped_user.sum().query('order_amount<4000').plot.scatter(x='order_amount',y='order_products')
Copy the code

There is a strong positive correlation between the number of times a user buys CDS and the amount of money. This is easy to understand, the unit price of CDS fluctuates within a small range. The figure above reflects the unit price of goods. If there are a variety of goods and the unit price range is wide, it may show a fan spread shape.

3.3 Distribution of user consumption amount

grouped_user.sum().order_amount.plot.hist(bins=20)
Copy the code

As can be seen from the histogram, the majority of users’ consumption amount presents a central trend, and a small number of outliers interfere with the judgment. You can use filtering to rule out exceptions.

3.4 Distribution of user consumption times

grouped_user.sum().query('order_products<100').order_products.plot.hist(bins=40)
Copy the code

  • Subliminal: Consumption-related data is basically this type of data distribution. A small percentage of users spend most of their money.
  • Chebyshev’s theorem was used to filter outliers and calculate the distribution of 95% of the data. For example, how is the number of purchased products <100 determined? Mean value 7+ standard deviation 17*5=92

3.5 Proportion of accumulated consumption amount of users (what percentage of users accounts for what percentage of consumption amount)

#cumsum
user_cumsum = grouped_user.sum().sort_values('order_amount').apply(lambda x:x.cumsum()/x.sum())
user_cumsum.reset_index().order_amount.plot()
Copy the code

The figure shows that 50% of users contributed only 15% of the consumption amount. The top 5,000 users accounted for 60 percent of spending. That is to say, as long as we maintain these 5000 users, we can achieve 60% of KPI performance, and if we can manage the 5000 users better, we can account for between 70% and 80%.

4. User consumption behavior

4.1 User’s First consumption (First purchase)

In many industries, first purchase is a very important dimension, which is closely related to channels. Especially for industries with high customer unit price and low customer retention rate, where first-time customers come from can expand many operation modes.

Find the minimum value of the month, i.e. the time of the first purchase in the user's purchase behavior
grouped_user.min().order_dt.value_counts().plot()
Copy the code

Users first purchase distribution, concentrated in the first three months. Among them, there was a sharp fluctuation from February 11 to February 25. Specific reasons can be assumed, such as changes in channels, changes in the way of rewards and incentives, and February may be related to the Spring Festival, but there is no Spring Festival in foreign countries, so it can be ruled out. That’s 123 points to analyze.

4.2 User’s last consumption

A user’s last purchase is related to churn

grouped_user.max().order_dt.value_counts().plot()
Copy the code

There is an obvious cliff drop. In the first three months, there are more new users, but there are no new users in the following months, which is mainly the consumption of old customers (regular customers). From the side, it shows that the vast majority of users only consume once and then no longer consume.

The distribution of users’ last purchase is wider than that of the first one, and most of the last purchase is concentrated in the first three months, indicating that many users do not purchase again after the first purchase. As time goes by, the number of the last purchase also increases, consumption shows a rising situation of loss, and user loyalty is slowly declining.

4.3 New and old customer consumption ratio (how many users only consume once)

user_life = grouped_user.order_dt.agg(['min'.'max'])
user_life.head()
Copy the code

(user_life['min'] == user_life['max']).value_counts()
Copy the code

Half of the users made only one purchase

4.4 User Stratification

4.4.1 RFM

RFM model describes the value status of a customer through three indicators: the recent purchase behavior, the total frequency of purchase and the amount of money spent.

  • Last consumption (Recency)
  • The higher F value of consumption Frequency is, the more frequent customers’ transactions are; otherwise, the less active customers’ transactions are.
  • The higher the Monetary M value is, the higher the customer value is; otherwise, the lower the customer value is.
rfm = df.pivot_table(index = 'user_id',
                     values = ['order_products'.'order_amount'.'order_dt'],
                     aggfunc = {'order_dt':'max'.'order_amount':'sum'.'order_products':'sum'})
rfm.head()
#order_amount Total amount consumed, order_products total number of products consumed, order_products last consumed time
Copy the code

Number of days since RFM

rfm['R']=(rfm.order_dt.max()-rfm.order_dt)/np.timedelta64(1.'D')  # divided by units, becomes a number
rfm.rename(columns = {'order_products':'F'.'order_amount':'M'},inplace = True)
rfm.head()
Copy the code

# Divide RFM into different dimensions, above and below the mean
rfm[['R'.'F'.'M']].apply(lambda x:x-x.mean())
Copy the code

def rfm_func(x) :
    level = x.apply(lambda x:'1' if x>=0 else '0')
    label = level.R + level.F + level.M
    d = {
         '111':'Key Value Account'.'011':'Important Retention customer'.'101':'Important Retention'.'001':'Key development client'.'110':'General Value Customer'.'010':'General retention customer'.'100':'General Retention'.'000':'General Development clients'
    }
    result = d[label]
    return result

rfm['label'] = rfm[['R'.'F'.'M']].apply(lambda x:x-x.mean()).apply(rfm_func,axis=1)
rfm
Copy the code

Hierarchical user count

rfm.loc[rfm.label == 'Key Value Account'.'color'] = 'g'
rfm.loc[~(rfm.label == 'Key Value Account'),'color'] = 'r'
rfm.plot.scatter('F'.'R',c=rfm.color)
Copy the code

rfm.groupby('label').sum(a)Copy the code

It can be seen from RFM stratification that most users are important retention customers, but this is due to the influence of extreme value, so the classification standard of RFM should be based on business

  • Try to cover the most with a small number of users
  • Don’t rank the data to look good

4.4.2 User life Cycle (New, Old, Active, Reflux, and Outflow)

pivoted_counts = df.pivot_table(index = 'user_id',
                                columns = 'month',
                                values = 'order_dt',
                                aggfunc = 'count').fillna(0)
pivoted_counts.head()
Copy the code

The division of the life cycle just needs to know whether or not the user made a purchase in the month, the number of purchases is not important here, so we need to simplify the model and some users who didn’t make a purchase in a particular month will be represented by NaA, and filna will be filled in here.

df_purchase = pivoted_counts.applymap(lambda x: 1 if x>0 else 0)
df_purchase.tail()
Copy the code

For tail data, the data of user_id2W+ is problematic because in a real business scenario, they are not registered in January and February until their first consumption in March. The perspective will add 0 to their data for January and February, which requires a judgment that the first consumption is the beginning of the life cycle, not a rough calculation from January.

def active_status(data) :
    status = []
    for i in range(18) :# If there is no consumption this month
        if data[i] == 0:
            if len(status) > 0:
                if status[i-1] = ='unreg':
                    status.append('unreg')
                else:
                    status.append('unactive')
            else:
                status.append('unreg')
                
        # If you spend this month
        else:
            if len(status) == 0:
                status.append('new')
            else:
                if status[i-1] = ='unactive':
                    status.append('return')
                elif status[i-1] = ='unreg':
                    status.append('new')
                else:
                    status.append('active')
    return pd.Series(status, index = df_purchase.columns)
Copy the code

The judgment is divided into two parts, with whether to consume this month as the boundary. If there is no consumption this month

  • If it was unregistered before, it is still unregistered
  • If there was previous consumption, it is lost/inactive
  • In other cases, it is not registered

If there is any consumption this month

  • If the first purchase, it is a new user
  • If there was previous consumption, and the last month was inactive, it is reflux
  • If you have not registered in the last month, you are a new user
  • Before that, for active

Mainstream writing in the industry

The user lifecycle state changes here are done once with pivottables, but in a real business scenario we might treat them as intermediate tables using SQL. We have the list, and we’re going to use the list to calculate the state table; For example, if last month was a new user or a returning user, we can directly use the status of the previous month to join the status of the current month. In order to compare the state of the game for each month, use the “left JOIN” function to find out which users are new to the game in January

purchase_status = df_purchase.apply(active_status,axis = 1)
purchase_status.tail()
Copy the code

# Count of different active users per month
purchase_status_ct = purchase_status.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))
purchase_status_ct
Copy the code

purchase_status_ct.fillna(0).T.head()
#purchase_status_ct.fillna(0).T.plot.area()
Copy the code

purchase_status_ct.fillna(0).T.apply(lambda x:x/x.sum(),axis=1)
Copy the code

As can be seen from the above table, monthly changes in user consumption status

  • Active users, users who continue to spend, correspond to the quality of the consumer operation
  • Circumfluence user, do not consume before this month just consume, corresponding is call back operation
  • Inactive users, which corresponds to churn

4.5 User Purchase Period (by order)

4.5.1 User Consumption Cycle Description

order_diff = grouped_user.apply(lambda x:x.order_dt - x.order_dt.shift())
order_diff.head(10)
# After grouping the users, the order purchase time of each user is subtracted by dislocation
Copy the code

The value of user_id 1 is null, indicating that the customer has purchased only one order. The second order of user user_id 2 is purchased on the same day as the second order.

4.5.2 Distribution of user consumption cycle

(order_diff / np.timedelta64(1.'D')).hist(bins=20)
Copy the code

  • The order cycle is exponential
  • The average purchase cycle is 68 days
  • The vast majority of users spend less than 100 days on purchases

4.6 User Life Cycle (First & Last Consumption)

4.6.1 User Life Cycle Description

(user_life['max'] - user_life['min']).describe()
Copy the code

The data offset is relatively large, with a median of 0 days meaning that more than 50% of users have a life cycle of 0 days and only purchase once, but the average life cycle is 134 days and the maximum is 544 days.

4.6.2 User Life Cycle Distribution

((user_life['max'] - user_life['min']) / np.timedelta64(1.'D')).hist(bins=40)
Copy the code

  • A user’s life cycle is significantly affected by a user who only buys once (can be excluded)
  • The average user spent 134 days, and the median was only 0 days
# Filter users who only buy once
u_1 = ((user_life['max'] - user_life['min']).reset_index()[0] / np.timedelta64(1.'D'))
u_1[u_1 > 0].hist(bins = 40)
Copy the code

Screen lifetime>0, not only exclude those who only consume once, there are many users whose life cycle is close to 0 days, part of the poor quality users who consume twice, but still cannot last, in the first 30 days of users should be guided as far as possible. A small number of users are concentrated in 50-300 days, which belongs to the ordinary life cycle. The life cycle of high-quality users, concentrated after 400 days, belongs to loyal users.

5. Analysis of repurchase rate and repurchase rate

Repurchase rate: the proportion of users who purchase multiple times in a natural month

Repurchase rate: The percentage of users who have purchased before who have repurchased within a given period

5.1 after purchase rate

pivoted_counts
Copy the code

purchase_r = pivoted_counts.applymap(lambda x: 1 if x > 1 else np.NaN if x==0 else 0)
purchase_r.head()
Copy the code

(purchase_r.sum() / purchase_r.count()).plot(figsize = (10.4))
Copy the code

The repurchase rate remained stable at about 20%. In the first three months, due to the influx of a large number of new users who only purchased once, the repurchase rate decreased.

The repurchase rate can be calculated by dividing sum and count. Because both functions ignore NAN, NAN is the user who has not consumed anything, count is counted whether it is 0 or 1, so it is the total number of users who have consumed anything, and sum is the number of users who have consumed anything more than twice. Here we use a more ingenious substitution method to calculate the repurchase rate, which can also be used in SQL. As can be seen from the figure, the repurchase rate of new customers was not high in the early stage due to the addition of a large number of new users. For example, the repurchase rate of new customers in January was only about 6%. In the later stage, the users are the old customers left by the great wave, and the re-purchase rate is relatively stable, at about 20%. Just look at the new and old customers, the rate of re-purchase has about three times the gap.

5.2 repo rate

Then calculate the repurchase rate. The repurchase rate is the percentage of users who spend at one time window and still spend at the next time window. If 1,000 users spent in January, 300 of them still spent in February, and the buyback rate is 30%.

df_purchase.head()
Copy the code

def purchase_back(data) :
    status = []
    for i in range(17) :if data[i] == 1:
            if data[i+1] = =1:
                status.append(1)
        if data[i+1] = =0:
            status.append(0)
        else:
            status.append(np.NaN)
    status.append(np.NaN)
    return status
Copy the code
purchase_b = df_purchase.apply(purchase_back,axis=1)
purchase_b.head()
Copy the code

  • 0 represents that month consumption had not consumed next month, 1 represents that month consumption had consumed next month still consumption.
  • Create a new judgment function. Data is the input data, namely, the record of whether the user has consumed in 18 months, and status is an empty list, which is used to store the field of whether the user has repurchased. Because there are 18 months, you have to make a judgment every month, and you need to use a loop. The main logic of if is that if the user has made a purchase in the current month and has made a purchase in the next month, it is marked as 1, and if the user has not made a purchase, it is 0. If there is no consumption in this month, it is NAN, which will be excluded in subsequent statistics. The apply function is applied to all rows to get the desired result.
  • In the end, the repurchase rate is almost the same as the repurchase rate. Count and sum are used to calculate it. As can be seen from the figure, the repurchase rate of users is higher than the repurchase rate, about 30%, which is not much different from old customers. From the comprehensive analysis of buyback rate and repurchase rate, it can be concluded that the overall quality of new customers is lower than that of old customers, and the loyalty of old customers (buyback rate) is better and consumption frequency is a little lower, which is the user consumption characteristics of CDNow website.