Python implementation of clustering (I) : data import and processing

♚ \

RAYW, Internet Data analysis.

Blog: zhihu.com/people/wang-rui-54-41

In this example, the user registration time (registration days reg_length), active (rec_act_length of recent active interval, act_days of recent 7 active days) and realization (ad_pd of daily AD clicks in recent 7 days, read_pd of daily read in recent 7 days) are used for clustering.

Library into

We’ll use the python/Numpy library to manipulate paths. We’ll use the Python/Numpy library to manipulate paths. We’ll use the Python/Numpy library to manipulate paths.

import os import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans mpl.rcParams['font.sans-serif'] = ['simhei'] # specify default font mpl.rcparams ['axes. Unicode_minus '] = False # fix save image is negative sign '-' display as square %matplotlib inline # Enable matplotlib graphic displayCopy the code

Data import

Pandas provides a variety of options for retrieving data from a database, whether CSV, Excel, or SQL. In this case, it reads from CSV.

DataFrame data = pd.read_csv('user_value_source_data.csv') [['reg_length','rec_act_length','act_days','ad_pd','read_pd']]Copy the code

The data observed

Data observation is a very important step. Through data observation, we can find dirty data, missing data or unreasonable abnormal data, which can help us improve data quality and better realize the following clustering work.

The data source in this example has a total of 63W pieces of user data. For the data of this magnitude, we cannot check and view them one by one. We need to use descriptive statistical tools and distribution maps to view them. In this case, the data is taken from the company’s database, which is relatively regular and all in numerical form, so it is relatively simple to observe and process.

Field attributes and descriptive statistics

# view field related information, including the field name, data of article number, whether there is null, the field type, etc. The info () # < class 'pandas. Core. Frame. The DataFrame' > # Int64Index: 623202 entries, 0 to 732007 #Data columns (total 5 columns): #reg_length 623202 non-null float64 #rec_act_length 623202 non-null int64 #act_days 623202 non-null int64 #ad_pd 623202 non-null float64 #read_pd 623202 non-null float64 #dtypes: float64(3), int64(2) #memory usage: Describe () # reg_length rec_act_length act_days ad_pd read_pd # count 623202.000000 623202.000000 623202.000000 623202.000000 623202.000000 623202.000000 # mean 690.140974 1.756763 5.183326 0.765262 3.048918 # STD 481.744966 1.562748 2.343711 1.716068 5.806001 # min 1.000000 1.000000 1.000000 0.000000 0.000000 # 25% 219.000000 1.000000 3.000000 0.000000 0.000000 # 50% 677.000000 1.000000 7.000000 0.143000 0.286000 # 75% 1119.000000 2.000000 7.000000 0.714000 3.000000 # Max 1537.000000 7.000000 7.000000 14.857000 29.857000Copy the code

Viewing the value distribution

Taking the field of daily AD clicks as an example, the histogram of data distribution is made, and it can be seen that the data skewness distribution is clustered near 0. It is found that this situation basically exists in all fields, so logarithmic processing of data can be considered to make the distribution more uniform.

Ad_pd = data['ad_pd'] plt.figure() ad_pd.plot.hist(bins=2000,figsize=(10,6),xlim=[0,100])Copy the code

The data processing

Missing value handling

If there is no missing data in the current data, this step can be omitted. When there is only a small amount of missing data, the data can be directly removed, which has a small impact on clustering. If the field is a type variable, sometimes the missing itself can be regarded as a class, which has certain business meaning.

In addition, sometimes the missing value is not necessarily NULL, but may also appear in the form of a specific string, such as ‘NULL’, ’empty’, or a specific value, such as – 1,999, etc. When doing data processing, remember to check whether there is a similar form of data, or a single value is abnormally distributed. In this case, we can use the replace() method to unify the missing values into the form we want.

# replace('NULL', ") with data.replace('NULL', ")Copy the code

Outlier handling

In the previous data observation, we looked at the distribution of data of each field. According to the business logic, if there is an abnormally large value in the field, we should consider eliminating it to avoid affecting the clustering result.

Filter_ = (data['ad_pd'] < 100) &(data['read_pd'] < 200) data = data[filter_]Copy the code

Data standardization

In this case, the skewness distribution of data can be processed by logarithm first, and then standardized by Z-score.

def standardize(df):
    data_log = df.apply(lambda x: np.log(x+1))
    Zdata = data_log.apply(lambda x: round((x-x.mean())/x.std(),4))
    return Zdata
Copy the code

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

More recommended

Python iterators use details \

Learn about Python iterables, iterators, and generators

Use Python to crawl financial market data \

Build CNN model to crack website captcha \

Image recognition with Python (OCR)

Email: [email protected]

**** Free membership of the Data Science Club ****

Python implementation of clustering (I) : data import and processing

Related Posts

What’s the difference between a container and a virtual machine?

How do you know thread pools in the JDK

SpringIOC source code parsing (4) — Resource, ResourceLoader, containers between the delicate relationship