Obtaining the data of 4000+ lipsticks on TaoBao, I find…

Hello, I’m Peter

Pair of mouth red start hands finally recently ~ obtained 4000+ lipstick data from the net, undertook data processing and analysis work, it is discovered new continent simply!

Want to know what this picture means? Be sure to read the full article

Import libraries

import pandas as pd
import numpy as np
import re 
import jieba

Display all columns
# pd.set_option('display.max_columns', None)

# display all rows
# pd.set_option('display.max_rows', None)

# set the display length of value to 100, default to 50
# pd.set_option('max_colwidth',100)

# Drawing correlation
import matplotlib.pyplot as plt
from pyecharts.globals import CurrentConfig, OnlineHostType   # Import in advance, prevent the graph
from pyecharts import options as opts  # configuration items
from pyecharts.charts import Bar, Scatter, Pie, Line,Map, WordCloud, Grid, Page  # Class of each graph
from pyecharts.commons.utils import JsCode   
from pyecharts.globals import ThemeType,SymbolType

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots  # subgraph
Copy the code

Data and information

The data comes from naughty baby TB, which basically crawls 5 fields of information:

  • The price
  • The store
  • Point of origin
  • The number of payment
  • Description information

By importing the data, we found that the final number was 4450:

Data exploration

View basic information about the data, including data size, missing value, and data type

Data preprocessing

Pre-processing is mainly based on two fields: number of buyers and place of shipment:

1. Number of buyers

The original number of purchasers is a character number, each followed by a “person account” and a “+” in part.

We deal with numeric data and replace the missing value with 0:

df["The number"] = df["The number"].fillna("0 people pay",inplace=False)  # Missing value handling

def change_buy_people(x) :
    if "+" in "x":
        return x.split("+") [0]
    elif "Payment" in x:
        return x.split("People") [0]
      
df["The number"] = df["The number"].apply(change_buy_people)
df
Copy the code

The second step is to remove the + sign:

Step 3: Process data with ten thousand

Get the final result!!

2. Handle the place of shipment

If it is a domestic province or city, carry out the cutting treatment. There are foreign countries in the place of shipment, such as: The United States, South Korea, etc

df["Place of shipment"] = df["Place of shipment"].fillna("No information",inplace=False)

df["Province _ country"] = df["Place of shipment"].apply(lambda x: x.split("") [0] if "" in x else x)
df["The city"] = df["Place of shipment"].apply(lambda x: x.split("") [1] if "" in x else x)
df.head()
Copy the code

This is the whole process of processing data, do you understand?

The data analysis

Data analysis does not show the specific data processing process, but mainly shows the results:

  • Store Quantity analysis
  • Price analysis
  • Analysis of purchasers
  • Place of origin analysis
  • Description word cloud

Store Quantity analysis

Looking at the distribution of the number of stores in the data we obtained, we took out the top 30:

  • Tmall International has the most stores
  • Brand: Watsons has the most

According to the proportion of stores, tmall still has the most stores

Price analysis

The price of lipstick is high and low, let’s first look at the specific numerical situation:

  • The average price is 165 yuan! Really?
  • The highest unit price in the data is 6160 YUAN!! Really step ma GUI 😅

Look at the distribution of the data on the violin chart: sure enough, the high of 6160 is ridiculous!

Look at the price distribution of different stores: we found the most expensive store

Christian Louboutin: The queen’s scepter

The following picture is from the official website. It looks very expensive. I won’t click on it

Payer analysis

Each shop has its own number of payers for lipstick. First look at the big picture of the data:

Conclusion:

  1. The average number of payers is 1220. I don’t know whether it is high or low. Shouldn’t it be lower?
  2. The maximum payment in the shop is 350,000 yuan, fierce ah!

Let’s look at the distribution of the data: stores with a lot of payers are, after all, few

Analysis of the number of payers in the store

The figure below shows the number of lipsticks paid by different stores. We find that:

  • Perfect Diary’s flagship store is the most popular
  • The flagship stores of MAC, Colorkey and 3CE also have a large number of payers
  • When we calculate the average number of payers per store, Colorkey goes first

The graph above shows the number of payers for each store

The graph below shows the average number of payers per store

Brand Introduction:

1. Colorkey Colaqi, a cosmetics brand of Meishang (Guangzhou) Cosmetics Co., LTD. Products include face, eye, lip, makeup remover, beauty tools and perfumes.

Perfect Diary: Guangzhou Yixian E-commerce Co., LTD., Perfect Diary is committed to explore the European and American fashion trends, combined with Asian people’s face and skin characteristics, heart for the new generation of women to develop a series of high quality, fine design, easy to use makeup products.

3. Huaxizi: Huaxizi is a cosmetics brand registered in the State Trademark Administration by Zhejiang Yige Enterprise Management Group Co., LTD. According to the skin characteristics and makeup needs of Oriental women, with flower essence and Chinese herbal medicine extract as the core ingredients, using modern cosmetics research and development and manufacturing technology, to create a healthy, skin nourishing and suitable cosmetics products for Oriental women.

The first two are both local brands in Guangzhou, which has a developed cosmetic industry, and the third is a brand from Hangzhou.

Place of origin analysis

The graph above is for domestic shipping locations, and the color distribution in the data and map also shows guangdong, Zhejiang and Shanghai have the most shipping locations

The graph below contains all the shipping places that appear abroad, such as the United States, Japan and Korea, etc. Mainly in Guangdong, Shanghai, Zhejiang, Jiangsu and other areas of the store delivery

Description word cloud

Draw the described text information into a word cloud:

Let’s take a look at the effect of the top 50 words: lipstick, matte, moisturizer, lip balm and authentic are all the most frequently used words in stores

Quantitative analysis of lipstick brands

Brand in the description of each store:

x_data = df9["index"].tolist()[:20]
y_data = df9["Brand"].tolist()[:20]


c = (
    Pie(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
    .add(""[list(z) for z in zip(x_data, y_data)])
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Number of lipstick Brands"),
        legend_opts=opts.LegendOpts(type_="scroll", pos_left="90%", orient="vertical"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
)

c.render_notebook()
Copy the code

conclusion

This data analysis work is really a new insight:

  1. New knowledge has been learned in data crawling and new methods have been used in data processing
  2. The most important thing is to have a deeper understanding of lipstick, incredibly there are so many brands!!

Finally, I’d like to share a website that Peter found on GitHub by accident: zhangwenli.com/lipstick/?r…

This is a lipstick visualization website made by a blogger. On the website, we can see many brands of lipstick colors. If you are interested, you can play around

  • We can click on one of the colors
  • The details are displayed in the upper left corner