The text and pictures in this article come from the network, only for learning, communication, do not have any commercial purposes, if you have any questions, please contact us to deal with.

The following article is about learning Python by J Brother

For those who are new to Python, you can copy the link below to watch the basic Python tutorial for free

https://v.douyu.com/author/y6AZ4jn9jwKW
Copy the code

preface

Before on, guangshen’s friends are probably still wearing short sleeves envy the atmosphere of snow in the north. As a result, last week, Guangzhou and Shenzhen also ushered in a cooling, people have joined the “cooling group chat”.

In order to help you resist the cold, I specially climb down jingdong down jacket data. Why not Tmall, the reason is very simple, slider verification is a bit troublesome.

Data acquisition

Jingdong’s website is a dynamically loaded Ajax site that can only be accessed by parsing the interface or using selenium automated testing tools. About the dynamic web crawler, the public number history of the original article introduced, interested friends can go to know about it.

Selenium was used for data acquisition this time. Due to the rapid update of my Google Browser version, the original Google driver was interrupted. So I replaced the browser auto-update and downloaded the corresponding version of the driver.

Then, using selenium, I searched for down jackets on JINGdong and logged in by scanning the code on my mobile phone, obtaining the product name, price, store name, number of comments and other information of down jackets.

from selenium import webdriver from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from lxml import etree Import random import JSON import CSV import time browser = webdriver.Chrome('/ 下 J 中 Python/ jd.com/Chromedriver ') wait =WebDriverWait(browser,50) # set wait time url = 'https://www.jd.com/' data_list= [] # set global variables to store data keyword =" down "# def Page_click (page_number) : try: # slide to the bottom the execute_script (" window. The scrollTo (0, document. Body. ScrollHeight);" Button = wait. Until (ec.element_to_be_clickable ((by.css_selector,)) '#J_bottomPage > sp.p-num > a.p-next > em')) #J_bottomPage > sp.p-num > a.p-next > em Ec.presence_of_all_elements_located ((by.css_selector, "#J_goodsList > ul > li:nth-child(30)")) 30 after loading the goods the execute_script (" window. The scrollTo (0, document. Body. ScrollHeight);" ) wait.until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)")) "#J_bottomPage > span.p-num > a.curr"), HTML = browser.page_source# retrieve page information prase_html(HTML)# call the function except to extract data TimeoutError: return page_click(page_number)Copy the code

Data cleaning

Import data

Import pandas as pd import numpy as np df = pd. Read_csv ("/ / Python/ jd/down. CSV ") df.sample(10)Copy the code

Rename columns

(df = df. Rename the columns = {" title ":" name of commodity, 'price' : 'commodity prices',' shop_name ':' shop name ', 'comment' : 'comment on the number of'})Copy the code

Viewing Data Information

Df.info () "1. Duplicate value possible 2. Missing value 3 in store name. Number of evaluation need to be cleaned "' < class 'pandas. Core. Frame. DataFrame' > RangeIndex: 4950 entries, 0 to 4949 Data columns (total 4 columns) : # Column non-null Count Dtype --------- -------------- ----- 0 Item name 4950 Non-NULL Object 1 Item price 4950 non-NULL FLOAT64 2 Store name 4949 Non-NULL object 3 Number of comments 4950 Non-null object DTypes: Float64 (1), Object (3) Memory Usage: 154.8+ KBCopy the code

Deleting Duplicate Data

df = df.drop_duplicates()
Copy the code

Missing value handling

Df [" 表 名 "] = df[" 表 名 "].fillna(" 表 名 ")Copy the code

Trade name cleaning

The thickness of the

TMP =[] for I in df[" 下 载 "]: if" 下 载 "in I: TMP. Append (" 下 载 ") else: if" 下 载 "in I: TMP. Append (" 下 载 ") else: Tmp.append (" other ") df[' thickness '] = TMPCopy the code

Version –

For I in df[" 下 载 "]: if" 下 载 "in I: tmp.append(" 下 载 ") elif" 下 载 "in I: tmp.append(" 下 载 ") else: tmp.append(" 下 载 ") df[' 下 载 '] = TMPCopy the code

style

TMP =[] for I in df[" 下 载 "]: if" 下 载 "in I: TMP. Append (" 下 载 ") elif" 下 载 "in I: Tmp.append (" casual ") elif" simple "in I: tmp.append(" simple ") else: tmp.append(" other ") df[' style '] = TMPCopy the code

Commodity price cleaning

Df [" price range "] = pd cut (df [" commodities "], [0, 100300, 500, 700, 1000000],labels=['100 yuan below ','100 yuan -300 yuan ','300 yuan -500 Yuan ','500 yuan -700 yuan ','700 yuan -1000 yuan ','1000 yuan above '],right=False)Copy the code

Evaluation number cleaning

The import re df [' digital '] = [re. The.findall (r '(\ d + \. {0, 1} \ d *)', I) [0] for I in df [' comment on the number of ']] # extract digital df [' digital '] = df [' digital '] astype (' float ') # into numeric df [' units'] = ['. Join (re. The.findall (r '(m), I)) for I in df [' comment on the number of ']] # extraction unit (m) df [' units'] = df [' units']. Apply (lambda x, if x = 10000 = 'wan' else1) df [' comment on the number of '] = df * [' digital '] Df [' units'] # calculating the number of comments df [' comment on the number of '] = df [r]. 'comment on the number of' astype (" int ") df. Drop ([' digital ', 'units'], axis = 1, inplace = True)Copy the code

Shop Name cleaning

If df[" STR "] = df[" STR "].Copy the code

visualization

Introduce visual correlation libraries

import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.rcParams['font.sans-serif'] = ['SimHei'] # Set loaded font name plt.rcparams ['axes. Unicode_minus '] = False# Import jieba import re from Pyecharts.charts import * from pyecharts import options as opts from pyecharts.globals import ThemeType import stylecloud from IPython.display import ImageCopy the code

Descriptive statistics

Correlation analysis

Histogram of commodity price distribution

Plot (df[" 小 plots "],color=" bins ",bins=10) Axes (fontsize=16) axes (fontsize=16) axes (fontsize=16)Copy the code

Histogram of comment population distribution

Plot (df[" number of plots "],color="green",bins=10,rug=True) Axes (fontsize=16) axes (fontsize=16) axes (fontsize=16)Copy the code

The relationship between the number of comments and commodity prices

Graph,axes= PLT. Subplots (figsize=(15,8)) SNS. plt.xticks(fontsize=16) plt.yticks(fontsize=16)Copy the code

Down jacket price distribution

Astype (" STR ").value_counts() print(df2) df2 = df2.sort_values(Ascending =False) Regions = df2.index.to_list() values = df2.to_list() c = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.DARK)) .add("", list(zip(regions,values))) .set_global_opts(legend_opts = opts.LegendOpts(is_show = False),title_opts= opts.titLeopts (title=" down jacket price range distribution ",subtitle=" ",pos_top="0.5 ",pos_left = 'left') .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=14)) ) c.render_notebook()Copy the code

Top10 stores by number of comments

Df5 = df.groupby(' store name ')[' comment number ']. Mean () df5 = df5.sort_values(ascending=True) df5 = df5.tail(10) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1100px",height="600px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_opts(title_opts= opts.titleopts (title=" number of comments TOP10",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code

Version –

[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:2] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title_OPts = opts.titLeopts (title=" all versions of down jacket average price ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code

The thickness of the

[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:2] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title_OPts = opts.titLEOPts (title=" average price of down jacket of all thickness ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code

style

[' product ']. Mean () df5 = df5.sort_values(ascending=True)[:4] #df5 = df5.tail(10) df5 = df5.round(2) print(df5.index.to_list()) print(df5.to_list()) c = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.DARK,width="1000px",height="500px")) .add_xaxis(df5.index.to_list()) .add_yaxis("",df5.to_list()). Reversal_axis () # Reversal_axis .set_global_OPts (title_OPts = opts.titLEOPts (title=" all styles of down jacket average price ",subtitle=" Xaxis_opts = opts.axisopts (axislabel_opts= opts.labelopts (font_size=11)), #yaxis_opts= opts.axisopts (axislabel_opts= opts.labelopts (font_size=12)) Axislabel_opts ={"rotate":30}) .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='right')) ) c.render_notebook()Copy the code

Down jacket word cloud picture