Moment For Technology

Python crawler combat, data capture and analysis of bra sales record data, found surprising secrets

Posted on Dec. 3, 2022, 9:57 a.m. by 熊雅涵
Category: The back-end Tag: The back-end The crawler

preface

Today I'm going to use Python to crawl the user comments on JD.com. Through data analysis and data visualization, I can find out which color of bra is the most popular among women

Results show

Process analysis

(Right mouse button or keyboard F12) Open the developer tool - Network, in the user reviews page we found the browser has such a request

(productId, page, pageSize) (productId, page, pageSize) The last two are paging parameters. ProductId is the ID of each item, and the evaluation record of the item can be obtained through this ID. Therefore, we only need to know the productId of each item to obtain the evaluation easily. Again to analyze the search page source code

Through analysis, we find that each item is in the Li tag, and the LI tag has a data-PID attribute, and the corresponding value is the productId of the item.

Obtain the URL of the website

First, we need to get the id of the product on the search page and provide productId for crawling user comments below. Key_word is the keyword of search, here is "bra".

import requests
import re

""" query commodity ID """
def find_product_id(key_word) :
    jd_url = 'https://search.jd.com/Search'
    product_ids = []
    # Climb the first 3 pages of merchandise
    for i in range(1.3):
        param = {'keyword': key_word, 'enc': 'utf-8'.'page': i}
        response = requests.get(jd_url, params=param)
        Id # goods
        ids = re.findall('data-pid="(.*?) "', response.text, re.S)
        product_ids += ids
    return product_ids
Copy the code

Put the item IDS from the first three pages into the list, and then we can climb the ratings

By analyzing preview, we found that the response format of the request to get user reviews is a string followed by a json spliced (as shown below), so we just need to delete the useless characters to get the JSON object we want.

The comments content in the JSON object is the final comment record we want

""" Get comment content """
def get_comment_message(product_id) :
    urls = ['https://sclub.jd.com/comment/productPageComments.action?' \
            'callback=fetchJSON_comment98vv53282' \
            'productId={}' \
            'score=0sortType=5' \
            'page={}' \
            'pageSize=10isShadowSku=0rid=0fold=1'.format(product_id, page) for page in range(1.11)]
    for url in urls:
        response = requests.get(url)
        html = response.text
        # Delete useless characters
        html = html.replace('fetchJSON_comment98vv53282('.' ').replace('); '.' ')
        data = json.loads(html)
        comments = data['comments']
        t = threading.Thread(target=save_mongo, args=(comments,))
        t.start()
Copy the code

In this method, only the urls for the first 10 pages of reviews are retrieved and placed in the urls list. The evaluation records of different pages are obtained through a loop, and a thread is started to save the message data to MongoDB.

We went on to analyze the evaluation record interface and found two pieces of data that we wanted

ProductColor: productColor

ProductSize: productSize

# mongo service
client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
# jd database
db = client.jd
# product table is not created automatically
product_db = db.product

# save mongo
def save_mongo(comments) :
    for comment in comments:
        product_data = {}
        # color
        # flush_data Specifies how to flush data
        product_data['product_color'] = flush_data(comment['productColor'])
        # size
        product_data['product_size'] = flush_data(comment['productSize'])
        # Comments
        product_data['comment_content'] = comment['content']
        # create_time
        product_data['create_time'] = comment['creationTime']
        # insert mongo
        product_db.insert(product_data)
Copy the code

Because of the differences in color and size description of each commodity, we carried out a simple data cleaning for statistical purposes.

def flush_data(data) :
    if 'skin' in data:
        return 'color'
    if 'black' in data:
        return 'black'
    if 'purple' in data:
        return 'purple'
    if 'powder' in data:
        return 'pink'
    if 'blue' in data:
        return 'blue'
    if 'white' in data:
        return 'white'
    if 'grey' in data:
        return 'grey'
    if 'somehow' in data:
        return 'Champagne'
    if 'Hu' in data:
        return 'Amber'
    if 'red' in data:
        return 'red'
    if 'purple' in data:
        return 'purple'
    if 'A' in data:
        return 'A'
    if 'B' in data:
        return 'B'
    if 'C' in data:
        return 'C'
    if 'D' in data:
        return 'D'
Copy the code

The functions of these modules are written, and the only thing to do is to connect them

Create a thread lock
lock = threading.Lock()

# Get the comment thread
def spider_jd(ids) :
    while ids:
        # lock
        lock.acquire()
        # Fetch the first element
        id = ids[0]
        # Remove extracted elements from the list to avoid repeated loading
        del ids[0]
        # releases the lock
        lock.release()
        # Get the comments
        get_comment_message(id)


product_ids = find_product_id('bra')
for i in (1.5) :# Add a thread to get comments
    t = threading.Thread(target=spider_jd, args=(product_ids,))
    # Start thread
    t.start()
Copy the code

Check MongoDB after running:

Once we have the results, we can use the Matplotlib library to chart the data to make it more intuitive

import pymongo
from pylab import *


client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
# jd database
db = client.jd
# product table is not created automatically
product_db = db.product
# Count the following colors
color_arr = ['color'.'black'.'purple'.'pink'.'blue'.'white'.'grey'.'Champagne'.'red']

color_num_arr = []
for i in color_arr:
    num = product_db.count({'product_color': i})
    color_num_arr.append(num)

# displays the color
color_arr = ['bisque'.'black'.'purple'.'pink'.'blue'.'white'.'gray'.'peru'.'red']

# labelDistance, how far is the text's position from the far point, 1.1 refers to the position of 1.1 radius
#autopct, text format within a circle, %3.1f%% represents a floating point number with three decimal places and one integer place
#shadow, whether the cake has a shadow
#startangle, startangle, 0, means from 0 anticlockwise, is the first block. Generally choose from 90 degrees to start better
# pctDistance, the percentage of the text distance from the center of the circle
#patches, l_texts, P_texts. In order to obtain the return value of the pie chart, P_texts is the text inside the pie chart, and l_texts is the text of the label outside the pie chart
patches,l_text,p_text = plt.pie(sizes, labels=labels, colors=colors,
                                labeldistance=1.1, autopct='% 3.1 f % %', shadow=False,
                                startangle=90, pctdistance=0.6)
Change the text size
# iterates over each text. Call the set_size method to set its properties
for t in l_text:
    t.set_size=(30)
for t in p_text:
    t.set_size=(20)
# Set the x and y axes to be on the same scale so that the pie chart is round
plt.axis('equal')
plt.title("Underwear color scale", fontproperties="SimHei") #
plt.legend()
plt.show()
Copy the code

Running the code, we found that the most popular skin color was followed by black

Next, let's statistic the distribution of size, which is shown in the bar chart

index=["A"."B"."C"."D"]

client = pymongo.MongoClient('mongo: / / 127.0.0.1:27017 /')
db = client.jd
product_db = db.product

value = []
for i in index:
    num = product_db.count({'product_size': i})
    value.append(num)

plt.bar(left=index, height=value, color="green", width=0.5)

plt.show()
Copy the code

When we ran it, we found that B size women were more likely

That's the end of this article, thank you for watching, Python crawler real Combat series, next article on buying a gift for your girlfriend

To thank the readers, I'd like to share some of my recent programming gems to give back to each and every one of you.

Dry goods are mainly:

① More than 2000 Python ebooks (both mainstream and classic)

Python Standard Library (Chinese version)

(3) project source code (forty or fifty interesting and classic practice projects and source code)

④Python basic introduction, crawler, Web development, big data analysis videos (suitable for small white learning)

⑤Python Learning Roadmap

⑥ Two days of Python crawler boot camp live permissions

All done~ see profile or private message for complete source code.

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.