Article | carefree huan \

Source: Python Technology “ID: pythonall”

In my last article, I explored the lingerie sizes, color preferences and comments of girls by crawling the comment data of Yanxuan, netease. Through the response, I found that I had inadvertently offended a certain group of people and satisfied a certain group of special habits. As an unintentional act, the author feels deeply guilty. In order to fight injustice for the girl, after working overtime late into the night, I resolutely sacrifice sleep time, to climb netease male underwear data, see what I found.

Crawl data

First of all, we entered the keyword “men’s underwear” in the search box of Netease Yan Xuan, and the page searched the product list interface of men’s underwear:

Search Results \

We click on the first item and click “Comment” to see the following information:

Comment message \

We analyze the request list and can easily find the comment data via you.163.com/xhr/comment… This request to obtain. We then filter the request parameters to remove those that are not required, and finally find that itemId and Page are required.

ItemId refers to the ID of the item, and page refers to the page number of the request. The default number of entries per page is 40. Therefore, the prerequisite for us to obtain the comment data is to obtain the corresponding commodity ID.

We click the product from the search page to enter the product details page, so there must be the product ID of each product in the product list of the search page. We go back to the search product list page and look for the request to search the product:

List of goods \

Similarly, in the search screen request analysis, we found you.163.com/xhr/search/… After analyzing the request parameters one by one, we found that we only needed the keyword and page parameters.

After the request analysis is complete, we can start coding. The code is as follows:



Get a list of items
def search_keyword(keyword) :
    uri = 'https://you.163.com/xhr/search/search.json'
    query = {
        "keyword": keyword,
        "page": 1
    }
    try:
        res = requests.get(uri, params=query).json()
        result = res['data'] ['directly'] ['searcherResult'] ['result']
        product_id = []
        for r in result:
            product_id.append(r['id'])
        return product_id
    except:
        raise


# Get comments
def details(product_id) :
    url = 'https://you.163.com/xhr/comment/listByItemByTag.json'
    try:
        C_list = []
        for i in range(1.100):
            query = {
                "itemId": product_id,
                "page": i,
            }
            res = requests.get(url, params=query).json()
            if not res['data'] ['commentList'] :break
            print("Crawl page %s comments" % i)
            commentList = res['data'] ['commentList']
            C_list.extend(commentList)
            time.sleep(1)


        return C_list
    except:
        raise




product_id = search_keyword('Men's Underwear')
r_list = []
for p in product_id:
    r_list.extend(details(p))


with open('./briefs.txt'.'w') as f:
    for r in r_list:
        try:
            f.write(json.dumps(r, ensure_ascii=False) + '\n')
        except:
            print('Wrong')
Copy the code

For simplicity, I grabbed the number of reviews for 40 items on the front page and saved the results in briefs.txt. A preview of the file data is as follows:

Storing data \

Analyze the data

After capturing the data, we can enter the exploration section, I want to analyze the data from three perspectives of color, size and comment, and see some “characteristics” of men’s underwear.

Let’s look at the characteristics of the data structure:

{" skuInfo ": [" colors: black," "size: M]", "frontUserName" : "S * * * *,", "frontUserAvatar" : "Https://yanxuan.nosdn.127.net/0da37937c896cac1955bda8522d5754f.jpg", "content" : "very good", "createTime" : 1592965119969, "picList": [], "commentReplyVO": null, "memberLevel": 5, "appendCommentVO": null, "star": 5, "itemId": 3544005}Copy the code

Looking closely at the comment data, we can see that the color and size are placed in the skuInfo array, and the comment is placed in the Content field. At the same time, if we look at the data a little more, we can see that there are several colors:

  • Color of single strip, for example: Color: light grey
  • Multiple colors, for example: Color: (black + grey + light grey) 3 PCS
  • You can select multiple colors, for example, black + navy blue
  • Other, such as: Specification: pack of 5 pieces

And here, the last one that I can’t see, I’m going to filter it out. For others, the color can be separated by a “+” to remove the interference.

The size data format is uniform and can be obtained directly.

I displayed the color and size as a bar chart, and the comments as a word cloud. The final effect is as follows:

Color distribution \

The colors didn’t surprise me. Black was way ahead, but if you add up the grays, you probably beat black. In short, black and gray are popular choices.

Size distribution \

In terms of size, the top three are XL, L and XXL, but XL and L are not much different.

Comments cloud \

As can be seen from the comments, for both men and women, comfort is always the first priority and quality is the second. Think also, the quality is good, wearing uncomfortable, is a little light sadness ~

conclusion

The target audience of Netease Yanxuan is young people under the age of 35. The result of data analysis can also reflect the general choice of this age group. So, young men, while you scoff at the fact that most women are size 13, don’t forget that waistbands are bulging before you reach middle age. Exercise more and pay more attention to body management!

PS: Reply “Python” within the public number to enter the Python novice learning exchange group, together with the 100-day plan!

Old rules, brothers still remember, the lower right corner of the “watching” click, if you feel the content of the article is good, remember to share moments to let more people know!

[Code access ****]

Identify the qr code at the end of the article, reply: 200903