It took three days to finish writing a crawler, and now I will record it.

Using the crawler to retrieve user comments on JD.com, the data can be analyzed to find out which color of underwear is preferred by the girls and the actual cup size of the Girls in China.

Web analytics

First open the home page of JINGdong, and search “Lingerie women”, will appear as follows:

The URL is as follows:

https://search.jd.com/Search?keyword= underwear women & enc = utf-8 & wq = female underwear & pvid d7426e03119404f98c55ecb20de787e = 4Copy the code

Price, title, item ID

In this page we need to get the price, title, and ID of each item. I’m sure there will be no confusion about the price and the title. Because next, we need to get the data of the number of comments and evaluation, so we need to enter the details page. For example, let’s take the four products in the picture above as an example. After clicking, you will find that we enter the details page.

The URL addresses of the details page of the four items in the figure above are as follows:

https://item.jd.com/100014910630.html

https://item.jd.com/56716346225.html

https://item.jd.com/100003425463.html

https://item.jd.com/100005850583.html
Copy the code

After looking at these four URLS, you should know that the changed data is just a string of numbers, so you can look in the source code to see if such data exists.

Open developer tools and search the page for any of the above numbers, for example, search: 100014910630

From the figure above, you can see that the data is in the source code of the web page, so it is very convenient for us to extract the data next.

I used Xpath to extract the price, title, and item ID.

The core code looks like this:

 def get_info(self, html) :
        titles = []
        prices = []
        comment_counts = []
        good_counts = []
        selector = etree.HTML(html)
        lis = selector.xpath('//*[@id="J_goodsList"]/ul/li')
        for li in lis:
            title = li.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()') [0].replace('\t\n'.' ').replace('\n'.' ')
            titles.append(title)

            price = float(li.xpath('.//div[@class="p-price"]/strong/i/text()') [0])
            prices.append(price)

            product_id = li.xpath('.//div[@class="p-commit"]/strong/a/@id') [0].replace('J_comment_'.' ')
            
Copy the code

The number of evaluation and praise

Here we need to think about a question, the number of reviews and praise such as the number of data is generated? Imagine if today 10,000 people bought underwear and commented, would programmers have to modify the source code of the web?

If you think so, you are underestimating the programmers on JD.com.

Therefore, I decided that it must not be the case above, either through Ajax asynchronous loading or js rendering.

Hard work pays off, refresh the page in the details page, and check the data loading process, finally let me find it, it is rendered by JS data.

As shown in the figure below:

Both the positive and the comment numbers appear here.

This data is still very confusing, at first glance is not very like JSON data, but very similar, if you look carefully, it is not, we need to do simple processing.

Delete the preceding jQuery7484346(and the following); These two strings, you can convert the above data into JSON format data.

Next, click on headers to view its URL information, as shown below:

Do you find interesting things from the above graph that tag data is not very well, right, tag data is our commodity ID, after tests found that the other parameters do not need to change, you just need to replace the commodity ID, so we can get the number of comments and reviews all the goods.

The core code looks like this:

    def get_comment_count(self, id) :
        url = f'https://club.jd.com/comment/productCommentSummaries.action?referenceIds={id}&callback=jQuery5250105&_=1615718915139'
        time.sleep(0.3)
        html = requests.get(url, headers=self.headers).content.decode('gbk')
        text = html.replace('jQuery5250105('.' ').replace('); '.' ')
        json_text = json.loads(text)
        comment_count = jsonpath.jsonpath(json_text, '$.. CommentCountStr') [0].replace('+'.' ')
        good_count = jsonpath.jsonpath(json_text, '$.. GoodCountStr') [0].replace('+'.' ')
        if '万' in good_count:
            good_count = int(good_count.replace('万'.' ')) * 10000
        if '万' in comment_count:
            comment_count = int(comment_count.replace('万'.' ')) * 10000

        return comment_count, good_count
Copy the code

Of course, here we need to do simple processing, for example, the data contains 10,000words, we need to delete the 10,000words, and multiplied by 10,000.

When working with JSON data, I still use the jsonPath module. That’s a lot easier than getting the value of the dictionary.

Next, the get_comment_count() function is called through the get_info() function above to get the number of comments and comments.

The specific code is as follows:

    def get_info(self, html) :
        titles = []
        prices = []
        comment_counts = []
        good_counts = []
        selector = etree.HTML(html)
        lis = selector.xpath('//*[@id="J_goodsList"]/ul/li')
        for li in lis:
            title = li.xpath('.//div[@class="p-name p-name-type-2"]/a/em/text()') [0].replace('\t\n'.' ').replace('\n'.' ')
            titles.append(title)

            price = float(li.xpath('.//div[@class="p-price"]/strong/i/text()') [0])
            prices.append(price)

            product_id = li.xpath('.//div[@class="p-commit"]/strong/a/@id') [0].replace('J_comment_'.' ')
            comment_count = int(self.get_comment_count(product_id)[0])
            comment_counts.append(comment_count)

            good_count = int(self.get_comment_count(product_id)[1])
            good_counts.append(good_count)

        result = zip(titles, prices, comment_counts, good_counts)
        return result
Copy the code

Save the data

Generally speaking, I save the data in the MySQL database, so that I can easily import and export. The pymysql module is installed as follows:

pip install pymysql
Copy the code

I created the data table Underware5 in the database, which contains several fields: title, Price, CommentCount, goodCount. Commentcount is the number of comments, and goodcount is the number of comments.

Underwear selling volume display

In general, the more reviews an item has, the more sales it proves to be. So I made a simple graph to simply rank the number of comments and the number of comments, as shown in the figure below:

Top of the list are cat people, Ouyang Nana, Antarctica and urban beauty, etc. I’ve chosen the top 10 as a reference.

Underwear price ranking

From the price data obtained, I took out the top ten underwear prices and drew a simple graph to show them, as shown in the figure below:

The most expensive underwear is the Velvet Goose Victoria, which costs a whopping 558 yuan. The price of other underwear brands is maintained between 360 and 390.

It seems that there is something powerful about this underwear.

Analysis of Ouyang Nana

Ouyang Nana this brand I simply looked at, whether in Taobao or in jd sales are pretty high. So break down nana Ouyang in its own right, and look at the reviews, sizes, ratings and colors that women like.

Since the comments are in the details page, we will analyze the data in detail.

Also, we need to think about the question, if there are 1000 comments today, do programmers have to manually embed the comments into the source code?

Isn’t that nonsense?

No way.

Therefore, here I also judge that it is most likely that the data is rendered via JS. After refreshing the details page to clarify my goal, I quickly found where the data was. As shown in the figure below:

Instead of simply refreshing the page, you have to slide it down until the comment appears and the data is loaded.

Again, we can click headers to see the URL for the data, as shown below:

The first tag refers to the item ID, and the second tag refers to the page in the comment area.

Therefore, within the same goods, the only data that changes is the page.

In fact, JINGdong also has anti-climbing measures here. It is impossible for us to climb down all the data of 200,000, and we can only climb to the end of page 99. However, I climbed ouyang Nana’s data of two businesses and climbed down about 2,000 pieces of data to do data analysis.

The little sister’s comment

Why is ouyang Nana’s sales so high? You have to see what they are commenting on.

From the word cloud above, it’s easy to see that the underwear is not only made of good material and quality, but also comfortable to wear. No wonder so many people have bought it.

The real Size of the little sister

Here I analyze a lonely.

Originally ouyang Nana’s underwear which have what cup, what suit is A~C cup of young lady sisters wear, heard that elasticity of this underwear is good, all underwear are average size.

Analyze urban beauty

Since all above are average code, I think many old drivers will give up, so I will grab the data of urban beauty down, to meet you.

The little sister’s comment

Obviously, urban Beauty and Ouyang nana’s comments are similar, so both lingerie styles are quite good.

Little sister’s favorite color

From the two images above, it’s clear that the most popular color to buy is pinkish pinkish, followed by pale silver.

The real Size of the little sister

Come on, old drivers, get in the car.

What you really want to see is how many people don’t even look at the code above and just scroll down to the end.

I don’t know what you saw, but I just saw loneliness. I don’t know anything except that A, B and C are cup sizes.

After professional guidance, I understand that the original front of the number represents the chest under the circumference, but different places on the size of the description is different.

The bust circumference that looks young lady sister concentrates in 32B and 36B are more ah!

Heard city beauty this brand, student party is in the majority, all see the distribution of cup, believe you understand all understand.

The last

Nothing can be accomplished overnight, so is life, so is learning!

So where is the three days, seven days?

Only persistence can lead to success!

Pegging the Books said:

Every word of the article is my heart out, just want to be worthy of every attention to my people. At the end of the article, click “like” and “looking” to let me know that you are also working hard for your study.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am a reader, a dedicated learner. The more you know, the more you don’t know. See you next time for more!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Secretly tell you the real Size of Chinese little sister!!

Web analytics

Price, title, item ID

The number of evaluation and praise

Save the data

Underwear selling volume display

Underwear price ranking

Analysis of Ouyang Nana

The little sister’s comment

The real Size of the little sister

Analyze urban beauty

The little sister’s comment

Little sister’s favorite color

The real Size of the little sister

The last

Secretly tell you the real Size of Chinese little sister!!

Web analytics

Price, title, item ID

The number of evaluation and praise

Save the data

Underwear selling volume display

Underwear price ranking

Analysis of Ouyang Nana

The little sister’s comment

The real Size of the little sister

Analyze urban beauty

The little sister’s comment

Little sister’s favorite color

The real Size of the little sister

The last

Related Posts

The significance of Go module and the problems to be solved

This paper analyzes the principle of high concurrency of Node.js single thread

Mainland China IP filter – Java implementation