I. Project description

1. Project background

One day, a friend threw me a linkItem.jd.com/10000049965…”, let me see how all the comments on this song product crawled. I opened it and saw, boy, there are nearly 3 million comments, which is not a small number.However, a closer look reveals that 2.34 million + comments are default positive comments, while a few are valuable ones. On closer inspection, it can be seenObviously, there are only 100 pages of data displayed on a web page, 10 pages per page. You can usually use Selenium to click on each page and retrieve it, but isn’t that inefficient? Or is it more direct to use Requests directly? In many cases, the data displayed on the web page is rendered by the REQUESTED JSON data on the web page. Will jd’s comments be the same? Ok, let’s do it!!

2. Project environment

This small project uses Python to crawl, and doesn’t require much configuration. Just install the Requests library, which is a must for many crawlers. Don’t tell me you can crawl without installing requests.

Ii. Project implementation

1. Project analysis

As mentioned above, much of the data in the web page is obtained by rendering the requested JSON data, so we will see whether jingdong is the same. Using your browser’s audit tool, select the Network bar to seeTake a closer look at the links for comments, and you’ll find one of the request linksClub.jd.com/comment/pro…, as shown in the picture above. This is about the overall situation of comments on the product. You can see the specific total number of comments, default number of favorable comments, favorable comments, favorable rate, etc. Although it is not what we want, we have taken a step closer. We continue to look for it and find a link with commentClub.jd.com/comment/pro…As shown in figureThere are 10 comments in the back, which should be the first page of comments corresponding to this product. Click to view, as follows:As you can see by comparing the comments displayed on the page, this is exactly what we’re looking for. Since the resulting data is not JSON in a standard format, I choose to use regular expressions to get the relevant content. Now there is a problem, we only got 1 page of comments, so how do we get all the comments? Could it be hidden in a link? For links **Club.jd.com/comment/pro…Now that the analysis is done, you can start coding.

2. Code implementation

Import modules and define constants

import re
import time
import csv
import os
import requests
import html

Set the request header
headers = {
    'cookie': 'shshshfp=22dd633052035d21be92463ffa35684d; shshshfpa=ab283f84-c40f-9710-db89-84a8d3366a81-1586333030; __jda = 122270672.1586333031101106032014.1586333031.1586333031.1586333031.1; __jdv=122270672|direct|-|none|-|1586333031103; __jdc=122270672; shshshfpb=bUe7tI9%2FOOaJKd7vP0EtSOg%3D%3D; __jdu=1586333031101106032014; areaId=22; ipLoc-djd=22-1977-1980-0; 3AB9D23F7A4B3C9B=7XEQD4BFTGEH44EK7LN7HLFCHJW6W2NS5VJOQOCHABZVI7LXJJIW3K2IX5MTPZ4TBERBLY6TRQR5CA3S3IYVLQ2JGI; jwotest_product=99; shshshsID=a7457cee6a4a9fa285fe2cff44c6bd17_4_1586333142454; __jdb = 122270672.4.1586333031101106032014 | 1.1586333031; JSESSIONID=8C21549A613B83F0CB86EF1F38FD63D3.s1'.'sec-fetch-dest': 'document'.'sec-fetch-mode': 'navigate'.'sec-fetch-site': 'none'.'sec-fetch-user': '? 1 '.'upgrade-insecure-requests': '1'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}
Copy the code

Import required modules and define request headers for requests to reduce the probability of reverse crawling.

Crawl the comment body function

def comment_crawl(page, writer) :
    "Crawl comments"
    print('Currently downloading page %d comments' % (page + 1))
    url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100000499657&score=0&sort Type=5&page=%d&pageSize=10&isShadowSku=0&fold=1' % page
    Request link for data
    text = requests.get(url, headers=headers).text
    # Regular expressions match data
    comment_list = re.findall(
        r'guid":".*?" content":"(.*?) ". *?" creationTime":"(.*?) ", ". *?" replyCount":(\d+),"score":(\d+).*? usefulVoteCount":(\d+).*? imageCount":(\d+).*? images":',
        text)
    for result in comment_list:
        Match data against regular expression results
        content = html.unescape(result[0]).replace('\n'.' ')
        comment_time = result[1]
        reply_count = result[2]
        score = result[3]
        vote_count = result[4]
        image_count = result[5]
        Write data
        writer.writerow((comment_time, score, reply_count, vote_count, image_count, content))
Copy the code

This is the body of the comment to crawl, and after the data is requested, the desired content is matched with a regular expression. One note here is that html.unescape(result[0]) is used to escape HTML characters such as … And so on into a normal string, so that the data is more complete and clear.

The main function

if __name__ == '__main__':
    Principal function
    start = time.time()
    if os.path.exists('DATA.csv'):
        os.remove('DATA.csv')
    with open('DATA.csv'.'a+', newline=' ', encoding='gb18030') as f:
        writer = csv.writer(f)
        writer.writerow(('Message Time'.'score'.'Number of replies'.'Likes'.'Number of pictures'.'Comment content'))
        for page in range(100):
            comment_crawl(page, writer)
    run_time = time.time() - start
    print('Run time is %d minutes %d seconds. ' % (run_time // 60, run_time % 60))
Copy the code

The main function creates a file to save data and time the program to see how efficiently it executes.

Iii. Project analysis and description

1. Run tests

The whole small project is very simple, focusing on the analysis of the process and ideas, as long as the analysis is good, the code implementation is very easy. An indication of a test is as follows:It was pretty efficient, getting nearly a thousand comments in 23 seconds. The screenshot of data is as follows:If you need to get other product reviews directly in the codeChange the productId url in the functionCan. The full code can be clickedDownload.csdn.net/download/CU…Download.

2. Improve the analysis

  • Using single thread, it is ok when the data is small, once the need to climb more comments, may be efficient on the bottleneck, so you can use multi-thread or multi-process, the main function is improved as follows:
pool = ThreadPoolExecutor(3)...for page in range(100):
	pool.submit(comment_crawl, page, data_list)
Copy the code

Code clickableDownload.csdn.net/download/CU…Download and learn. The demo is as follows:The running time has been reduced by about one-third, obviously greatly improving efficiency.

  • Because jingdong has fewer anti-crawling measures, there are fewer preventive measures against anti-crawling, and it is ok to take less crawling. If the demand is high, the anti-crawling mechanism will definitely be triggered, leading to failure of climbing.
  • The expansibility still needs to be improved. At present, it only climbs jingdong commodity review, but it is difficult for other e-commerce platforms such as Taobao, which puts forward further requirements on the code.

3. Other instructions

  • This project is only used for learning and counting communication, and shall not be used for malicious crawlers, illegal profit-making and other purposes.
  • If the infringement of others’ interests, please contact deletion.