Write a crawler, which should be scrapy or Selenium

This article is participating in Python Theme Month. See the link for details

Background story

At that time, the author was... Meditation, wechat suddenly pop up the news of the former leader, suddenly a little excited in the heart, thought is to pull a brother, promotion raise!! Turns out, well, it's a job, sending in a video, a crawler like desktop app that crawls data from a sound and finds someone's private message corresponding to the comment! It could be that the person who reviews the video in bulk sells the product in a private message, and if they come in, they close the deal and make the conversion! Ask me how much does it cost to implement?Copy the code

To start

The first point, for such things as reptiles, the author is both happy and nervous, psychological always some small excitement < prison food? >.

Lie!!!!! I immediately had multiple solutions in mind: what technology to use? How can we do it better?Copy the code

Solution < Soon to be installed >

Python’s familiar scrapy crawler framework is great to use!!

But the author is a test engineer, not a reptile engineer, reptile is just an occasional hobby!! The point is that the author had taught himself only a cursory scrapy frame and now had to start from scratch.Copy the code

So with F12 browser debugger, can you use the interface to crawl data? Disappointed!!

Because direct requests. Get (URL) requests will prompt an error and no data will be returned. A closer look reveals that there is an encryption parameter in the request parameters, so every browser request is different, and without knowing the rules, you can't crawl the data this way.Copy the code

Since the author is a tester, is the Selenium tool in UI automation testing available?

Yes, there are more ways than difficulties.Copy the code

Use Selenium to crawl video data for a sound

Prerequisites: The user must have basic Python and xpath syntax, and a sense of automation

# Prepare the Python + Selenium environment; Download the chrome driver. First prepare a demo test, whether it can be used normally!

from selenium import webdriver

driver=webdriver.Chrome(executable_path=".. /chromedriver.exe")
driver.implicitly_wait(30)
driver.maximize_window()

driver.get("http://www.baidu.com")

# do somethings

driver.quit()
Copy the code

If the goal is clear, directly open the official website in the browser, search for keywords to get the URL < code demo below >

"" Created on July 22, 2021 @Author: Qguan"

import re
from time import sleep

from selenium import webdriver

Initialize the driver object
driver=webdriver.Chrome(executable_path=".. /chromedriver.exe")
driver.implicitly_wait(30)
driver.maximize_window()

# Open the target url
url="https://www.xxxxyin.com/search/%E4%B8%8A%E6%B5%B7%E6%95%B4%E5%9E%8B"
driver.get(url)
# May be the reason for anti - pickling, there is jigsaw verification
sleep(5) # Manual here, for testing
# You can use image processing to handle slider verification
# Finally, of course, use headless mode

Get the current handle
# handler=driver.current_window_handle

Get all elements of the current result page
video_pic=driver.find_elements_by_xpath("//div[@style='display: block;']/ul/li/div/a[1]")

# counter
i=1
for video in video_pic:
    Walk through the click element
    video.click()
    Get all browser handles
    handlers=driver.window_handles
    # Switch to the latest one
    driver.switch_to_window(handlers[-1])
    
    Get jump page elements: title, like, comment, post time, username
    titles=driver.find_elements_by_xpath("//div/div[2]/div[1]/div[1]/div[1]/div[2]/h1/span[2]/span/span/span/span")
    if len(titles)>0:
        title=""
        for tit in titles:
            title+=tit.text
    else:
        title="Did not get the full title"
    
    praise=driver.find_element_by_xpath("//div/div[2]/div[1]/div[1]/div[1]/div[2]/div/div[1]/div[1]/span").text
    comment=driver.find_element_by_xpath("//div/div[2]/div[1]/div[1]/div[1]/div[2]/div/div[1]/div[2]/span").text
    open_time=driver.find_element_by_xpath("//div/div[2]/div[1]/div[1]/div[1]/div[2]/div/div[2]/span").text
    
    # Video author name
    username=driver.find_element_by_xpath("//div/div[2]/div[1]/div[2]/div/div[1]/div[2]/a/div/span/span/span/span/span")
    
    Click the user name to jump to the user details page
    username.click()
    
    Get the url of the current page, close the current page
    c_url=driver.current_url
    driver.close() # Why can't it be closed
    
    param_url=c_url.split("?") [1] # split url? Parameter of splicing
    Extract user video ID and user ID by regular matching
    author_id=re.findall("(\d{11})",param_url)[0]
    group_id=re.findall(r"(\d{19})",param_url)[0]
    
    # Console output results
    print("The first {} article, video title: {}, id: {}, user id: {}, comments: {}, thumb up number: {}, release time: {}".format(i,title,group_id,author_id,comment,praise,open_time))
    
    # Switch to page 1
    driver.switch_to_window(handlers[0])
    
    i+=1 # counter increment by 1
     
Exit the driver
driver.quit()

Copy the code

Output the result to the console, and crawl the key information as follows:

conclusion

In terms of tool usability, Selenium is inferior to scrapy in crawler, mainly because scrapy is convenient for database operation and more direct in function.

Selenium needs to be familiar with element positioning schemes, but also needs to solve more PROBLEMS at the UI level. Even database tables need to do it yourself; Other frameworks may also be needed to solve certain problems.Copy the code

From the perspective of learning cost, any tool that can solve problems quickly is a good tool.

Write a crawler, which should be scrapy or Selenium

Background story

To start

Python’s familiar scrapy crawler framework is great to use!!

So with F12 browser debugger, can you use the interface to crawl data? Disappointed!!

Since the author is a tester, is the Selenium tool in UI automation testing available?

Use Selenium to crawl video data for a sound

conclusion

Related Posts

Spring – IOC DI (DI) Set the difference between injection and constructor injection and implementation | Spring series (3)

Simple Queue System RQ- In-depth Understanding (2)

How does Nestjs access static files