This article is participating in Python Theme Month. See the link to the event for more details

Prospects for review

I analyzed some knowledge of nugget login in the article “Thinking caused by Nugget login”. Although it is feasible to call the logon interface directly after assembling the logon parameters through the program, it is not cost-effective. Moreover, if the login policy is changed, it may end up being a waste of time.

So today, think about it the other way around and use Selenium, a tool for Testing Web applications, to complete the login.

Technical background

A brief introduction to the techniques used in this article is all you need to know. Of course, these technologies do not understand it does not matter, do not prevent us from looking down.

  • Introduction to cookies
  • Front end knowledge XPath tutorial
  • Image processing OpenCV Chinese documents
  • Python Selenium User Guide for Testing Web applications

Deciphering the login

Let’s look directly at how to complete the login with Selenium, which is roughly divided into the following steps:

  • Get slide verification code by simulating browser operation
  • Crack the slide captcha
    • A. Calculate the distance the slider needs to slide
    • B. Simulate the browser drag and drop to identify the sliding verification code
  • Simulate the browser to achieve login

1. Environment Introduction

  • Hardware Mac
  • Language Python3
  • The editor PyCharm
  • The browser Chrome

2. Obtain the sliding verification code

By analyzing the login process, it can be determined that the slider verification code consists of two pictures. One is the slider image and the other is the background image. Here, you only need to jump to the page before login to obtain the image URL of the verification code through Selenium simulation browser. There are not too many difficulties in this step. The code is as follows:

from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By


driver = webdriver.Chrome(executable_path="./chromedriver")
driver.get("https://juejin.cn/")

# Code for each step from the home page to get the URL, you can open the browser to try

Click the login button
login_button = driver.find_element(By.XPATH, "' / / button/text () =" login "] ' ' ')
ActionChains(driver).move_to_element(login_button).click().perform()

time.sleep(2)

# Select account password to log in
other_login_span = driver.find_element(By.XPATH, ''//span[text()=" other login methods "]'')
ActionChains(driver).move_to_element(other_login_span).click().perform()
time.sleep(2)

Enter the username and password
username_input = driver.find_element(By.XPATH, '//input[@name="loginPhoneOrEmail"]')
password_input = driver.find_element(By.XPATH, '//input[@name="loginPassword"]')
username_input.send_keys("")  # Nuggets account
password_input.send_keys("")  # Nugget code

# Click login
login_button = driver.find_element(By.XPATH, "' / / button/text () =" login "] ' ' ')
ActionChains(driver).move_to_element(login_button).click().perform()
time.sleep(2)

# Get the slider captcha image
verify_image1 = driver.find_element(By.XPATH, '''//img[@id="captcha-verify-image"]/.. /img[1]''')
verify_image2 = driver.find_element(By.XPATH, '''//img[@id="captcha-verify-image"]/.. /img[2]''')
verify_image1_src = verify_image1.get_attribute("src")
verify_image2_src = verify_image1.get_attribute("src")
Copy the code

3. Crack the slider verification code

The following figure shows the corresponding slider, background image and the corresponding recognition process

To put it simply: the recognition process is to drag the image corresponding to the slider into the background of the depression.

3.1 Positioning the slider coordinates

Python Opencv template matching is used here. Template matching is to find a small area matching the stator image in the whole image region. The process is roughly to calculate the matching degree between the template image and the overlapping sub-image from left to right and top to bottom on the image to be detected. The greater the matching degree, the greater the possibility that they are the same.

The first two parameters are the image to be used, the background image, and the third parameter is the matching algorithm. In this case, TM_CCOEFF_NORMED (standard correlation coefficient matching) is used. The result returned by this function is a result set (multidimensional array) composed of a combination of the results of the comparisons at each position.

For example, if the size of the input image (the original image) is W H, and the size of the template is W H, then the size of the returned value is (w-w +1) * (h-h +1).

Slider_pic background_pic is the gray-binarized image

# Read the slider
slider_pic = cv2.imread("slider_pic")
# Read background image
background_pic = cv2.imread("background_pic")
Compare the overlap between the two images
result = cv2.matchTemplate(slider_pic, background_pic, cv2.TM_CCOEFF_NORMED)

Top, left is the slider's position relative to the upper left corner of the image
top, left = np.unravel_index(result.argmax(), result.shape)
Copy the code

Note: Np.unravel_index (result.argmax(), result.shape) Gets the value of the coordinate corresponding to the maximum value of the multidimensional array (corresponding to the pixel coordinate of the found graph). Since the slider only moves horizontally, I can just omit the top here.

3.2 Generate slide track

As we all know, it is difficult for people to drag pictures at a constant speed, so we need to use a program to simulate human dragging. Makes the motion trajectory of the picture more disorderly. Here the idea and code from the polar slide verification code identification

Because you can’t drag the slider to a specific position at once, you need to simulate a normal drag here. Therefore, the logic of dragging is that the front slider does uniform acceleration and the back slider does uniform deceleration, and the acceleration formula of physics can be used to complete the verification.

The sliding acceleration of the slider is represented by A, the current velocity is represented by V, the initial velocity is represented by v0, the displacement is represented by x, and the required time is represented by T. The relationship between them is as follows: x = v0 * t + 0.5 * a * t * t v = v0 + a * tCopy the code
# where track is the track of the slider drag
def get_track(distance) :  # distance is the total distance passed in
    # Moving track
    track = []
    # Current displacement
    current = 0
    # Deceleration threshold
    mid = distance * 4 / 5
    # Calculate interval
    t = 0.2
    # velocity
    v = 1

    while current < distance:
        if current < mid:
            # acceleration
            a = 4
        else:
            # acceleration
            a = -3
        v0 = v
        # Current speed
        v = v0 + a * t
        # Move distance
        move = v0 * t + 1 / 2 * a * t * t
        # Current displacement
        current += move
        # Add track
        track.append(round(move))
    return track
Copy the code

4. Operate the page slider

In the previous step, opencV was used to complete the positioning of the slider verification code and the generation of the track. The next step is to complete the drag and drop of the slider on the page.


Locate to the move button
verify_div = self.driver.find_element(By.XPATH, '''//div[@class="sc-kkGfuU bujTgx"]''')

# Press the left mouse button
ActionChains(self.driver).click_and_hold(verify_div).perform()
time.sleep(0.5)

# Traverse the track to slide
for t in track:
    time.sleep(0.01)
    ActionChains(self.driver).move_by_offset(xoffset=t, yoffset=0).perform()
# Release mouse to complete drag
ActionChains(self.driver).release(on_element=verify_div).perform()
Copy the code

Note: The time.sleep() in the code is for brief pauses in page action, not for continuous quick action.

5. Get a cookie

After the verification code of the slider is identified, the login to the page is successful. In this case, you only need to obtain the cookie corresponding to Juejin.cn.

# Get the current page cookie
driver.get_cookies()

''' [{'domain': '.juejin.cn', 'expiry': 1632902943, 'httpOnly': False, 'name': 'MONITOR_WEB_ID', 'path': '/', 'secure': False, 'value': 'e7fa2492-...-8ff5e04c6727'}, # ... {'domain': '.juejin.cn', 'expiry': 1630310943, 'httpOnly': True, 'name': 'sessionid_ss', 'path': '/', 'sameSite': 'None', 'secure': True, 'value': 'bfac25b956...f7be812054f'}, {'domain': '.juejin.cn', 'expiry': 1630310943, 'httpOnly': True, 'name': 'sessionid', 'path': '/', 'secure': False, 'value': 'bfac25b956b...42f7be812054f'}, ] '''
Copy the code

Afterword.

Points to optimize

  • Images are stored locally and read for many times, which is optimized to process images directly in memory.
  • The code is cluttered in a single file, optimized to separate objects by function.
  • Identify the probability of an error, where an error retry mechanism needs to be added (see figure below).

The resources

  • OpenCV image processing based on Python
  • numpy.unravel_index
  • Python Selenium User Guide