Selenium automatically logged in websites, screenshots, and Requests to grab logged page content. Let’s learn about it.

  • Selenium: A comprehensive project of a series of tools and libraries that support Web browser automation.
  • Requests: The only non-gmO Python HTTP library that is safe for human consumption.

Why choose Selenium for automatic login?

Selenium is implemented to simulate the process of manually opening a browser and logging in.

There are several benefits to logging in over a direct HTTP request:

  1. Avoid login window complications (iframe, Ajax, etc.) and parsing details.
    • Selenium implementation, according to the user operation process.
  2. Avoid emulating Headers, logging Cookies, and other DETAILS of HTTP login completion.
    • Selenium implementation, depending on the browser’s own functions.
  3. Easy to implement load wait, find special cases (login authentication, etc.), add further logic.

In addition, automatic login and other process visualization, to the layman look quite let a person feel high-end.

Why select Requests to crawl web content?

Grab some content after logging in, instead of crawling the site, Requests is useful.

1) prepare Selenium

Basic Environment: Python 3.7.4 (Anaconda3-2019.10)

PIP Installation Selenium:

pip install selenium
Copy the code

Get Selenium version information:

$python Python 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] : Anaconda, Inc. on darwin Type"help"."copyright"."credits" or "license" for more information.
>>> import selenium
>>> print('Selenium version is {}'.format(selenium.__version__))
Selenium version is 3.141.0
Copy the code

2) Prepare the browser and its driver

Download Google Chrome and install it at www.google.com/chrome/

Download the Chromium/Chrome WebDriver: chromedriver.storage.googleapis.com/index.html

Then, add the WebDriver PATH to the PATH, for example:

# macOS, Linux
export PATH=$PATH:/opt/WebDriver/bin >> ~/.profile

# Windows
setx /m path "%path%; C:\WebDriver\bin\"Copy the code

3) Go coding!

Reading login configuration

Login information is private and we read it from the JSON configuration:

# load config
import json
from types import SimpleNamespace as Namespace

secret_file = 'secrets/douban.json'
# {
# "url": {
# "login": "https://www.douban.com/",
# "target": "https://www.douban.com/mine/"
#}.
# "account": {
# "username": "username",
# "password": "password"
#}
#}
with open(secret_file, 'r', encoding='utf-8') as f:
  config = json.load(f, object_hook=lambda d: Namespace(**d))

login_url = config.url.login
target_url = config.url.target
username = config.account.username
password = config.account.password
Copy the code

Selenium Automatic Login

It uses Chrome WebDriver to log in to the test site as Douban.

Open the login page and automatically enter the user name and password to log in:

# automated testing
from selenium import webdriver

# Chrome Start
opt = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=opt)
# Chrome opens with “Data;” with selenium
# https://stackoverflow.com/questions/37159684/chrome-opens-with-data-with-selenium
# Chrome End

# driver.implicitly_wait(5)

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 5)

print('open login page ... ')
driver.get(login_url)
driver.switch_to.frame(driver.find_elements_by_tag_name("iframe") [0])

driver.find_element_by_css_selector('li.account-tab-account').click()
driver.find_element_by_name('username').send_keys(username)
driver.find_element_by_name('password').send_keys(password)
driver.find_element_by_css_selector('.account-form .btn').click()
try:
  wait.until(EC.presence_of_element_located((By.ID, "content")))
except TimeoutException:
  driver.quit()
  sys.exit('open login page timeout')
Copy the code

If you use Internet Explorer, it looks like this:

# Ie Start
# Selenium Click is not working with IE11 in Windows 10
# https://github.com/SeleniumHQ/selenium/issues/4292
opt = webdriver.IeOptions()
opt.ensure_clean_session = True
opt.ignore_protected_mode_settings = True
opt.ignore_zoom_level = True
opt.initial_browser_url = login_url
opt.native_events = False
opt.persistent_hover = True
opt.require_window_focus = True
driver = webdriver.Ie(options = opt)
# Ie End
Copy the code

If you set more functions, you can:

cap = opt.to_capabilities()
cap['acceptInsecureCerts'] = True
cap['javascriptEnabled'] = True
Copy the code

Open the target page and take a screenshot

print('open target page ... ')
driver.get(target_url)
try:
  wait.until(EC.presence_of_element_located((By.ID, "board")))
except TimeoutException:
  driver.quit()
  sys.exit('open target page timeout')

# save screenshot
driver.save_screenshot('target.png')
print('saved to target.png')
Copy the code

Requests copy Cookies for HTML

# save html
import requests

requests_session = requests.Session()
selenium_user_agent = driver.execute_script("return navigator.userAgent;")
requests_session.headers.update({"user-agent": selenium_user_agent})
for cookie in driver.get_cookies():
  requests_session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain'])

# driver.delete_all_cookies()
driver.quit()

resp = requests_session.get(target_url)
resp.encoding = resp.apparent_encoding
# resp.encoding = 'utf-8'
print('status_code = {0}'.format(resp.status_code))
with open('target.html'.'w+') as fout:
  fout.write(resp.text)

print('saved to target.html')
Copy the code

4) Run tests

You can temporarily add WebDriver paths to PATH:

# macOS, Linux
export PATH=$(pwd)/drivers:$PATH

# Windows
set PATH=%cd%\drivers; %PATH%Copy the code

Run the Python script, and the following output is displayed:

$python douban.py Selenium version is 3.141.0 -------------------------------------------------------------------------------- open login page ... open target page ... saved to target.png status_code = 200 saved to target.htmlCopy the code

Screenshot target.png, HTML content target.html, the result is as follows:

conclusion

What if the login process encounters authentication?

  1. Sliding verification, which can be simulated by Selenium
    • Slide distance, image gradient algorithm can be judged
  2. Graphic verification, can be recognized by the Python AI library

reference

This article code Gist address: gist.github.com/ikuokuo/116…

  • Selenium: www.selenium.dev/documentati…
  • WebDriver: www.selenium.dev/documentati…
  • requests: requests.readthedocs.io/en/latest/
  • Requestium: github.com/tryolabs/re…
  • The Selenium Requests: github.com/cryzed/Sele…

Share practical tips and knowledge in Coding! Welcome to pay attention and grow together!