Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

The introduction

Front row several crawler real don’t know how, the classmate play small make up is to continue to update, this paper we introduce a already will be installed in front of the tool: Selenium, if say call tools is not appropriate, is a lot of time for test automation in the industry, so the title of this is also called a test automation framework.

As for why we call this name, we won’t go into the details, foreigners’ imagination is quite good.

It can drive the browser to perform specific actions through the driver, which is very friendly for crawling pages rendered dynamically by JavaScript.

Because of pages rendered dynamically by JavaScript, the JavaScript on such pages is usually compiled and packaged, and you see simple code, which is very difficult to read.

The reason they compile and package is that they don’t want anyone to see it, but since browsers are designed for everyone to see, it’s awkward…

More common packaging methods are webpack packaging and so on.

Interested students can leave a message in the message area, if more people can share some front-end content.

Lead to

Before we get started, if you haven’t installed your environment yet, I suggest you take a look at your previous articles and fix your environment first.

Make sure that you have installed Chrome and configured ChromeDriver correctly, and that you have installed the Selenium library properly.

First of all, here is the official website:

The official document: https://selenium.dev/selenium/docs/api/py/api.html

There are any problems to find the official, do not understand the use of translation software.

Basic operation

With that in mind, let’s take a look at some of the basics of Selenium. Let’s start with a simple demo:

from selenium import webdriver from selenium.webdriver.common.keys import Keys browser = webdriver.Chrome() Browser.get ('https://www.baidu.com') input = browser.find_element_by_id('kw') input.send_keys(' Geek digger ') input.send_keys(Keys.ENTER) print(browser.current_url) print(browser.get_cookies()) print(browser.page_source)Copy the code

Run the code above and you’ll see a Chrome browser pop up with a message that says: Chrome is under the control of automated software. Then I opened Baidu and entered “Geek excavator” in the input box to search.

After the search results come out, the console prints the current URL, cookies and source code of the web page.

The results of the console run on a screenshot, the content is too long not to paste.

As you can see, Selenium gets the content that is actually displayed in the browser. DOM nodes generated from pages dynamically loaded by JavaScript are also available under Selenium.

This is easy to explain, because Selenium is directly available in the browser presentation.

Declare browser objects

Selenium supports a wide variety of browsers, such as:

From Selenium import webdriver # Browser = WebDriver.Android () browser = WebDriver.blackberry () browser = webDriver.chrome () browser = webdriver.edge() browser = webdriver.firefox() browser = webdriver.ie() browser = webdriver.opera() browser = webdriver.phantomjs() browser = webdriver.safari()Copy the code

I can see the Familiar Internet Explorer, Edge, FireFox, Opera, and so on.

Access to web pages

To access a web page, use the get() method, passing the argument to the site we want to visit:

from selenium import webdriver

browser = webdriver.Chrome()

browser.get('https://www.jd.com/')
print(browser.page_source)Copy the code

Through the above two lines of code, we can see that the browser is automatically opened and visited jingdong, and the source code of Jingdong is printed on the console.

Of course, if you want the application to automatically close the browser, you can use:

browser.close()Copy the code

This sentence can be seen in the browser opened after the visit jingdong flash shut down.

Finding a single node

After we get the web page, the first step is to find the DOM node, and then we can get the data directly from the DOM node.

With Selenium, however, we can not only find nodes and retrieve data, but also simulate user actions, such as typing something into a search box, clicking a button, etc. First let’s look at how to find nodes:

As you can see from the picture above, we want to get the input field by id, so our next code would be like this:

from selenium import webdriver

browser = webdriver.Chrome()

browser.get('https://www.jd.com/')
input_key = browser.find_element_by_id('key')
print(input_key)Copy the code

The results are as follows:

<selenium.webdriver.remote.webelement.WebElement (session="86d1ae1419bee22099a168dfbf921a27", element="53047804-ad39-4dfd-b3fb-a149fb1c8ac8")>Copy the code

As you can see, the element type we get is WebElement.

Here’s a quick list of all the ways to get a single node:

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selectorCopy the code

In addition, Selenium does not provide a generic method, find_element(), that takes two arguments: the lookup method By and the value. In fact, the lookup in the above example could have been written like this:

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()

browser.get('https://www.jd.com/')
input_key1 = browser.find_element(By.ID, 'key')
print(input_key1)Copy the code

Results xiaobian will not stick, you can run their own comparison.

Find multiple nodes

Let’s say we want to find all the items in the navigation bar on the left:

It could be written like this:

lis = browser.find_elements_by_css_selector('.cate_menu li')
print(lis)Copy the code

The results are as follows:

[<selenium.webdriver.remote.webelement.WebElement (session="6341ab4f39733b5f6b6bd51508b62f1d", element="8e0d1a8c-d5dc-4b1f-8250-7f0eca864ea7")>, <selenium.webdriver.remote.webelement.WebElement (session="6341ab4f39733b5f6b6bd51508b62f1d", element="15cd4dc9-42f4-4ed7-9258-9aa29073243c")>, 
......]Copy the code

There are too many, and the results behind xiaobian are omitted.

The following lists all the multi-node selection methods:

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selectorCopy the code

There is also a method called find_elements() for multi-node selection.

This is the end of this article, and we’ll continue with interactive operations in the next article.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee

reference

https://selenium-python.readthedocs.io/api.html

https://cuiqingcai.com/5630.html