Hello everyone, I’m Xiao CAI. A man who wants to be a man who talks architecture! If you also want to be the person I want to be, or point a concern to do a companion, let small dishes no longer lonely!

This article focuses on getting started with Python

Refer to it if necessary

If it is helpful, do not forget the Sunday

Wechat public account has been opened, vegetable farmers said, do not pay attention to the students remember to pay attention to oh!

Hello, everybody. Here is CAI not dishes, the predecessor of vegetable farmers yue. Don’t get lost just because you change your name or your profile picture

Recently, in order to expand the language, I got to know how to play Python this week. After learning it, I realized, wow, it smells good. I don’t know when you just learn a language, have you ever felt that the language is a bit interesting, what would like to try.

When it comes to Python, people’s reactions may be crawlers and automated tests, and they seldom talk about using Python to do Web development. Comparatively speaking, Java is still the most popular language for Web development in China. But it is not that Python is not suitable for web development. As far as I know, the most common web frameworks are Django and Flask, etc

Django is a heavy framework that provides a lot of handy tools and encapsulates a lot of things without having to build your own wheels

The Flask has the advantage of being small, but it also has the disadvantage of being small, and being flexible means you have to build more wheels yourself, or spend more time configuring them

But this article is not about web development in Python, nor is it about getting started with Python. It is about getting started with automated tests and crawlers in Python

In my opinion, if you have other language development experience, the dishes were suggested to directly from a case of watching while learning, grammar and so on are the same (behind the combination of Java to learn python content), basic can read code a close, but without any language development experience of the students. Learn Python from scratch. Videos and books are good choices. Here is liao Xuefeng’s blog, which contains a good Python tutorial

I. Automated testing

Python does a lot of things, and it does a lot of interesting things

Learning a language, of course, you have to find something interesting to learn faster, for example, you want to climb the pictures or videos of xyz website, right

What is automated testing? Once you have written a script (.py file) that automatically runs your testing process in the background, there is a great tool that can help you with automated testing. Selenium is the Selenium tool

Selenium is a Web automation testing tool that makes it easy to simulate real users’ browser operations. It supports a wide variety of major browsers, such as Internet Explorer, Chrome, Firefox, Safari, Opera, etc. Here we use Python to demonstrate. Selenium doesn’t just support Python. It has client-side drivers for multiple programming languages.

1) Preparation

In order for the demo to go smoothly, we need to do some pre-preparation, otherwise the browser may not open properly

Step 1

To check the browser version, we use Edge, we can enter Edge ://version in the url input box to check the browser version, and then go to the corresponding driver store to install the corresponding version of the driver Microsoft Edge-WebDriver (Windows.net).

Step 2

Then we will unzip the downloaded driver file into the Scripts folder in your Python installation directory

2) Browser operation

To prepare, let’s look at the following simple code:

Add the guide package to the total of 4 lines of code, and enter python autotest.py on the terminal, and get the following demonstration:

You can see that the script has realized the automatic opening of the browser, automatic enlargement window, automatic opening of Baidu web page, three automatic operations, our learning forward close a step, is not feel a little interesting ~ let you gradually sink!

Here are a few common approaches to browser manipulation:

methods instructions
webdriver.xxx() Used to create browser objects
maximize_window() Window maximization
get_window_size() Get the browser size
set_window_size() Setting the browser size
get_window_position() Get browser location
set_window_position(x, y) Setting the browser location
close() Close the current TAB/window
quit() Close all tabs/Windows

These are, of course, the basic general operations of Selenium, and there are better ones to come

When we open the browser, we want to do more than just open the web page, after all, the ambition of the programmer is infinite! We also want to automate page elements, which brings us to Selenium’s location operations

3) Locate elements

Page element positioning is not strange for the front end, with JS can be very easy to achieve element positioning, such as the following:

  • Location by ID

document.getElementById("id")

  • Locate by name

document.getElementByName("name")

  • Location is performed by label name

document.getElementByTagName("tagName")

  • Locate through the class class

document.getElementByClassName("className")

  • Location through the CSS selector

document.querySeletorAll("css selector")

Selenium is an automated testing tool that implements page element location in eight ways, as follows:

  1. Id positioning

driver.find_element_by_id("id")

When we open baidu page, we can find that the ID of the input box is KW.

Once we know the element ID, we can use the ID to locate the element as follows

from selenium import webdriver

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")

Locate elements by ID
i = driver.find_element_by_id("kw")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code

  1. Name Attribute value location

driver.find_element_by_name("name")

The locating method of name is similar to that of ID, which is to find the value of name and then call the corresponding API. The use method is as follows:

from selenium import webdriver

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")

Locate elements by ID
i = driver.find_element_by_name("wd")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code
  1. The name of the class location

driver.find_element_by_class_name("className")

The location mode is the same as that of ID and name. You need to find the corresponding className and locate ~

  1. Label name location

driver.find_element_by_tag_name("tagName")

This way we use in daily life is relatively rare, because in HTML is defined by tag, such as input is input, table is table… Each element is actually a tag, and a tag is often used to define a class of functions. There may be multiple divs, input, tables, etc., in a page, so it is difficult to accurately locate elements using tags

  1. CSS selectors

driver.find_element_by_css_selector("cssVale")

This approach requires connecting five selectors of CSS

Five big selectors

  1. Element selector

The most common CSS selector is the element selector, which in HTML documents usually refers to an HTML element, such as:

html {background-color: black; }p {font-size: 30px; backgroud-color: gray; }h2 {background-color: red; }Copy the code
  1. Class selectors

. Add the class name to form a class selector, for example:

.deadline { color: red; }span.deadline { font-style: italic; }Copy the code
  1. The id selector

ID selectors are somewhat similar to class selectors, but the differences are significant. First, an element cannot have multiple classes like a class attribute. An element can only have a unique ID attribute. Use the ID selector to add the hash # to the ID value, for example:

#top { ...}
Copy the code
  1. Property selector

We can select elements based on their attributes and their values, for example:

a[href][title] { ...}
Copy the code
  1. Derived selector

Also known as a context selector, it uses the document DOM structure for CSS selection. Such as:

body li { ...}
h1 span { ...}
Copy the code

Of course, this selector is just a simple introduction, more content to consult their own documents ~

Now that we know about selectors, we can have fun locating CSS selectors:

from selenium import webdriver

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")

Locate the element through the ID selector
i = driver.find_elements_by_css_selector("#kw")
Enter a value into the input box
i.send_keys("The vegetable farmer said.")
Copy the code
  1. Link text location

driver.find_element_by_link_text("linkText")

This way is specially used to locate text links, such as we can see baidu’s home page has a news, hao123, map… Etc link elements

Then we can use the link text to locate

from selenium import webdriver

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://baidu.com")

Position the element with the link text and click
driver.find_element_by_link_text("hao123").click()
Copy the code

  1. Partial link text

driver.find_element_by_partial_link_text("partialLinkText")

This approach is an aid to link_text, and sometimes a hyperlink text can be too long, which would be cumbersome and ugly if we typed all of it

In fact, we only need to truncate a portion of the string for Selenium to understand what we are selecting, using the partial_link_text method

  1. Xpath path expression

driver.find_element_by_xpath("xpathName")

Ideally, each element would have a unique id or name or class or hyperlinked text attribute, so we could locate them by that unique attribute value. But sometimes the element we’re trying to locate doesn’t have the ID,name, or class attributes, or the values of these attributes are the same for multiple elements, or they change when the page is refreshed. At this point we can only use xpath or CSS to locate. And of course you don’t have to calculate the value of xpath we just go to the page and find the element in F12, right click copy xpath

Then position it in the code:

from selenium import webdriver

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open baidu webpage
driver.get("http://www.baidu.com")

driver.find_element_by_xpath("//*[@id='kw']").send_keys("The vegetable farmer said.")
Copy the code

4) Element operation

What we want to do is not just select the element, but what we want to do after we select the element. We have already done the click() and send_keys(“value”) operations in the above demo. Here are a few more operations

The method name instructions
click() Click on the element
send_keys(“value”) Analog key input
clear() Clears the contents of an element, such as an input field
submit() Submit the form
text Gets the text content of the element
is_displayed Determines whether the element is visible

After seeing whether there is a kind of like ever similar feeling, this is the basic operation of JS ~!

5) Practical exercises

After learning the above operations, we can simulate a xiaomi mall shopping operation, the code is as follows:

from selenium import webdriver

item_url = "https://www.mi.com/buy/detail?product_id=10000330"

Load the Edge driver
driver = webdriver.ChromiumEdge()
# Set maximum windowing
driver.maximize_window()
# Open the shopping page
driver.get(item_url)
# Implicit wait Settings prevent network congestion pages from not loading in time
driver.implicitly_wait(30)

# select address
driver.find_element_by_xpath("//*[@id='app']/div[3]/div/div/div/div[2]/div[2]/div[3]/div/div/div[1]/a").click()
driver.implicitly_wait(10)
Click to manually select the address
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div["
                             "1]/div/div/div[2]/span[1]").click()
# Select Fujian
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
                             "1]/div[2]/span[13]").click()
driver.implicitly_wait(10)
# option,
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
                             "1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# selection area
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
                             "1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# Pick a street
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
                             "1]/div[2]/span[1]").click()
driver.implicitly_wait(20)

# Click to add to cart
driver.find_element_by_class_name("sale-btn").click()
driver.implicitly_wait(20)

# Click to go to cart checkout
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div[1]/div[2]/a[2]").click()
driver.implicitly_wait(20)

# Click to settle
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div/div/div[1]/div[4]/span/a").click()
driver.implicitly_wait(20)

# Click agree agreement
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div[3]/button[1]").click()
Copy the code

The effect is as follows:

This is the practice of our learning results, of course, if the second kill situation might as well write a script to practice ~:boom:, if there is no goods, we can add a while loop to polling access!

Second, crawler test

We have shown you how to use Selenium for automated testing, using the whiskers method. Next, we will show you another powerful feature of Python, which is for crawlers

Before learning about crawlers, we need to understand a few necessary tools

1) Page loader

The Python standard library already provides: Urllib, URllib2, Httplib and other modules for HTTP requests, but the API is not elegant enough ~, it requires a huge amount of work, as well as various methods of coverage, to complete the simplest task, of course, this is the programmer can not tolerate, the parties developed in addition to a variety of good third-party libraries for use ~

  • request

Request is an HTTP library based on Python developed under the Apache 2 license. It is highly encapsulated on the basis of Python built-in modules, so that users can easily complete all operations available in the browser when making network requests

  • scrapy

The difference between Request and scrapy may be that scrapy is a more important framework. It is a site-level crawler, while Request is a page-level crawler, with less concurrency and performance than scrapy

2) Page parsers

  • BeautifulSoup

BeautifulSoup is a module that takes an HTML or XML string, formats it, and then makes it easy to find a specified element in HTML or XML using methods it provides to quickly find the specified element.

  • scrapy.Selector

Selector is based on Parsel, an advanced wrapper that selects a portion of an HTML file using a specific XPath or CSS expression. It is built on top of the LXML library, which means they are very similar in speed and parsing accuracy.

See Scrapy for details

3) Data storage

When we climb down the content, we need to have a corresponding storage source to store it

Detailed database operations will be covered in a future Web development blog post

  • TXT text

Common operations using file

  • sqlite3

SQLite, a lightweight database, is an ACID-compliant relational database management system contained in a relatively small C library

  • mysql

Do not do too much introduction, understand all understand, web development old lover

4) Practical exercises

Web crawler, which is actually called Web data collection, is easier to understand. It is to programmatically request data (HTML forms) from the web server, and then parse the HTML to extract the data you want.

We can simply divide it into four steps:

  • Gets HTML data based on the given URL
  • Parsing the HTML to get the target data
  • Store the data

Of course, all this requires that you understand the simple syntax of Python and the basic operations of HTML

Let’s use a combination of Request + BeautifulSoup + Text for an exercise. Suppose we want to climb the Python tutorial by Liao Xuefeng

# import requests library
import requests
# Import file operation library
import codecs
import os
from bs4 import BeautifulSoup
import sys
import json
import numpy as np
import importlib

importlib.reload(sys)

Assign a request header to the request to mimic chrome
global headers
headers = {'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
server = 'https://www.liaoxuefeng.com/'
# Liao Xuefeng Python tutorial address
book = 'https://www.liaoxuefeng.com/wiki/1016959663602400'
Define the storage location
global save_path
save_path = 'D:/books/python'
if os.path.exists(save_path) is False:
    os.makedirs(save_path)


Get chapter content
def get_contents(chapter) :
    req = requests.get(url=chapter, headers=headers)
    html = req.content
    html_doc = str(html, 'utf8')
    bf = BeautifulSoup(html_doc, 'html.parser')
    texts = bf.find_all(class_="x-wiki-content")
    # get the content of div tag id attribute content \xa0 is unbroken whitespace  
    content = texts[0].text.replace('\xa0' * 4.'\n')
    return content


Write file
def write_txt(chapter, content, code) :
    with codecs.open(chapter, 'a', encoding=code)as f:
        f.write(content)


# main method
def main() :
    res = requests.get(book, headers=headers)
    html = res.content
    html_doc = str(html, 'utf8')
    # HTML parsing
    soup = BeautifulSoup(html_doc, 'html.parser')
    Get all chapters
    a = soup.find('div'.id='1016959663602400').find_all('a')
    print('Total entries: %d' % len(a))
    for each in a:
        try:
            chapter = server + each.get('href')
            content = get_contents(chapter)
            chapter = save_path + "/" + each.string.replace("?"."") + ".txt"
            write_txt(chapter, content, 'utf8')
        except Exception as e:
            print(e)


if __name__ == '__main__':
    main()
Copy the code

When we run the program, we can see the tutorial content we have climbed in the D:/books/python location!

In this way, we have simply implemented the crawler, but the crawler needs to be careful ~!

This article explores python’s use of automated tests and crawlers in two dimensions

Don’t talk, don’t be lazy, and xiao CAI do a blowing bull X do architecture of the program ape ~ point a concern to do a companion, let xiao CAI no longer lonely. See you later!

Today you work harder, tomorrow you will be able to say less words!

I am xiao CAI, a man who grows stronger with you. 💋

Wechat public account has been opened, vegetable farmers said, do not pay attention to the students remember to pay attention to oh!