Tencent cloud technology community – Gold digging home page continues to present cloud computing technology articles, welcome your attention!


Author: Cui Qingcai

preface

In the last section we learned the basics of PhantomJS, which is essentially a browser with no interface and runs JavaScript scripts, but is that enough to write a crawler? What does this have to do with Python? What about the Python crawler? You want to show me this when Ku’s done? Guest officer don’t worry, next we introduce this tool, all solve your doubts.

Introduction to the

What is Selenium? In a word, automated test tools. It supports a variety of browsers, including Chrome, Safari, Firefox and other major interface browsers. If you install a Selenium plug-in in these browsers, you can easily implement Web interface testing. In other words, Selenium supports these browser drivers. Then again, isn’t PhantomJS a browser, so does Selenium support it? The answer is yes, so the two can work seamlessly together.

And then what’s the good news? Selenium supports development in many languages, such as Java, C, Ruby, etc. That’s a must! Oh, that’s great news.

Yeah, so? Selenium+PhantomJS + Python Selenium+PhantomJS PhantomJS is used for rendering and parsing JS, Selenium is used for driving and docking with Python, Python for post-processing, the perfect three swordmen!

Someone asked, why not just use a browser instead of a PhantomJS with no interface? The answer is: efficient!

Selenium has two versions, the latest version is 2.53.1 (2016/3/22)

Selenium 2, also known as WebDriver, is a major new feature that integrates Selenium 1.0 with WebDriver, which was once Selenium’s competitor. In other words, Selenium 2 is a combination of Selenium and WebDriver projects. Selenium 2 is compatible with Selenium and supports both Selenium and WebDriver apis.

See the Webdriver introduction for more details.

Webdriver

Well, with the above description, we should have a general idea of Selenium, so let’s start to enter the new world of dynamic crawling.

References for this article are from

The Selenium website

SeleniumPython document

The installation

The server is Tencent Cloud

image

First install Selenium

pip install selenium

Or download the source code

Download the source code

Then unpack and run the following command to install

python setup.py install

Once installed, we began to explore grasping methods.

Quick start

Preliminary experience

Let’s get a taste of Selenium with a small example. Here we test it in Chrome to see how it works, and switch back to PhantomJS when it’s time to actually crawl.

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.baidu.com/')Copy the code

Run this code, it will automatically open the browser, and then visit Baidu.

If the program executes incorrectly and the browser is not open, then Chrome is not installed or the Chrome driver is not configured in the environment variable. Download the driver and configure the driver file path in the environment variable.

Browser driver download

For example, if I’m running Mac OS, I’ll just put the downloaded files in /usr/bin.

Simulation to submit

The following code implements the function of simulating the submission search by first waiting for the page to complete loading, then entering the text into the search box and clicking Submit.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
print driver.page_sourceCopy the code

Also in Chrome test, feel it.

The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the “onload” event has fired) before returning control to your test or script. It’s worth noting that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded.

The driver.get method will open the requested URL, and WebDriver will wait for the page to be fully loaded before returning, that is, the program will wait for all contents of the page to be loaded and JS rendering to be completed before continuing to execute. Note: If you use a lot of Ajax here, your program might not know if it’s fully loaded.

WebDriver offers a number of ways to find elements using one of the find_elementby* methods. For example, the input text element can be located by its name attribute using find_element_by_name method

WebDriver provides a number of methods for finding web elements, such as the find_element_by_* method. For example, an input box can be determined by looking for the name attribute in the find_element_by_name method.

Next we are sending keys, this is similar to entering keys using your keyboard. Special keys can be send using Keys class imported from selenium.webdriver.common.keys

And then we type in the text and then we simulate hitting enter, just like we hit the keyboard. We can use the Keys class to simulate keyboard typing.

Last but not least

Get the source code after web page rendering.

Simply print the page_source property.

In this way, we can do the dynamic crawling of the web page.

The test case

Given the above features, we can certainly use them to write test samples.

import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class PythonOrgSearch(unittest.TestCase):

    def setUp(self):
        self.driver = webdriver.Chrome()

    def test_search_in_python_org(self):
        driver = self.driver
        driver.get("http://www.python.org")
        self.assertIn("Python", driver.title)
        elem = driver.find_element_by_name("q")
        elem.send_keys("pycon")
        elem.send_keys(Keys.RETURN)
        assert "No results found." not in driver.page_source

    def tearDown(self):
        self.driver.close()

if __name__ == "__main__":
    unittest.main()Copy the code

Run the program, the same functionality, we encapsulated it in the form of a test standard class.

The test case class is inherited from unittest.TestCase. Inheriting from TestCase class is the way to tell unittest module that this is a test case. The setUp is part of initialization, this method will get called before every test function which you are going to write in this test case class. The test case method should always start with characters test. The tearDown method will get called after every test method. This is a place to do all cleanup actions. You can also call quit method instead of close. The quit will exit the entire browser, whereas close will close a tab, but if it is the only tab opened, by default most browser will exit entirely.

Test cases inherit from the UnitTest. TestCase class, which indicates that this is a test class. The setUp method is the initialized method that is called automatically in each test class. Each test method name has a specification that must start with test and is automatically executed. The final tearDown method is called after the end of each test method. This is equivalent to the final destructor method. In this method you can write the close method, you can also write the quit method. However, the close method closes the TAB, while quit exits the entire browser. When you open only one TAB, closing it will close the entire browser.

Page operation

Page interaction

Just grabbing pages isn’t much use, what we really need to do is to interact with the page, such as clicking, typing and so on. The premise is to find the elements on the page. WebDriver provides a variety of ways to find elements. For example, here is a form input box.

<input type="text" name="passwd" id="passwd-id" />

We can get it like this

element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_elements_by_tag_name("input")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")Copy the code

You can also get it from its text link, but be careful, the text has to match exactly, so it’s not a very good match.

And the other thing you need to be aware of when using xpath is that if multiple elements match xpath, it will only return the first element that matches. If not, NoSuchElementException is thrown.

Once you have the element, the next step of course is to enter the content into the text, using the following method

element.send_keys("some text")

You can also use the Keys class to simulate clicking a key.

element.send_keys("and some", Keys.ARROW_DOWN)

You can use the send_keys method for any retrieved element, just like you would if you hit send in GMail. The result, however, is that the input text does not clear automatically. So the input text will continue to be entered on the basis of the original. You can use the following method to clear the content of the input text.

element.clear()

The entered text will be cleared.

Fill in the form

We already know how to enter text into a text field, but what about other form elements? For example, the processing of the drop-down TAB can be as follows

element = driver.find_element_by_xpath("//select[@name='name']")
all_options = element.find_elements_by_tag_name("option")
for option in all_options:
    print("Value is: %s" % option.get_attribute("value"))
    option.click()Copy the code

We first get the first SELECT element, the drop-down TAB. It then sets each option option in the Select TAB in turn. As you can see, this is not a very efficient approach.

In fact, WebDriver provides a method called Select to help us do this.

from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_name('name'))
select.select_by_index(index)
select.select_by_visible_text("text")
select.select_by_value(value)Copy the code

As you can see, it can be selected by index, it can be selected by value, it can be selected by text. It’s very convenient.

What if I deselect them all? Very simple

select = Select(driver.find_element_by_id('id'))
select.deselect_all()Copy the code

This cancels all choices.

Alternatively, we can obtain all selected options by using the following method.

select = Select(driver.find_element_by_xpath("xpath"))
all_selected_options = select.all_selected_optionsCopy the code

Get all optional options are

options = select.options

If you fill out the form, you have to submit it eventually, right? How do I submit it? Very simple

driver.find_element_by_id("submit").click()

This is equivalent to the simulation of clicking the Submit button to submit the form.

Of course, you can also submit elements separately

The element.submit() method. WebDriver looks for the form in which it is located, and if it finds that the element is not surrounded by the form, it throws NoSuchElementException.

Drag and drop elements

To do this, you first need to specify the element to be dragged and the target element to drag, and then use the ActionChains class to do it.

element = driver.find_element_by_name("source")
target = driver.find_element_by_name("target")

from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()Copy the code

This allows the element to be dragged from source to target.

Page switching

A browser is bound to have many Windows, so there must be a way to switch between them. The method for switching Windows is as follows

driver.switch_to_window("windowName")

Alternatively, you can use the Window_handles method to obtain the operands for each window. For example,

for handle in driver.window_handles:
    driver.switch_to_window(handle)Copy the code

In addition, the method of switching the frame is as follows

driver.switch_to_frame("frameName.0.child")

This switches focus to a frame with a name of Child.

Popup window handle

When you launch an event, a pop-up message appears on the page. How do you handle the message or get the message?

alert = driver.switch_to_alert()

You can get a popover object by using the above method.

The historical record

So how do you navigate backwards and forwards?

driver.forward()
driver.back()Copy the code

* * Cookies

Add Cookies to a page as follows

#Go to the correct domain driver.get("http://www.example.com") # Now set the cookie. This one's valid for the entire Domain cookie = {' name ':' foo ', 'value' : 'bar'} driver.add_cookie(cookie)Copy the code

Get the page Cookies as follows

# Go to the correct domain
driver.get("http://www.example.com")

# And now output all the available cookies for the current URL
driver.get_cookies()Copy the code

This is the processing of Cookies, which is also very simple.

Element selection

For element selection, there is the following API for single element selection

find_element_by_id

find_element_by_name

find_element_by_xpath

find_element_by_link_text

find_element_by_partial_link_text

find_element_by_tag_name

find_element_by_class_name

find_element_by_css_selector

Multiple element selection

find_elements_by_name

find_elements_by_xpath

find_elements_by_link_text

find_elements_by_partial_link_text

find_elements_by_tag_name

find_elements_by_class_name

find_elements_by_css_selector

You can also use the By class to determine which selection method to use

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')Copy the code

Some of the properties of the By class are as follows

ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"Copy the code

See the official documentation for more details on how to select elements

Element selection

Page waiting

This is a very important part, as more and more web pages are using Ajax technology, so programs can’t be sure when an element is fully loaded. This would allow the element positioning difficulties and will improve produce ElementNotVisibleException probability.

So Selenium provides two waiting modes, one is implicit and the other is explicit.

Implicit wait is to wait for a specific time, and explicit wait is to specify a condition until the condition is true.

Explicit waiting

Explicitly wait to specify a condition, and then set the maximum wait time. If the element is not found by this time, an exception is thrown.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://somedomain/url_that_delays_loading")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement")))finally:
    driver.quit()Copy the code

By default, the program calls every 500ms to see if the element has been generated, and returns immediately if the element already exists.

Here are some of the built-in wait conditions that you can invoke directly instead of writing your own wait conditions.

  • title_is
  • title_contains
  • presence_of_element_located
  • visibility_of_element_located
  • visibility_of
  • presence_of_all_elements_located
  • text_to_be_present_in_element
  • text_to_be_present_in_element_value
  • frame_to_be_available_and_switch_to_it
  • invisibility_of_element_located
  • Element_to_be_clickable – It is Displayed and Enabled.
  • staleness_of
  • element_to_be_selected
  • element_located_to_be_selected
  • element_selection_state_to_be
  • element_located_selection_state_to_be
  • alert_is_present
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID,'someid')))Copy the code

An implicit wait

Implicit wait is simpler, simply setting a wait time in seconds.

from selenium import webdriver

driver = webdriver.Chrome()
driver.implicitly_wait(10) # seconds
driver.get("http://somedomain/url_that_delays_loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")Copy the code

If this parameter is not set, the default waiting time is 0.

Application framework

For page testing and analysis, the official provides a relatively clear code structure, you can refer to.

Page test architecture

API

By the end, I’m sure it’s the most relaxing and most important API, so I hope you can practice more.

API

conclusion

That’s the basic use of Selenium. We cover page interaction and source code retrieval after page rendering. This way, even if the page is rendered in JS, we can also easily. That’s it!

Related to recommend

The process of building Selenium+PhantomJS environment in Tencent Cloud Ubuntu


Has been authorized by the author tencent cloud community released, reproduced please indicate the article source The original link: www.qcloud.com/community/a… Get more Tencent mass technology practice dry goods, welcome to Tencent cloud technology community