Are python crawlers for requests, bf4, and scrapy? But one crawler library that already has 3k+ stars on GitHub is MechanicalSoup:

This article will explain the crawler package from the following dimensions:

  • What are the features of MechanicalSoup
  • Where does MechanicalSoup fit in
  • Code details MechanicalSoup workflow

MechanicalSoup introduction

MechanicalSoup not only crawls data from web sites like a regular crawler package, but also automates python libraries that interact with web sites with simple commands. Underlying it is the BeautifulSoup (ALSO known as BS4) and Requests libraries, so if you are familiar with both of these libraries, you will feel more comfortable using them.

Therefore, MechanicalSoup can be very useful if you need to interact with your site constantly during development, such as clicking buttons or filling out forms. Next, let’s go straight to the code to show how this amazing crawler package works.

MechanicalSoup installation

# Direct installation
pip install mechanicalsoup
Download and install the development version from GitHub
pip install git+https://github.com/MechanicalSoup/MechanicalSoupCopy the code

MechanicalSoup

We will explain how to achieve web content acquisition and web interaction through MechanicalSoup in two cases. First, we will look at the first crawling tiger hot post.

We first open the home page of hupu community, you can see that there are several posts with red titles, now want to climb down and save the titles of these several posts. Start by creating a browser instance:

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()Copy the code

Now we open the web site of Hupu BBS in the browser instance, and the prompt 200 means OK access is successful

browser.open('https://bbs.hupu.com/')
<Response [200]>Copy the code

Our browser instance is now on the hupu BBS homepage. Now we need to get the list of articles that exist on the page. This part is a bit tricky because we need to uniquely identify the attributes of the tag that contains the list of articles. However, you can easily do this with a browser like Chrome:

We check the element to find the ul tag inside a DIV whose CLSS is list. Then we check the CLSS of the hot post =”red”, so we can find the title we need using a similar method to BS4.

result = browser.get_current_page().find('div', class_="list")
result= list(result.find('ul'))
bbs_list =[]for i in range(len(result)):
    ifresult[i] ! ='\n':
        bbs_list.append(result[i])
bbs_top = []for i in bbs_list:
    bbs_top.append(i.find('span',class_="red"))
bbs_topCopy the code

If you look at the result, you have successfully saved the title with the label, and now you can retrieve the title by simply using.text.

[<span class="red""> <span style =" max-width: 100%; clear: both; min-height: 1em"red""> <span class=" p0 "style =" margin-top: 1em; margin-bottom: 1em"red""> < p style =" max-width: 100%; clear: both; min-height: 1em; If I could do January 27 over again, maybe...... </span>, <span class="red"</span> </span>, None, None, None, None, None, None, None]Copy the code

Let’s look at another example of how mechanicalsoup interacts with a web site. Let’s take a simpler example this time, using mechanicalsoup for baidu search.

As before, we first create an instance in the browser and open the Baidu home page.

import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open('https://www.baidu.com/')
<Response [200]>Copy the code

Once the response is successful, let’s extract the form that needs to be submitted

browser.select_form()
browser.get_current_form().print_summary()
<input name="bdorz_come" type="hidden" value="1"/>
<input name="ie" type="hidden" value="utf-8"/>
<input name="f" type="hidden" value="8"/>
<input name="rsv_bp" type="hidden" value="1"/>
<input name="rsv_idx" type="hidden" value="1"/>
<input name="tn" type="hidden" value="baidu"/>
<input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
<input autofocus="" class="bg s_btn" id="su" type="submit" value="Google it."/>
Copy the code

You can see that the form you want to fill is the second-to-last line, so you can fill it as follows

browser["wd"] = 'the early python'Copy the code

You can then use the following command to open a local web page with the same content as the original web page and populate the table with the values we provided.

browser.launch_browser()Copy the code


As you can see, the search box has already filled in the content to search, the next step is to ask the browser we created to help us click on it, execute:

browser.submit_selected()
<Response [200]>Copy the code

A return of 200 represents success, and the simulation is done in one click. Next, browser.get_current_page() is used to view the content of the returned page!

conclusion

The two examples above are simple, but that’s basically how MechanicalSoup works: create a browser instance, then use the browser to perform the action you want, and even open a local visual page to preview the content of your form before submitting it! What are you waiting for? Hurry up and try it!

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

My public id: Get up early python