How can a Python crawler get up to speed and reach a level where it can crawl large amounts of data

Data on the Internet is exploding, and with Python crawlers we can get a lot of valuable data:

1. Crawled data, conducted market research and business analysis

Climb zhihu quality answers, screen the best quality content under each topic; Grasp the sale information of real estate websites, analyze the change trend of housing prices, and do the analysis of housing prices in different regions; Climb the job information of recruitment websites, analyze the talent demand and salary level of various industries.

2. As the original data for machine learning and data mining

For example, if you want to build a recommendation system, then you can scale more dimensions of data and make better models.

3. Crawl high-quality resources: pictures, texts, videos

Crawl product (store) reviews and various picture sites to get picture resources and comment text data.

Master the right method, in a short time to be able to crawl the mainstream website data, in fact, very easy to achieve.

However, it is recommended that you have a specific goal from the very beginning. Driven by the goal, your study will be more accurate and efficient. Here’s a smooth, no-basics quickstart path to learning:

1. Understand the basic principle and process of crawlers

2.Requests+Xpath for common crawler routines

3. Understand the storage of unstructured data

4. Anti-crawler measures for special websites

5.Scrapy and MongoDB, advanced distribution

Understand the basic principle and process of crawler

Most crawlers are carried out according to the process of “sending request — obtaining page — parsing page — extracting and storing content”, which actually simulates the process of obtaining webpage information by using browser.

Simply put, we send a request to the server and get the page returned. After parsing the page, we can extract the information we want and store it in a specified document or database.

In this section, you can learn the HTTP protocol and the basic knowledge of web pages, such as POSTGET, HTML, CSS, and JS.

Learn the Python package and implement the basic crawler process

There are many crawler packages in Python: Urllib, Requests, BS4, Scrapy, PySpider, and so on. It is recommended that you start with Requests +Xpath, which links websites and returns pages, and Xpath, which parsed pages and extracted data.

If you’ve ever used BeautifulSoup, Xpath is a lot easier, eliminating all the work of checking element code layer by layer. After mastering, you will find that the basic routines of crawlers are almost the same, the general static website is not a thing at all, xiaozhu, Douban, Qiaoshi Encyclopedia, Tencent news and so on can basically get started.

Here’s an example of a douban essay:

Select the first short comment, right-click – “Check” to view the source code

Copy the XPath information for the short comment information

We get the XPath information for the first comment by positioning:

If we want to crawl through a lot of short comments, it’s natural to get (copy) more xPaths like this:

If you look at the xpaths for the first, second, and third comments, you’ll notice a pattern. Only the number after “li” is different, which corresponds exactly to the number of the comment. So if we want to crawl all the comments on this page, we don’t need this number.

With XPath information, we can climb it off with simple code:

Crawl all the short comment information on the page

Of course, if you need to climb asynchronously loaded websites, you can learn browser packet capture and analysis of real requests or learn Selenium to implement automatic crawling, so that Zhihu, Mtime, Tripadvisor and other dynamic websites are basically no problem.

You also need to know the basics of Python, such as:

File read and write operations: used to read parameters and save crawl content

List, dict: Used to serialize crawl data

Conditional judgment (if/else) : Resolves whether or not a judgment is executed in a crawler

Loops and iterations (for… While) : used to loop crawler steps

Storage of unstructured data

The data crawled back can be stored locally as a document or stored in a database.

For small amounts of data, you can use Python’s syntax or pandas to store the data in text or CSV files. To continue with the above example:

Implementing storage in Python’s base language:

To store in pandas’ language:

Both pieces of code can store the comment information from the crawl by pasting the code after the crawl code.

Store the short comment data for that page

You may find that the data is not clean. There may be errors, omissions, and other errors. You may also need to clean the data.

Missing value processing: Deleting or populating missing data rows

Duplicate value processing: the judgment and deletion of duplicate values

White space and outliers handling: clear unnecessary white space and extreme, abnormal data

Data grouping: data division, function execution and data reorganization

Master various skills to cope with anti-crawling measures of specific websites

It’s okay to crawl one page, but we usually crawl multiple pages.

This time to see how the url changes when turning the page, or in the short comment page as an example, let’s see how the url of multiple pages are different:

Through the first four pages, we can see the pattern, different pages, just numbered pages at the end. Let’s take the example of climbing five pages and write a loop to update the page address.

Of course, there is some desperation involved in crawlers, such as IP blocking, strange capttos, userAgent access restrictions, dynamic loading, etc.

Encountered these anti-crawler means, of course, also need some advanced skills to deal with, such as access frequency control, the use of proxy IP pool, packet capture, verification code OCR processing and so on.

For example, we often find that the URL of some websites does not change after the page is turned, which is usually asynchronous loading. We use developer tools to extract and analyze web page loading information and often get unexpected results.

Analyze the loaded information through developer tools

For example, many times if we find that the web page is not accessible through code, we can try to add userAgent information, or even browser cookie information.

UserAgent information in the browser

Add userAgent information to the code

Often websites tend to prefer the former between efficient development and anti-crawler, which also provides space for crawlers, master these anti-crawler skills, the vast majority of websites have been difficult to you.

Scrapy and MongoDB, advanced distribution

A solid scrapy framework is useful in complex situations where data and code of average magnitude are no longer a problem.

Scrapy is a very powerful crawler framework. It is not only easy to build request, but also powerful selector can easily parse response, but the most amazing thing about it is that it is extremely high performance, allowing you to build crawlers, modular.

Distributed crawling rental information

The amount of data to crawl, naturally will need a database, MongoDB can facilitate you to store large-scale data. Because the database knowledge to be used here is actually very simple, mainly data how to enter the library, how to extract, in need of time to learn on the line.

MongoDB stores job information

Distributed this thing, sounds very scary, but in fact is the use of multi-threading principle to allow multiple crawlers to work at the same time, you need to master the three tools Scrapy + MongoDB + Redis.

Scrapy is used for basic page crawls, MongoDB is used to store crawls, and Redis is used to store crawls for page queues, also known as task queues.

At this point, you’re ready to write distributed crawlers.

You see, this learning path, you can already become an old driver, very smooth. So at the beginning, try not to systematically nibble on something, find a practical project (start with something simple like douban or piglet), and start straight away.

Because crawler technology does not require you to systematically master a language, nor does it require sophisticated database technology, effective posture is to learn these scattered knowledge points from actual projects, you can ensure that each time you learn the most needed part.

Of course, the only trouble is that in the specific problem, how to find the specific part of the learning resources, how to screen and screening, is a big problem many beginners face.

But don’t worry, we have prepared a very systematic crawler course. In addition to providing you with a clear learning path, we have selected the most practical learning resources and a huge database of mainstream crawler cases. After a short period of study, you will be able to master the skill of crawler and get the data you want.

More than 3000 students have joined this course. After a short period of learning, many of them have made progress from 0 to 1, and can write their own crawlers and crawl large-scale data.

If you hope to learn to reptile in a short time, avoid detaches

It is very unreasonable to talk about theory, grammar and programming language at the beginning. We will directly start with specific cases and learn specific knowledge points through practical operations. We plan a systematic learning path for you, so that you do not face scattered knowledge points.

This article is from the cloud community partner “Xiao Zhan Learning Python”, for related information, you can pay attention to “Xiao Zhan Learning Python”.

How can a Python crawler get up to speed and reach a level where it can crawl large amounts of data

Related Posts

Good container practice

How can companies work remotely during the pandemic?

React Uses TransitionGroup for animation effects