How do I get started with a Python crawler?

1. Many people start with a crawler, but do not know what to do with it, and finally learn it but can not use it.

Most people don’t really need to learn crawlers, because the company they work for has its own database of data to help you do business analysis.

When do you need a crawler?

When there is no data you need in your work and you have to collect some data from the Internet, at this time you can use crawler to simulate the browser to open the web page and obtain the part of data we want in the web page, so as to improve work efficiency.

In addition, the construction of automatic tables can also be done by crawler, that is, by crawler to collect specific data, and then automatically save to Excel to realize the construction of automatic tables.

Whether it is building a crawler or creating an automated table, the crawler has to do a 4-step process (figure below). Simulate browser request (get code) -> Get response content (get crawl content) -> parse content (extract useful data from crawl content) -> save data (save crawl data to database or Excel file).

2. There are two kinds of crawler tools, one is the dumb crawler tool, which can be operated through the visual interface and can be handled quickly. Another way is to use Python programming to crawl, which requires a certain learning cost.

3. Dumb crawler tools

1) Houyi, official website: Houyicaiji.com

Open the official website and click “Document Center” below to get started.

2) Search, gooseeker.com

4. Python crawler

Crawler through programming involves a lot of knowledge, many people learn crawler knowledge because of the wrong steps, resulting in from entry to give up. Here are the overall steps for learning Python crawlers.

1) Learning a crawler is like cooking. You need to learn basic Python syntax and be familiar with the recipes so you can get a good idea of how to cook.

2) Compared with chef’s kitchen knife, Python crawler’s several important built-in libraries urllib, HTTP, etc., are our most powerful tools for crawlers. They can help us download web pages.

3) Chefs’ familiarity with food ingredients before cooking is a basic requirement. For crawlers, learning regular expression Re, BeautifulSoup (BS4), Xpath (LXML) and other web page parsing tools is also a basic requirement. Only by learning them can we know the rules of specific websites and successfully climb the data.

4) Familiar with the recipe, know the ingredients, with a kitchen knife, we can start some simple website crawl, understand the crawl data process. At this point, you’re a beginner.

And if you want to do more than just cook simple ingredients and get into the reptile world, you can start learning the following steps that are essential to becoming a reptile god:

5) Understand some crawler anti-crawler mechanisms, such as header, robot, time interval, proxy IP, hidden fields, etc.

6) Learn to crawl some special websites and solve problems such as login, Cookie and dynamic web pages.

7) Understand the combination of crawler and database and how to store crawler data.

8) Learn to apply multi-threading and multi-process of Python to crawl and improve crawler efficiency.

9) Learn the framework of crawler, Scrapy, PySpider, etc.

10) Learning distributed crawler (huge data demand)

5. Recommended channels for learning Python crawlers include CSDN, BiliBili and Zhihu. These sites have many free and excellent resources that can help you get started quickly with crawlers.

Xlwings, Xlsxwriter, XLRD, XLWT, PANDAS, Xlsxwriter, Win32com, Xlutils, etc.

These libraries make it easy to add, delete, rewrite, and format files in Excel, but it’s not recommended to try them all because it’s too time consuming. Using xlwings, XLRD and XLWT is enough, and basically solves all the problems of Excel automated tables.

Xlwing can not only read and write Excel, but also do formatting and VBA operations. It is very powerful and easy to use.

This is the official tutorial about XlWing (www.xlwings.org), there are rich examples of teaching videos, but it is explained in English.

7. It should be reminded that crawler is one of the technical means to obtain data. Due to the sensitivity of some data, if you cannot identify which data can be retrieved and which will touch on laws, you may be the next leading role in the news. So you have to pay extra attention.

How to define the validity of a crawler can be considered from three perspectives, namely, collection approach, collection behavior and use purpose.

1) What is the most important way to crawl data? In general, unpublished, unlicensed data with sensitive information, regardless of how it is obtained, is illegal.

Therefore, when collecting such sensitive data, it is best to check relevant laws and regulations, especially users’ personal information, information of other commercial platforms and other such information, and find a suitable way.

2) The use of technical means should be restrained. If crawlers will cause interference or even damage to servers and businesses of other companies, such behavior should not be allowed.

3) The purpose of data use is also a key point. Even if you collect data through legal means, if you do not use the data correctly, there will also be illegal behaviors. Common violations include: non-compliance with data licensing agreements and use beyond agreed upon; Selling personal information; Improper business practices, etc.

8. Finally, if you don’t have to use a crawler, don’t crawl if you can. zhuanlan.zhihu.com/p/192273352

How do I get started with a Python crawler?

Related Posts

Ali Cloud Front Weekly – the 34th issue

Low code platform +RPA+AI, from ERP to digital transformation

PyTorch 1.0 中文 版 : torch.nn.init

PyTorch 1.0 中文版 : torch.nn.init