What is a reptile

Simply put, write a program that gets the data you need from the Web. You can follow the rules to save data if you need to. Data can be a string, a picture, or a file. It depends on what you’re climbing.

In theory, the steps of crawler are very simple. The first step is to obtain HTML source code, and the second step is to analyze HTML and get data. But the actual operation, old trouble

Here are some handy libraries for writing crawlers in Python:

Common network request library:

Requests urllib and urllib2 are Python built-in modules, and requests are third-party libraries

Common parsing libraries and crawler frameworks:

BeautifulSoup, LXML, HTMLParser, Selenium, Scrapy, HTMLParser is a Python module.

BeautifulSoup parses HTML into Python syntax objects, making it easy to manipulate objects directly;

LXML can parse XML and HTML tag languages with the advantage of speed;

Selenium calls the browser driver. With this library, you can directly call the browser to perform certain operations, such as entering a verification code.

Scrapy is a powerful and well-known crawler framework that makes it easy for simple web sites to crawl

What a reptile needs to know

Learn to read HTML code and know which data corresponds to which page.

Learn browser debugging function, learn crawler need to learn to capture packets, see how the protocol of others is transmitted.

Advanced crawler

After mastering the basic crawler, you will want to get more data, crawl more difficult sites, and then you will find that getting data is not easy, and now there are many anti-crawler mechanisms.

Some need to log in and bring sessions to the next request, some need proxy IP, some encryption is in there, the encryption may vary from site to site, some return JS code and execute it with the browser. These are more than can be said in a few words.

However, we do not need to be too discouraged, we can only say that we can solve problems separately after encountering them, because problems are not unified and the solutions are not the same. The only way to get better is to practice and write more code.

Crawler interested students, you can also pay attention to the public number: poetry code, find me to learn together.