preface

With the advent of the era of big data, more and more technologies that conform to the development of The Times have been attached importance to and applied to the field. Crawler technology is undoubtedly the best among them, and many children are learning this technology whether they are employed or interested in it.

A reptile only needs to learn about a reptile, right? It’s not that simple. Most people who tell you these quick ideas have an agenda. As a retired programmer for more than ten years, I personally feel that mastering crawlers involves 8 aspects of knowledge!

I’m here to tell you today.”What skills should a qualified herpetologist learn“?What skills should we focus on?

First, Python

★★★★ (Top priority)

Any technology needs language support. Among many computer programming languages, Python is undoubtedly the most suitable for crawler development. The most suitable language does not mean that it is the only one. Other languages such as Java and C can be used for crawler development. The classmate that does not know can understand on the net, data is very much.

Python language is the foundation, only this stage of knowledge grasp solid basic learning will not be too much problem. The most important knowledge in this stage is “object-oriented programming”, which is also the most difficult. Therefore, students can spend more time on this stage of Python language. You can spend more time on this part of the content, otherwise they will learn later. You always have a feeling that books are running out of time.

Two, front-end knowledge

Importance: ★★ (Understanding)

This content serves as understanding. Since we are not front-end, we do not have to pay much attention to this part of the content, but the basic page tags and structure we must understand. Later capture data, not to analyze the page. Of course, if you have the energy, you can go a little further. After all, the more knowledge you have, the better the development of crawlers will be. This part of the content of learning or according to their own situation to a reasonable arrangement

Third, network programming

★★★ (Study)

Network programming is really an important part of the content, if there are students to comb this knowledge point more clearly, then the process of the whole crawler is very clear.

But again, our main job is crawlers. So for this part, we just need to learn the basics of network programming. For example: network communication protocols (especially HTTP and HTTPS), network request methods, network request and response processes, and so on.

4. Data storage

★★★★ (Top priority)

The importance of data storage is obvious. Crawler development, part of the work is to crawl data, the other part also needs us to store data.

In this section, in addition to common storage methods such as JSON, TXT, HTML and so on, but also to master CSV and mongodb, especially mongodb is almost to the enterprise inside the interview must be a storage method; Mysql and Redis are also a plus, although mongodb can solve most of the requirements. But knowing more about storage technology is certainly good for their competitiveness.

5. Data analysis

★★★ (master)

What about this piece? The strict meaning is not our job responsibilities for crawler development, but now many enterprises have such requirements for crawler development engineers, which indicates that our threshold is getting higher and higher.

So about this part of the content, students can study it in the last stage. Skills to master include numpy, pandas, missingno, jieba and so on.

JavaScript language

★★★ (Study)

We learn Python language we all understand is to facilitate the development of crawlers, so why to learn JS?

That’s a simple question. Now more and more web pages will have some JS encryption. That’s a big hurdle for us to climb the data. However, we are going to learn a JS language, obviously learning achievement is too high. So giving it three stars doesn’t mean it’s not important, it means it’s expensive right now. If there are students who want to learn JS reverse, this content can also be put in the later to learn. Also, we are now required to be familiar with JS encryption and familiar with python’s common implementation of JS methods, such as PyV8. That’s it.

7. Mobile terminal technology

★★★ (Master)

Now for a qualified crawler development engineer, it is far from enough to just grasp the data on the web. With the development of the Internet, the data on mobile devices has more reference value, so this technical point is still very important. Then what technologies do we learn?

First of all, the basic knowledge of Android is to learn a simple, such as what controls andrond has; The second is to master UIAutomator2; Finally, master the use of packet capture tools such as Fiddler and so on

Of course, the technical points listed only about the direction, need to learn a lot of details, here only to provide you with a reference to learn.

Eight, reptilian knowledge

★★★★ (Top priority)

This is the most important part. Crawler technology is the survival of us as crawler development engineers, the way of life. So let’s talk about what skills to master in general:

Proficient in web page parsing techniques such as regular, Xpath, BS4, etc

Study crawler strategies and anti-shielding rules, solve the difficulties of account blocking, IP blocking, page skipping and so on, and improve the efficiency and quality of web page grabbing

Familiar with verification code identification, simulated login, data cleaning, deduplication, warehousing, etc

Proficient in Scrapy framework and distributed crawlerThrough the listing of the above knowledge points we are not difficult to find, which mainly includes, web page parsing, anti-crawl technology, data into the library, Scrapy framework these four aspects, each aspect to learn well, are a certain degree of difficulty, which requires us to accumulate more in the usual learning, so as to achieve flexible application of knowledge points.