Python has become a star programming language as machine learning, artificial intelligence and algorithmic programmers are paid astronomically high salaries. The Python programming language is known for its concise syntax, clear algebraic logic, and a large number of third-party libraries. Life is short. I use Python, and more and more people are learning. It is said that information courses for elementary school students in some regions have even included Python. It is an irresistible trend.

Of course, in the market economy, the popularity of any trend is the capital of a small number of people. For example, there are large and small programmer training institutions on the market, and there are all kinds of online classes. Even in the we-media industry, there are more and more marketing accounts that use “teach Python” to make money.

I’ve been looking around at some Python marketing organizations lately, and it’s interesting to note that 70% of the marketing accounts promoting and introducing Python are talking about the Python web crawler.

To be sure, I learned Python on my own in a week two years ago, having used the language proficiently during my master’s internship. However, because I’m a programmer myself (my major was materials science), I don’t have a lot of computer knowledge, and I don’t know much about so-called web crawlers.

These marketing accounts’ favorite tricks are: “Teach you a seven-day Crash Python crawler”, “now 50% off if you join the learning community”, “The first three live classes are free, and the last four classes are guaranteed to teach you Python crawlers”… A wide variety of selling class numbers are this model.

So what exactly is a Python crawler?

After spending the day browsing various “Teach you to write Python crawlers” marketing articles, I had a rough idea of what they were all about.

To sum up, in three words:

1. Use Python to connect to specific websites.

2. Use Python to grab web page information and pull it locally.

Parse pulled information, store or visualize it.

To be honest, this routine is pretty freaking me out. It’s as if you clicked on a photo site, found a url for the image, and then downloaded the image locally from that url… This is not so much teaching web crawlers as teaching Python applications…

I looked up some information further and found that reptiles in the real world are far from so simple. Common crawlers, such as search engine crawlers, are responsible for massive clicks, information crawling and page presentation under search. This involves a lot of computer networks, algorithms and data structure knowledge, the front-end knowledge is also very high requirements. For example, do you want to improve the efficiency of crawler, is it breadth search or depth search? How do you deal with high concurrency and load when dealing with network problems? You have to face the anti-crawling mechanism of many websites, how should you pull data in a reasonable and legal situation? These problems can not be taught in a day or two. They require long-term technical training and experience accumulation.

Have to say, marketing number eager to boast the so-called “instant web crawler”, is not desirable. Pure white entry, also should not start to learn this chicken “crawler”, but should lay a good foundation of computer knowledge, at least should master Python first.

In my opinion, learning Python is completely free of the need to send money to these training institutions and marketing numbers. If your English is good, go straight to the official documentation. If you don’t want to read English, buy a textbook and type through the code. If you prefer to study online, there are many free and excellent learning resources available online. As for crawlers, wait until you’ve mastered Python.

Finally, I posted a section of code in a short time after understanding the “crawler” in the marketing number. The basic idea is 123 as I mentioned above. 1. Link to image sites with search terms (dilieba as an example); 2. Figure out the address of the picture with regular expression; 3, according to the address to download the image to the local.

Then there are the renderings.

Of course, if you really want to learn the so-called crawler, it is better to learn the basic knowledge of computer, especially network and algorithm knowledge, and master a programming language (Python is not the only choice, Java, PHP, Go and so on can write crawler). I’m still 80, 000 miles from being a real web crawler. In fact, this article is just to reveal a truth: if you want to work hard, you can’t do without suffering.

In this world, all quick tricks are lies.

Off-topic: if you are willing to pay attention to me, and I study together, I swear not like marketing to sell class number as cheat money cut leek. I’m an honest person.

Author: Jia Hutu

Source: Zhihu