The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

1 background

In this era of data flooding and efficiency pursuit, the use of Python can create a lot of convenience for us, such as Web development, desktop gadget development, sticky scripting, big data processing, image processing, machine learning and so on. There is so much that can be done. For example, if we want to automatically convert *.png images from our drawable directory to WebP format when developing Android, the common approach is to use third-party tools (many of which can only convert one image at a time). This can be done with two lines of Python core code, and it is also very convenient to use Python if you want to customize the conversion posture (path, file name, etc.) in bulk. For example, here is a very simple Python PNG batch conversion WebP tool I wrote. The specific source code is as follows:

#! /usr/bin/env python3

from PIL import Image
from glob import glob
import os
Note: Only provide the core basic ideas and scripts themselves can be improved for automatic recognition of the Android project full conversion. 1. Place the script in your Android PNG directory. 2. Run the python3 image2webp.py command. 3. Generate webP images corresponding to all PNG images in the current folder under the output directory. "" "

def image2webp(inputFile, outputFile):
    try:
        image = Image.open(inputFile)
        ifimage.mode ! ='RGBA' andimage.mode ! ='RGB':
            image = image.convert('RGBA')

        image.save(outputFile, 'WEBP')
        print(inputFile + ' has converted to ' + outputFile)
    except Exception as e:
        print('Error: ' + inputFile + ' converte failed to ' + outputFile)

matchFileList = glob('*.png')
if len(matchFileList) <= 0:
    print("There are no *.png file in this directory (you can run this script in your *png directory)!")
    exit(-1)

outputDir = os.getcwd() + "/output"
for pngFile in matchFileList:
    fileName = pngFile[0:pngFile.index('. ')]
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)
    image2webp(pngFile, outputDir + "/" + fileName + ".webp")

print("Converted done! all webp file in the output directory!")Copy the code

Shock, life is short, I use Python! That’s true, but instead of exploring other Python mysteries, this series will dive directly into a vertical domain called The Python crawler. In fact, the win-win crawler (search engine including crawler is win-win, underground black workshop on the Internet wantonly washing data, such as email data is boycotting or illegal) is beneficial to most websites, while the malicious crawler is just the opposite. Normally, if we want to obtain some website data, we should legally authorized access through their open API. However, enterprises are enterprises after all and have reserved open API permissions, so sometimes we have to use violence to loot valuable data, which is a big value of the existence of crawlers.

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

2 Crawler foundation

Crawler actually involves things or are more diverse and more, the more important points may be have mastered the basic Python syntax and some commonly used built-in or expand module, familiar with WEB development related knowledge, familiar with data persistence (relational database, a relational database, file) cache, familiar with some technologies such as regular, etc.

2-1 is an unwritten rule

Those who know about the WEB know that general websites will have robots.txt and Sitemap definitions. These definitions are of guiding significance to our rational crawler compilation. For example, we take a look at rare earth nuggets (juejin. Cn) this website’s robots.txt file (juejin. Im /robots.txt), as follows:

User-agent:*
Disallow:/timeline
Disallow:/submit-entry.Disallow:/subscribe/all? sort=newest
Disallow:/search

Sitemap:https://juejin.im/sitemap/sitemappart1.xml
......
Sitemap:https://juejin.im/sitemap/sitemappart4.xmlCopy the code

Sitemap defined in robots.txt, access (juejin. Im/Sitemap /sit…) As follows:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://juejin.cn</loc>
<priority>1.0</priority>
<changefreq>always</changefreq>
</url>
<url>
<loc>https://juejin.im/welcome/android</loc>
<priority>0.8</priority>
<changefreq>hourly</changefreq>
</url>.</urlset>Copy the code

It can be seen that the content of the robots.txt file clearly suggests (note: only suggestions, only suggestions, only suggestions, malicious crawlers manage you this fart advice) what restrictions crawlers have when crawling the website, generally abide by these restrictions rules can well reduce the risk of their crawlers being blocked; Sitemap provides a list of almost all the pages on the site, and we can use this list to crawl the site directly, or we can do something else ourselves, because not every site has it. So robots.txt and Sitemap are only established unspoken rules, unspoken rules, we generally as appropriate to comply with the line, such as they can be considered to comply with the access request interval, proxy prohibited type, etc., the rest will see your own integrity.

2-2 Basic Tools

As the saying goes, “to do a good job, you must have a sharp tool,” and reptiles need some sharp tools. For Python development tools I choose PyCharm and Sublime; For the browser can choose Chrome, etc., and then install some WEB development plug-ins, such as FireBug, Wappalyzer, Chrome Sniffer, etc., convenient crawler analysis site, especially browser F12 method and site cookies must be master, Otherwise there would be no fun playing with reptiles. Of course, one of the core of the crawler is actually how to screen out after capturing data for their valuable data, about to do and it is necessary for us to do this site page there is a more accurate grasp, if you want to do this you have to know roughly the page using the technology, so that you can improve the efficiency of our analysis page. There are many ways to analyze which technologies are used on the web. There are also many browser plug-ins, such as the Aforementioned Chrome Sniffer. You can also go to Builtwith.com/ and enter the page you want to climb to identify it; You can also use Python’s Builtwith module, which sadly does not currently support Python 3.X and requires manual modification after installation.

Of course, there is a rare weapon to know, that is Baidu and Google cough up, why? Because sometimes our large-scale projects may need to roughly evaluate the amount of crawlers in the whole site for technical selection reference of related crawlers, it is necessary to know both sides. Here is an example of rare earth mining, as shown in the figure:



It can be seen that baidu search through the site command tells us that the site of rare earth mining has about 27256 pages (this is only a reference value, not completely accurate). When we really need the whole site crawler attack, we should consider the selection and strategy of crawler scheme under such a large number of circumstances, so as to ensure the efficiency of crawler.

2-3 Basic crawling technical ideas

The most core of the general technology involved in crawler may be HTTP request, we should at least master HTTP POST and GET request methods; Secondly, the meaning of HTTP request and returned Header and how to use browser and other tools to track request Header, because the most problems in crawler link request are generally caused by the problem of Header. For example, at least ensure the correctness of the disguise of User-agent, Referer, Cookie, etc., return the redirected links in the Header, and extract Gzip data. There is also the URlencode wrapper of POST data sent and so on; Therefore, before crawler, we must have a solid foundation of front-end and back-end knowledge, as well as a sufficient knowledge of HTTP.

The first thing you should do is analyze the site you want to climb. For how to do this, here are some general tips:

  1. First flip down to see if you want to crawl the site has a responsive mobile page, if there is a principle, as far as possible to crawl their mobile pages (the reason is that the general mobile pages are content dry ah, relatively PC pages are not so bloated, convenient analysis).

  2. It is recommended to open stealth mode when analyzing cookies. Otherwise, it will face the method of clearing cookies. Clearing cookies is very important for crawler website analysis, so we must get them.

  3. Analyze whether a web page is a static page or a dynamic page, so as to adopt different crawl strategies and use different crawl tools.

  4. View the web source code to find out the page layout rules of valuable data to you, such as specific CSS selection, so as to specify the data parsing rules after the capture.

  5. After cleaning the data, choose how to deal with the captured valuable data, such as storage or direct use, and how to store it.

After more than a few routines for clear you can begin to write the crawler code, but this time there are still a lot of code routines need to be aware of, such as the repetition of the crawl, invalid URL URL deception, excluding, creeper crawled take exception handling, etc., if you want to own the crawler is very robust, these routines seem to be all must be considered.

Of course, the above is only the core issues of the foundation of crawler, and the knowledge points involved in large crawler projects are more trivial. With the gradual progress of this series, we will gradually get to know them. Let’s try a knife first.

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】

Get up and get high

BB hot yao much foundation, there is no actual combat, make wool! Talk is easy, show me the code! Since it is Python3.X crawler combat series, so we first let ourselves climb up, so we first look at a crawler routine flow structure, as shown in the following figure (this figure is quoted from the network) :

See, a core processes of the crawler is actually get a URL, and download the URL specified data or structured data (web page), and analyze the valuable data for their own use, so that the core of the crawler mechanism process is constantly repeat this process, day by day to help you in that various climb ah climb ah climb.

According to the crawler flow chart above, we give a simple crawler procedure below, in order to understand and feel the charm of crawler. Below is a deep climb to Baidu BaikeandroidAn instance of an entry profile and its derivative entry profileClick me on Github to view the source of the crawler module, this crawler is not that robust, but enough to illustrate the above flow chart, the crawler package structure is shown below:



Python3 spider_main.py on the command line or click spider_main.py on PyCharm to see the crawler start to crawl data (note: This crawler relies on the external module BeautifulSoup. If it is not installed, you are advised to run the PIP install Beautifulsoup4 command to install it. Secondly, the crawler only crawls 30 links in depth by default. Finally, after the crawler crawls 30 links, an HTML page table named out_2017-06-13_21:55:57.html will be automatically output in the current directory. We can open the file and find the crawler results as follows:

How’s that? We crawled some terms about Android and deep links from Baidu Encyclopedia, and then output a WEB page according to our own preferences. Of course, we can write these data into the database, write RESTFUL interfaces with PHP and return them to the APP through JSON structured statements. Like not like, no longer need to do a small App to find free API everywhere (such as to aggregate data search), can be free hands automatic capture and use, but must not be unauthorized direct capture to commercial App use, which may be sued.

The following is github.com/yanbober/Sm… This crawler source, we can compare the crawler flow chart above.

Spider_main. py is the object oriented version of the scheduler in the crawler flow chart above. The scheduler is responsible for iterating the crawl link from UrlManager to HtmlDownLoader, which then passes the download to HtmlParser. Then the valuable data is output to HtmlOutput for application. ' ' '
class SpiderMain(object):
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownLoader()
        self.parser = html_parser.HtmlParser()
        self.out_put = html_output.HtmlOutput()

    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print("craw %d : %s" % (count, new_url))
                html_content = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_content, "utf-8")
                self.urls.add_new_urls(new_urls)
                self.out_put.collect_data(new_data)
                # default only 30 depth, otherwise too slow, you can modify.
                if count >= 30:
                    break
                count = count + 1
            except Exception as e:
                print("craw failed! \n"+str(e))
        self.out_put.output_html()

if __name__ == "__main__":
    rootUrl = "http://baike.baidu.com/item/Android"
    objSpider = SpiderMain()
    objSpider.craw(rootUrl)Copy the code
Url_manager.py The [URL manager] in the crawler flowchart above manages mechanisms such as deep URL links and de-weight. ' ' '
class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.used_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.used_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) > 0

    def get_new_url(self):
        temp_url = self.new_urls.pop()
        self.used_urls.add(temp_url)
        return temp_urlCopy the code
Html_downloader. py The [downloader] in the crawler flow chart above is responsible for downloading and obtaining the content of the specified URL web page. Here, HTTP CODE 200 is simply processed, and in essence, it should be retried according to the situation of 400 and 500 equal points. ' ' '
class HtmlDownLoader(object):
    def download(self, url):
        if url is None:
            return None
        response = urllib.request.urlopen(url)
        ifresponse.getcode() ! =200:
            return None
        return response.read()Copy the code
Html_parser. py The [parser] in the crawler flow chart above is responsible for parsing the content of the web page downloaded by the downloader. The parsing rules are the content of interest defined by ourselves. Parsed data is returned through the dictionary. ' ' '
class HtmlParser(object):
    def parse(self, url, content, html_encode="utf-8"):
        if url is None or content is None:
            return
        soup = BeautifulSoup(content, "html.parser", from_encoding=html_encode)
        new_urls = self._get_new_urls(url, soup)
        new_data = self._get_new_data(url, soup)
        return new_urls, new_data


    def _get_new_urls(self, url, soup):
        new_urls = set()
        links = soup.find_all("a", href=re.compile(r"/item/\w+"))
        for link in links:
            url_path = link["href"]
            new_url = urljoin(url, url_path)
            new_urls.add(new_url)
        return new_urls


    def _get_new_data(self, url, soup):
        data = {"url": url}
        title_node = soup.find("dd", class_="lemmaWgt-lemmaTitle-title").find("h1")
        data["title"] = title_node.get_text()
        summary_node = soup.find("div", class_="lemma-summary")
        data["summary"] = summary_node.get_text()
        return dataCopy the code
Html_output.py The [applicator] in the crawler flow chart above is responsible for the application of the parsed data. Here, a simple WEB page is used to output all the crawled data in the datAS list as Table. ' ' '
class HtmlOutput(object):
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        file_name = time.strftime("%Y-%m-%d_%H:%M:%S")
        with open("out_%s.html" % file_name, "w") as f_out:
            f_out.write("<html>")
            f_out.write(r'<head>'
                        r'<link rel="stylesheet" '
                        R 'href = "https://cdn.bootcss.com/bootstrap/3.3.7/css/bootstrap.min.css"
                        r'integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" '
                        r'crossorigin="anonymous"></head>')
            f_out.write("<body>")
            f_out.write(r'<table class="table table-bordered table-hover">')

            item_css = ['active'.'success'.'warning'.'info']
            for data in self.datas:
                index = self.datas.index(data) % len(item_css)
                f_out.write(r'<tr class="'+item_css[index]+r'">')
                f_out.write('<td>%s</td>' % data["url"])
                f_out.write('<td>%s</td>' % data["title"])
                f_out.write('<td>%s</td>' % data["summary"])
                f_out.write("</tr>")

            f_out.write("</table>")
            f_out.write("</body>")
            f_out.write("</html>")Copy the code

Whoa! Is so great, how, to have the Python small crawler to produce an overall cognition, if the said to understand, the next article will step by step we talk other Python crawler technology point (of course, the above code though rarely, but you may still feel a little look not to understand, it’s going to cram the knowledge of myself, Details are beyond the scope of this series).

^ – ^Of course, if you see this if it is helpful to you, you might as well scan the TWO-DIMENSIONAL code to appreciate the small amount of money to buy badminton (now the ball is also very expensive), which is both a kind of encouragement and a kind of sharing, thank you!

The workman is like waterblog.csdn.net/yanboberNo reprint without permission, please respect the author’s work.Contact me in private】