One, foreword

Today, WHAT I want to share with you is how to crawl the product quotation data of China Agricultural Network, and use common single thread, multi-thread and coroutine to crawl, so as to compare the performance of single thread, multi-thread and coroutine in the network crawler.

Target URL:www.zhongnongwang.com/quote/produ…

Crawl the product name, latest quotation, unit, quotation number, quotation time and other information and save it in local Excel.

Two, climb test

Page to see the rule of URL change:

https://www.zhongnongwang.com/quote/product-htm-page-1.html
https://www.zhongnongwang.com/quote/product-htm-page-2.html
https://www.zhongnongwang.com/quote/product-htm-page-3.html
https://www.zhongnongwang.com/quote/product-htm-page-4.html
https://www.zhongnongwang.com/quote/product-htm-page-5.html
https://www.zhongnongwang.com/quote/product-htm-page-6.html
Copy the code

Check the web page, you can find that the web page structure is simple, easy to parse and extract data.

Get the content of all tr tags in the tr tag of tbody under the table tag with class TB, and then iterate to extract the information of product name, latest quotation, unit, quotation number, quotation time and so on.

# -*- coding: UTF- 8 -- * -"""@ File: demo. Py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
import logging
from fake_useragent import UserAgent
from lxml importBasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')
url = 'https://www.zhongnongwang.com/quote/product-htm-page-1.html'Headers = {"Accept-Encoding""gzip", # Use gzip compression to transfer data for faster access"User-Agent"Rep = requests. Get (url, headers=headers)print(rep.status_code)    # 200HTML = etree.html (rep.text) items = html.xpath('/html/body/div[10]/table/tr[@align="center"]')
logging.info(f'How many pieces of information does this page have: {len(items)}') # one page has20Piece of information # Iterate to extract datafor item in items:
    name = ' '.join(item.xpath('.//td[1]/a/text()') # price =' '.join(item.xpath('.//td[3]/text()') # latest quotation unit =' '.join(item.xpath('.//td[4]/text()') # nums =' '.join(item.xpath('.//td[5]/text()') # time_ =' '.join(item.xpath('.//td[6]/text()'Info ([name, price, unit, nums, time_])Copy the code

The running results are as follows:

You can successfully crawl the data, and then use normal single-threaded, multi-threaded, and coroutine respectively to crawl 50 pages of data and save them to Excel.

Single threaded crawler

# -*- coding: UTF- 8 -- * -"""@ File: the single thread. Py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
import logging
from fake_useragent import UserAgent
from lxml import etree
import openpyxl
from datetime importDatetime # Logging. BasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')
wb = openpyxl.Workbook()
sheet = wb.active
sheet.append(['name'.'Latest offer'.'unit'.'Number of offers'.'Quote time'])
start = datetime.now()

for page in range(1.51URL = f'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'Headers = {"Accept-Encoding""gzip", # Use gzip compression to transfer data for faster access"User-Agent"Rep = requests. Get (URL, headers=headers) #printHTML = etree.html (rep.text) items = html.xpath(rep.status_code)'/html/body/div[10]/table/tr[@align="center"]')
    logging.info(f'How many pieces of information does this page have: {len(items)}') # one page has20Piece of information # Iterate to extract datafor item in items:
        name = ' '.join(item.xpath('.//td[1]/a/text()') # price =' '.join(item.xpath('.//td[3]/text()') # latest quotation unit =' '.join(item.xpath('.//td[4]/text()') # nums =' '.join(item.xpath('.//td[5]/text()') # time_ =' '.join(item.xpath('.//td[6]/text()') # Quote time sheet.append([name, price, unit, nums, time_])
        logging.info([name, price, unit, nums, time_])


wb.save(filename='data1.xlsx')
delta = (datetime.now() - start).total_seconds()
logging.info(fTime: {delta}s')
Copy the code

The running results are as follows:

The single-threaded crawler must finish the crawling of the last page before it can continue to crawl. It may also be affected by the network status at that time, and it takes 48.528703s to finish the crawling of the data, which is relatively slow.

4. Multi-threaded crawler

# -*- coding: UTF- 8 -- * -"""@ File: multi-threaded. Py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/"""
import requests
import logging
from fake_useragent import UserAgent
from lxml import etree
import openpyxl
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
from datetime importDatetime # Logging. BasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')
wb = openpyxl.Workbook()
sheet = wb.active
sheet.append(['name'.'Latest offer'.'unit'.'Number of offers'.'Quote time']) start = datetime.now() def get_data(page)'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'Headers = {"Accept-Encoding""gzip", # Use gzip compression to transfer data for faster access"User-Agent"Rep = requests. Get (URL, headers=headers) #printHTML = etree.html (rep.text) items = html.xpath(rep.status_code)'/html/body/div[10]/table/tr[@align="center"]')
    logging.info(f'How many pieces of information does this page have: {len(items)}') # one page has20Piece of information # Iterate to extract datafor item in items:
        name = ' '.join(item.xpath('.//td[1]/a/text()') # price =' '.join(item.xpath('.//td[3]/text()') # latest quotation unit =' '.join(item.xpath('.//td[4]/text()') # nums =' '.join(item.xpath('.//td[5]/text()') # time_ =' '.join(item.xpath('.//td[6]/text()') # Quote time sheet.append([name, price, unit, nums, time_]) def run(): #1- 50Page with ThreadPoolExecutor (max_workers =6) as executor:
        future_tasks = [executor.submit(get_data, i) for i in range(1.51)]
        wait(future_tasks, return_when=ALL_COMPLETED)

    wb.save(filename='data2.xlsx')
    delta = (datetime.now() - start).total_seconds()
    print(fTime: {delta}s')


run()
Copy the code

The running results are as follows:

The efficiency of multi-threaded crawler is greatly improved, with a time of 2.648128s and a fast crawler speed.

Asynchronous coroutine crawler

# -*- coding: UTF- 8 -- * -"""
@File    :demo1.py
@Author  :叶庭云
@CSDN    :https://yetingyun.blog.csdn.net/
"""
import aiohttp
import asyncio
import logging
from fake_useragent import UserAgent
from lxml import etree
import openpyxl
from datetime importDatetime # Logging. BasicConfig (level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s'Ua = UserAgent(verify_ssl=False, path= UserAgent'fake_useragent.json')
wb = openpyxl.Workbook()
sheet = wb.active
sheet.append(['name'.'Latest offer'.'unit'.'Number of offers'.'Quote time'])
start = datetime.now()


class Spider(object):
    def __init__(self):
        # self.semaphore = asyncio.Semaphore(6Self.header = {self.header = {self.header = {self.header = {self.header = {self.header = {"Accept-Encoding""gzip", # Use gzip compression to transfer data for faster access"User-Agent": ua.random } async def scrape(self, url): # async with self.semaphore: Set the maximum semaphore. Sometimes you need to control the number of coroutines. Session = aiohttp.ClientSession(headers=self.header, headers=self. connector=aiohttp.TCPConnector(ssl=False)) response = await session.get(url) result = await response.text() await session.close(a)return result

    async def scrape_index(self, page):
        url = f'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'text = await self.scrape(url) await self.parse(text) async def parse(self, text): HTML = etree.html (text) items = html.xpath()'/html/body/div[10]/table/tr[@align="center"]')
        logging.info(f'How many pieces of information does this page have: {len(items)}') # one page has20Piece of information # Iterate to extract datafor item in items:
            name = ' '.join(item.xpath('.//td[1]/a/text()') # price =' '.join(item.xpath('.//td[3]/text()') # latest quotation unit =' '.join(item.xpath('.//td[4]/text()') # nums =' '.join(item.xpath('.//td[5]/text()') # time_ =' '.join(item.xpath('.//td[6]/text()') # Quote time sheet.append([name, price, unit, nums, time_])
            logging.info([name, price, unit, nums, time_])

    def main(self):
        # 50Scrape_index_tasks = [asyncio.ensure_future(self.scrape_index(page))]for page in range(1.51)]
        loop = asyncio.get_event_loop()
        tasks = asyncio.gather(*scrape_index_tasks)
        loop.run_until_complete(tasks)


if __name__ == '__main__':
    spider = Spider()
    spider.main()
    wb.save('data3.xlsx')
    delta = (datetime.now() - start).total_seconds()
    print(Time: {:.3f}s.format(delta))
Copy the code

The running results are as follows:

However, when it comes to coroutine asynchronous crawler, the crawl speed is much faster, swoosh, it takes 0.930s to crawl 50 pages of data, aiOHTTP + Asyncio asynchronous crawler is terrible. Under the premise that the server can withstand high concurrency, asynchronous crawler can increase the number of concurrency, and the improvement of crawler efficiency is very considerable, which is faster than multithreading.

The three crawlers all took down 50 pages of data and saved it locally, and the results are as follows:

6. Summary and review

Today I demonstrated simple single-threaded crawlers, multi-threaded crawlers, and coroutine asynchronous crawlers. It can be seen that in general, asynchronous crawler has the fastest speed, multi-threaded crawler has a slower speed, and single-threaded crawler has a slower speed, and crawler can only continue to crawl after the last page crawler is completed.

However, the coroutine asynchronous crawler is not so easy to write, and can only use AIOHTTP instead of Request library for data capture. In addition, when a large amount of data is crawled, the asynchronous crawler needs to set the maximum semaphore to control the number of coroutines and prevent the crawler from crawling too fast. Therefore, in the actual compilation of Python crawler, we generally use multi-threaded crawler to speed up, but it must be noted that websites have IP access frequency limit, too fast crawling may be BLOCKED IP, so we can generally use proxy IP to concurrently crawl data when multi-threading speed up.

  • Multithreading: A technique that enables concurrent execution of multiple threads from software or hardware. Multithreaded computers have hardware that allows them to execute more than one thread at a time, improving overall processing performance. Systems with this capability include symmetric multiprocessors, multicore processors, and chip-level multiprocessors or simultaneous multithreading processors. In a program, these independently running pieces of program are called threads, and the concept of programming with them is called multithreading.
  • Asynchronous: A way in which unrelated units of a program can accomplish a task without communication and coordination during the process. For example, a crawler downloads a web page. After the scheduler calls the downloader, it can schedule other tasks without having to communicate with the downloader to coordinate behavior. Downloading, saving and other operations of different web pages are irrelevant, and there is no need to notify each other and coordinate. The completion time of these asynchronous operations is uncertain. In short, asynchrony means disorder.
  • Coroutine (coroutine), also known as micro thread, fiber, coroutine is a user – mode lightweight thread. Coroutines have their own register context and stack. When coroutine schedules a switch, the register context and stack are saved elsewhere, and when cut back, the previously saved register context and stack are restored. Thus, the coroutine can preserve the state of the last call, that is, a specific combination of all local states, and each procedure reentry is equivalent to entering the state of the last call. Coroutines are essentially single processes. Compared with multiple processes, coroutines do not need the overhead of thread context switch, atomic operation locking and synchronization, and the programming model is very simple. We can use it to implement asynchronous operations, such as the web crawler scenarios, we send a request, needs to wait for some time to get a response, but in the process of the waiting, the program can do many other things, wait until after the response is to switch back to continue processing, so that we can make full use of the CPU and other resources, That’s the advantage of coroutines.

Author: Yip Tingyun

CSDN:yetingyun.blog.csdn.net/

Love can be worth years, discover the joy of learning, learning and progress, and you share.

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community