Asyncio + aioHTTP async crawler

IT journey (ID: Jake_Internet) please contact authorization (wechat ID: Hc220088)

Public number: Jie Elder brother’s IT journey, background reply: “Chengdu second-hand housing data”, to obtain the complete data of this article.

Asyncio + AIoHTTP async crawler

This paper is familiar with the basic concepts of concurrency and parallelism, blocking and non-blocking, synchronous and asynchronous, multithreading, multithreading and coroutine. Then the realization of asyncio + AIOHTTP crawling chain home chengdu second-hand house source information of the asynchronous crawler, crawling efficiency and multi-threaded version of a simple test and comparison.

1. Basic Concepts

Concurrency and parallelism

Concurrency: only one command can be executed at the same time, but the commands of multiple processes are executed in quick rotation, so that the macro effect of simultaneous execution of multiple processes is achieved. However, the micro effect is not simultaneous execution, but the time is divided into several segments, so that multiple processes can execute rapidly alternately.
Parallelism: The simultaneous execution of multiple instructions on multiple processors at the same time. So both from a micro and a macro point of view, the two are executed together.

Blocking and non-blocking

A blocked state is the state in which a program is suspended if it does not get the computing resources it needs. A program is said to block on an operation while it waits for it to complete and cannot continue to do anything else on its own.
Nonblocking: A program is said to be nonblocking if it does not block while waiting for an operation and can continue to process other things.

Synchronous and asynchronous

Synchronization: In order to complete a task, the execution of different program units needs to be coordinated by some means of communication. We call these program units synchronous execution.
Asynchrony: The way in which a task can be accomplished without communication and coordination between different program units, or asynchronously between unrelated program units.

multithreading

Multithreading is a technique that enables concurrent execution of multiple threads from software or hardware. Multithreaded computers have hardware that allows them to execute more than one thread at a time, improving overall processing performance. Systems with this capability include symmetric multiprocessors, multicore processors, and chip-level multiprocessors or simultaneous multithreading processors. In a program, these independently running pieces of program are called threads, and the concept of programming with them is called multithreading.

Multiple processes

Multiprocessing. Each program running on the system is a process. Each process contains one to more threads. A process may also be the dynamic execution of an entire program or part of a program. A thread is a set of instructions, or a special segment of a program, that can be executed independently within the program or understood as the context in which the code runs. So threads are basically lightweight processes that perform multiple tasks within a single program. Multi-process is to take advantage of CPU’s multi-core advantage to execute multiple tasks in parallel at the same time, which can greatly improve the execution efficiency.

coroutines

Coroutine, English called Coroutine, also known as micro thread, fiber, Coroutine is a user – mode lightweight thread.

Coroutines have their own register context and stack. When coroutine schedules a switch, the register context and stack are saved elsewhere, and when cut back, the previously saved register context and stack are restored. Thus, the coroutine can preserve the state of the last call, that is, a specific combination of all local states, and each procedure reentry is equivalent to entering the state of the last call.

Coroutines are essentially single processes. Compared with multiple processes, coroutines do not need the overhead of thread context switch, atomic operation locking and synchronization, and the programming model is very simple.

We can use it to implement asynchronous operations, such as the web crawler scenarios, we send a request, needs to wait for some time to get a response, but in the process of the waiting, the program can do many other things, wait until after the response is to switch back to continue processing, so that we can make full use of the CPU and other resources, That’s the advantage of coroutines.

Ii. Asyncio + AIoHTTP asynchronous crawler

Basic idea of crawler:

Determine the destination URL
Send a request to get a response
Parse the response to extract the data
Save the data

Look at the source code of the web page to find the data we want to extract

Check the analysis page:

You can find all kinds of information about each house in a page under the Li tag

Page 1: cd.lianjia.com/ershoufang/ page 2: cd.lianjia.com/ershoufang/… Page 3: cd.lianjia.com/ershoufang/… Page 100: cd.lianjia.com/ershoufang/… Analyze the rule of easy page-turning and construct the request URL list.

The asynchronous crawler code is as follows:

import asyncio import aiohttp from lxml import etree import logging import datetime import openpyxl wb = Openpyxl.workbook () sheet = wb.active sheet. Append ([' listing ', 'listing ',' listing ', 'Listing ',' Listing ', 'listing ',' listing ', 'listing ',' listing ', 'listing') BasicConfig (level= logging.info, format='%(asctime)s - %(levelName)s: %(message)s') start = datetime.datetime.now() class Spider(object): def __init__(self): Self. Header = {"Host": "cd.lianjia.com", "Referer": "https://cd.lianjia.com/ershoufang/", "Cookie": "lianjia_uuid=db0b1b8b-01df-4ed1-b623-b03a9eb26794; _smt_uid = 5 f2eabe8. 5 e338ce0; UM_distinctid=173ce4f874a51-0191f33cd88e85-b7a1334-144000-173ce4f874bd6; _jzqy = 1.1596894185.1596894185.1. JZQSR = baidu. -; _ga = GA1.2.7916096.1596894188; gr_user_id=6aa4d13e-c334-4a71-a611-d227d96e064a; Hm_lvt_678d9c31c57be1c528ad7f62e5123d56=1596894464; _jzqx = 1.1596897192.1596897192.1. CD JZQSR = % 2 elianjia % 2 ecom | JZQCT = / ershoufang pg2 /. -; select_city=510100; lianjia_ssid=c9a3d829-9d20-424d-ac4f-edf23ae82029; Hm_lvt_9152f8221cb6243a53c83b956842be8a = 1596894222159055, 584; gr_session_id_a1a50f141657a94e=33e39c24-2a1c-4931-bea2-90c3cc70389f; CNZZDATA1253492306=874845162-1596890927-https%253A%252F%252Fwww.baidu.com%252F%7C1597054876; CNZZDATA1254525948=1417014870-1596893762-https%253A%252F%252Fwww.baidu.com%252F%7C1597050413; CNZZDATA1255633284=1403675093-1596890300-https%253A%252F%252Fwww.baidu.com%252F%7C1597052407; CNZZDATA1255604082=1057147188-1596890168-https%253A%252F%252Fwww.baidu.com%252F%7C1597052309; _qzjc=1; gr_session_id_a1a50f141657a94e_33e39c24-2a1c-4931-bea2-90c3cc70389f=true; _jzqa = 1.3828792903266388500.1596894185.1596897192.1597055585.3; _jzqc=1; _jzqckmp=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22173ce4f8b4f317-079892aca8aaa8-b7a1334-1327104-173ce4f8b50247%22%2C%2 2%24device_id%22%3A%22173ce4f8b4f317-079892aca8aaa8-b7a1334-1327104-173ce4f8b50247%22%2C%22props%22%3A%7B%22%24latest_tr affic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_ref errer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6% 89%93%E5%BC%80%22%7D%7D; _gid = GA1.2.865060575.1597055587; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1597055649; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiOWQ4ODYyNmZhMmExM2Q0ZmUxMjk1NWE2YTRjY2JmODZiZmFjYTc2N2U1ZTc2YzM2ZDVkNmM2OGJlOWY5ZDZhOWNkN2 U3YjlhZWZmZTllNGE3ZTUwYjA3NGYwNDEzMThkODg4NTBlMWZhZmRjNTIwNDBlMDQ2Mjk2NTYxOWQ1Y2VlZjE5N2FhZjUyMTZkOTcyZjg4YzNiM2U1MThmNj c5NmQ4MGUxMmU2YTM4MmI3ZmU0NmFhNTJmYmMyYWU1ZWI3MjU5YWExYTQ1YWFkZDUyZWVjMzM2NTFjYTA2M2NlM2ExMzZhNjEwYjFjYzQ0OTY5MTQwOTA4Zj Q0MjQ3N2ExMDkxNTVjODFhN2MzMzg5YWM3MzBmMTQxMjU4NzAwYzk5ODE3MTk1ZTNiMjc4NWEzN2M3MTIwMjdkYWUyODczZWJcIixcImtleV9pZFwiOlwiMV wiLFwic2lnblwiOlwiYmExZDJhNWZcIn0iLCJyIjoiaHR0cHM6Ly9jZC5saWFuamlhLmNvbS9lcnNob3VmYW5nLyIsIm9zIjoid2ViIiwidiI6IjAuMSJ9; _qzja = 1.726562344.1596894309336.1596897192124.1597055583833.1597055601626.1597055649949.0.0.0.12.3; _qzjb = 1.1597055583833.3.0.0.0; _qzjto = 3.1.0; _jzqb = 1.3.10.1597055585.1; _gat=1; _gat_past=1; _gat_global=1; _gat_new_global=1; _gat_DIANpu_agent =1", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"} async def scrape(self, url): async with self.semaphore: session = aiohttp.ClientSession(headers=self.header) response = await session.get(url) result = await response.text() await session.close() return result async def scrape_index(self, page): url = f'https://cd.lianjia.com/ershoufang/pg{page}/' text = await self.scrape(url) await self.parse(text) async def parse(self, text): html = etree.HTML(text) lis = html.xpath('//*[@id="content"]/div[1]/ul/li') for li in lis: House_data = li. Xpath ('. / / div [@ class = "title"] / a/text () ') [0] # housing house_info = Li. Xpath ('. / / div [@ class = "houseInfo"] / text () ') [0] # house information address = ' 'join (li. Xpath ('. / / div [@ class = "positionInfo"] / a/text ()')) price = # location information Li. Xpath ('. / / div [@ class = "priceInfo"] / div [2] / span/text () ') [0] attention_num = # unit price yuan/square meters Li. Xpath ('. / / div [@ class = "followInfo"] / text () ') [0] # attention tag number and release time = ' '.join(li.xpath('.//div[@class="tag"]/span/text()')) # tag sheet.append([house_data, house_info, address, price, attention_num, tag]) logging.info([house_data, house_info, address, price, attention_num, tag]) def main(self): Scrape_index_tasks = [asyncio.ensure_future(self. Scrape_index (page)) for page in range(1, 1)) 101)] loop = asyncio.get_event_loop() tasks = asyncio.gather(*scrape_index_tasks) loop.run_until_complete(tasks) if __name__ == '__main__': spider = Spider() spider.main() wb.save('house.xlsx') delta = (datetime.datetime.now() - start).total_seconds() Print (" time: {:.3f}s". Format (delta))Copy the code

The running results are as follows:

It took 15.976s to successfully climb 100 pages of data with 3000 house information.

The multithreaded version crawler is as follows:

import requests from lxml import etree import openpyxl from concurrent.futures import ThreadPoolExecutor import datetime  import logging headers = { "Host": "cd.lianjia.com", "Referer": "https://cd.lianjia.com/ershoufang/", "Cookie": "lianjia_uuid=db0b1b8b-01df-4ed1-b623-b03a9eb26794; _smt_uid = 5 f2eabe8. 5 e338ce0; UM_distinctid=173ce4f874a51-0191f33cd88e85-b7a1334-144000-173ce4f874bd6; _jzqy = 1.1596894185.1596894185.1. JZQSR = baidu. -; _ga = GA1.2.7916096.1596894188; gr_user_id=6aa4d13e-c334-4a71-a611-d227d96e064a; Hm_lvt_678d9c31c57be1c528ad7f62e5123d56=1596894464; _jzqx = 1.1596897192.1596897192.1. CD JZQSR = % 2 elianjia % 2 ecom | JZQCT = / ershoufang pg2 /. -; select_city=510100; lianjia_ssid=c9a3d829-9d20-424d-ac4f-edf23ae82029; Hm_lvt_9152f8221cb6243a53c83b956842be8a = 1596894222159055, 584; gr_session_id_a1a50f141657a94e=33e39c24-2a1c-4931-bea2-90c3cc70389f; CNZZDATA1253492306=874845162-1596890927-https%253A%252F%252Fwww.baidu.com%252F%7C1597054876; CNZZDATA1254525948=1417014870-1596893762-https%253A%252F%252Fwww.baidu.com%252F%7C1597050413; CNZZDATA1255633284=1403675093-1596890300-https%253A%252F%252Fwww.baidu.com%252F%7C1597052407; CNZZDATA1255604082=1057147188-1596890168-https%253A%252F%252Fwww.baidu.com%252F%7C1597052309; _qzjc=1; gr_session_id_a1a50f141657a94e_33e39c24-2a1c-4931-bea2-90c3cc70389f=true; _jzqa = 1.3828792903266388500.1596894185.1596897192.1597055585.3; _jzqc=1; _jzqckmp=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22173ce4f8b4f317-079892aca8aaa8-b7a1334-1327104-173ce4f8b50247%22%2C%2 2%24device_id%22%3A%22173ce4f8b4f317-079892aca8aaa8-b7a1334-1327104-173ce4f8b50247%22%2C%22props%22%3A%7B%22%24latest_tr affic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_ref errer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6% 89%93%E5%BC%80%22%7D%7D; _gid = GA1.2.865060575.1597055587; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1597055649; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiOWQ4ODYyNmZhMmExM2Q0ZmUxMjk1NWE2YTRjY2JmODZiZmFjYTc2N2U1ZTc2YzM2ZDVkNmM2OGJlOWY5ZDZhOWNkN2 U3YjlhZWZmZTllNGE3ZTUwYjA3NGYwNDEzMThkODg4NTBlMWZhZmRjNTIwNDBlMDQ2Mjk2NTYxOWQ1Y2VlZjE5N2FhZjUyMTZkOTcyZjg4YzNiM2U1MThmNj c5NmQ4MGUxMmU2YTM4MmI3ZmU0NmFhNTJmYmMyYWU1ZWI3MjU5YWExYTQ1YWFkZDUyZWVjMzM2NTFjYTA2M2NlM2ExMzZhNjEwYjFjYzQ0OTY5MTQwOTA4Zj Q0MjQ3N2ExMDkxNTVjODFhN2MzMzg5YWM3MzBmMTQxMjU4NzAwYzk5ODE3MTk1ZTNiMjc4NWEzN2M3MTIwMjdkYWUyODczZWJcIixcImtleV9pZFwiOlwiMV wiLFwic2lnblwiOlwiYmExZDJhNWZcIn0iLCJyIjoiaHR0cHM6Ly9jZC5saWFuamlhLmNvbS9lcnNob3VmYW5nLyIsIm9zIjoid2ViIiwidiI6IjAuMSJ9; _qzja = 1.726562344.1596894309336.1596897192124.1597055583833.1597055601626.1597055649949.0.0.0.12.3; _qzjb = 1.1597055583833.3.0.0.0; _qzjto = 3.1.0; _jzqb = 1.3.10.1597055585.1; _gat=1; _gat_past=1; _gat_global=1; _gat_new_global=1; _gat_DIANpu_agent =1", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/84.0.4147.89 Safari/537.36"} basicConfig(level= logging.info, format='%(asctime)s - %(levelname)s: %(message)s') wb = openPyxl.workbook () sheet = wb.active sheet. Append ([' listing ', 'listing ',' listing ', 'listing ',' listing ', 'listing ',' listing ', 'listing ',' listing ', 'listing ') Start = datetime.datetime.now() def get_house(page): if page == 1: url = "https://cd.lianjia.com/ershoufang/" else: url = f"https://cd.lianjia.com/ershoufang/pg{page}/" res = requests.get(url, headers=headers) html = etree.HTML(res.text) lis = html.xpath('//*[@id="content"]/div[1]/ul/li') for li in lis: House_data = li. Xpath ('. / / div [@ class = "title"] / a/text () ') [0] # housing house_info = Li. Xpath ('. / / div [@ class = "houseInfo"] / text () ') [0] # house information address = ' 'join (li. Xpath ('. / / div [@ class = "positionInfo"] / a/text ()')) price = # location information Li. Xpath ('. / / div [@ class = "priceInfo"] / div [2] / span/text () ') [0] attention_num = # unit price yuan/square meters Li. Xpath ('. / / div [@ class = "followInfo"] / text () ') [0] # attention tag number and release time = ' '.join(li.xpath('.//div[@class="tag"]/span/text()')) # tag sheet.append([house_data, house_info, address, price, attention_num, tag]) logging.info([house_data, house_info, address, price, attention_num, tag]) if __name__ == '__main__': with ThreadPoolExecutor(max_workers=6) as executor: executor.map(get_house, [page for page in range(1, 100)]) wb.save('house.xlsx') delta = (datetime.datetime.now() -start).total_seconds() print(" time: {:.3f}s".format(delta))Copy the code

The running results are as follows:

It took 16.796s to successfully climb 100 pages of data with 3000 listings.

Third, data acquisition

Generally, land market data will be publicized in the local public resource trading center, but often only the data of the current week or month will be publicized. Therefore, we have to find professional land websites to obtain transaction data. For example, soil flow network: www.tudinet.com/market-0-0-…

The site structure is simple, simple URL page flipping construction, and then using xpath parsing to extract data.

The crawler code is as follows:

import requests from lxml import etree import random import time import logging import openpyxl from datetime import Datetime wb = OpenPyxl.workbook () sheet = wb.active sheet. Append ([' land location ', 'form ',' launch time ', 'land area ',' land address ', BasicConfig (level=logging.INFO, format='%(asctime)s - %(LevelName)s: %(message)s') user_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Now () def get_info(page): headers = {" user-agent ": headers = {" user-agent ": random.choice(user_agent) } url = f'https://www.tudinet.com/market-254-0-0-0/list-pg{page}.html' resp = requests.get(url, headers=headers).text time.sleep(1) html = etree.HTML(resp) lis = html.xpath('//div[@class="land-l-cont"]/dl') # Print (len(lis)) # print(len(lis)) Location = li. Xpath ('. / / dd/p [7] / text () ') [0] # land location transfer_form = li. Xpath ('. / / dt/I/text () ') [0] launch_time = # transfer form Li. Xpath ('. / / dd/p [1] / text () ') [0] # land_area = li. The launch time xpath ('. / / dd/p [3] / text () ') [0] # land area planning_area = Li. Xpath ('. / / dd/p [5] / text () ') [0] # planning construction area address = li. Xpath ('. / / dd/p [4] / text () ') [0] address state = # land Li. Xpath ('. / / dd/p [2] / text () ') [0] # clinch a deal the state area_code = li. Xpath ('. / / dt/span/text () ') [0] # code planned_use = land //dd/p[6]/text()'.//dd/p[6]/text()')[0] address, state, area_code, planned_use] sheet.append(data) logging.info(data) except Exception as e: logging.info(e.args[0]) continue def main(): for i in range(1, 101): Get_info (I) logging. Info (f' fetch page {I} end ') 2)) wb.save(filename="real_estate_info.xlsx") if __name__ == '__main__': Main () delta = (datetime.now() -start). Total_seconds () print() {delta}')Copy the code

The crawler code was run to extract the data of 3158 pieces of land in Chengdu, and the results are as follows:

4. Data viewing

The data is clean and complete, and can be directly used for data analysis.

5. Analyze land transaction data

1) Land transfer form & land transaction status

import pandas as pd from pyecharts import options as opts from pyecharts.charts import Pie from pyecharts.globals import  CurrentConfig, Currentconfig. ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/' # fetch data df = Pd. Read_excel (' real_estate_info. XLSX). Loc [: [' transfer form, Df1 = df[' 表 现 '].value_counts() df2 = df[' 表 现 '].value_counts() # df1 = df[' 表 现 ']. int(j)) for i, j in zip(df1.index, df1.values)] data_pair_2 = [(i, int(j)) for i, j in zip(df2.index, C = (Pie(init_opts= opts.initopts (theme= themetype. DARK, width="1100px", Label_opts =opts.LabelOpts(is_show=True) ). Set_colors ([' red ', 'blue', 'purple']), add (" land transaction status, "data_pair_2, center = (" 70%", "50%"), Label_opts = opts.labelopts (is_show=True),).set_global_opts(title_opts= opts.titleopts (title_opts= opts.titleopts), Legend_opts = opts.legendopts (is_show=False).set_series_opts( trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)" ) ) .render("pie_.html") )Copy the code

Statistical analysis and visualization with Pyecharts pie chart. From September 2015 to February 2020, the form of land transfer in Chengdu is as follows: The proportion of listed and auctioned land was 67.73% and 31.45%, and only a small part of the land was sold by bidding, accounting for 0.82%. The proportion of uncompleted and uncompleted land in Chengdu’s land auction was less than half, while the proportion of completed land was as high as 65.77%, and the overall transaction rate was relatively high, which may be due to the large number of interested bidders and suitable bidding.

2) Land transaction area

import pandas as pd import pyecharts.options as opts from pyecharts.charts import Bar from pyecharts.globals import CurrentConfig, ThemeType CurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/' df = Pd. Read_excel (' real_estate_info. XLSX). Loc [: [' launch ', 'land', 'planning construction area]] date = df [' launch'] STR. Split (' years', Expand =True)[0] # cut the string in this column by year df[' year '] = date # add a new column year # drop the 'flat' data type to float df[' land '] = Df [' land area]. STR [: - 1]. The map (float) df [' planning construction area] = df [' planning construction area]. STR [: - 1]. The map (float) # grouped into m m squared summation unit land_area = Df. Groupby (' year '). Agg ({' land ':' sum '}) / 10000 planned_area = df groupby (' year '). Agg ({' planning construction area: 'sum'}) / 10000 # <class 'pandas.core.frame.DataFrame'> print(land_area, type(land_area)) print(planned_area, Int (y) for y in land_area.index[1:-1]] # type(planned_area) # Area to retain two decimal places ydata_1 = [float (' {: 2 f} '. The format (I)) for I in land_area [' land area] [1, 1]] ydata_2 = [float (' {: 2 f} '. The format (j)) for Bar = (bar (init_opts= opts.initopts (theme= themetype.dark)) .add_xaxis(xaxis_data=years). Add_yaxis (series_name=' land area (10,000 m²)', yaxis_data=ydata_1, Label_opts = opts.labelopts (is_show=False).add_yaxis(series_name=' planned building area (ten thousand m²)', yaxis_data=ydata_2, Label_opts = opts.labelopts (is_show=False)).set_global_opts(xaxis_opts= opts.axisopts (name=' year '), AxisOpts(name=' 000 m²')).set_series_opts(markpoint_opts= opts.markPointopts (data=[ Opts. MarkPointItem (type_ = "Max" and name = "maximum"), opts. MarkPointItem (type_ = "min" and name = "minimum value"),),). The render (' bar_. HTML '))Copy the code

From 2016 to 2019, the land transaction area increased year by year. In 2018, the land transaction area began to reach a climax. The total planned building area in this year was 41.561,500 m², and then the land transaction area in 2019 decreased compared with that in 2018.

import pandas as pd from pyecharts import options as opts from pyecharts.charts import Bar from pyecharts.globals import  CurrentConfig, ThemeType CurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/' df = Pd. Read_excel (' real_estate_info. XLSX). Loc [: [' launch ', 'land area, Df 'planning construction area]] [' land area] = df [' land area]. STR [: - 1]. The map (float) df [' planning construction area] = df [' planning construction area]. STR [: - 1]. The map date = (float) Df [' exp '].str.split(' exp ', expand=True)[0] # X + 'month') # # print plus month (date) df = date # 'in' took after 2019 df1 = df [(df [' launch '] STR [: 4] = = '2020') | (df [' launch '] STR [: 4) = = '2019')] df2 = df1. Groupby (', '). Agg ({' land ':' sum '}) / 10000 df3 = df1 groupby (', '). Agg ({' planning construction area: 'sum'}) / 10000 # print(df2) # print(df3) month = df2.index.tolist() ydata_1 = [float('{:.2f}'.format(i)) for i in Df2 ydata_2 = [' land area]] [float (' {: 2 f} '. The format (j)) for j in df3 [' planning construction area]] bar = ( Bar(init_opts= opts.initopts (theme= themetype.dark)).add_xaxis(xaxis_data=month).add_yaxis(series_name=' Land area (10,000 m²)', yaxis_data=ydata_1, stack='stack1', # stack label_opts= opts.labelopts (is_show=False).add_yaxis(series_name=' planned building area (10,000 m²)', yaxis_data=ydata_2, stack='stack1', Label_opts = opts.labelopts (is_show=False)). Reversal_axis () # invert horizontal bar chart. AxisOpts(name=' 000 m²'), yaxis_opts= opts.axisopts (name=' month ')).render('reverse_bar.html'))Copy the code

From January 2019 to February 2020, the land transaction area in Chengdu is relatively active in 2019, and the land transaction area fluctuates greatly. In December 2019, the planned building area is 8.1747 million square meters, which reaches the peak. From then on, the land transaction area in January and February 2020 decreases a lot. This may be partly due to the impact of the COVID-19 outbreak in China earlier this year.

3) The planned use of the land

import pandas as pd from pyecharts.charts import Radar from pyecharts import options as opts from pyecharts.globals import CurrentConfig, ThemeType CurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/' df = Pd.read_excel ('real_estate_info.xlsx')[' planned usage '] datas = df.value_counts() items = datas.index.tolist() colors = [' # FF0000 ', '# FF4500', '# 00 fa9a', '# FFFFF0', '# FFD700] # RadarItem: Set labels = [opts.RadarIndicatorItem(name=items[I], max_=50, color=colors[i]) for i in range(len(items))] value = [int(j) for j in datas.values] radar = ( Radar(init_opts= opts.initopts (theme= themetype.dark)).add_schema(schema=labels).add(series_name=' Ratio of planned land use (%)', Data = [[round((x/sum(value)) * 100, 3) for x in value]], areastyLE_OPts =opts.AreaStyleOpts(opacity=0.5, Color ='blue') # region fill color).set_global_opts().render('radar.html'))Copy the code

The transaction land is mainly used for industrial land, accounting for 43.667% of the total, and a large part of the land is used for commercial/office land, comprehensive land and other land, while residential land only accounts for 5.098%. It also reflects chengdu’s emphasis on industrial development. According to some information searched, during the 12th Five-Year Plan period, the average annual growth rate of Chengdu’s industry was about 14.4%, ranking the first among 15 sub-provincial cities, and strongly supporting The GDP of Chengdu to reach the level of “one trillion yuan”.

4) Land transaction area

import matplotlib.pyplot as plt import numpy as np import seaborn as sns import pandas as pd import matplotlib as mpl df = pd. Read_excel ('real_estate_info.xlsx') area = df # with ('test.txt', Encoding =' utF-8 ') as f: areas = f.read().split(', ') for item in areas: Df [item] = [eval(df.loc[x,]) [:-1]) if item in df.loc[x, 'land '] else 0 for x in range(len(df[' land '])] date = df[' land '].str.split(' year ', Expand =True) df[' year '] = date # Add a New Year df1 = df[' areas '] DF2 = index = df[' year '] df2 = Groupby (' year ').sum() # print(type(df2.iloc[:5, ::]) # print(type(df2.iloc[:5, ::]) # print(type(df2.iloc[:5, ::]) # print(type(df2.iloc[:5, :]) # print(type(df2.iloc[:5,)) # print(type(df2.iloc[:5,)) ::].t) # transpose datas = np.array(df2.iloc[:5, ::].t) # print(datas, type(datas)) x_label = [year for year in range(2015, 2020)] y_label = areas mpl.rcParams['font.family'] = 'Kaiti' fig, Ax = plt.subplots(figsize=(15, 9)) Heatmap (data=df2.iloc[:5, ::].T, linewidths=0.25, Linecolor ='black', ax=ax, ANNOT =True, FMT ='.1f', X y axis title ax.set_xlabel(' year ', fontdict={'size': 18, 'weight': 'bold'}) ax. Set_ylabel (' administrative region, fontdict = {' size: 18, 'weight' : 'bold'}) ax.set_title(r' Total planned gross floor area of each administrative region 2015-2019 (m2)', fontsize=25, x=0.5, Y =1.02) # Hide frame ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) Ax.spines ['bottom'].set_visible(False) # save plt.savefig('heat_map.png') plt.show()Copy the code

From the perspective of trading areas, except Shuangliu county and Pixian County, each administrative region has a certain amount of land transaction every year. Longquanyi District and Qingbaijiang District have the largest area of land transaction from 2018 to 2019, and the land transaction market is hot.

Other instructions

The data analysis in this paper is only for study and research purposes, and the conclusions provided are for reference only.
It can be seen from the above simple tests that if asynchronous request is flexibly used in crawler, the crawler efficiency can be greatly improved by increasing the number of concurrent requests on the premise that the server can bear high concurrent requests.
Crawler code is only used for Python crawler knowledge exchange, not for other purposes, violation of the consequences;
It is not recommended to grab too much data, easy to load the server, and a little.
Please correct the inadequacies.