Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

The introduction

Why not include a new house, emmmmmmmmm

It’s a history of blood and tears…

Xiaobian has been crying in the toilet, the students quickly wake up, the sun has not gone down.

Don’t look down on a second-hand house. It sounds like everyone can afford it.

Analysis of the

Light not pull, first, to get down to business target page links small make up have been found: https://sh.lianjia.com/ershoufang/pg1/.

The number of housing is still quite many, this year’s real estate industry recession, it is said that the prices are not high.

Xiaobian actually has a purpose, after all, also came to Shanghai for more than five years, in case really climb out of the data to see a suitable, right, by the way can also help you explore a way.

First of all, the analysis of the page link information, in fact, has been obvious, in the link to the last column there is a PG1, xiaobian guess should be page1, do not believe the meaning of pG2 try, very obvious yao.

Open a random house page and go to the inner page and look at the data:

The data is still very comprehensive, so the detailed data is taken from here.

By the way, could you look at the details page link: https://sh.lianjia.com/ershoufang/107102012982.html.

Where does this number come from?

The small editor is sure to find it in the DOM structure of the outer list page.

This is called the old driver’s intuition, show or not show is over.

Lu code

The old idea is to build a list of data from the outer list page, then loop through that list to crawl the detail page and write the data to Mysql.

The request and parsing libraries used in this article are Requests and PyQuery.

Don’t ask why, ask is xiaobian like.

Because it’s simple.

Let’s first define a method to crawl the outer listings:

def get_outer_list(maxNum): list = [] for i in range(1, maxNum + 1): Url = 'https://sh.lianjia.com/ershoufang/pg' + STR (I) print (' climbing on a link for:  %s' %url) response = requests.get(url, Headers =headers) print(' response.text ') doc = PyQuery(response.text) num = 0 for item in doc('.sellListContent Li ').items(): num += 1 list.append(item.attr('data-lj_action_housedel_id')) print(' %d sets' %num) return listCopy the code

Here we first obtain the ID number list of the house source, so as to facilitate us to join the connection in the next step. The input parameter here is the maximum number of pages, as long as it does not exceed the actual number of pages. Currently, the maximum number of pages is 100, but only 100 can be passed here.

After the house list is obtained, it is necessary to obtain the detailed information of the house, this time the information is a little large, and it takes a little effort to parse it:

def get_inner_info(list): for i in list: try: response = requests.get('https://sh.lianjia.com/ershoufang/' + str(i) + '.html', Headers =headers) doc = PyQuery(response.text) base_li_item = doc('.base.content ul li').remove('.label').items() base_li_list = [] for item in base_li_item: Transaction_li_item = doc('.transaction.content ul li').items() transaction_li_list = [] for item in transaction_li_item: transaction_li_list.append(item.children().not_('.label').text()) insert_data = { "id": i, "danjia": Doc (' unitPriceValue '). Remove (' I '). The text (), "zongjia" : doc ('. Price. The total '). The text () + 'm', "quyu" : doc('.areaName .info').text(), "xiaoqu": doc('.communityName .info').text(), "huxing": base_li_list[0], "louceng": base_li_list[1], "jianmian": base_li_list[2], "jiegou": base_li_list[3], "taoneimianji": base_li_list[4], "jianzhuleixing": base_li_list[5], "chaoxiang": base_li_list[6], "jianzhujiegou": base_li_list[7], "zhuangxiu": base_li_list[8], "tihubili": base_li_list[9], "dianti": base_li_list[10], "chanquan": base_li_list[11], "guapaishijian": transaction_li_list[0], "jiaoyiquanshu": transaction_li_list[1], "shangcijiaoyi": transaction_li_list[2], "fangwuyongtu": transaction_li_list[3], "fangwunianxian": transaction_li_list[4], "chanquansuoshu": transaction_li_list[5], "diyaxinxi": Transaction_li_list [6]} cursor.execute(sql_insert, insert_data) conn.mit () print(I, ': write to complete ') except: Print (I, ': write exception ') continueCopy the code

The two most critical methods have been written, so let’s take a look at the results of xiaobian:

This price is a little high blood pressure.

As expected or MY big devil, no matter how many hands, the price is good to see.

summary

As can be seen from the result, although Lianjia said that there were more than 6W houses, in fact, we could only get a total of 3000 houses from the page, far from reaching all the data we wanted. However, if xiaobian increases the screening conditions, the total number of houses will indeed change. It should be strongly limited to show only 100 pages of data at most, to prevent data from being completely crawled away.

The routine is still very deep, as long as you do not put the data out, mud meng will not be able to climb to my data. For the average user, it’s enough to see some of the data in the front, and few of them will turn to the last few pages to see the data.

This is the end of the code, if you need to get all the code, you can access the code repository.

The sample code

All of the code in this series will be available on Github and Gitee.

Example code -Github

Example code -Gitee