Background This article is the sequel of “Big Data Analysis 01 — Chengdu Second-hand House (Average Price)”. In the previous article, we learned how to use crawler to get data and look at the average price of each region to get a general idea. However, there are two problems: (1) the crawler crawls a large number of repeated data, which affects the analysis results; (2) it fails to help users locate satisfactory houses. This paper will explain how to solve these two problems in detail.

Data to heavy

The solution comes from the problem I raised myselfHow does a reptile de-weight, interested friends can go and have a look. I’m going to relearn scrapy’s framework as suggested:

Scrapy runs like this:

First, the engine takes a URL from the Scheduler for the next fetching engine to wrap the URL as a request to the Donwloader. The Donwloader downloads the resource and wraps it as a Response. If an entity Item is resolved by the crawler, it will be handed over to the entity Pipeline for further processing. If a link (URL) is resolved, the URL is handed over to the Scheduler for fetching.

The Scheduler’s middleware takes care of URL de-duplication. I then remove the request module and have all requests sent using scrapy. requset. Finally, I got more than 20,000 unduplicated data, only a few hundred different from the official hint of Lianjia. Whether lianjia had duplicated data or I lost this part of data when I entered the verification code is not clear. Follow up later. But the data now reflect the real situation.

Locating housing

First, I re-created a perspective of average housing prices in each district, so you can compare it with the previous article to see the difference between duplicate data and full data:

Then, we wanted to know which area people are paying more attention to, so I stacked the “viewings” and “views” of the building as the attention, and got the following picture:

It seems that tianfu new area and high-tech zone after the purchase restrictions, everyone began to see the surrounding houses, such as Longquanyi, Wenjiang, Shuangliu.

Add the number of views and views to the list, and then filter the number of views and views above 200:

Just a colleague of the company is also ready to buy a house, he wants to buy a set of two in Shuangliu, the price is 60-90W, we use his conditions to add “heat”, I filter out the following data:

Finally, let’s take a look at where our data is concentrated. Here we measure the average price we use, and the chart shows that red indicates higher prices and more buildings:

Thanks for watching and give a thumbs up if you think it’s good.