Author: Zone7, a scrappy back end siege lion who loves to write and share.

An overview of the

  • preface
  • The statistical results
  • Crawler code implementation
  • Crawler analysis implementation
  • Afterword.

preface

Since most of my friends are in Guangzhou and Shenzhen, many friends have asked me to analyze the current situation of rental prices in Guangzhou. This article was published in many voices. However, the crawler technology has also been upgraded with more details. The source code is worth exploring. This analysis collected 23,339 data from 11 districts in Guangzhou, as shown in the figure below: \

Sample data

The latter part of the data volume is low, due to the lack of housing resources in this area. Therefore, this survey is not very accurate, just as a piece of entertainment, for everyone to enjoy.

The statistical results

Let’s look at the statistical results first and then at the technical analysis. Distribution of housing resources in guangzhou :(by district) tianhe occupies the majority of housing resources. But this land is expensive to rent. \

Housing distribution

Unit rent :(unit rent per square meter per month — average) is the price of 1 square meter per month. The bigger the square, the higher the price. \

Unit rent: square meters/month

It can be seen that Tianhe, Yuexiu and Haizhu have all passed the 50 mark, with 75.042, 64.249 and 59.621 respectively, several times higher than other areas. If you rent a room of 20 square meters in Tianhe:

75.042 x 20 = 1500.84

Water, electricity and property of 200 yuan:

1500.84 + 200 = 1700.84

According to our normal life, breakfast is 10 yuan, lunch is 15 yuan, dinner is 15 yuan:

1700.84 + 40 x 30 = 2700.84

So you need $2700.84 for your daily life. Cut off the time to go to a restaurant, buy some clothes every month, transportation expenses, talk about a girlfriend, go shopping with her girlfriend, no problem drop add 2500

2700.84 + 2500 = 5200.84

One thousand for mom and Dad:

5200.84 + 2000 = 7200.84

Monthly salary 10000 still have a bit deposit, better than Shenzhen, but the salary of guangzhou did not have shenzhen so tall.

Unit rent :(per square metre per day — average)

That’s the price of 1 square meter per day.

Unit price: square meters/day

Haha, feel the price of an inch of land. [his face]

collapse

\

Apartment type apartment type mainly with 3 rooms 2 hall and 2 rooms 2 hall. Renting a room in a group with a friend is the best choice. Otherwise, a series of uncomfortable things can happen when you share a room with someone you don’t know. The larger the font, the greater the number of house types. \

Door model

Door model

Renting area statistics of which 30-90 square meters of renting accounted for the majority, now, also can only be a few friends together to rent a house, together to keep warm. \

Rental area statistics

This is the crawling description of the rental, where the larger the font, the more times the logo appears. Among them [home, full set, luxury, complete] occupies a large part, indicating that the supporting facilities are quite complete. \

Describe that rent a house

Crawler analysis

  • Request libraries: scrapy, requests
  • HTML parsing: BeautifulSoup
  • The word cloud: wordcloud
  • Data visualization: Pyecharts
  • Database: MongoDB
  • Database connection: Pymongo

Crawler code implementation

It uses a scrapy crawler framework to crawl data, and is optimized for various aspects, such as automatically generating page addresses. Because the form of the address of the home page in each area of the room is not the same as that of the address outside the home page, but there are certain rules, so it is necessary to splice the address of each part.

Home address case:

# the first page of http://gz.zu.fang.com/house-a073/Copy the code

Non-home Address:

The second page # http://gz.zu.fang.com/house-a073/i32/ # 3 http://gz.zu.fang.com/house-a073/i33/ # 4 pages http://gz.zu.fang.com/house-a073/i34/Copy the code

\

\

\

\

First parse the home page URL

def head_url_callback(self, response): soup = BeautifulSoup(response.body, "html5lib") dl = soup.find_all("dl", attrs={"id": "Rentid_D04_01 "}) # select * from my_as = dl[0]. Find_all ("a") # select * from my_a in my_as: If my_a. Text == "open ": Self.headurllist. append(self.baseurl) self.allurlList. append(self.baseurl) continue if "periphery" in my_a: Continue # print(my_a["href"]) # print(my_a) self.allurllist.append (self.baseurl + my_a["href"]) self.headUrlList.append(self.baseUrl + my_a["href"]) print(self.allUrlList) url = self.headUrlList.pop(0) yield Request(url, callback=self.all_url_callback, dont_filter=True)Copy the code

Then parse the non-home page URL

Here we get the total number of pages in each region before we can join the specific page address. \

Def all_url_callback(self, response): Soup = BeautifulSoup(response.body, "html5lib") div = soup. Find_all ("div", attrs={"id": Span = div[0]. Find_all ("span")  span_text = span[0].text for index in range(int(span_text[1:len(span_text) - 1])): if index == 0: pass # self.allUrlList.append(self.baseUrl + my_a["href"]) else: if self.baseUrl == response.url: self.allUrlList.append(response.url + "house/i3" + str(index + 1) + "/") continue self.allUrlList.append(response.url + "i3" + str(index + 1) + "/") if len(self.headUrlList) == 0: url = self.allUrlList.pop(0) yield Request(url, callback=self.parse, dont_filter=True) else: url = self.headUrlList.pop(0) yield Request(url, callback=self.all_url_callback, dont_filter=True)Copy the code

Finally, a page’s data is parsed

def parse(self, response): Self # parsing a page of data. The logger. The info (" = = = = = = = = = = = = = = = = = = = = = = = = = = ") soup = BeautifulSoup (the response body, Divs = soup. Find_all ("dd", attrs={"class": "info rel"}) For index, p in enumerate(ps): for index, p in enumerate(ps): for index, p in enumerate(ps): for index, p in enumerate(ps): # From the source code, we can see that each p tag has the information we want, so we will pass through the P tag, Text = p.t ext strip () print (text) # see if providing the information we want roomMsg = ps [1]. The text. The split area = (" | ") roomMsg[2].strip()[:len(roomMsg[2]) - 1] item = RenthousescrapyItem() item["title"] = ps[0].text.strip() item["rooms"] =  roomMsg[1].strip() item["area"] = int(float(area)) item["price"] = int(ps[len(ps) - 1].text.strip()[:len(ps[len(ps) - 1].text.strip()) - 3]) item["address"] = ps[2].text.strip() item["traffic"] = ps[3].text.strip() if (self.baseUrl+"house/") in response.url: # select item["region"] = "region" else: item["region"] = ps[2].text.strip()[:2] item["direction"] = roomMsg[3].strip() print(item) yield item except: Print (" bad, exception") continue if len(self.allurlList)! = 0: url = self.allUrlList.pop(0) yield Request(url, callback=self.parse, dont_filter=True)Copy the code

Data analysis implementation

Here mainly through pymongo some aggregation operations to carry out statistics, combined with the relevant icon library, to display the data. Data analysis:

Def getAvgPrice(self, region): areaPinYin = self.getPinyin(region=region) collection = self.zfdb[areaPinYin] totalPrice = collection.aggregate([{'$group': {'_id': '$region', 'total_price': {'$sum': '$price'}}}]) totalArea = collection.aggregate([{'$group': {'_id': '$region', 'total_area': {'$sum': '$area'}}}]) totalPrice2 = list(totalPrice)[0]["total_price"] totalArea2 = list(totalArea)[0]["total_area"] return Def totalAvgprice (self) def totalavgprice (self): totalAvgPriceList = [] totalAvgPriceDirList = [] for index, region in enumerate(self.getAreaList()): avgPrice = self.getAvgPrice(region) totalAvgPriceList.append(round(avgPrice, 3)) totalAvgPriceDirList.append({"value": round(avgPrice, 3), "name": region + " " + str(round(avgPrice, Return totalavGpricePerDay (self) def totalavGpriceperDay (self): totalAvgPriceList = [] for index, region in enumerate(self.getAreaList()): avgPrice = self.getAvgPrice(region) totalAvgPriceList.append(round(avgPrice / 30, 3)) return (self.getAreaList(), Def getAnalycisNum(self): analycisList = [] for index, region in enumerate(self.getAreaList()): collection = self.zfdb[self.pinyinDir[region]] print(region) totalNum = collection.aggregate([{'$group': {'_id': '', 'total_num': {'$sum': 1}}}]) totalNum2 = list(totalNum)[0]["total_num"] analycisList.append(totalNum2) return (self.getAreaList(), Def getAreaWeight(self): result = self.zfdb.rent. Aggregate ([{'$group': {'_id':}) def getAreaWeight(self): result = self.zfdb.rent. '$region', 'weight': {'$sum': 1}}}]) areaName = [] areaWeight = [] for item in result: if item["_id"] in self.getAreaList(): areaWeight.append(item["weight"]) areaName.append(item["_id"]) print(item["_id"]) print(item["weight"]) # Def getTitle(self): print(type(item)) return (areaName, areaWeight) collection = self.zfdb["rent"] queryArgs = {} projectionFields = {'_id': False, 'title': SearchRes = collection.find(queryArgs, projection=projectionFields).limit(1000) content = '' for result in searchRes: Print (result["title"]) content += result["title"] return content # def getRooms(self): results = self.zfdb.rent.aggregate([{'$group': {'_id': '$rooms', 'weight': {'$sum': 1}}}]) roomList = [] weightList = [] for result in results: roomList.append(result["_id"]) weightList.append(result["weight"]) # print(list(result)) return (roomList, WeightList) # def getAcreage(self): results0_30 = self.zfdb.rent. Aggregate ([{'$match': {'area': {'$gt': 0, '$lte': 30}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results30_60 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 30, '$lte': 60}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results60_90 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 60, '$lte': 90}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results90_120 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 90, '$lte': 120}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results120_200 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 120, '$lte': 200}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results200_300 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 200, '$lte': 300}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results300_400 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 300, '$lte': 400}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results400_10000 = self.zfdb.rent.aggregate([ {'$match': {'area': {'$gt': 300, '$lte': 10000}}}, {'$group': {'_id': '', 'count': {'$sum': 1}}} ]) results0_30_ = list(results0_30)[0]["count"] results30_60_ = list(results30_60)[0]["count"] results60_90_ = list(results60_90)[0]["count"] results90_120_ = list(results90_120)[0]["count"] results120_200_ = list(results120_200)[0]["count"] results200_300_ = list(results200_300)[0]["count"] results300_400_ = List (results300_400)[0]["count"] results400_10000_ = list(results400_10000)[0]["count"] attr = ["0-30 m2 ", "30-60 m2 ", "60-90 square meters ", "90-120 square meters ", "120-200 square meters ", "200-300 square meters ", "300-400 square meters ", "400+ square meters "] value = [results0_30_, results30_60_, results60_90_, results90_120_, results120_200_, results200_300_, results300_400_, results400_10000_ ] return (attr, value)Copy the code

Data Display:

Def showPie(self, title, attr, value): from pyecharts import Pie pie = Pie(title) pie.add("aa", attr, value, Is_label_show =True) pie. Render () # def showTreeMap(self, title, data): From Pyecharts import TreeMap data = data TreeMap = TreeMap(title, width=1200, height=600) TreeMap. Add (" 中 文 ", data, Is_label_show =True, label_pos='inside', label_text_size=19) treemap.render() def showLine(self, title, attr) value): From pyecharts import Bar Bar = Bar(title) bar.add(" sz ", attr, value, is_convert=False, is_label_show=True, label_text_size=18, is_random=True, # xaxis_interval=0, xaxis_label_textsize=9, legend_text_size=18, Label_text_color =["#000"]) bar.render() def showWorkCloud(self, content, image_filename, font_filename) out_filename): D = path.dirname(__name__) # content = open(path.join(d, filename), 'rb').read() # The default value is 20, withWeight # is whether to return the weight of the keyword tags = jieba.analyse. Extract_tags (content, topK=100, WithWeight =False) text = ". Join (tags) # img = imread(path.join(d, image_filename)) # Wc = WordCloud(font_path=font_filename, background_color='black', # WordCloud shape, Mask =img, # max_words=400, # maximum font size, image height if not specified max_font_size=100, # canvas width and height, # width=600, # height=400, margin=2. Generate (text) img_color = ImageColorGenerator(img) plt.imshow(wc.recolor(color_func=img_color)) plt.axis("off") plt.show() wc.to_file(path.join(d, Def showPyechartsWordCloud(self, attr, value): from pyecharts import WordCloud wordcloud = WordCloud(width=1300, height=620) wordcloud.add("", attr, value, word_size_range=[20, 100]) wordcloud.render()Copy the code

Afterword.

It has been 3 or 4 months since my last analysis of the rental market, and my technical level has also been improved to some extent. So coding hard is the quickest way to grow. Finally, in response to changes in external conditions, we should improve our hard power, so as to improve our survival ability.

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

The recent hot

\

Data on who is the goddess of straight men \

\

Methods of classification model evaluation and Python implementation \

\

I analyzed 50,000 articles on Husniff and found these secrets

\

I crawled meituan network analysis, the original Beijing and Shanghai Top10 cuisine is them \

\

The secret to golden State’s 3 championships in 4 years! Python data visualization analysis tells you the answer

\

Email: [email protected]

\

Click below to read the original storyCommunity Registered Member