An overview of the

  • preface
  • The statistical results
  • Crawler analysis
  • Crawler code implementation
  • Crawler analysis implementation
  • Afterword.

preface

Recently, the rent of the second tier cities have risen, what is the overall rise to what extent? We don’t know, so In order to find out, Zone used Python to crawl the rental data of shenzhen under the house. Here is the sample data of this time:

The statistical results

Let’s look at the statistical results first and then at the technical analysis. Housing distribution in shenzhen :(by district) futian and nanshan have the most housing distribution. But the rent on these two lots is pretty high.

Unit rent :(unit rent per square meter per month — average) is the price of 1 square meter per month. The bigger the square, the higher the price.

It can be seen that Futian and Nanshan are the first, with 114.874 and 113.483 respectively, several times that of other areas. If you rent a room of 20 square meters in Futian:

114.874 x 20 = 2297.48

Water, electricity and property of 200 yuan:

2297.48 + 200 = 2497.48

Let’s be frugal. Breakfast is $10, lunch is $25, and dinner is $25:

2497.48 + 50 x 30 = 3997.48

Yeah, $3,997.48 just to survive. Cut off time to go to a restaurant, buy some clothes every month, transportation expenses, talk about a girlfriend, go shopping with her girlfriend, no problem drop add 3500

3997.48 + 3500 = 7497.48

One thousand for mom and Dad:

7497.48 + 2000 = 9497.48

Ten thousand a month no problem drop, into the moonlight clan.

Unit rent :(per square metre per day — average)

That’s the price of 1 square meter per day.

Door model

The apartment type is mainly 3 rooms, 2 halls and 2 rooms, 2 halls. Renting a room in a group with a friend is the best choice. Otherwise, a series of uncomfortable things can happen when you share a room with someone you don’t know. The larger the font, the greater the number of house types.

Rental area statistics

Among them, 30-90 square meters of rental accounted for the majority, now the way, also can only be a few small friends rent a house together, huddle together to keep warm.

Cloud of rental descriptors

This is a description of the rental, where the larger the font, the more times the sign appears. Among them [fine decoration] occupies a large part, indicating that long-term rental apartments also occupy a large part of the market.

The crawler thinking

Firstly, the data of each plate in Shenzhen of Fang x is climbed, and then stored in MongoDB database, and finally the data is analyzed.

Part of database data:

/ * 1 * / {"_id" : ObjectId("5b827d5e8a4c184e63fb1325"),
    "traffic" : "It is about 567 meters away from shajing Electronic City bus station."// Traffic description"address" : "Bao 'an - Shajing - Ming Holi City"/ / address"price": 3100, / / price"area": / / area, 110"direction" : "Face south \r\n", / / front"title" : "Shajing hao Li Cheng hardcover three room furniture neat bag live high-rise south at any time to see the house."/ / title"rooms" : "Three rooms, two rooms", / / family"region" : "Baoan"} / / areaCopy the code

Crawler analysis

  • Request library: requests
  • HTML parsing: BeautifulSoup
  • The word cloud: wordcloud
  • Data visualization: Pyecharts
  • Database: MongoDB
  • Database connection: Pymongo

Crawler code implementation

First right click the page, view the page source, find out what we want to crawl to get.

def getOnePageData(self, pageUrl, reginon="不限"):
    rent = self.getCollection(self.region)
    self.session.headers.update({
        'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36'})
    res = self.session.get(
        pageUrl
    )
    soup = BeautifulSoup(res.text, "html.parser")
    divs = soup.find_all("dd", attrs={"class": "info rel"})  # retrieve the div
    for div in divs:
        ps = div.find_all("p")
        try:  # catch exception, because some data in the page is not filled in completely, or an advertisement is inserted, there will be no corresponding tag, so the error will be reported
            for index, p in enumerate(ps):  # From the source code, we can see that each p tag has the information we want, so we will pass through the P tag,
                text = p.text.strip()
                print(text)  # output to see if it is the information we want
            print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
            # crawl and save into MongoDB database
            roomMsg = ps[1].text.split("|")
            # rentMsg This is done because some information is incomplete, causing the object to be empty
            area = roomMsg[2].strip()[:len(roomMsg[2]) - 2]
            rentMsg = self.getRentMsg(
                ps[0].text.strip(),
                roomMsg[1].strip(),
                int(float(area)),
                int(ps[len(ps) - 1].text.strip()[:len(ps[len(ps) - 1].text.strip()) - 3]),
                ps[2].text.strip(),
                ps[3].text.strip(),
                ps[2].text.strip()[:2],
                roomMsg[3],
            )
            rent.insert(rentMsg)
        except:
            continue
Copy the code

Data analysis implementation

Data analysis:

    Price of rent for a district (square meters/yuan)
    def getAvgPrice(self, region):
        areaPinYin = self.getPinyin(region=region)
        collection = self.zfdb[areaPinYin]
        totalPrice = collection.aggregate([{'$group': {'_id': '$region'.'total_price': {'$sum': '$price'}}}])
        totalArea = collection.aggregate([{'$group': {'_id': '$region'.'total_area': {'$sum': '$area'}}}])
        totalPrice2 = list(totalPrice)[0]["total_price"]
        totalArea2 = list(totalArea)[0]["total_area"]
        return totalPrice2 / totalArea2

    # Get how much each district costs per square meter per month
    def getTotalAvgPrice(self):
        totalAvgPriceList = []
        totalAvgPriceDirList = []
        for index, region in enumerate(self.getAreaList()):
            avgPrice = self.getAvgPrice(region)
            totalAvgPriceList.append(round(avgPrice, 3))
            totalAvgPriceDirList.append({"value": round(avgPrice, 3), "name": region + "" + str(round(avgPrice, 3))})

        return totalAvgPriceDirList

    Figure out how much it costs per square meter per day in each district
    def getTotalAvgPricePerDay(self):
        totalAvgPriceList = []
        for index, region in enumerate(self.getAreaList()):
            avgPrice = self.getAvgPrice(region)
            totalAvgPriceList.append(round(avgPrice / 30, 3))
        return (self.getAreaList(), totalAvgPriceList)

    # Obtain the number of statistical samples in each district
    def getAnalycisNum(self):
        analycisList = []
        for index, region in enumerate(self.getAreaList()):
            collection = self.zfdb[self.pinyinDir[region]]
            print(region)
            totalNum = collection.aggregate([{'$group': {'_id': ' '.'total_num': {'$sum': 1}}}])
            totalNum2 = list(totalNum)[0]["total_num"]
            analycisList.append(totalNum2)
        return (self.getAreaList(), analycisList)

    # Obtain the proportion of housing in each district
    def getAreaWeight(self):
        result = self.zfdb.rent.aggregate([{'$group': {'_id': '$region'.'weight': {'$sum': 1}}}])
        areaName = []
        areaWeight = []
        for item in result:
            if item["_id"] in self.getAreaList():
                areaWeight.append(item["weight"])
                areaName.append(item["_id"])
                print(item["_id"])
                print(item["weight"])
                # print(type(item))
        return (areaName, areaWeight)

    Get the title data to build the word cloud
    def getTitle(self):
        collection = self.zfdb["rent"]
        queryArgs = {}
        projectionFields = {'_id': False, 'title': True}  Use a dictionary to specify the required fields
        searchRes = collection.find(queryArgs, projection=projectionFields).limit(1000)
        content = ' '
        for result in searchRes:
            print(result["title"])
            content += result["title"]
        return content

    # Obtain apartment type data (e.g., 3 bedrooms and 2 halls)
    def getRooms(self):
        results = self.zfdb.rent.aggregate([{'$group': {'_id': '$rooms'.'weight': {'$sum': 1}}}])
        roomList = []
        weightList = []
        for result in results:
            roomList.append(result["_id"])
            weightList.append(result["weight"])
        # print(list(result))
        return (roomList, weightList)

    # Obtain rental area
    def getAcreage(self):
        results0_30 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 0.'$lte': 30}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results30_60 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 30.'$lte': 60}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results60_90 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 60.'$lte': 90}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results90_120 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 90, '$lte': 120}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results120_200 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 120, '$lte': 200}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results200_300 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 200, '$lte': 300}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results300_400 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 300, '$lte': 400}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results400_10000 = self.zfdb.rent.aggregate([
            {'$match': {'area': {'$gt': 300, '$lte': 10000}}},
            {'$group': {'_id': ' '.'count': {'$sum': 1}}}
        ])
        results0_30_ = list(results0_30)[0]["count"]
        results30_60_ = list(results30_60)[0]["count"]
        results60_90_ = list(results60_90)[0]["count"]
        results90_120_ = list(results90_120)[0]["count"]
        results120_200_ = list(results120_200)[0]["count"]
        results200_300_ = list(results200_300)[0]["count"]
        results300_400_ = list(results300_400)[0]["count"]
        results400_10000_ = list(results400_10000)[0]["count"]
        attr = ["0-30 square meters"."30-60 square meters"."60-90 square meters"."90-120 square meters"."120-200 square meters"."200-300 square meters"."300-400 square meters"."400+ square meters"]
        value = [
            results0_30_, results30_60_, results60_90_, results90_120_, results120_200_, results200_300_, results300_400_, results400_10000_
        ]
        return (attr, value)
Copy the code

Data Display:

    # Show pie chart
    def showPie(self, title, attr, value):
        from pyecharts import Pie
        pie = Pie(title)
        pie.add("aa", attr, value, is_label_show=True)
        pie.render()

    Display the rectangle tree
    def showTreeMap(self, title, data):
        from pyecharts import TreeMap
        data = data
        treemap = TreeMap(title, width=1200, height=600)
        treemap.add("Shenzhen", data, is_label_show=True, label_pos='inside', label_text_size=19)
        treemap.render()

    # Show bar chart
    def showLine(self, title, attr, value):
        from pyecharts import Bar
        bar = Bar(title)
        bar.add("Shenzhen", attr, value, is_convert=False, is_label_show=True, label_text_size=18, is_random=True,
                # xaxis_interval=0, xaxis_label_textsize=9,
                legend_text_size=18, label_text_color=["# 000"])
        bar.render()

    # Show word clouds
    def showWorkCloud(self, content, image_filename, font_filename, out_filename):
        d = path.dirname(__name__)
        # content = open(path.join(d, filename), 'rb').read()
        # Keyword extraction based on TF-IDF algorithm, topK returns the items with the highest frequency, the default value is 20 withWeight
        # is whether to return the weight of the keyword
        tags = jieba.analyse.extract_tags(content, topK=100, withWeight=False)
        text = "".join(tags)
        # Background image to display
        img = imread(path.join(d, image_filename))
        # specify a Chinese font, otherwise it will be garbled
        wc = WordCloud(font_path=font_filename,
                       background_color='black'.# word cloud shape,
                       mask=img,
                       Allow maximum vocabulary
                       max_words=400,
                       # Maximum font size, or image height if not specified
                       max_font_size=100,
                       Canvas width and height will not take effect if MSAK is set
                       # width=600,
                       # height=400,
                       margin=2,
                       The frequency of horizontal placement of # words is 0.9 by default. The frequency of vertical placement is 0.1Prefer_horizontal =0.9) wc. Generate (text) img_color = ImageColorGenerator(img) plt.imshow(wc.recolor(color_func=img_color)) plt.axis("off")
        plt.show()
        wc.to_file(path.join(d, out_filename))
        
    # Show pyecharts word cloud
    def showPyechartsWordCloud(self, attr, value):
        from pyecharts import WordCloud
        wordcloud = WordCloud(width=1300, height=620)
        wordcloud.add("", attr, value, word_size_range=[20, 100])
        wordcloud.render()
Copy the code

Afterword.

There are a lot of things happening recently. The surge in rents is actually the power of capital entering the rental market. These long-term rental apartments, Such as Freely and Eggshell, have too high rent prices for each other, and let customers sign third-party loan agreements. The early development may require a little money, but when the market is monopolized in the later period, as long as the housing is just needed, the money will not be lost. Finally, in response to changes in external conditions, we should improve our hard power, so as to improve our survival ability.

This article was first published in the public account [Zone7], pay attention to the public account to get the latest tweets, background reply [Shenzhen rent] to get the source code.