“This is the 28th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

Writing in the front

The data used in this paper is the second-hand housing information in Wuhan that lianjia climbed before. This time we’ll dig into the secrets behind the data…

The main Python libraries covered in this article:

  • pandas: Reads the content in the CSV file and processes the data.
  • matplotlibIt is a Python toolkit based on Numpy. This package provides rich data graphing tools, mainly used to draw some statistical graphs.
  • seaborn: Seaborn is a Graphical visualization Python package based on Matplotlib. Seaborn is a more advanced API package based on Matplotlib, which makes drawing easier. Compared to some of the diagrams in Matplotlib, seaborn is more attractive, but the features (drawing details) are not as good as Matplotlib. Seaborn is generally considered a complement to matplotlib, not a substitute. It is also highly compatible with THE NUMPY and PANDAS data structures and statistical models such as SCIPY and STATsmodels.
  • pyechartsPyecharts is a library for generating Echarts diagrams. Echarts is an open source DATA visualization JS library of Baidu. General use it to draw dynamic map, visualization effect is very good.
  • jieba: a very popular Chinese word segmentation pack. There are three main segmentation modes: full mode, precise mode (used in this paper) and search engine mode. You can add a custom dictionary before word segmentation to improve the accuracy of word segmentation.
  • collections: Mainly uses the Counter class to count the occurrence times of each value.

Without further ado, let’s get to the point.

1. Read data

Start by reading the house_info.csv file and looking at the structural information of the data set.

import pandas as pd

df = pd.read_csv('house_info.csv')
df.info()
Copy the code

According to the above information, there are 27 columns in the dataset.house_labelColumns with more missing values,floorColumn andhouse_areaThe type ofobjectShould be converted to a numeric type.

2. Data preprocessing

2.1 Missing Value processing

First delete the row containing the missing value. The number of deleted rows is 5108.

df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
Copy the code

2.2 column processing

Since the map needs to be drawn by Pyecharts later, while The Latitude and longitude of Donghu Hi-tech Zone and Hunan Development Zone are not detailed, they are respectively classified as Hongshan District and Hannan District according to their approximate geographical location.

Deal with content

  • extractfloorNumbers in floors
  • Change the price area from “85.99m²” –> “85.99”
  • East Lake high-tech division to Hongshan, Hunkou development zone division to Hannan
# fetch the number in the floor
df['floor'] = df['floor'].str.extract(r'(\d+)', expand=False).astype('int')
# change the price area from "85.99m²" --> "85.99"
df['house_area'] = df['house_area'].apply(lambda x: x[:-1]).astype('float')

Divide Donghu Hi-tech zone into Hongshan and Hunnan Zone into Hongkou Development Zone
df.loc[df['region'] = ='East Lake High-tech'.'region'] = 'hongshan'
df.loc[df['region'] = ='Dun Kou Development Zone'.'region'] = 'the south han'
# add 'region' to 'hanyang' --> 'Hanyang'
df['region'] = df['region'] + 'area'
Copy the code

View attribute descriptions of numeric columns using the describe() function. If you are viewing all columns, you can specify the parameter include as all (default: None).

df.describe()
Copy the code

The figure shows that the average number of people concerned about second-hand houses in Wuhan is 17, the average total price is 1.84 million, the average unit price is 19,364 yuan /m², the average floor is 22, and the average floor area is 95 m². In addition, there is standard deviation, minimum value, quarter, half, quarter, maximum and other information.

3. Bar chart of second-hand houses in each district

Obtain the information of each district in the data and the number of houses in the corresponding district, and draw a bar chart.

import pyecharts.options as opts
from pyecharts.charts import Bar
from pyecharts.globals import ThemeType

region_list = df['region'].value_counts().index.tolist()
house_count_list = df['region'].value_counts().values.tolist()

c = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK)
c.add_xaxis(region_list)
c.add_yaxis(Wuhan City, house_count_list)
c.set_global_opts(title_opts=opts.TitleOpts(title=Bar chart of second-hand Houses in Wuhan, subtitle=""),
                                     xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(interval=0)))
# c. ender(" Bar chart of second-hand Housing quantity in Wuhan districts.html")
c.render_notebook()
Copy the code

Although Hongshan district is the east Lake high-tech Zone after the merger of statistics, but before the merger, the number of second-hand houses in the two areas is also a lot, followed by the riverside area, river scenery alone. So let’s go throughThe 2 d map3 d mapLook at the distribution of housing prices in each district on the map.

4. 2D map of housing price distribution in each district

Statistics of district names and corresponding median unit prices (the median is little affected by extreme values). Load local wuhan map data (latitude and longitude information of each district). Draw a 2D map of housing price distribution.

region_list = df['region'].value_counts().index.tolist()
median_unit_price = []
for region in region_list:
    median_unit_price.append(df.loc[df['region'] == region, 'unit_price'].median())
    
# Draw 2D maps
from pyecharts.charts import Map
Load wuhan map data
json_data = json.load(open('Wuhan. Json', encoding='utf-8'))

data_pair = [list(z) for z in zip(region_list, median_unit_price)]

text_style = opts.TextStyleOpts(color='#fff')
c = Map(init_opts=opts.InitOpts(width='1500px', height='700px', bg_color='#404a58'))    
c.add_js_funcs("Echarts.registermap (' Wuhan ',{});.format(json_data))
c.add(series_name=Wuhan City, data_pair=data_pair, maptype=Wuhan City, label_opts=opts.LabelOpts(color='#fff'))
c.set_global_opts(legend_opts=opts.LegendOpts(textstyle_opts=text_style), 
                  title_opts=opts.TitleOpts(title="Wuhan", title_textstyle_opts=text_style)
                  ,visualmap_opts=opts.VisualMapOpts(split_number=6, max_=30000, range_text=['high'.'low'], 
                                                     textstyle_opts=text_style))
# c. ender(" 2D map of Housing price distribution in Wuhan districts. HTML ")
c.render_notebook()
Copy the code

According to the map information, the areas with high housing prices are concentrated in the central area of Wuhan, led by Wuchang District, whose housing price is 24,600 yuan /m². The housing price in other central cities is also above 15000 yuan /m². The lowest housing price is in Xinzhou District, with a median price of 7,806 yuan /m². Let’s take a look at this with a 3D map.

5. 3D map of housing price distribution in each district

The required data is the same as the 2D map, but more code is not shown here (friends need to obtain at the end of the article).Compared with 2D, the difference of housing price among districts in 3D map is more obvious. Look at also compare NB!! Next, take a closer look at the outliers of unit prices for each region through the box plot.

6. Box diagram of second-hand housing unit price in each district

Make statistics of the name information and corresponding unit price information of each district, and draw a box diagram.

Statistics of second-hand housing unit price information in each district
unit_price_list = []
for region in region_list:
    unit_price_list.append(df.loc[df['region'] == region, 'unit_price'].values.tolist())

# Draw a box diagram
from pyecharts.charts import Boxplot

c = Boxplot(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis(region_list)
c.add_yaxis(Wuhan City, c.prepare_data(unit_price_list))
c.set_global_opts(title_opts=opts.TitleOpts(title="Wuhan district second-hand housing total price box chart"), 
                 xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(interval=0)))
# c.render("boxplot_base.html")
c.render_notebook()
Copy the code

pyechartsThe upper and lower boundaries of the box diagram are the maximum and minimum values, which are different from the maximum and minimum observed values in the standard box diagram. According to the distribution of upper quartile and lower quartile, it can be seen that hongshan District, Jiangan District and Wuchang District, which have higher housing prices, are typicalRight skewness(Outliers are concentrated on the side of the larger value, with long tails). This shows that the price of many second-hand houses may be due to the location, decoration and other reasons, the unit price seriously deviated from the average level of local housing prices.

7. Second-hand housing area distribution and price diagram

Since the scatter chart in Pyecharts is not convenient to draw trend lines, we directly use Seaborn to draw the second-hand house area distribution and the correlation between area and price.

import matplotlib.pyplot as plt
import seaborn as sns

f, [ax1,ax2] = plt.subplots(1.2, figsize=(16.6))

# Housing area
sns.distplot(df['house_area'], ax=ax1, color='r')
sns.kdeplot(df['house_area'], shade=True, ax=ax1)
ax1.set_xlabel('area')

# The relationship between house size and price
sns.regplot(x='house_area', y='total_price', data=df, ax=ax2)
ax2.set_xlabel('area')
ax2.set_ylabel('total')

plt.show()
Copy the code

The second-hand housing area is mainly distributed between 60-130m ². The most attractive area is 400m², with a total price of 20 million, which stands out from the crowd. 😂

8. 3D bar chart of floor and housing price distribution

Now let’s look at the relationship between floor and housing price in each district. I heard that wuhan has beautiful river view in the evening, so the price of high-rise building should be higher.

The meanings of each axis in the figure

  • X-axis: floors, with 5 floors as intervals
  • Y-axis: district name
  • Z-axis: unit price

In other regions, there may be little difference in the price of each floor, but wuchang district and Jianghan District are the most prominent. Due to the advantages of east Lake and Linjiang, the second largest city, their high-rise housing price is generally higher than that of the bottom.

9. Horizontal bar chart of each apartment type

Count the types of houses and the names of various classes and draw horizontal bar charts.

series = df['house_type'].value_counts()
series.sort_index(ascending=False, inplace=True)
house_type_list = series.index.tolist()
count_list = series.values.tolist()

c = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add_xaxis(house_type_list)
c.add_yaxis(Wuhan City, count_list)
c.reversal_axis()
c.set_series_opts(label_opts=opts.LabelOpts(position="right"))
c.set_global_opts(title_opts=opts.TitleOpts(title="Horizontal bar chart of second-hand Housing in Wuhan"),
                datazoom_opts=[opts.DataZoomOpts(yaxis_index=0, type_="slider", orient="vertical") ",# c. ender(" wuhan second-hand housing type horizontal bar graph.html")
c.render_notebook()
Copy the code

You can see that the main rooms areOne room, one hall, one kitchen and one bathroom,Two bedrooms, one living room, one kitchen and one bathroom,Two rooms, two rooms, one kitchen, one bathroom,Three rooms, one living room, one kitchen and one bathroom,Three rooms, two rooms, one kitchen, one bathroom,Three bedrooms, two living rooms, one kitchen and two bathrooms,Four bedrooms, one living room, one kitchen, two bathrooms. Most of them areTwo rooms, two rooms, one kitchen, one bathroomThis is also more in line with the requirements of most young people. The e big ones can’t afford it, the small ones can’t live in it.

Pie chart of the degree of house decoration

Now look at the second-hand housing decoration, the general second-hand housing guess blank should not be much. See what the actual situation is, count the types of decoration and the number of various types, and draw a pie chart.

decoration_list = df['decoration'].value_counts().index.tolist()
count_list = df['decoration'].value_counts().values.tolist()

from pyecharts.charts import Pie

c = Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add(series_name="House Decoration",
        data_pair=[list(z) for z in zip(decoration_list, count_list)],
        rosetype="radius",
        radius="55%",
        center=["50%"."50%"],
        label_opts=opts.LabelOpts(is_show=False, position="center"))
c.set_global_opts(title_opts=opts.TitleOpts(
                  title="Wuhan second-hand housing decoration pie chart",
                  pos_left="center",
                  pos_top="20",
                  title_textstyle_opts=opts.TextStyleOpts(color="#fff")),
                  legend_opts=opts.LegendOpts(is_show=False))
c.set_series_opts(tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
                label_opts=opts.LabelOpts(color="rgba(255, 255, 255, 255)"))
# c.render("customized_pie.html")
c.render_notebook()
Copy the code

According to the information in the figure, the second-hand houses that are more than ordinary are hardcover, after all, people have lived before, and if they are simply decorated, they will certainly be very sweet to change hands at a higher price. Nearly 25% of the second-hand houses are simple, and the rest are other decoration types and rough. Rough second-hand housing is really not much, and expected about.

11. Relationship chart between elevator and housing price

Below, we look at the proportion of second-hand houses with or without elevators, and the price of houses with or without elevators., according to the information in the above districts in the second-hand housing has accounted for the vast majority of the number of the elevator, in addition to the prices of the things no elevator than an elevator is a little high, the rest of the district have the lift house prices are higher than without the lift house prices, the gap of wuchang district is the most obvious, this also verify the relationship between the floors above and house prices, wuchang district by the reason in view of the tall building is very popular. The bar chart of Xinzhou district in the figure overlaps with the broken line chart. We will draw the two charts separately for better effect.

12 popular second-hand housing label funnel chart

Collect the label information of popular second-hand homes with more than 3 people following them and draw a funnel map to see what these second-hand homes have in common.

from collections import Counter

# Only count hot second-hand houses with more than three followers
detail_df = df.loc[df['follower_numbers'] > 3]
label_list = []
for house_label in detail_df['house_label'].values.tolist():
    label_list += house_label.split(', ')
label_and_count = Counter(label_list)
label_and_count = label_and_count.most_common()

from pyecharts.charts import Funnel

c = Funnel(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
c.add("Goods"[list(z) for z in label_and_count])
c.set_global_opts(title_opts=opts.TitleOpts(title="Wuhan popular second-hand housing label funnel map"))
c.render("Wuhan popular second-hand housing tag funnel graph.html")
c.render_notebook()
Copy the code

According to the information in the picture, it can be clearly seen that VR decoration appears the most frequently in the popular second-hand house label. I tried VR viewing in Lianjia before, it is really convenient, no dead corner, but it is a little dizzy, ha ha! The rest is to see the house at any time, the house is full of two years or five years, after all, you can pay less tax.

13. Hot headline keywords

Now we want to extract hot second-hand housing title keywords (hot words), first load local stop words.

def load_stopwords(read_path) :
    Read each line of the file and save it to a list :param read_path: the path to the file to be read :return: The list to save each line of the file.
    result = []
    with open(read_path, "r", encoding='utf-8') as f:
        for line in f.readlines():
            line = line.strip('\n')  # Remove line breaks for each element in the list
            result.append(line)
    return result

# load Chinese stop words
stopwords = load_stopwords('wordcloud_stopwords.txt')
Copy the code

Get the result of dividing the title by jieba and remove the stop word.

import jieba

Add a custom dictionary
jieba.load_userdict("Custom dictionary.txt")

token_list = []
# Word segmentation for the title content and save the word segmentation results in the list
for title in detail_df['title']:
    tokens = jieba.lcut(title, cut_all=False)
    token_list += [token for token in tokens if token not in stopwords]
len(token_list)
Copy the code
29203

According to the word segmentation list, the Counter class is used to count the occurrence times of each word in the word segmentation list, and the top 100 words with the most occurrence times are selected to draw the word cloud map.

from pyecharts.charts import WordCloud
from collections import Counter

token_count_list = Counter(token_list).most_common(100)
new_token_list = []
for token, count in token_count_list:
    new_token_list.append((token, str(count)))

c = WordCloud()   
c.add(series_name="Hot word", data_pair=new_token_list, word_size_range=[20.200])
c.set_global_opts(
    title_opts=opts.TitleOpts(
        title="Wuhan hot second-hand housing title keywords", title_textstyle_opts=opts.TextStyleOpts(font_size=23)
    ),
    tooltip_opts=opts.TooltipOpts(is_show=True),
)
c.render("Wuhan hot second-hand housing title keywords.html")
c.render_notebook()
Copy the code

Hot second-hand room title appears more words are: elevator, floor, lighting, fine decoration, housing, full two, traffic, etc.. There are also some location-related vocabulary, which can be used as a reference by these keywords of the seller. Maybe it is the content we need to pay attention to when buying a house.

conclusion

Through so many aspects of the analysis, also roughly understand the general market of second-hand housing in Wuhan, the city center housing price of 15,000 yuan /m² starting, the periphery of the lowest about 8,000. Floor according to their own needs, if you want to see the scenery so there is no problem with the high-level, but the price is generally higher, if not bad money Wuchang district is very fragrant. If the area is about 100m², it is enough. If the price is too large, it may be very high. After all, according to the data in the box chart, the housing price in each district is much higher than the average. Decorate pure see individual be fond of, I like oneself to decorate personally, oneself style understands only oneself, the likelihood that others installs feels without sweet feeling. The door model chooses popular two rooms two hall one hutch one defends. There are some other things to pay attention to, such as lighting, age, transportation, environment, etc. Ha ha, I am not selling houses, can only be based on some data to get some superficial insights, we see a good laugh, the real analysis of this data is certainly not enough. All in all, three words, can’t afford, goodbye!


Remember to “like” 👇🏻