This is the 8th day of my participation in Gwen Challenge

Author: Cola

Source: Coke’s path to data analysis

Please contact authorization for reprinting (wechat ID: data_COLA)

I. Descriptive statistical analysis

In Excel, the 【 Description statistics 】 function in the 【 Data Analysis 】 function can be used to view the statistical indicators commonly used in the data set, but it can only be used for the statistics of numerical data.

Pandas provides a descriptive statistical analysis of an entire dataset using the describe method. The results are for numerical data only.

Df_list.describe ()Copy the code

The results are as follows, and you can see that count, mean, STD, min, Max, 25%, 50%, and 75% represent 3/4, median, and 1/4 digits, respectively.

The ranks of transpose

Because the field is too many, so here can be transposed, convenient to view, with. T transpose

Describe ().tCopy the code

The result is shown in the figure, which is more in line with the habit of a table. It can be seen that only numerical data can be counted, while character data cannot be counted.

It is observed that the minimum value, 1/4 digit, median digit and 3/4 digit of the field of minimum_nights are all 1, indicating that the minimum number of nights in most houses is 1 day. The same applies to the number of comments per month (reviews_per_month) field

Second, group analysis

Excel PivotTable can be used to calculate data groups.

Take a look at the values of the Neighborhood_new field and count the number of occurrences using the value_COUNTS method

Df_list [" Neighborhood_new "].value_counts()Copy the code

As a result, you can see the total number of neighborhood_new categories and their frequency of occurrence in descending order. You can see that the Chaoyang district has the most properties and the Pingdale district has the least.

You can also use the groupby method for group counting

# grouping df_list. Groupby (" neighborhood_new ") [r]. "neighborhood_new" count ()Copy the code

You get the same answer

The room_type_new column can also be grouped to see the results

df_list["room_type_new"].value_counts()
Copy the code

The Entire home has the most rooms and the Shared room has the fewest rooms.

3. Crossover analysis

grouping

For regional grouping, the level of housing prices in different regions can also be counted by groupBY method, but multiple summary methods can be used in one time by AGG method.

df_list.groupby("neighborhood_new")["price"].agg(["max","min","mean","count"])
Copy the code

As shown in the figure, the neighborhood_new field is grouped, and the maximum and minimum average values of the grouped prices are calculated and counted. It can be seen that huairou District has the highest average house prices, while Fengtai District has the lowest.

Group room types and rank the results in descending order of mean

r_p = df_list.groupby("room_type_new")["price"].agg(["max","min","mean","count"]).reset_index()
r_p.sort_values("mean",ascending = False)
Copy the code

As shown in the figure, the average house price of whole rent is the highest, while that of joint rent is the lowest, which is a very reasonable result.

perspective

Pivot_table and Pivottables in Excel are the same type of operations. The first parameter is the data to be PivotTable. Values is the value area in the Excel PivotTable, which is the field to be summarized. The columns parameter is the column area, and the AGgFUc parameter is the type to summarize values.

pd.pivot_table(df_list,values = "price",index = "neighborhood_new",
                columns = "room_type_new",aggfunc = "mean",margins = True)
Copy the code

The result is shown in the figure, which shows the price distribution of whole rent, shared rent and single room in each region.

4. Correlation analysis

Correlation analysis is used to describe the results of the correlation between variables, represented by correlation coefficient R, r>0 means positive correlation, r<0 means negative correlation, the closer the absolute value of R is to 1, the higher the correlation is. Excel can directly calculate the correlation coefficient of each field by using the correlation coefficient function in the data analysis tool.

Python can use the corR function to calculate the correlation coefficient between data, for the entire table, and take the result to four decimal places

Df_list.corr ().round(4)Copy the code

The result is as follows, and the correlation coefficients between the columns are obtained.

But what we’re really interested in here is the correlation between them and the prices, which is the red part of the graph, and you can sort the values in this column.

Numerical sequence

Numeric sorting is to arrange the entire table in ascending or descending order, using the sort_values method. To descending the price column of the data box after the correlation coefficient is calculated, the first parameter is to sort the column, and the second parameter is ascending = False, which is True by default.

Corr_p = df_list.corr().round(4) corr_p["price"].sort_values(ascending = False)Copy the code

The results are as follows. It can be seen that the house price has the highest correlation with latitude and longitude. Besides, compared with other variables, availability_365 has the most positive correlation with price. Second, the number of reviews per month (reviews_per_month) is negatively correlated with price.