preface

So what we’re going to do is we’re going to crawl through 58.com, and we’re going to use that data to predict the price of unknown listings.

Source code involves three parts, machine learning, web front-end and crawler.

Prediction mainly uses regression prediction, the prediction results are relatively simple, through this project to learn a simple Python based regression prediction.

In this paper, three regression prediction algorithms are implemented:


1. Support vector regression (SVR)

2. Logistic regression

3. Ridge regression using nuclear techniques (L2 regression)

The implementation process

The complete code for the project is uploaded to Github, and the implementation of this section is here:

https://github.com/TomorrowIsBetter/crawler/tree/master/price_prediction

The index. CSV file in this directory is part of the file data that crawled out. We can look directly at the file format:


_id,date,areas,square,methods,price,direction,type,houseAreas


http://cc.58.com/zufang/34009015301200x.shtml, 1.53 e+12, 0190 2,1,20000,0,4 rooms 2 hall, citic city (villa)


http://cc.58.com/zufang/33977466998861x.shtml, 1.53 e+12, 0190 2,1,20000,0,4 rooms 2 hall, citic city (villa)


http://cc.58.com/zufang/32214749419981x.shtml, 1.53 e+12, room 5400,1,15000,0,4 3 hall, gen uptown


http://cc.58.com/zufang/34129082983861x.shtml, 1.53 e+12, 0500,1,15000,0,5 room 3-2, China shipping lane east county

As 58.com is located in first-tier cities, such as Beijing, there is generally no one renting or releasing houses on 58.com, and most of the data comes from intermediaries. However, the usability of 58 is good in the second and third-tier cities, so the data in this data capture is from some second and third-tier cities in China.

Model training

The algorithm is a statistical learning method implemented by using the SciKit-Learn library based on Python. Part of the code is as follows:


def normalization(data,tag=””):


    mean = data.mean()


    maximum = data.max()


    minimum = data.min()


    print(tag,mean,maximum,minimum)


    return (data – mean) / (maximum – minimum)





df = pandas.read_csv(“index.csv”)


df = shuffle(df)


df = shuffle(df)


square = df[‘square’].values


square = normalization(square)


areas = df[‘areas’].values / 5


direction = df[‘direction’].values / 4


price = df[‘price’].values


#price = normalization(price)





print(areas.shape,square.shape,direction.shape)


data = np.array([areas,square,direction])


data = data.T


train_fraction = .8


train_number = int(df.shape[0] * train_fraction)


X_train = data[:train_number]


X_test = data[train_number:]


y_train = price[:train_number]


y_test = price[train_number:]


print(np.max(price))


# model


CLF = GridSearchCV (SVR (kernel = ‘RBF’, gamma = 0.1), {” C “: [1 e0, e1 and e2, 1 1 e3],” gamma “: np. Logspace (2, 2, 5)}, CV = 5)


# CLF = GridSearchCV (LogisticRegression (), {” C “: [1 e0, e1 and e2, 1 1 e3],” random_state “: the list (range (10))}, CV = 5)


# CLF = GridSearchCV(KernelRidge(kernel=’ RBF ‘, gamma=0.1), {“alpha”: [1e0, 1e1, 1e2, 1e3], “gamma”: np.logspace(-2, 2, 5)},cv=5)


clf.fit(X_train,y_train)


result = clf.score(X_train,y_train)


test = clf.score(X_test,y_test)


c = clf.best_params_


y = clf.predict(X_test)


x = list(range(len(y)))


PLT. Subplot (2,1,1)


plt.scatter(x=x,y=y,color=’r’)


plt.scatter(x=x,y=y_test,color=’g’)


print(clf.best_params_,result,test)


deviation = y – y_test


deviation = deviation.flatten()


deviation = abs(deviation)


print(np.median(deviation))


PLT. Subplot (2,1,2)


plt.hist(deviation,10)


joblib.dump(clf,”model.m”)


plt.show()


The complete code is at this location:

https://github.com/TomorrowIsBetter/crawler/blob/master/price_prediction/train.py

The code is written in a hurry, and the coding format is not standardized enough.

Model specification

The code is divided into several parts:

1. Data preprocessing: load data, scramble data, and normalize data

2. Data set splitting: Split the data set into training set and verification set

3. Training model: automatically test the optimal hyperparameters and train the model with training set

4. Data visualization: visualization of prediction results and deviation distribution


The normalized

The normalization of the model follows the following formula:

Mp.weixin.qq.com/s/A8exoL7iT…


Where the category part is labeled {0,1… ,N} to mark the uncategorized, normalized, directly divided by N can be.

Data set out of order

The training data set must be shuffled to prevent deviation of prediction results caused by skew of training data.

If you want to see the effect, you can comment out these two lines of code:


df = shuffle(df)


df = shuffle(df)

After that, you can see the model prediction results caused by non-out-of-order data training intuitively.

Data set splitting

In this case, the cross-validation method is used to split the data set into training set and verification set. In fact, for deep learning with a large number of data sets, it is generally divided into training set, test set and verification set. Here, the data is relatively small, and the ratio of training set and verification set is 8:2

The test results

Here is a visual image of the program after it runs:



Green is the actual value of the sample, red is the predicted result. As you can see from the graph above, the prediction is ok; The figure below shows the deviation between the predicted result and the actual value, presented as a histogram. It can be seen that the overall deviation distribution is left biased, mainly concentrated in the range of [0,500], indicating that the model has achieved the prediction effect.

conclusion

Github has training set data and other implementation parts for you to check out:

https://github.com/TomorrowIsBetter/crawler

Among them, this part is located in the price_prediction directory, and the other parts are the crawler implemented by Node.js and the Web front end implemented by AntDesign. Those who are interested in the Web front end can also learn about it.

For those who are interested in big data, machine learning, and system architecture, take a look at this course: Mastering Spark machine Learning Library Big Data Development Skills

The knowledge involved in this course is more systematic, and questions about machine learning and big data will be discussed in the QUESTION and answer session of this course.