K nearest neighbor is an idea that I think is the purest and most clear. K nearest Neighbor algorithm (KNN) is just an application of this idea in the field of data.

Your salary is determined by the people around you.

Your level is determined by the level of the people closest to you.

The world you see is determined by the people around you.

Ideas are ideas. They can’t be coded and they can’t be applied to data science.

We deepen our understanding of the method by asking questions and then applying the method to solve them.

Question: If you’re an Airbnb host, how do you rent your house?

Analysis: According to the rental information on Airbnb platform, mainly including price, number of bedrooms, type of house, location and so on, tenants choose the house they are satisfied with. Setting rent for a house is closely related to market dynamics. For the same type of house, if we charge too much, the tenants will not rent it, and if we charge too little, the profits will not be good.

Answer: Collect information about some houses similar to our house conditions, determine the most similar to our house, and then calculate the average value of its pricing, as the rent of our house.

This is K-nearest Neighbors (KNN) algorithm. The core idea of KNN is the category of unmarked samples, which is decided by the vote of k neighbors nearest to it.

In this paper, the whole application process of the algorithm based on the rent pricing problem is sorted out, including the following parts.

  1. Read the data
  2. The data processing
  3. Handwriting algorithm code prediction
  4. Sklearn is used for model prediction
  5. Super parameter optimization
  6. Cross validation
  7. conclusion

To be clear, this dataset is public, and you can find a lot of material on the subject online. This article tries to explain it completely and accurately, but if you can find more detailed material, that’s great.

1. Read data

We found that the target variable price, cleaning_fee and security_deposit formats are not correct, and some other variables are characters, which need to be handled. I have transposed the dataframe for easy viewing.

2. Data processing

Let’s just deal with Price and try to focus on the idea of the algorithm itself.

# process the target variable price and convert it to a numeric type
stripped_commas = dc_listings['price'].str.replace(', '.' ')
stripped_dollars = stripped_commas.str.replace('$'.' ')
dc_listings['price'] = stripped_dollars.astype('float')

# K nearest neighbor algorithm is also a model, which needs to divide training set and test set
sample_num = len(dc_listings)
# Here we first randomly break up the data to ensure that the segmentation of the data set is random and effective
dc_listings = dc_listings.loc[np.random.permutation(len(sample_num))]
train_df = dc_listings.iloc[0:int(0.7*sample_num)]
test_df = dc_listings.iloc[int(0.7*sample_num):]
Copy the code

3. Handwritten algorithm code prediction

According to the definition of k-nearest neighbor algorithm, we write the code directly. Considering simplicity and efficiency, we only make prediction for single variable.

The number of residents should be highly correlated with rent, as should the size. Let’s use the former here.

Our goal is to understand the algorithmic logic. In practice, a single variable is not usually considered.

# train_df = train_df
def predict_price(new_listing) :
    temp_df = train_df.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = temp_df.iloc[0:5] ['price']
    predicted_price = nearest_neighbor_prices.mean()
    return(predicted_price)

# here is test_df
test_df['predicted_price'] = test_df['accommodates'].apply(predict_price)
# MAE(mean absolute error), MSE(mean squared error), RMSE(root mean squared error)
test_df['squared_error'] = (test_df['predicted_price'] - test_df['price']) * * (2)
mse = test_df['squared_error'].mean()
rmse = mse ** (1/2)
Copy the code

It is worth emphasizing that the construction of model algorithm is based on training set, and the prediction evaluation is based on test set. There is also strictly a class of samples, OOT: cross-time samples.

The results showed that the prediction was efficient even if we used only the number of people accommodates variable to make neighboring selection.

4. Sklearn is used for model prediction

We’re going to use more variables this time, just get rid of strings and unexplained variables, and use whatever’s left over.

When we use multiple variables, these invariant dimensions are not the same, and we need to standardize. The distribution difference of the variables is guaranteed, and the superposition of variables is guaranteed.

# Remove non-numeric variables and inappropriate variables
drop_columns = ['room_type'.'city'.'state'.'latitude'.'longitude'.'zipcode'.'host_response_rate'.'host_acceptance_rate'.'host_listings_count']
dc_listings = dc_listings.drop(drop_columns, axis=1)
# remove columns with a high percentage of missing variables
dc_listings = dc_listings.drop(['cleaning_fee'.'security_deposit'], axis=1)
# remove rows with missing values
dc_listings = dc_listings.dropna(axis=0)
# Multiple variables have different dimensions and need to be standardized
normalized_listings = (dc_listings - dc_listings.mean())/(dc_listings.std())
normalized_listings['price'] = dc_listings['price']

# So we have a data set that can be used for modeling, and the training set test set is divided 7:3
train_df = normalized_listings.iloc[0:int(0.7*len(normalized_listings))]
test_df = normalized_listings.iloc[int(0.7*len(normalized_listings)):]
# price is y, and the rest of the variables are X
features = train_df.columns.tolist()
features.remove('price')
Copy the code

The processed data set is as follows, where Price is the target we want to predict and the rest are available variables.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn.fit(train_df[features], train_df['price'])
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df['price'], predictions)
rmse = mse ** (1/2)
Copy the code

The final RMSE is 111.9, which is smaller than 117.4 of univariate KNN, and the result is optimized. Strictly speaking, this comparison is not entirely fair, because we have lost a small number of feature-missing samples.

5. Hyperparameter Optimization

In parts 3 and 4, we preset k=5, but this is definitely the case. Whether the value is reasonable and optimal needs to be further determined.

Where k is a hyperparameter. For any data set, as long as you use KNN, you need to determine this k value.

K value is not obtained by model based on data learning, but is determined by presetting, and then inverse selection according to the results. Any hyperparameter is determined this way, as are other algorithms.

import matplotlib.pyplot as plt
%matplotlib inline

hyper_params = [x for x in range(1.21)]
rmse_values = []
features = train_df.columns.tolist()
features.remove('price')

for hp in hyper_params:
    knn = KNeighborsRegressor(n_neighbors=hp, algorithm='brute')
    knn.fit(train_df[features], train_df['price'])
    predictions = knn.predict(test_df[features])
    mse = mean_squared_error(test_df['price'], predictions)
    rmse = mse**(1/2)
    rmse_values.append(rmse)

plt.plot(hyper_params, rmse_values,c='r',linestyle=The '-',marker='+')
Copy the code

What we find is that the larger k is, the more accurate the trend is of the deviation between the forecast price and the real price. But notice, the bigger the k, the bigger the computation.

When we determine k, we can use the Albow method, which is the inflection point in the figure above, which is the elbow of the elbow.

K =7 or 10 is probably a better result than k=5.

6. Cross Validation

Our calculation results above are entirely dependent on the training set and test set, although randomness has been taken into account for their partitioning. But one-off results can still be occasional, especially if the sample size is not large enough.

Cross validation is designed to solve this problem. We can divide the same sample set into different training sets and test sets. Retrain and predict after each partition, and then look at the results together.

The most widely used is n-fold cross validation, in which the data set is randomly divided into N parts, and n-1 subset is used as the training set and the remaining 1 subset is used as the test set. That’s n times of training and prediction.

We can just write the logic by hand as follows.

sample_num = len(normalized_listings)
normalized_listings.loc[normalized_listings.index[0:int(0.2*sample_num)], "fold"] = 1
normalized_listings.loc[normalized_listings.index[int(0.2*sample_num):int(0.4*sample_num)], "fold"] = 2
normalized_listings.loc[normalized_listings.index[int(0.4*sample_num):int(0.6*sample_num)], "fold"] = 3
normalized_listings.loc[normalized_listings.index[int(0.6*sample_num):int(0.8*sample_num)], "fold"] = 4
normalized_listings.loc[normalized_listings.index[int(0.8*sample_num):], "fold"] = 5

fold_ids = [1.2.3.4.5]
def train_and_validate(df, folds) :
    fold_rmses = []
    for fold in folds:
        # Train
        model = KNeighborsRegressor()
        train = df[df["fold"] != fold]
        test = df[df["fold"] == fold].copy()
        model.fit(train[features], train["price"])
        # Predict
        labels = model.predict(test[features])
        test["predicted_price"] = labels
        mse = mean_squared_error(test["price"], test["predicted_price"])
        rmse = mse**(1/2)
        fold_rmses.append(rmse)
    return(fold_rmses)

rmses = train_and_validate(normalized_listings, fold_ids)
avg_rmse = np.mean(rmses)
Copy the code

In engineering, we should make full use of tools and resources. The SkLearn library contains implementations of commonly used machine learning algorithms that can be used directly for validation.

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(5, shuffle=True, random_state=1)
model = KNeighborsRegressor()
mses = cross_val_score(model, normalized_listings[features], normalized_listings["price"], scoring="neg_mean_squared_error", cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
Copy the code

Cross-validation results are more reliable, especially on small data sets. Because it can reduce the chance error to a certain extent.

Combined with cross validation and hyperparameter optimization, we generally get the optimal prediction result of KNN algorithm under this data set.

# overparameter optimization
num_folds = [x for x in range(2.50.2)]
rmse_values = []

for fold in num_folds:
    kf = KFold(fold, shuffle=True, random_state=1)
    model = KNeighborsRegressor()
    mses = cross_val_score(model, normalized_listings[features], normalized_listings["price"], scoring="neg_mean_squared_error". cv=kf) rmses = np.sqrt(np.absolute(mses)) avg_rmse = np.mean(rmses) std_rmse = np.std(rmses) rmse_values.append(avg_rmse) plt.plot(num_folds, rmse_values,c='r',linestyle=The '-',marker='+')
Copy the code

We get the same trend, the bigger the k, the better the effect. At the same time, because cross validation solves the over-fitting problem to some extent, the larger the ideal k value is, the more complex the model can be.

7. To summarize

It can be seen from the core idea of k-nearest Neighbor algorithm and the above coding process that the algorithm is a learning method based on instances, because it completely depends on instances in the training set.

The algorithm requires no mathematical method and is easy to understand. However, it is not suitable for large data sets, because each prediction of k-nearest Neighbor algorithm needs to calculate the distance between the data of the whole training set and the data to be predicted, and then arrange them in increased order, resulting in a huge amount of calculation.

If a mathematical function can be used to describe the relationship between the characteristic variables and the target variables of the data set, then once the function representation is obtained with the training set, the prediction becomes a simple mathematical calculation problem. Computational complexity is greatly reduced.

Other classical machine learning algorithms are basically a function representation problem. We’ll see later.