1. Introduction

  • This task is mainly to classify data sets in ex2datA1. TXT and ex2data2. TXT by logistic regression. The ex2data1.txt contains the students’ historical scores on the two tests and whether they were admitted, while the ex2data2.txt contains the two test results of the chip products and whether they were good or bad. Use algorithms to classify students and chips, judge what kind of students can be admitted, and what kind of chips are good.

  • Among them, sigmoid function, cost function, gradient descend function, Regularized method need to be completed, featuring the mapping and drawing the decision boundary

  • Compute the package import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-dark-palette') You can use plt.style.available to see how many styles are available
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report # This package is the evaluation report
Copy the code

2. Prepare data

data = pd.read_csv('ex2data1.txt', names=['exam1'.'exam2'.'admitted'])
data.head()# Look at the first five lines
Copy the code
exam1 exam2 admitted
0 34.623660 78.024693 0
1 30.286711 43.894998 0
2 35.847409 72.902198 0
3 60.182599 86.308552 1
4 79.032736 75.344376 1
  • Let’s look at the data
data.describe()
Copy the code
exam1 exam2 admitted
count 100.000000 100.000000 100.000000
mean 65.644274 66.221998 0.600000
std 19.458222 18.582783 0.492366
min 30.058822 30.603263 0.000000
25% 50.919511 48.179205 0.000000
50% 67.032988 67.682381 1.000000
75% 80.212529 79.360605 1.000000
max 99.827858 98.869436 1.000000
sns.set(context="notebook", style="darkgrid")

sns.lmplot('exam1'.'exam2', hue='admitted', data=data, The hue parameter is used for classification
           height=6, 
           fit_reg=False,
           scatter_kws={"s": 30} # Set the radius of the drawn point
          )
plt.show()# Take a look at the data
Copy the code

  • Read data function
def get_X(df):# Read feature
Use the concat function to add x_0 columns to the X matrix, the independent variable matrix
    ones = pd.DataFrame({'ones': np.ones(len(df))})#ones is the dataframe of row m, column 1
    data = pd.concat([ones, df], axis=1)  # merge data by column
    return data.iloc[:, :- 1].values  # This operation returns the matrix


def get_y(df):# fetch tag
The last column is the target value, which is the dependent variable
    return np.array(df.iloc[:, - 1]) # df.iloc[:, -1] is the last column of df


def normalize_feature(df):
# Keep data in one dimension
    return df.apply(lambda column: (column - column.mean()) / column.std())# Feature scaling
Copy the code
X = get_X(data)
print(X.shape)

y = get_y(data)
print(y.shape)
Copy the code
(100, 3) (100),Copy the code

3. function

Classification problems, especially binary classification problems, are black and white. So what we need to know is whether it’s 0 or 1, what’s the probability of this 0 or 1. And often the classification problem is nonlinear, linear regression things in most scenarios are invalid. Therefore, the sigmoid function is the only hypothetical function that can satisfy the above characteristics. Give it a large number and it will give you a number close to one; Give it a very small number and it will give you a number close to zero, collapsing any of your complex nonlinear combinations into a reasonable number.

It is commonly usedThe formula is:? g\left( z \right)=\frac{1}{1+{{e}^{-z}}}? Then we combine the hypothesis functions of linear regression to obtain the hypothesis functions of logistic regression model:? {{h}_{\theta }}\left( x \right)=\frac{1}{1+{{e}^{-{{\theta }^{T}}X}}}=\frac{1}{1+{{e}^{-(\theta_0x_0+\theta_1x_1+\theta_2x_2+… +\theta_nx_n)}}}?

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
Copy the code
fig, ax = plt.subplots(figsize=(8.6))
ax.plot(np.arange(- 10.10, step=0.01),
        sigmoid(np.arange(- 10.10, step=0.01)))
ax.set_ylim((0.1.1.1))
ax.set_xlabel('z', fontsize=18)
ax.set_ylabel('g(z)', fontsize=18)
ax.set_title('Sigmoid Function', fontsize=18)
plt.show()
Copy the code

Cost Function = Cost Function

Logistic regression cost function that is, the objective function does not continue to use the target of minimizing the sum of squares like linear regression, because the hypothesis function change into numerical always quantico between 0 and 1, the sum of squares function will fluctuate, produce co., a local minimum, convergence cannot lead to the final or is unable to find a global optimal solution.

Therefore, there is an urgent need for improvement. At present, the function we need needs to meet the conditions based on the purpose of the task. Since it is a classification problem, when the actual result is 1, we need to followAs delta gets bigger in the direction of one,CostIt keeps decreasing; When the actual result is 0, we need to go withAs delta gets bigger in the direction of one,CostThe value keeps increasing. Like the following two pictures:

Such a function can be defined as


To simplify this:


You can separateAs well asThe result is consistent with the formula discussed in classification.

From this we can get the new cost function:


theta = np.zeros(3) Theta is a column vector of 3x1, corresponding to x0,x1, and x2

theta
Copy the code
array([0., 0., 0.])
Copy the code
def cost(theta, X, y):
That's the cosine of t that I mentioned before
    return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))

# x@ theta is equivalent to x.dot (theta)
Copy the code
cost(theta, X, y)
Copy the code
0.6931471805599458
Copy the code

Their Descent is Gradient Descent.

The original Batch gradient descent algorithm is


Apply to what we just got:


The following is the derivation process:


Consideration:


Is:


So:


  • To vectorization:

def gradient(theta, X, y):
Only one batch gradient descent algorithm is required
    return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)
Copy the code
gradient(theta, X, y)
Copy the code
Rs-0.100000 exam1-12.009217 exam2-11.262842 DType: FLOAT64Copy the code

6. Fitting parameters

  • Here I use scipy. Optimize. Minimize to find parameters
import scipy.optimize as opt
Copy the code
res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
Copy the code
print(res)
Copy the code
Fun: 0.2034977030343232 JAC: ONes-3.908000E-09 exam1-6.926737E-05 exam2-4.659433E-05 DType: Float64 message 'Optimization terminated successfully.' nfev: 75 nhev: 0 nit: 31 njev: 287 status: 0 success: True x: Rs-25.158217 exam1 0.206207 exam2 0.201446 DType: Float64Copy the code

7. Prediction and verification with training sets

def predict(x, theta):
    prob = sigmoid(x @ theta)
    return (prob >= 0.5).astype(int)
Copy the code
final_theta = res.x
y_pred = predict(X, final_theta)

print(classification_report(y, y_pred)) # The difference between output raw data and data using a predictive model
Copy the code
Precision Recall F1-Score Support 0 0.87 0.85 0.86 40 1 0.90 0.92 0.91 60 accuracy 0.89 100 macro AVG 0.89 0.88 0.88 100 Weighted AVG 0.89 0.89 0.89 100Copy the code
  • Precision =Of all predicted truths, the actual percentage is true(The percentage of all false predictions that are actually false)

  • Recall ratio = (Recall)The percentage of successful predictions that are true out of all actual predictions that are true(The percentage of successful predictions that were false when the experience was actually false)

TP: True Positive

TN: True Negative Forecast false, actual false

FP: False Positive Indicates that the prediction is true and the actual prediction is False

FN: False Negative predicted False, actual true

F1-score: harmonic average of accuracy and recall rate


When F1 is 1, it is the best value, perfect precision and recall, and the worst value is 0.

8. Look for decision boundaries

Stats.stackexchange.com/questions/9…From the previous introduction, we can know whenTime is the boundary, i.e


print(res.x) # This is the θ parameter for the final fit
Copy the code
Rs-25.158217 exam1 0.206207 exam2 0.201446 DType: Float64Copy the code
coef = -(res.x / res.x[2])  # Eliminate the third variable and construct a linear function
print(coef)

x = np.arange(130, step=0.1)
y = coef[0] + coef[1]*x
Copy the code
Ones 124.887907 exam1-1.023631 Exam2-1.000000 DType: FLOAT64Copy the code
sns.set(context="notebook", style="ticks", font_scale=1.5)

sns.lmplot('exam1'.'exam2', hue='admitted', data=data, 
           size=6, 
           fit_reg=False, 
           scatter_kws={"s": 25}
          )

plt.plot(x, y, 'grey')
plt.xlim(0.130)
plt.ylim(0.130)
plt.title('Decision Boundary')
plt.show()
Copy the code

This is the end of the first task set

9. Regularized logistic regression

The regularized logistic regression is applied to the second dataset

df = pd.read_csv('ex2data2.txt', names=['test1'.'test2'.'accepted'])
df.head()
Copy the code
test1 test2 accepted
0 0.051267 0.69956 1
1 0.092742 0.68494 1
2 0.213710 0.69225 1
3 0.375000 0.50219 1
4 0.513250 0.46564 1
sns.set(context="notebook", style="ticks", font_scale=1.5)

sns.lmplot('test1'.'test2', hue='accepted', data=df, 
           size=6, 
           fit_reg=False, 
           scatter_kws={"s": 50}
          )

plt.title('Regularized Logistic Regression - Data Preview')
plt.show()
Copy the code

10. Feature Mapping

In order to better fit the data, it is necessary to create more characteristic parameters, but without inventing out of thin air, it is necessary to generate new features from existing data. And the way you do that is you do polynomials.


def feature_mapping(x, y, power, as_ndarray=False):
# return parameters as either ndarray or dataframe
    data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
                for i in np.arange(power + 1)
                for p in np.arange(i + 1)}if as_ndarray:
        return pd.DataFrame(data).values
    else:
        return pd.DataFrame(data)

Copy the code
x1 = np.array(df.test1)
x2 = np.array(df.test2)
Copy the code
data = feature_mapping(x1, x2, power=6)
print(data.shape)
data.head()
Copy the code
(118, 28)
Copy the code
f00 f10 f01 f20 f11 f02 f30 f21 f12 f03 . f23 f14 f05 f60 f51 f42 f33 f24 f15 f06
0 1.0 0.051267 0.69956 0.002628 0.035864 0.489384 0.000135 0.001839 0.025089 0.342354 . 0.000900 0.012278 0.167542 1.815630 e-08 2.477505 e-07 0.000003 0.000046 0.000629 0.008589 0.117206
1 1.0 0.092742 0.68494 0.008601 0.063523 0.469143 0.000798 0.005891 0.043509 0.321335 . 0.002764 0.020412 0.150752 6.362953 e-07 4.699318 e-06 0.000035 0.000256 0.001893 0.013981 0.103256
2 1.0 0.213710 0.69225 0.045672 0.147941 0.479210 0.009761 0.031616 0.102412 0.331733 . 0.015151 0.049077 0.158970 9.526844 e-05 3.085938 e-04 0.001000 0.003238 0.010488 0.033973 0.110047
3 1.0 0.375000 0.50219 0.140625 0.188321 0.252195 0.052734 0.070620 0.094573 0.126650 . 0.017810 0.023851 0.031940 2.780914 e-03 3.724126 e-03 0.004987 0.006679 0.008944 0.011978 0.016040
4 1.0 0.513250 0.46564 0.263426 0.238990 0.216821 0.135203 0.122661 0.111283 0.100960 . 0.026596 0.024128 0.021890 1.827990 e-02 1.658422 e-02 0.015046 0.013650 0.012384 0.011235 0.010193

5 rows × 28 columns

data.describe()
Copy the code
f00 f10 f01 f20 f11 f02 f30 f21 f12 f03 . f23 f14 f05 f60 f51 f42 f33 f24 f15 f06
count 118.0 118.000000 118.000000 118.000000 118.000000 118.000000 1.180000 e+02 118.000000 118.000000 118.000000 . 118.000000 1.180000 e+02 118.000000 1.180000 e+02 118.000000 1.180000 e+02 118.000000 1.180000 e+02 118.000000 1.180000 e+02
mean 1.0 0.054779 0.183102 0.247575 0.025472 0.301370 5.983333 e-02 0.030682 0.015483 0.142350 . 0.018278 4.089084 e-03 0.115710 7.837118 e-02 0.000703 1.893340 e-02 0.001705 2.259170 e-02 0.006302 1.257256 e-01
std 0.0 0.496654 0.519743 0.248532 0.224075 0.284536 2.746459 e-01 0.134706 0.150143 0.326134 . 0.058513 9.993907 e-02 0.299092 1.938621 e-01 0.058271 3.430092 e-02 0.037443 4.346935 e-02 0.090621 2.964416 e-01
min 1.0 0.830070 0.769740 0.000040 0.484096 0.000026 5.719317 e-01 0.358121 0.483743 0.456071 . 0.142660 4.830370 e-01 0.270222 6.472253 e-14 0.203971 2.577297 e-10 0.113448 2.418097 e-10 0.482684 1.795116 e-14
25% 1.0 0.372120 0.254385 0.043243 0.178209 0.061086 5.155632 e-02 0.023672 0.042980 0.016492 . 0.001400 7.449462 e-03 0.001072 8.086369 e-05 0.006381 1.258285 e-04 0.005749 3.528590 e-04 0.016662 2.298277 e-04
50% 1.0 0.006336 0.213455 0.165397 0.016521 0.252195 2.544062 e-07 0.006603 0.000039 0.009734 . 0.001026 8.972096 e-09 0.000444 4.527344 e-03 0.000004 3.387050 e-03 0.000005 3.921378 e-03 0.000020 1.604015 e-02
75% 1.0 0.478970 0.646562 0.389925 0.100795 0.464189 1.099616 e-01 0.086392 0.079510 0.270310 . 0.021148 2.751341 e-02 0.113020 5.932959 e-02 0.002104 2.090875 e-02 0.001024 2.103622 e-02 0.001289 1.001215 e-01
max 1.0 1.070900 1.108900 1.146827 0.568307 1.229659 1.228137 e+00 0.449251 0.505577 1.363569 . 0.287323 4.012965 e-01 1.676725 1.508320 e+00 0.250577 2.018260 e-01 0.183548 2.556084 e-01 0.436209 1.859321 e+00

8 rows × 28 columns

11. Regularized Cost function

Sometimes when linear functions cannot be used to properly classify data, nonlinear functions will be considered. However, there are also cases where the curve is too curved, leading to overfitting, or the curve is not curved enough, which is also known as underfitting. In order to prevent the over-fitting, the key is to reduce the number of parameters of the high-power features, so that the terms that affect the results of the hypothesis function occupy a smaller weight, which is to give it more punishment.

For example, for hypothetical functions like this:


We can use regularization like this:


We can have a look atandIncrease penalties to reduce their impact. But for general problems we don’t know who to punish, so it’s up to the algorithm to decide, we just sum these terms and add one moreTo constrain, of courseThe selection of is also based on experience. This is the general method of regularization represented below:


theta = np.zeros(data.shape[1])
X = feature_mapping(x1, x2, power=6, as_ndarray=True)
print(X.shape)

y = get_y(df)
print(y.shape)
Copy the code
(118, 28) (118)Copy the code
def regularized_cost(theta, X, y, l=1):
There is no need to regularize x0
    theta_j1_to_n = theta[1:]
    regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()

    return cost(theta, X, y) + regularized_term
Regularize the cost function
Copy the code
regularized_cost(theta, X, y, l=1)
Copy the code
0.6931471805599454
Copy the code

Because we set θ to 0, the regularization cost function has the same value as the cost function

12. Regularized Gradient


If we’re going to use gradient descent this cost function is minimized, because we’re notTherefore, the gradient descent algorithm will be divided into two cases:


For the above algorithm j=1,2… , the updated formula at n can be adjusted to obtain:

It can be seen that the change added to the regular term is that, during each iteration, will beIt’s a little bit smaller

def regularized_gradient(theta, X, y, l=1):
    theta_j1_to_n = theta[1:]
    regularized_theta = (l / len(X)) * theta_j1_to_n

    # by doing this, no offset is on theta_0
    regularized_term = np.concatenate([np.array([0]), regularized_theta])

    return gradient(theta, X, y) + regularized_term
Copy the code
regularized_gradient(theta, X, y)
Copy the code
Array ([8.47457627E-03, 1.87880932E-02, 7.77711864E-05, 5.03446395E-02, 1.15013308E-02, 3.76648474E-02, 1.83559872E-02, 7.32393391E-03, 8.19244468E-03, 2.34764889E-02, 3.93486234E-02, 2.23923907E-03, 1.28600503E-02, 3.09593720E-03, 3.93028171E-02, 1.99707467E-02, 4.32983232E-03, 3.38643902E-03, 5.83822078e-03, 4.47629067E-03, 3.10079849E-02, 3.10312442E-02, 1.09740238E-03, 6.31570797E-03, 4.08503006E-04, 7.26504316E-03, 1.37646175E-03, 3.87936363E-02])Copy the code

13. Regularized fitting parameters

import scipy.optimize as opt
Copy the code
print('init cost = {}'.format(regularized_cost(theta, X, y)))

res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
res
Copy the code
Init cost = 0.6931471805599454 Fun: 0.5290027297127226 Array ([-5.59000348E-08, -1.28838122E-08, -6.36410934E-08, 1.48052446E-08, -2.17946454E-09, -3.76148722E-08, 9.78876709E-09, -1.82798254E-08, 1.13128886E-08, -1.07496536E-08, 9.72446799E-09, -3.07080137E-09, -8.9994446709, 3.95483551E-09, -2.61273742E-08, 4.27780929E-10, -1.11055205E-08, -6.79817860E-10, -5.00207423e-09, 2.66918207E-09, -1.42573657E-08, 2.66682830E-09, -3.70874575E-09, -1.41882519E-10, -1.24101649E-09, -1.53332708E-09, 3.89033012E-10, -2.18628962e-08]) message: 'Optimization terminated successfully.' nfev: 7 nhev: 0 nit: 6 njev: 68 status: 0 success: True x: Array ([1.27273933, 0.62527083, 1.18108774, -2.01995945, -0.91742379, -1.43166442, 0.12400731, -0.36553516, -0.35723847, -0.17512854, -1.45815594, -0.05098912, -0.61555563, -0.27470594, -1.19281681, -0.24218847, -0.20600683, -0.04473089, -0.27778458, -0.29537795, -0.45635707, -1.04320269, 0.02777149, -0.29243126, 0.01556672, -0.32737949, 0.14388703, 0.92465318])Copy the code

14. Regularized prediction

final_theta = res.x
y_pred = predict(X, final_theta)

print(classification_report(y, y_pred))
Copy the code
Precision recall F1-Score Support 0 0.90 0.75 0.82 60 1 0.78 0.91 0.84 58 Accuracy 0.83 118 Macro AVG 0.84 0.83 0.83 118 Weighted AVG 0.84 0.83 0.83 118Copy the code

15. Use different ones(This is a constant)

  • Draw decision boundaries

We find all satisfaction

def feature_mapped_logistic_regression(power, l):
# power: int
# Use x1, x2 to generate power degree polynomials
# l: int
Lambda constant regularized
# "" "
    df = pd.read_csv('ex2data2.txt', names=['test1'.'test2'.'accepted'])
    x1 = np.array(df.test1)
    x2 = np.array(df.test2)
    y = get_y(df)

    X = feature_mapping(x1, x2, power, as_ndarray=True)
    theta = np.zeros(X.shape[1])

    res = opt.minimize(fun=regularized_cost,
                       x0=theta,
                       args=(X, y, l),
                       method='TNC',
                       jac=regularized_gradient)
    final_theta = res.x

    return final_theta
Copy the code
def find_decision_boundary(density, power, theta, threshhold):
    t1 = np.linspace(- 1.1.5, density)
    t2 = np.linspace(- 1.1.5, density)

    cordinates = [(x, y) for x in t1 for y in t2]
    x_cord, y_cord = zip(*cordinates) To extract the # zip (*)
    # e.g. For [1, 2, 3], [4 and 6] after the zip into ([1, 4], [2, 5], [3]) after the zip (*) into ([1, 2, 3], [4 and 6])
    mapped_cord = feature_mapping(x_cord, y_cord, power)  This is a dataframe

    inner_product = mapped_cord.values @ theta

    decision = mapped_cord[np.abs(inner_product) < threshhold] # cannot be exactly equal to 0, it should be less than threshold

    return decision.f10, decision.f01
# Find the decision boundary function
Copy the code
def draw_boundary(power, l):
# "" "
# power: Mapped feature
# l: lambda constant
# "" "
    density = 1000
    threshhold = 2 * 10**- 3

    final_theta = feature_mapped_logistic_regression(power, l)
    x, y = find_decision_boundary(density, power, final_theta, threshhold)

    df = pd.read_csv('ex2data2.txt', names=['test1'.'test2'.'accepted'])
    sns.lmplot('test1'.'test2', hue='accepted', data=df, height=6, fit_reg=False, scatter_kws={"s": 100})

    plt.scatter(x, y, c='R', s=10)
    plt.title('Decision boundary')
    plt.show()
Copy the code
draw_boundary(power=6, l=1) #lambda=1
Copy the code

draw_boundary(power=6, l=0)  Lambda =0, no regularization, overfitting
Copy the code

draw_boundary(power=6, l=100)  # underfitting, lambda=100
Copy the code