Machine Learning Topic -02 Logistic Regression and Maximum Entropy Models (pure Python implementation and SkLearn)

This is the 14th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

Logistic regression

introduce

First of all to have a perceptual understanding of it is a classification model. Secondly, it is a linear model and the parameter space of linear regression is consistent, and the information is contained in W and B.

In other words, add another layer of mapping function F to the linear regression. Generally, sigmoid function is used for this mapping function. The function form is as follows:

$f( x_{11} w_1 + x_{12} w_2 + x_{13} w_3 + … + x_{1j} w_j + … + x_{1m} w_m + b) = \hat y_1$

F (x) = 11 + e – xf (x) = \ frac {1} {1 + e ^ {x} -} f (x) = 1 + e – x1

In a word, logistic regression is: logistic regression assumes that the data obey Bernoulli distribution, and uses gradient descent to solve parameters by maximizing likelihood function, so as to achieve the purpose of binary classification. Among them:

$p(y_i=1|x_i; w,b) = f(x_i w + b) = \frac{1}{1 + e^{-(x_i w +b)}}$

$p(y_i=0|x_i; w,b) = 1-f(x_i w + b) = \frac{e^{-(x_i w +b)}}{1 + e^{-(x_i w +b)}}$

Loss function

Calculate the likelihood function for all samples:

$L(w,b) = \prod_{i=1}^{n} [f(x_i w + b)] ^{y_i} [1-f(x_i w + b)]^{(1-y_i)}$

Logarithmic likelihood function:

$\log L(w,b) = \log \prod_{i=1}^{n} [f(x_i w + b)] ^{y_i} [1-f(x_i w + b)]^{(1-y_i)}$

The maximum value of the likelihood function is obtained, so the loss function (cost function) can be defined as (that is, the likelihood function multiplied by a minus sign) :

$J(w,b) = -\log L(w, b) = -\sum_{i=1}^n y_i \log [f(x_i w + b)] + (1-y_i) \log [1-f(x_i w + b)]$

Calculate f'(x)f ‘(x)f ‘(x) in advance:

$\begin{aligned} f'(x) &= (\frac{1}{1 + e^{-x}})’\\ &=- \frac{1}{ {(1+e^{-x}})^2} (1+e^{-x})’\\ &=\frac{1}{1+e^{-x}} \frac{e^{-x}}{1+e^{-x}} \\ &=f*(1-f) \end{aligned}$

The partial derivative with respect to W: Partial J (w, b) partial wj = partial partial wj {- ∑ I = 1 nyilog ⁡ [f] (xiw + b) + (1 – yi) log ⁡ [1 – f (xiw + b)]} = – ∑ I = 1 n {yi1f (xiw + b) – (1 – yi) 11 – f (xiw + b)} partial f partial wj (xiw + b) = – ∑ I = 1 n {yi1f (xiw + b) – (1 – yi) 11 – f (xiw + b)} f (xiw + b) [1 – (xiw + b) f] partial (xiw + b) partial wj = – ∑ I = 1 n {yi [1 – (xiw + b) f] – (1 – yi) f (xiw + b)} partial (xiw + b) partial wj = – ∑ I = 1 n {y I [1 – (xiw f + b)] – (1 – yi)} xij (xiw + b) = f – ∑ I = 1 n {yi – f (xiw + b)} xij = ∑ I = 1 n {f (xiw + b) – yi} xij = ∑ I = 1 n {y ^ I – yi} xij \ begin} {aligned \ frac {\partial J(w,b)}{\partial w_j} &= \frac {\partial}{\partial w_j} \{-\sum_{i=1}^n y_i \log [f(x_i w + b)] + (1-y_i) \log [1-f(x_i w + b)] \} \\ &= -\sum_{i=1}^n \{y_i \frac {1}{f(x_i w + b)} – (1-y_i) \frac {1}{1-f(x_i w + b)} \} \frac {\partial f(x_i w + b)} {\partial w_j} \\ &= -\sum_{i=1}^n \{y_i \frac {1}{f(x_i w + b)} – (1-y_i) \frac {1}{1-f(x_i w + b)} \} f(x_i w + b) [1-f(x_i w + b)] \frac {\partial (x_i w + b)} {\partial w_j} \\ &= -\sum_{i=1}^n \{y_i [1-f(x_i w + b)] – (1-y_i) f(x_i w + b) \} \frac {\partial (x_i w + b)} {\partial w_j} \\ &= -\sum_{i=1}^n \{y_i [1-f(x_i w + b)] – (1-y_i) f(x_i w + b) \} x_{ij} \\ &= -\sum_{i=1}^n \{y_i – f(x_i w + b) \} x_{ij} \\ &= \sum_{i=1}^n \{f(x_i w + b) – y_i \} x_{ij} \\ &= \sum_{i=1}^n \{\hat y_i – y_i \} x_{ij} \\ } {\ end aligned partial wj partial J (w, b) = partial wj partial {- I = 1 ∑ nyilog [f] (xiw + b) + (1 – yi) log (1 – f (xiw + b)]} = – = 1 ∑ I n {yif (xiw + b) 1 – (1 – yi) 1 – (xiw + b) 1} f partial wj partial f (xiw + b) = – = 1 ∑ I n {yif (xiw + b) 1 – (1 – yi) 1 – (xiw + b) 1} f f (xiw + b) [1 – (xiw + b) f] partial wj partial (xiw + b) = – = 1 ∑ I n {yi [1 – (xiw + b) f] – (1 – yi) f (xiw + b)} partial wj partial (xiw + b ∑) = – I = 1 n {yi [1 – (xiw + b) f] – (1 – yi) f (xiw + b)} xij ∑ = – I = 1 n {yi – f (xiw + b)} xij ∑ = I = 1 n {f (xiw + b) – yi} xij ∑ = I = 1 n {y ^ I – yi} xij

Similarly, take the partial derivative of B:

\frac {\partial J(w,b)}{\partial b} = \sum_{i=1}^n f(x_i w + b) – y_i = \hat y_i – y_i

W and B can be solved by stochastic gradient descent:

$w_j = w_j – \alpha \sum_{i=1}^{n} {(f(x_i w + b) – y_i)} x_{ij}$

$b = b – \alpha \sum_{i=1}^{n} {(f(x_i w + b) – y_i)}$

These are also the keys to hand-tearing code.

example

Pure python implementation

For the time being, only the first two categories of irises were used as data for classification.

# Temporarily implement dichotomies
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
def get_train_test() :
        iris = load_iris()
        index =  list(iris.target).index(2) # only for class0 andclass1
        iris = load_iris()
        X = iris.data[:index]
        y = iris.target[:index]
        X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
        return X_train,y_train,X_test,y_test
        
        
def sigmoid(x) :
    return 1/ (1+ np.exp(-x))


lr = LogisticRegression()
X_train,y_train,X_test,y_test =  get_train_test()

X_train.shape,y_train.shape,X_test.shape,y_test.shape

lr.fit(X_train,y_train)
predictions = lr.predict(X_test)

print(y_test == (predictions >0.5))



Copy the code

skleran

from sklearn.datasets import load_iris
iris = load_iris()


X = iris.data
Y = iris.target
# Divide data into training sets and test sets
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)
Import the model and call the LogisticRegression() function
from sklearn.linear_model import LogisticRegression
# lr = LogisticRegression(penalty='l2',solver='newton-cg',multi_class='multinomial')
lr = LogisticRegression()
lr.fit(x_train,y_train)

Evaluate the model
print('Logistic regression training set accuracy: %.3f'% lr.score(x_train,y_train))
print('Logistic regression test set accuracy: %.3f'% lr.score(x_test,y_test))
from sklearn import metrics
pred = lr.predict(x_test)
accuracy = metrics.accuracy_score(y_test,pred)
print(Accuracy of logistic regression model: %.3f% accuracy)

Copy the code

Maximum entropy model

First of all, there is a perceptual understanding of the maximum entropy model, which is a model that maximizes entropy. The idea is that it should be as confusing, or as random, as possible in addition to the conditions that have been given. And this randomness is mathematically called the maximum entropy model.

Code core formula:

$\delta_i = \frac {1}{f^*(x,y)} \log \frac {E_{\hat P}(f_i) }{E_P(f_i)}$

The wiw_iwi parameter can be obtained by iterating:

$w_i = w_i + \delta_i$

Python implementation

class MaxEntropy(object) :
    def __init__(self, lr=0.01,epoch = 1000) :
        self.lr = lr Vector #
        self.N = None  # Number of data
        self.n = None # Xy for
        self.hat_Ep = None
        # self.sampleXY = []
        self.labels = None
        self.xy_couple = {}
        self.xy_id = {}
        self.id_xy = {}
        self.epoch = epoch
    def _rebuild_X(self,X) :
        X_result = []
        for x in X:
            print(x,self.X_columns)
            X_result.append([y_s + '_' + x_s for x_s, y_s in zip(x, self.X_columns)])
        return X_result
        
    def build_data(self,X,y,X_columns) :
        self.X_columns = X_columns
        self.y = y

        
        self.X = self._rebuild_X(X)
        self.N = len(X)
        self.labels = set(y)
        for x_i,y_i in zip(self.X,y):
            for f in  x_i:
              
                self.xy_couple[(f,y_i)]  = self.xy_couple.get((f,y_i),0) + 1

                
        self.n = len(self.xy_couple.items())

    def fit(self,X,y,X_columns) :
        self.build_data(X,y,X_columns)
        self.w = [0] * self.n
        for _ in range(self.epoch):
            for i in range(self.n):
                # self.w[I] += 1/self.n * np.log(self.get_hat_ep (I)/self.get_ep (I)) # self.w[I] += 1/self.n * np.log(self.get_hat_ep (I)/self.get_ep (I)
                self.w[i]  += self.lr * np.log(self.get_hat_Ep(i) / self.get_Ep(i) )  # Here multiply by 1/self.n, or by a smaller learning rate
                # print(_,np.log(self.get_hat_Ep(i) / self.get_Ep(i) ) )
    
    def predict(self,X) :
        print(X)
        X = self._rebuild_X(X)
        print(X)
        
        result = [{} for _ in range(len(X))] 
        for i,x_i in enumerate (X):
            for y in self.labels:
                # print(x_i)
                result[i][y] = self.get_Pyx(x_i,y)
        return result
       


    def get_hat_Ep(self,index) :
        
        self.hat_Ep = [0]*(self.n)
        for i,xy in enumerate(self.xy_couple):
            self.hat_Ep[i] = self.xy_couple[xy] / self.N
            self.xy_id[xy] = i
            self.id_xy[i] = xy
        return self.hat_Ep[index]




    def get_Zx(self,x_i) :
        Zx = 0
        for y in self.labels:
            count = 0
            for f in x_i :
                if (f,y) in self.xy_couple:
                    count += self.w[self.xy_id[(f,y)]]
            Zx +=  np.exp(count)
        return  Zx
    def get_Pyx(self,x_i,y) :
        
        count = 0
        for f in x_i :
            if (f,y) in self.xy_couple:
                count += self.w[self.xy_id[(f,y)]]
           

        return np.exp(count) / self.get_Zx(x_i)

    def get_Ep(self,index) :
        f,y = self.id_xy[index]
        # print(f,y)
        ans = 0
        # print(self.X)
        for x_i in self.X:
            if f not in x_i:
                continue
            pyx = self.get_Pyx(x_i,y)
            ans += pyx / self.N
            # print("ans",ans,pyx)
        return ans
        
data_set = [['youth'.'no'.'no'.'1'.'refuse'],
               ['youth'.'no'.'no'.'2'.'refuse'],
               ['youth'.'yes'.'no'.'2'.'agree'],
               ['youth'.'yes'.'yes'.'1'.'agree'],
               ['youth'.'no'.'no'.'1'.'refuse'],
               ['mid'.'no'.'no'.'1'.'refuse'],
               ['mid'.'no'.'no'.'2'.'refuse'],
               ['mid'.'yes'.'yes'.'2'.'agree'],
               ['mid'.'no'.'yes'.'3'.'agree'],
               ['mid'.'no'.'yes'.'3'.'agree'],
               ['elder'.'no'.'yes'.'3'.'agree'],
               ['elder'.'no'.'yes'.'2'.'agree'],
               ['elder'.'yes'.'no'.'2'.'agree'],
               ['elder'.'yes'.'no'.'3'.'agree'],
               ['elder'.'no'.'no'.'1'.'refuse'],
               ]
X = [i[:-1] for i in data_set]
X_columns = columns[:-1]
Y = [i[-1] for i in data_set]

train_X = X[:12]
test_X = X[12:]
train_Y = Y[:12]
test_Y = Y[12:]
 
columns = ['age'.'working'.'house'.'credit_situation'.'labels']

X_columns = columns[:-1]



mae = MaxEntropy()
mae.fit(train_X,train_Y,X_columns)

mae.predict(test_X)


    
        
Copy the code

The resources

zhuanlan.zhihu.com/p/68423193

Blog.csdn.net/weixin_4156…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine Learning Topic –02 Logistic Regression and Maximum Entropy Models (pure Python implementation and SkLearn)

Logistic regression

introduce

Loss function

example

Pure python implementation

skleran

Maximum entropy model

Python implementation

The resources

Machine Learning Topic –02 Logistic Regression and Maximum Entropy Models (pure Python implementation and SkLearn)

Logistic regression

introduce

Loss function

example

Pure python implementation

skleran

Maximum entropy model

Python implementation

The resources

Related Posts

Google announces: BERT is open source for the best NLP pre-training model in the world!

Ali Cloud low code industry intelligent open platform to explore new industry AI application methods

MILABOT: Building chatbots based on Deep reinforcement Learning