The author | Angel Das compile | source of vitamin k | forward Datas of Science

introduce

Artificial neural networks (ANNs), an advanced version of machine learning, are the core of deep learning. Artificial neural networks involve the following concepts. Input and output layer, hidden layer, neuron under hidden layer, forward propagation and back propagation.

Simply put, the input layer is a set of independent variables, the output layer represents the final output (dependent variables), and the hidden layer consists of neurons where equations and activation functions are applied. Forward propagation discusses the specific form of the equation to obtain the final output, while back propagation computes gradient descent to update parameters accordingly. For more information about the process, see the article below.

Towardsdatascience.com/introductio…

Deep neural network

When an ANN contains a deep hidden layer, it is called a deep neural network (DNN). DNN has multiple weights and deviation terms, each of which requires training. Backpropagation can determine how to adjust each weight and each deviation term of all neurons to reduce error. Unless the network converges to a minimum error, the process repeats.

The algorithm steps are as follows:

  • Get training and test data to train and validate the output of the model. All statistical assumptions involving correlation, outlier processing are still valid and must be addressed.

  • The input layer consists of independent variables and their respective values. The training set is divided into multiple batches. The complete training set is called an epoch. The more epochs, the longer the training

  • Each batch is passed to the input layer, which sends it to the first hidden layer. Compute the output of all neurons in the layer (for each small batch). The results are passed to the next layer, and the process is repeated until we get the output of the last layer, the output layer. This is forward propagation: just like making predictions, except that all intermediate results are retained because they are needed for backward propagation

  • The output error of the network is then measured using a loss function that compares the expected output to the actual output of the network

  • The contribution of each parameter to the error term is calculated

  • The algorithm performs gradient descent to adjust weights and parameters based on the learning rate (back propagation), and the process is repeated

It is important to randomly initialize the weights of all hidden layers, otherwise the training will fail.

For example, if ownership weight and offset are initialized to zero, all neurons in a given layer will be exactly the same, so backpropagation will affect them in exactly the same way, so they will remain the same. In other words, although there are hundreds of neurons per layer, your model will behave as if there is only one neuron per layer: it won’t be too smart. Conversely, if you randomly initialize the weights, you break the symmetry, allowing back propagation to train different neurons

The activation function

The activation function is the key to gradient descent. Gradient descent cannot move in the plane, so it is important to have a well-defined non-zero derivative to allow gradient descent to progress at each step. Sigmoid is commonly used for logistic regression problems, but there are other popular options.

Hyperbolic tangent function

This function is s-shaped and continuous, with an output range between -1 and +1. At the beginning of training, the output of each layer is more or less zero centered, thus facilitating faster convergence.

Rectifying linear element

It is not differentiable for input less than 0. For other cases, it produces good output and, more importantly, faster computations. The function has no maximum output, so some problems that might occur during gradient descent are well handled.

Why do we need to activate functions?

Let’s say f of x is equal to 2x plus 5 and g of x is equal to 3x minus 1. The weights of the two inputs are different. When we link these functions, we get f of g of x is equal to 2 times 3x minus 1 plus 5 is equal to 6x plus 3, which is again a linear equation. The absence of nonlinearity is equivalent to a linear equation in the deep neural network. The complex problem space in this case cannot be handled.

Loss function

When dealing with regression, we do not need to use any activation functions for the output layer. The loss function used in training regression problems is the mean square error. However, outliers in the training set can be treated with mean absolute errors. Huber loss is also a widely used error function in regression-based tasks.

When the error is less than the threshold t(mostly 1), Huber loss is quadratic, but when the error is greater than t, Huber loss is linear. The linear part makes it less sensitive to outliers than the mean square error, and the quadratic part converges faster and gives more precise numbers than the mean absolute error.

Dichotic cross entropy, multi-classification cross entropy or sparse classification cross entropy are usually used in classification problems. Dichotomous cross entropy is used for dichotomous classification, while multi-classification or sparse classification cross entropy is used for multi-class classification problems. You can find more details about the loss function at the link below.

Note: Classification cross entropy is used for one-hot representation of dependent variables, and sparse classification cross entropy is used when labels are provided as integers.

keras.io/api/losses/

Develop ANN in Python

We will use Kaggle’s credit data to develop a fraud detection model using Jupyter Notebook. The same approach can be implemented in Google Colab.

The data set contains transactions made by European cardholders via credit cards in September 2013. This data set shows transactions that occurred within two days, with 492 frauds out of 284,807 transactions. The data set is highly unbalanced, with positive classes (fraud) accounting for 0.172% of all transactions.

www.kaggle.com/mlg-ulb/cre…

import tensorflow as tf
print(tf.__version__)

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
import tensorflow as tf

from sklearn import preprocessing

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc

import matplotlib.pyplot as plt
from tensorflow.keras import optimizers

import seaborn as sns

from tensorflow import keras

import random as rn

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
PYTHONHASHSEED=0

tf.random.set_seed(1234)
np.random.seed(1234)
rn.seed(1254)
Copy the code

The dataset consists of the following attributes. Time, main ingredients, amount and category. More information can be found at Kaggle.

The file = tf. Keras. Utils raw_df = pd. Read_csv (' https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')
raw_df.head()
Copy the code

Since most attributes are principal components, the correlation is always 0. The only column where an outlier can occur is the amount column. Here’s a quick look at the statistics.

count    284807.00
mean         88.35
std         250.12
min           0.00
25%           5.60
50%          22.00
75%          77.16
max       25691.16
Name: Amount, dtype: float64
Copy the code

Outliers are crucial for detecting fraud because the underlying assumption is that higher trading volumes can be a sign of fraudulent activity. However, the boxplot does not reveal any specific trends to test the above hypothesis.

Prepare input/output and training test data

X_data = credit_data.iloc[:, :-1]

y_data = credit_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2, random_state = 7)

X_train = preprocessing.normalize(X_train)
Copy the code

Quantitative and principal component analysis variables use different scales, so the data set is standardized. Standardization plays an important role in gradient descent. The convergence rate of standardized data is much faster.

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Copy the code

Output:

(227845.29) Number of records x number of columns
(56962.29)
(227845(,)56962.)Copy the code

Develop the neural network layer

The output above shows that we have 29 arguments to deal with, so the shape of the input layer is 29. The general structure of any artificial neural network architecture is outlined below.

+----------------------------+----------------------------+
 |      Hyper Parameter       |   Binary Classification    |
 +----------------------------+----------------------------+
 | # input neurons | One per input feature |
 | # hidden layers | Typically 1 to 5 |
 | # neurons per hidden layer | Typically 10 to 100 |
 | # output neurons | 1 per prediction dimension |
 | Hidden activation          | ReLU, Tanh, sigmoid        |
 | Output layer activation    | Sigmoid                    |
 | Loss function              | Binary Cross Entropy       |
 +----------------------------+----------------------------+
+-----------------------------------+----------------------------+
 |          Hyper Parameter          | Multiclass Classification  |
 +-----------------------------------+----------------------------+
 | # input neurons | One per input feature |
 | # hidden layers | Typically 1 to 5 |
 | # neurons per hidden layer | Typically 10 to 100 |
 | # output neurons | 1 per prediction dimension |
 | Hidden activation                 | ReLU, Tanh, sigmoid        |
 | Output layer activation           | Softmax                    |
 | Loss function                     | "Categorical Cross Entropy | | Sparse Categorical Cross Entropy" |                            |
 +-----------------------------------+----------------------------+
Copy the code
The input of the Dense function
  1. Units – Output dimensions
  2. Activation – The activation function, if not specified, does not use anything
  3. Use_bias – Boolean value if bias is used
  4. Kernel_initializer – The initial value setting for the core weight
  5. Bias_initializer – Initializer for the bias vector.
model = Sequential(layers=None, name=None)
model.add(Dense(10, input_shape = (29,), activation = 'tanh'))
model.add(Dense(5, activation = 'tanh'))
model.add(Dense(1, activation = 'sigmoid'))

sgd = optimizers.Adam(lr = 0.001)

model.compile(optimizer = sgd, loss = 'binary_crossentropy', metrics=['accuracy'])
Copy the code

Architecture Summary

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None.10)                300       
_________________________________________________________________
dense_1 (Dense)              (None.5)                 55        
_________________________________________________________________
dense_2 (Dense)              (None.1)                 6         
=================================================================
Total params: 361
Trainable params: 361
Non-trainable params: 0
_________________________________________________________________
Copy the code

Let’s try to understand the output above (the output specification is provided using two hidden layers) :

  1. We create a neural network with one input, two hidden, and one output layer

  2. The input layer has 29 variables and 10 neurons. So the weight matrix has the shape of 10 x 29, and the bias matrix has the shape of 10 x 1

  3. Total number of layer 1 parameters =10 x 29+10 x 1=300

  4. The first layer has 10 output values, using TANh as the activation function. The second layer has 5 neurons and 10 inputs, so the weight matrix is 5×10 and the bias matrix is 5×1

  5. Layer 2 total parameter =5 x 10+5 x 1=55

  6. Finally, the output layer has one neuron, but it has 5 different inputs from hidden layer 2 and has a bias term, so the number of neurons =5+1=6

model.fit(X_train, y_train.values, batch_size = 2000, epochs = 20, verbose = 1)
Epoch 1/20
114/114 [==============================] - 0s 2ms/step - loss: 0.3434 - accuracy: 0.9847
Epoch 2/20
114/114 [==============================] - 0s 2ms/step - loss: 0.1029 - accuracy: 0.9981
Epoch 3/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0518 - accuracy: 0.9983
Epoch 4/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0341 - accuracy: 0.9986
Epoch 5/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0255 - accuracy: 0.9987
Epoch 6/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0206 - accuracy: 0.9988
Epoch 7/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0174 - accuracy: 0.9988
Epoch 8/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0152 - accuracy: 0.9988
Epoch 9/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0137 - accuracy: 0.9989
Epoch 10/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0125 - accuracy: 0.9989
Epoch 11/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0117 - accuracy: 0.9989
Epoch 12/20
114/114 [==============================] - 0s 2ms/step - loss: 0.0110 - accuracy: 0.9989
Epoch 13/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0104 - accuracy: 0.9989
Epoch 14/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0099 - accuracy: 0.9989
Epoch 15/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0095 - accuracy: 0.9989
Epoch 16/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0092 - accuracy: 0.9989
Epoch 17/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0089 - accuracy: 0.9989
Epoch 18/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0087 - accuracy: 0.9989
Epoch 19/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0084 - accuracy: 0.9989
Epoch 20/20
114/114 [==============================] - 0s 1ms/step - loss: 0.0082 - accuracy: 0.9989
Copy the code

Evaluation of the output

X_test = preprocessing.normalize(X_test)

results = model.evaluate(X_test, y_test.values)

1781/1781 [==============================] - 1s 614us/step - loss: 0.0086 - accuracy: 0.9989
Copy the code

Use the Tensor Board to analyze the learning curve

TensorBoard is a great interactive visualization tool for viewing learning curves during training, comparing learning curves of multiple runs, analyzing training metrics, and more. This tool is installed automatically with TensorFlow.

importOS root_logdir = os.path.join(os.curdir, "my_logs")def get_run_logdir() :
 importTime run_id = time. Strftime (" m_ Y_ run_ % % % d m_ H_ - % % % S ")return os.path.join(root_logdir, run_id)
 
run_logdir = get_run_logdir()

tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)

model.fit(X_train, y_train.values, batch_size = 2000, epochs = 20, verbose = 1, callbacks=[tensorboard_cb])

%load_ext tensorboard
%tensorboard --logdir=./my_logs --port=6006
Copy the code

Super parameter adjustment

As mentioned earlier, there are no predefined rules for how many hidden layers or how many neurons are best for a problem space. We can use a randomized searchCV or GridSearchCV to overshoot some parameters. The fine-tuning parameters are summarized as follows:

  • Number of hidden layers

  • Hidden layer neuron

  • The optimizer

  • vector

  • epoch

Declare functions to develop models

def build_model(n_hidden_layer=1, n_neurons=10, input_shape=29) :
    
    # create model
    model = Sequential()
    model.add(Dense(10, input_shape = (29,), activation = 'tanh'))
for layer in range(n_hidden_layer):
        model.add(Dense(n_neurons, activation="tanh"))
model.add(Dense(1, activation = 'sigmoid'))
    
    # build model
model.compile(optimizer ='Adam', loss = 'binary_crossentropy', metrics=['accuracy'])
    
    return model
Copy the code

Use the wrapper clonoid model

from sklearn.base import clone
 
keras_class = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn = build_model,nb_epoch = 100,
 batch_size=10)
clone(keras_class)

keras_class.fit(X_train, y_train.values)
Copy the code

Create a random search grid

from scipy.stats import reciprocal
from sklearn.model_selection importRandomizedSearchCV Param_distribs = {" n_hidden_layer ": [1.2.3], "n_neurons" :20.30].# "learning_rate" : reciprocal (3 e - 4, 3 e - 2),
# "opt" : [' Adam ']
}

rnd_search_cv = RandomizedSearchCV(keras_class, param_distribs, n_iter=10, cv=3)

rnd_search_cv.fit(X_train, y_train.values, epochs=5)
Copy the code

Check the best parameters

rnd_search_cv.best_params_

{'n_neurons': 30.'n_hidden_layer': 3}

rnd_search_cv.best_score_

model = rnd_search_cv.best_estimator_.model
Copy the code

Optimizers should also be fine-tuned because they affect gradient descent, convergence, and automatic adjustment of the learning rate.

  • Adadelta -Adadelta is a more robust extension of Adagrad that adjusts learning rates based on a moving window of gradient updates, rather than accumulating all past gradients
  • Random gradient descent – Often used. Need to fine-tune the learning rate using the search grid
  • Adagrad- The learning rate is constant for each cycle of all parameters and other optimizers. However, when Adagrad deals with the derivative of the error function, it changes the learning rate “η” of each parameter and changes at each time step “t”
  • Adam-adam (adaptive moment estimation) uses first – and second-order momentum to prevent jumping over local minima, preserving the exponential decay average of the past gradient

In general, better output can be obtained by increasing the number of layers rather than the number of neurons per layer.

reference

Aurelien Geron (2017). Hands-on machine Learning with SciKit-learn and TensorFlow: Concepts, Tools, and Techniques to build intelligent Systems. Sebastopol, Ca: O ‘Reilly Media

Original link: towardsdatascience.com/a-beginners…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/