0 x00 the

Alink is a new generation of machine learning algorithm platform developed by Alibaba based on real-time computing engine Flink. It is the first machine learning platform in the industry that supports both batch algorithm and streaming algorithm. This article and the following will lead you to analyze the implementation of multilayer perceptron in Alink.

Since Alink’s public information is too little, the following is my own speculation, and there will definitely be omissions. I hope you can point out that I will update at any time.

0x01 Background Concepts

Almost all deep learning algorithms can be described as a fairly simple formula: specific data sets, cost functions, optimization procedures, and models.

1.1 Feedforward neural network

In Feedforward Neural Network (FNN), each neuron is divided into different groups according to the order of receiving information, and each group can be regarded as a Neural layer. The neurons in each layer receive the output of the neurons in the previous layer and send it to the neurons in the next layer. The information in the whole network is propagated in one direction without reverse information dissemination (which is not the same thing as error backpropagation), that is, there is no feedback in the whole network and the signal propagates one-way from the input layer to the output layer, which can be represented by a directed acyclic graph. In the feedforward neural network, the 0 layer is called the input layer, the last layer is called the output layer, and the other intermediate layers are called the hidden layer.

In the feedback neural network, neurons can not only receive signals from other neurons, but also receive their own feedback signals. Compared with feedforward neural network, neurons in feedback neural network have memory function and have different states at different time. Information propagation in feedback neural network can be one-way or two-way, so it can be represented by a directed cyclic graph or an undirected graph.

The main goal of feedforward networks is to approximate some functions f*. For example, the regression function y = f *(x) maps the input x to the value y. The feedforward network defines y = f (x; θ), and learn the value of parameter θ to make the result more close to the optimal function.

For example, we have three function f (1), f (2) and f (3) is connected in a chain to form f (x) = f (3) (f (2) (f (1) (x))). These chains are the most commonly used structures in neural networks. In this case, F (1) is called the first layer of the network, f(2) is called the second layer, and so on. The full length of the chain is called the depth of the model. It is because of this term that the name “deep learning” has emerged.

Now the question is, why do we need feedforward networks when we have linear machine learning models? This is because linear models are limited to linear functions, whereas neural networks are not. When our data is not a linearly separable linear model, we face the problem of approximation, and neural networks are quite easy to deal with. The hiding layer is used to add nonlinearity and change the representation of the data to better generalize the function.

1.2 Back Propagation

How to understand this “back propagation”? In fact, the core concept of DL lies in finding the corresponding weight “W” and “B” that the global error function Loss meets the requirements. Then the problem comes. When the error Loss obtained does not meet the requirements (that is, the error is too large), the error obtained by the output layer can be transmitted to the hidden layer through “back propagation” and distributed to different neurons, so as to adjust the “weight” of each neuron and finally adjust until the Loss meets the requirements. This is the core idea of error response propagation.

Here we should first clarify a confusing concept, that is, in some places, back propagation is often used to refer to the whole learning algorithm of depth model, which is actually inaccurate. The overall learning algorithm can be divided into two aspects:

  • How is cost information transmitted to each layer of the depth model?
  • How should the parameters of this layer be updated based on the information passed to this layer?

In a given structure, information flows forward along the organizational structure, which is called forward propagation, and correspondingly, back propagation refers to information flows from back to front along the structure.

In feed-forward neural network, input is propagated forward and gradually abstracted as a feature in the process, while the current output value and expected output cost information, or error, is propagated back. The information transmitted to each layer is the cost information of the output value of the layer and the “expected output” of the layer.

In today’s mainstream frameworks, back propagation is combined with cost information and gradient with the help of a computational graph. Therefore, back propagation does not exist only in neural network or depth model, nor does it represent the whole learning algorithm of depth model. What it represents is only the first problem, that is, how to update parameters based on cost information and how to conduct more efficient optimization is the problem of optimization algorithm. The most effective optimization algorithms in modern times are mainly based on gradient descent, and many innovative works have been made on this basis.

The training process of depth model is summarized as follows: Against established network structure and the performance index, detailed definition/target price/error function, input through to reach the output layer, and for each one or a group of input output, computational cost information under the cost function is defined, by back propagation model is passed to the depth of each layer, based on the cost information on each layer of parameters of gradient update, Until the stop condition is met and the training is completed.

1.3 Cost Function

What the cost function does is show the difference between our model’s approximate value and the actual target value we are trying to achieve.

Usually the cost function contains at least one component that makes the learning process statistically estimated. The most common cost functions are negative log likelihood and maximum likelihood estimation caused by minimizing cost functions. Cost functions may also contain additional terms, such as regularization terms.

In some cases, we can’t actually compute the cost function for computational reasons. In this case, as long as we have a way to approximate its gradient, then we can still use iterative numerical optimization to approximate the target.

Like machine learning algorithms, feedforward networks are trained using gradient-based learning methods, in which algorithms such as stochastic gradient descent are used to minimize the cost function. The whole training process largely depends on our choice of cost function, which is more or less the same as other parameter models.

For the cost function of the back propagation algorithm, it must satisfy two attributes:

  • The cost function must be able to be expressed as an average.

  • The cost function cannot depend on any activation value of the network adjacent to the output layer.

The cost function is mainly in the form of C(W, B, Sr, Er), where W is the weight of the neural network, B is the bias of the network, Sr is the input of a single training sample, and Er is the expected output of the training sample.

1.4 Optimization Process

1.4.1 iteration method

At the beginning of the training of an algorithm model, the weight w and the bias B are assigned randomly. Theoretically, it can appear anywhere in the whole function image. How can we get him to find the value we want?

This is where the idea of “iteration” comes in: we can try by substituting different points to the left and right, assuming that the point on the left of the current x is smaller than the point on the right, we can make x the point on the left, and keep trying until we find the “minimum”. This is why algorithmic models are trained to take time to iterate.

1.4.2 Gradient descent

Using iterative method, then followed by another problem: a such an attempt, although the end result is bound to find the value of what we need, but is there any way I can make it far from “extreme”, the pace of moving more, near the “extreme”, move the smaller steps (to prevent over the extremum), “convergence” to achieve faster and more accurately. If the graph is a “quadratic function”, the closer the point is to the “minimum”, the smaller the “partial derivative” of the function at that point (the “slope of the function at that point”). This leads to the following method:


x n + 1 = x n eta d f ( x ) d x X_ {n+1} = x_n – η\frac{df(x)}{dx}

The core idea of gradient descent is that Xn represents the “step length” of the move, and the latter part represents the “partial derivative” of the current point in the function. In this way, it means that the closer the point is to the extreme point, the smaller the “partial derivative” is, and the “step length” of the move is shorter. On the contrary, if it is far from the extreme point, the “step length” of the next move will be larger.

By changing this formula to our algorithm model, the relationship between “moving step length” and Loss and (w,b) can be found to achieve rapid “convergence”.

Through the cooperation of “iteration method” and “gradient descent method”, we realized iteration after iteration, and each update would get closer and closer to the extreme value until the updated value was very small or already met our error range. At the end of the training, the (W,b) obtained at this time was the model we were looking for.

1.5 Related Formulas

Here are some of the formulas, excerpted for your reading.

1.5.1 Weighted sum H


h j = i = 0 M w i j x j h_j = \sum_{i=0}^M w_{ij}x_j

Hj represents the sum of all input weights of the current node.

1.5.2 Neuron output value A


a j = g ( h j ) = g ( i = 0 M w i j x j ) a_j = g(h_j) = g(\sum_{i=0}^M w_{ij}x_j)
  • A_j represents the output value of the neuron in the hidden layer.
  • G () represents the activation function, w is the weight, and x is the input.
  • A_j =x_jk is the output value of neurons at the current layer, which is equal to the input value of neurons at the next layer.

1.5.3 Output value Y of the output layer


y = a k = g ( h k ) = g ( i = 0 M w j k x j k ) y = a_k = g(h_k) = g(\sum_{i=0}^M w_{jk}x_{jk})
  • Y represents the value of the output layer, which is the final result.
  • H_k represents the sum of input weights of neuron K at the output layer.

1.5.4 Activation function G (h)

Using Sigmoid function:


g ( h ) = sigma ( h ) = 1 1 + e h G = sigma (h) (h) = \ frac {1} {1 + e ^ {- h}}

Derivatives of sigmoid functions:


sigma ( x ) = sigma ( x ) [ 1 sigma ( x ) ] Sigma ‘(x) = sigma (x) [1 – sigma (x)]

Aj is equal to g of hj


g ( h ) = a j ( 1 a j ) G ‘(h)=a_j (1−a_j)

1.5.5 Loss function E

Using a sum-of-squares error function


E = 1 2 k = 1 N ( y t ) 2 E = \frac{1}{2}\sum _{k=1}^N(y-t)^2
  • The square is to avoid the error points at either end of the hyperplane cancelling each other out (y−t having positive and negative values).
  • And the reason why I took 1/2 is that when I do gradient descent, the gradient cancels out the squared derivative of 2.

1.5.6 Error backpropagation — update the weight

Gradient descent is used to find the optimal solution, that is, to find the partial derivative of the loss function E with respect to the weight W


partial E partial w i k = partial E partial h k partial h k partial w i k \ frac {partial E} {partial w_ {ik}} = \ frac {partial E} {partial h_k} \ frac {partial h_k} {partial w_ {ik}}

The right-hand side of the equation can be interpreted as: If we want to know how the output error E changes when the weight W changes, we can observe how the error E changes with the input value h of the activation function, and how the input value H of the activation function changes with the weight W.

H_k represents the sum of all input weights of neuron K in the output layer, which is the input value of activation function G (h).

1.5.7 Output layer delta item δ O

The first term on the right is more important, which is called delta term (error or delta term). We continue to deduce the delta term of the output layer through the chain rule


Delta t. o ( k ) = partial E partial h k = partial E partial y partial y partial h k = ( y t ) g ( h k ) The delta _o (k) = \ frac {partial E} {partial h_ {k}} = \ frac {partial E} {partial y} \ frac {partial y} {partial h_k} = (y – t) g ‘h_ (k)

You can then update the weight w of the output layer.

1.5.8 Updating the output Layer weight WJK

For the loss function, the gradient descent method is used to update the weight:


w j k please w j k eta partial E partial w j k W_ {jk} ← W _{jk} −η \frac{partial E}{partial W_ {jk}}

So get


w j k = w j k eta Delta t. o ( k ) a i W_ {jk} = W_ {jk} – ηδ_o(k)a_i

Ai is the output value of the previous layer, i.e., the input value xi of the output layer.

0x02 Sample code

The sample code for this article is as follows:

public class MultilayerPerceptronClassifierExample {
    public static void main(String[] args) throws Exception {
        BatchOperator data = Iris.getBatchData();

        MultilayerPerceptronClassifier classifier = new MultilayerPerceptronClassifier()
                .setFeatureCols(Iris.getFeatureColNames())
                .setLabelCol(Iris.getLabelColName())
                .setLayers(new int[] {4.5.3})
                .setMaxIter(100)
                .setPredictionCol("pred_label")
                .setPredictionDetailCol("pred_detail"); BatchOperator res = classifier.fit(data).transform(data); res.print(); }}Copy the code

Iris is defined as follows

public class Iris {
    final static String URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/iris.csv";
    final static String SCHEMA_STR
            = "sepal_length double, sepal_width double, petal_length double, petal_width double, category string";

    public static BatchOperator getBatchData(a) {
        return new CsvSourceBatchOp(URL, SCHEMA_STR);
    }

    public static StreamOperator getStreamData(a) {
        return new CsvSourceStreamOp(URL, SCHEMA_STR);
    }

    public static String getLabelColName(a) {
        return "category";
    }

    public static String[] getFeatureColNames() {
        return new String[] {"sepal_length"."sepal_width"."petal_length"."petal_width"}; }}Copy the code

0x03 Train general logic

MultilayerPerceptronTrainBatchOp class is the implementation of the batch training.

protected BatchOperator train(BatchOperator in) {
	return new MultilayerPerceptronTrainBatchOp(this.getParams()).linkFrom(in);
}
Copy the code

So the old routines, see MultilayerPerceptronTrainBatchOp linkFrom function directly.

The general idea is as follows:

  • 1) Obtain some meta information, such as label name, feature column name, feature type, etc.;
  • 2) Obtain test datatrainData = getTrainingSamples;
  • 3) training
    • 3.1) Get the initial weightinitialWeights = getInitialWeights();
    • 3.2) Topology buildingtopology = FeedForwardTopology.multiLayerPerceptron
    • 3.3) Build trainersFeedForwardTrainer.
      • 3.3.1) Initialize the model
      • 3.3.2) Construct the objective function
      • 3.3.3) The trainer will build the optimizer based on the objective function, where the optimizer isL-BFGS.
    • 3.4) Training to obtain the final weightweights = trainer.train
  • 4) Output modelThe DataSet < Row >;
  • 5) theDataSet<Row>Into the Table;
@Override
public MultilayerPerceptronTrainBatchOp linkFrom(BatchOperator
       ... inputs) { BatchOperator<? > in = checkAndGetFirst(inputs);// 1) Get meta information, such as label name, feature column name, feature type, etc.
        final String labelColName = getLabelCol();
        final String vectorColName = getVectorCol();
        final booleanisVectorInput = ! StringUtils.isNullOrWhitespaceOnly(vectorColName);final String[] featureColNames = isVectorInput ? null :
            (getParams().contains(FEATURE_COLS) ? getFeatureCols() :
                TableUtil.getNumericCols(in.getSchema(), new String[]{labelColName}));

        finalTypeInformation<? > labelType = in.getColTypes()[TableUtil.findColIndex(in.getColNames(), labelColName)]; DataSet<Tuple2<Long, Object>> labels = getDistinctLabels(in, labelColName);// The program variable is as follows:
labelColName = "category"
vectorColName = null
isVectorInput = false
featureColNames = {String[4] @6412} 
 0 = "sepal_length"
 1 = "sepal_width"
 2 = "petal_length"
 3 = "petal_width"
labelType = {BasicTypeInfo@6414} "String"
labels = {MapOperator@6415} 
    
        // 2) Obtain test data
        // get train data
        DataSet<Tuple2<Double, DenseVector>> trainData =
            getTrainingSamples(in, labels, featureColNames, vectorColName, labelColName);

        // train 3
        final int[] layerSize = getLayers();
        final int blockSize = getBlockSize();
        // 3.1) Get the initial weight
        final DenseVector initialWeights = getInitialWeights();
        // 3.2) Obtain the topology
        Topology topology = FeedForwardTopology.multiLayerPerceptron(layerSize, true);
        // 3.3) Build the trainer
        FeedForwardTrainer trainer = new FeedForwardTrainer(topology,
            layerSize[0], layerSize[layerSize.length - 1].true, blockSize, initialWeights);
        // 3.4) Training to get the final weight
        DataSet<DenseVector> weights = trainer.train(trainData, getParams());

        // output model 4) output model
        DataSet<Row> modelRows = weights
            .flatMap(new RichFlatMapFunction<DenseVector, Row>() {
                @Override
                public void flatMap(DenseVector value, Collector<Row> out) throws Exception {
                    List<Tuple2<Long, Object>> bcLabels = getRuntimeContext().getBroadcastVariable("labels");
                    Object[] labels = new Object[bcLabels.size()];
                    bcLabels.forEach(t2 -> {
                        labels[t2.f0.intValue()] = t2.f1;
                    });

                    MlpcModelData model = new MlpcModelData(labelType);
                    model.labels = Arrays.asList(labels);
                    model.meta.set(ModelParamName.IS_VECTOR_INPUT, isVectorInput);
                    model.meta.set(MultilayerPerceptronTrainParams.LAYERS, layerSize);
                    model.meta.set(MultilayerPerceptronTrainParams.VECTOR_COL, vectorColName);
                    model.meta.set(MultilayerPerceptronTrainParams.FEATURE_COLS, featureColNames);
                    model.weights = value;
                    new MlpcModelDataConverter(labelType).save(model, out);
                }
            })
            .withBroadcastSet(labels, "labels");

        // 5) update DataSet
      
        to Table
      
        setOutput(modelRows, new MlpcModelDataConverter(labelType).getModelSchema());
}
Copy the code

3.1 Overall logic sample diagram

An example of the overall logic is shown below, with the order of the initialization steps tweaked for better illustration.

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ multiLayerPerceptron │ │ to build topology getTrainingSamples │ access to training data └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ < label index, The vector > │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ FeedForwardTopology │ topology, Containing the layers │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ the layers is the topology of each layer, Such as AffineLayer │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ initModel │ │ initialization model trainData = stack │ () └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ to compress the training data into vector │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ FeedForwardTrainer generated trainers (topology) │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ AnnObjFunc objective function │ │ [topology,topologyModel] │ The member variable topology is the topology │ of the neural network └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ member variable topologyModel calculation model is │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ AnnObjFunc. TopologyModel topology model │ │ to generate objective function └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ optimizer =newLbfgs(.. annObjFunc..) (in the process of training) │ │ generated optimizer └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ based on objective function to generate │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ optimizer. InitCoefWith (initCoef) initialization of the optimizer │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ < -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ Optimizer. Optimize () │ optimizer L - BFGS iterative training │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ calculate gradient (using the topology model) │ │ │ │1.Calculate the output of each layer │ │ │2.Calculate the output layer loss │ │ │3.Calculate Delta │ │ │ for each layer4.Calculate each layer gradient │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ calculate direction │ │ │ │ topology model of target function is not used here │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ calculate loss (using the topology model) │ │ │ │1.Calculate the output of each layer │ │ │2.Calculate loss of output layer │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ │ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ │ update model │ │ │ topology model of target function is not used here │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ │ │ │ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ │ ----------------------------------------------------------------------------------------Copy the code

The above image can be distorted on a mobile phone, so you can also see the image below:

3.2 Overview of L-BFGS training call logic

For illustration above, L-BFGS is our optimizer, and several key steps are as follows:

  • CalcGradient()Computing the gradient
  • CalDirection(...)Calculate the direction
  • CalcLosses(...)Calculate loss
  • UpdateModel(...)Update model

The algorithm framework is basically unchanged, the difference is the specific objective function and loss function. For example, linear regression uses UnaryLossObjFunc and the loss function is SquareLossFunc. In the case of the multilayer perceptron, the target function used is AnnObjFunc.

Specifically for multi-layer perceptron, the steps related to the objective function in L-BFGS are as follows:

CalcGradient computes the gradient

  • 1) callAnnObjFunc. UpdateGradient;
    • 1.1) Call the topology model in the target functiontopologyModel.computeGradientTo calculate the
      • 1.1.1) Calculate the output of each layer;forward(data, true)
      • 1.1.2) Calculate the output layer loss;labelWithError.loss
      • 1.1.3) Calculate Delta of each layer;layerModels.get(i).computePrevDelta
      • 1.1.4) Calculate the gradient of each layer;layerModels.get(i).grad

CalDirection Computing direction

  • There is no topological model of the objective function.

CalcLosses count the losses

  • 1) callAnnObjFunc.calcSearchValues;It calls internallycalcLossCalculate loss;
    • 1.1) calltopologyModel.computeGradientTo calculate the loss
      • 1.1.1) Calculate the output of each layer;forward(data, true)
      • 1.1.2) Calculate the output layer loss;labelWithError.loss

UpdateModel Updates the model

  • There is no topological model of the objective function.

3.3 Obtaining training data

The getTrainingSamples function will get the training data from the raw input.

Raw data Examples

5.1	3.5	1.4	0.2	Iris-setosa
5	2	3.5	1	Iris-versicolor
5.1	3.7	1.5	0.4	Iris-setosa
6.4	2.8	5.6	2.2	Iris-virginica
6	2.9	4.5	1.5	Iris-versicolor
Copy the code

Mainly done as follows:

  • 1) Obtain metadata, such as index of feature column and index of label column;
  • 2) Broadcast labels, which will be used later in the open function;
  • 3) One of the open functions must be invertedlabel : indexThe mapping of
  • 4) There are two execution sequences in the map function, which will be converted to<label index, vector>Such a binary
    • 4.1) The original input has a vector, for example, something like 5.1 3.5 1.4 0.2 iris-setosa 5.1 3.5 1.4 0.2 is the vector in bold.
    • 4.2) There is no vector in the original input, such as 5.1 3.5 1.4 0.2 iris-setosa;

The specific code is as follows:

private static DataSet<Tuple2<Double, DenseVector>> getTrainingSamples(
        BatchOperator data, DataSet<Tuple2<Long, Object>> labels,
        final String[] featureColNames, final String vectorColName, final String labelColName) {
        
        // 1) Obtain metadata, such as index of feature column, index of label column;
        final booleanisVectorInput = ! StringUtils.isNullOrWhitespaceOnly(vectorColName);final int vectorColIdx = isVectorInput ? TableUtil.findColIndex(data.getColNames(), vectorColName) : -1;
        final int[] featureColIdx = isVectorInput ? null : TableUtil.findColIndices(data.getSchema(),
            featureColNames);
        final int labelColIdx = TableUtil.findColIndex(data.getColNames(), labelColName);

// The program variables are as follows
isVectorInput = false
vectorColIdx = -1
featureColIdx = {int[4] @6443} 
 0 = 0
 1 = 1
 2 = 2
 3 = 3
labelColIdx = 4
    
        DataSet<Row> dataRows = data.getDataSet();
        return dataRows
            .map(new RichMapFunction<Row, Tuple2<Double, DenseVector>>() {
                transient Map<Comparable, Long> label2index;

                @Override
                public void open(Configuration parameters) throws Exception {
                    List<Tuple2<Long, Object>> bcLabels = getRuntimeContext().getBroadcastVariable("labels");
                    this.label2index = new HashMap<>();
                    // Add a label: index mapping
                    bcLabels.forEach(t2 -> {
                        Long index = t2.f0;
                        Comparable label = (Comparable) t2.f1;
                        this.label2index.put(label, index);
                    });
/ / variable is
this = {MultilayerPerceptronTrainBatchOp$2@11578} 
 label2index = {HashMap@11580}  size = 3
  "Iris-versicolor" -> {Long@11590} 2
  "Iris-virginica" -> {Long@11592} 1
  "Iris-setosa" -> {Long@11594} 0                    
                    
                }

                @Override
                public Tuple2<Double, DenseVector> map(Row value) throws Exception {
                    Comparable label = (Comparable) value.getField(labelColIdx);
                    Long labelIdx = this.label2index.get(label);

                    if (isVectorInput) { // 4.1) If there is a vector in the original input
                        Vector vec = VectorUtil.getVector(value.getField(vectorColIdx));
                        // Convert to a binary like 
                        if (null == vec) {
                            return new Tuple2<>(labelIdx.doubleValue(), null);
                        } else {
                            return new Tuple2<>(labelIdx.doubleValue(),
                                (vec instanceofDenseVector) ? (DenseVector) vec : ((SparseVector) vec).toDenseVector()); }}else { // 4.2) if there is no vector in the original input
                        int n = featureColIdx.length;
                        DenseVector features = new DenseVector(n);
                        for (int i = 0; i < n; i++) {
                            double v = ((Number) value.getField(featureColIdx[i])).doubleValue();
                            features.set(i, v);
                        } 
                        // Convert to a binary like 
                        return Tuple2.of(labelIdx.doubleValue(), features);
                    }
                }
            })
            .withBroadcastSet(labels, "labels"); // 2) broadcasts labels as open;
}
Copy the code

3.4 Topology Construction

FeedForwardTopology. MultiLayerPerceptron to complete the work of the construction of feedforward neural network topology.

public static FeedForwardTopology multiLayerPerceptron(int[] layerSize, boolean softmaxOnTop) {
        List<Layer> layers = new ArrayList<>((layerSize.length - 1) * 2);
        for (int i = 0; i < layerSize.length - 1; i++) {
            layers.add(new AffineLayer(layerSize[i], layerSize[i + 1]));
            if (i == layerSize.length - 2) {
                if (softmaxOnTop) {
                    layers.add(new SoftmaxLayerWithCrossEntropyLoss());
                } else {
                    layers.add(newSigmoidLayerWithSquaredError()); }}else {
                layers.add(new FuntionalLayer(newSigmoidFunction())); }}return new FeedForwardTopology(layers);
}
Copy the code

To review the concept: Feedforward neural networks are called networks because they are usually represented by a composite of many different functions. The model is associated with a directed acyclic graph that describes how functions are compounded together.

Each neuron starts from the input layer, receives the previous level of input, and outputs to the next level, until the output layer. There is no feedback in the entire network. Each layer contains several neurons. Neurons in the same layer are not connected to each other, and the transmission of information between layers is carried out in one direction only. The first layer is called the input layer. The last layer is the output layer. The middle is the hidden layer, referred to as the hidden layer. A hidden layer can be a layer. It can also be multiple layers.

FeedForwardTopology is the topology of a feedforward neural network, which is the logical representation of the network layer mentioned above. This topology contains several layers, from the hidden layer to the output layer.

/** * The topology of a feed forward neural network. */
public class FeedForwardTopology extends Topology {
    /** * All layers of the topology. */
    private List<Layer> layers;
}
Copy the code

The constructed topological variables are roughly as follows, divided into four layers:

  • Affine layerAffineLayer. Affine transformation is equal to linear transformation plus translationh = WX + b;
  • Functional layersFuntionalLayer, its function isSigmoidFunctionIs the activation layer corresponding to the previous affine layer;
  • Affine layerAffineLayer;
  • Output layerSoftmaxLayerWithCrossEntropyLoss;

Here the affine layer and the functional layer together constitute the hidden unit. Most hidden units can be described as taking input vector x, computing the affine transformation z = wTx+b, and then using an element-by-element nonlinear function g(z). Most hidden units differ only in the form of the activation function G (z).

Now I’m going to print out what the variables are at run time just to make it a little bit clearer. SetLayers (new int[]{4, 5, 3}), the layers are set accordingly: 4, 5, 3

this = {FeedForwardTopology@4951} 
 layers = {ArrayList@4944}  size = 4
      0 = {AffineLayer@4947} / / affine layer
       numIn = 4
       numOut = 5
      1 = {FuntionalLayer@4948} 
       activationFunction = {SigmoidFunction@4953}  // Activate the function
      2 = {AffineLayer@4949} / / affine layer
       numIn = 5
       numOut = 3
      3 = {SoftmaxLayerWithCrossEntropyLoss@4950}  // Activate the function
Copy the code

3.4.1 track AffineLayer

Y =A*x+b, the Layer properties of affine doubling.

public class AffineLayer extends Layer {
	public int numIn;
	public int numOut;

	public AffineLayer(int numIn, int numOut) {
		this.numIn = numIn;
		this.numOut = numOut;
	}

	@Override
	public LayerModel createModel(a) {
		return new AffineLayerModel(this); }... }Copy the code

3.4.2 FuntionalLayer

Y is equal to f of x. The activationFunction here is f(x)

public class FuntionalLayer extends Layer {
    public ActivationFunction activationFunction;
    
    @Override
    public LayerModel createModel(a) {
        return new FuntionalLayerModel(this); }}Copy the code

Rule 3.4.3 SoftmaxLayerWithCrossEntropyLoss

3.4.3.1 Softmax

Output functions are basically Softmax functions, which are defined as follows:


sigma i ( Z ) = e x p ( Z i ) j = 1 m e x p ( z j ) . i = 1 . . . . . m Sigma _i (Z) = \ frac {exp (Z_i)} {\ sum_ {j = 1} ^ m exp (z_j)}, I = 1,… ,m

The output vector of Softmax is the probability, which is the probability that the sample belongs to each class! What it does in Logistic Regression is convert linear predictions into category probabilities.

Assuming that z_i = W_i + b_i is the linear prediction result of the ith category, the result brought into Softmax is actually the exponential change of each z_i to non-negative, and then divided by the sum of all terms to normalize. Now each σ_i = σ_i(z) can be interpreted as the probability, or Likelihood, of the observed data X belonging to category I.

Therefore, the goal of training W of the full connection layer is to make its output W.X have the highest prediction probability corresponding to the real label after calculation by the SoftMax layer.

3.4.3.2 softmax loss

Once you understand Softmax, it’s time to talk about Softmax Loss. So what does softmax loss mean? ? Details are as follows:


L = j = 1 T y j l o g S j L = – \sum_{j=1}^T y_j logS_j
  • L is for loss.
  • Sj is the JTH value of The output vector S of SoftMax, and represents the probability that this sample belongs to the JTH category.
  • Yj has a summation sign in front of it, and j also ranges from 1 to T, so y is a 1 by T vector, where only 1 of the T values is 1, and all the other T minus 1 values are 0. So where is the value of 1? The answer is that the value of the actual tag is 1, and everything else is 0.

So this formula actually has a simpler form:


L = l o g S j L = -logS_j

Of course, I’m going to limit j to the actual tag that points to the current sample.

3.4.3.3 cross entropy

With softmax Loss sorted out, we can look at the Cross Entropy. Corss entropy stands for cross entropy and is formulated as follows:


E = j = 1 T y j l o g P j E = – \sum_{j=1}^T y_j logP_j

Most modern neural networks are trained using maximum likelihood. This means that the cost function is negative logarithmic likelihood, which is equivalent to the cross entropy between the training data and the model distribution. The exact form of the cost function changes with the model.

In information theory, cross entropy refers to two probability distributions P and Q, where P represents the real distribution and Q represents the non-real distribution. In the same group of events, the non-real distribution Q is used to represent the average number of bits required for the occurrence of an event. The cross entropy can be used as a loss function in neural network (machine learning). P represents the distribution of real markers, and Q represents the predicted marker distribution of the trained model. The cross entropy loss function can measure the similarity between P and Q.

Is it similar to the formula of Softmax Loss? When the input P of the cross entropy is the output of Softmax, the cross entropy is equal to softmax Loss. Pj is the JTH value of the input probability vector P, so if your probability is derived from the Softmax formula, then the cross entropy is softmax Loss

One advantage of using maximum likelihood to derive cost functions is that it relieves the burden of designing cost functions for each model. Clear a model p (y | x) is automatically determine a cost function logp (y | x). The gradient of the cost function must be large enough to be predictive enough to provide a good guide for the learning algorithm.

3.4.3.4 SoftmaxLayerWithCrossEntropyLoss

SoftmaxLayerWithCrossEntropyLoss is a softmax layer with cross entropy loss, namely softmax layer with cross entropy loss.

public class SoftmaxLayerWithCrossEntropyLoss extends Layer {
    @Override
    public LayerModel createModel(a) {
        return newSoftmaxLayerModelWithCrossEntropyLoss(); }}Copy the code

3.5 Building the trainer

Recall sample code

.setLayers(new int[] {4.5.3})
Copy the code

Here the structure of the neural network is specified. The input layer is 4, the hidden layer is 5, and the output layer is 3.

The code for generating the trainer is as follows:

FeedForwardTrainer trainer = new FeedForwardTrainer(topology,
            	layerSize[0], layerSize[layerSize.length - 1].true, blockSize, 	
            	initialWeights);
Copy the code

FeedForwardTrainer is a feedforward neural network trainer.

public class FeedForwardTrainer implements Serializable {
    private Topology topology;
    private int inputSize;
    private int outputSize;
    private int blockSize; // The data block size, 64 by default, is called by the stack function during compression
    private boolean onehotLabel;
    private DenseVector initialWeights;
}
Copy the code

The variable is printed as follows

trainer = {FeedForwardTrainer@6456} 
 topology = {FeedForwardTopology@6455} 
  layers = {ArrayList@4963}  size = 4
   0 = {AffineLayer@6461} 
   1 = {FuntionalLayer@6462} 
   2 = {AffineLayer@6463} 
   3 = {SoftmaxLayerWithCrossEntropyLoss@6464} 
 inputSize = 4
 outputSize = 3
 blockSize = 64
 onehotLabel = true
 initialWeights = null
Copy the code

We can see that the core variable of training is FeedForwardTrainer, which contains the topology model, which contains four layers.

The optimizer and objective function used by the trainer are also shown in advance. The trainer uses the optimizer to optimize the objective function.

Here the optimizer is Lbfgs, which contains the objective function AnnObjFunc and contains the topology and topology model.

public class AnnObjFunc extends OptimObjFunc {
    private Topology topology;
    private transient TopologyModel topologyModel = null;
}
Copy the code

Topology model is generated based on topology. Here is FeedForwardModel, in which the corresponding models of each layer are AffineLayerModel, FuntionalLayerModel, etc.

For example, affinelayerModel. eval is a simple affine transformation WX + b.

This completes the first part of the multilayer perceptron. Stay tuned.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Introduction to deep feedforward networks in deep learning

Deep Learning Chinese translation

Github.com/fengbingchu…

Introduction to Deep Learning — Affine Layer (Affine Layer-Matrix Product)

Machine learning – Formulas for MLP of multilayer perceptron

Multilayer perceptron fast

Neural Network (Multilayer Perceptron) Credit Card Fraud Detection (1)

Hand lifting ANN – loss layer

Artificial neural network ANN meaning in Chinese

Formula derivation of artificial neural network (ANN)

Step by step understanding gradient descent and neural networks (ANN) with code

Detailed analysis of Softmax and Softmax Loss

Softmax loss function and gradient calculation

Softmax vs. Softmax-Loss: Numerical Stability

【 Technical Review 】 This article introduces softmax Loss and its variants

Introduction to Feedforward Neural Networks: Why is it Important?

Basic Understanding of deep learning: Take feedforward neural network as an example

Supervised learning and regression models

Machine learning — Feedforward neural networks

AI products: BP feedforward neural networks and gradient problems

Feedforward Neural Networks for Deep Learning (Forward propagation and Error Back Propagation)

Understand the back propagation algorithm