Before we start, let’s take a look at the final code:

1. The branch with the characteristics of the back-end (https://github.com/OneRaynyDay/autodiff/tree/eigen)

2. Support only a scalar branch (https://github.com/OneRaynyDay/autodiff/tree/master)

I worked with Minh Le on this project.

Why is that?

If you’re a computer science (CS) student, you’ve probably heard the phrase “don’t do it yourself ____” thousands of times. It includes encryption, standard libraries, parsers, and more. I think by now it should also include the ML Library.

No matter what the reality is, this shocking lesson is worth learning. People now take TensorFlow and libraries like it for granted. They see it as a black box and make it work, but not many people know how it works behind the scenes. This is just a non-convex optimization problem! Stop messing around with code that doesn’t make sense — just to make it look right.

TensorFlow

In TensorFlow’s code, there is an important component that allows you to string calculations together into something called a “computation graph.” The computed graph is a directed graph G=(V,E), where at some nodes U1,u2… , the UN, v ∈ v, and e1 and e2,… , en ∈ E, ei = (UI, v). We know that there is some kind of graph that puts u1… UN maps to VV.

For example, if we have x + y = z, then (x,z), (y,z)∈E.

This is very useful for evaluating arithmetic expressions, and we can find the result under the confluence of the computed graph. Convergence points are vertices like v∈V, ∄e=(v,u). On the other hand, these vertices have no directional boundaries from themselves to other vertices. Similarly, the input source is v∈ v, ∄e=(u,v).

For us, we always place the value on the input source, and the value is propagated to the sink.

Differential in reverse mode

If you think my explanation is not correct, please refer to these slides.

Differentiation is a core requirement for many models in Tensorflow, because we need it to run gradient descent. Everyone who has graduated from high school should know what differentials mean. If it’s a complex function based on basic functions, you just have to figure out the derivative of the function and then apply the chain rule.

Super succinct overview

If we have a function that looks like this:

Derivative with respect to x:

Derivative with respect to y:

Other examples:

Its derivative is:

So the gradient is:

The chain rule, for example, applies to f(g(h(x)) :

Reverse mode in 5 minutes

So for now, remember that we’re running the Graph with the DAG/Directed Acyclic Graph, and the chain rule that we used in the previous example. As shown below:

Copy the code
  1. x -> h -> g -> f 

As a graph, we can obtain the answer in F, however, it can also be reversed:

Copy the code
  1. dx <- dh <- dg <- df 

So it looks like the chain rule! We need to multiply the derivatives along the path to get the final result. Here is an example of a computed graph:

This reduces it to a graph traversal problem. Has anyone noticed that this is topological sorting and depth-first/width-first search?

Yes, in order to support topological sorting in both paths, we need to include a set of parent groups and a set of subgroups, and the sink point is the source of the other direction. And vice versa.

perform

Minh Le and I started designing this project before school started. We decided to use Eigen Library Backend for linear algebra, which has a matrix class called MatrixXd for our project:

Copy the code
  1. class var {// Forward declarationstruct impl; public:
  2.     // For initialization of new vars by ptr    var(std::shared_ptr<impl>); 
  3.  
  4.     var(double); 
  5.     var(const MatrixXd&); 
  6.     var(op_type, const std::vector<var>&);     
  7. .
  8.      
  9.     // Access/Modify the current node value    MatrixXd getValue() const; 
  10.     void setValue(const MatrixXd&); 
  11.     op_type getOp() const; 
  12.     void setOp(op_type); 
  13.      
  14.     // Access internals (no modify)    std::vector<var>& getChildren() const; 
  15.     std::vector<var> getParents() const; 
  16. . private:
  17. // PImpl idiom requires forward declaration of the class: std::shared_ptr

    pimpl; }; struct var::impl{public:
  18.     impl(const MatrixXd&); 
  19.     impl(op_type, const std::vector<var>&); 
  20.     MatrixXd val; 
  21.     op_type op;  
  22.     std::vector<var> children; 
  23. std::vector
    <:weak_ptr>
    > parents; };

Here, we use a syntax called “pImpl,” which means “pointer to execution.” It has many uses, such as decoupling implementations of interfaces and instantiating things on the memory heap when there is a local interface on the stack. Some of the side effects of “pImpl” are a slight slowdown in runtime, but a significant reduction in compile time. This allows us to keep the data structure persistent through multiple function calls/returns. A tree data structure like this should be persistent.

We have enumerations to tell us what is currently going on:

Copy the code
  1. enum class op_type { 
  2.     plus, 
  3.     minus, 
  4.     multiply, 
  5.     divide, 
  6.     exponent, 
  7.     log, 
  8.     polynomial, 
  9.     dot, 
  10. .
  11.     none // no operators. leaf.};  

The actual class that performs the evaluation of this tree is called expression:

Copy the code
  1. class expression {public: 
  2.     expression(var); 
  3. .
  4.     // Recursively evaluates the tree.    double propagate(); 
  5. .
  6.     // Computes the derivative for the entire graph.    // Performs a top-down evaluation of the tree.    void backpropagate(std::unordered_map<var, double>& leaves); 
  7. . private:
  8. var root; };

In backpropagation, our code can do something like the following:

Copy the code
  1. backpropagate(node, dprev): 
  2.     derivative = differentiate(node)*dprev 
  3.     for child in node.children: 
  4.         backpropagate(child, derivative)  

It’s almost like doing a depth-first search (DFS). Do you see that?

Why C++?

In practice, C++ may not be suitable for this sort of thing. We can spend less time developing in a functional language like “Oaml”. Now I understand why “Scala” is used in machine learning, mainly because of “Spark”. However, there are many benefits to using C++.

Eigen (library)

For example, we can directly use a linear algebra library called “Eigen” from TensorFlow. This is a linear algebra library that has been used up without thinking. There is a flavor similar to our expression tree, where we build expressions that are evaluated only when we really need them. However, using “Eigen” allows you to determine when to use templates at compile time, which means less runtime time. I have a lot of respect for the person who wrote “Eigen” because looking at the template error almost blinded me!

Their code looks something like this:

Copy the code
  1. Matrix A(…) , B(…) ;
  2. auto lazy_multiply = A.dot(B); 
  3. typeid(lazy_multiply).name(); // the class name is something like Dot_Matrix_Matrix. 
  4. Matrix(lazy_multiply); // functional-style casting forces evaluation of this matrix.  

This signature library is very powerful, which is why it is one of the main backends of TensortFlow, where there are other optimizations besides this lazy evaluation technique.

Operator overloading

It’s a great library to develop in Java — there’s no shareD_ptrs, Unique_ptrs, Weak_ptrs; We get a real, useful Graphing Calculator (GC=Graphing Calculator). This greatly saves development time, not to mention faster execution. However, Java does not allow operator overloading, so they cannot:

Copy the code
  1. // These 3 lines code up an entire neural network! 
  2. var sigm1 = 1 / (1 + exp(-1 * dot(X, w1))); 
  3. var sigm2 = 1 / (1 + exp(-1 * dot(sigm1, w2))); 
  4. var loss = sum(-1 * (y * log(sigm2) + (1-y) * log(1-sigm2)));  

This is the actual code, by the way. Isn’t it beautiful? I would say that this is even more elegant than Python encapsulation in TensorFlow! I just want to show that they’re also matrices.

In Java, having a string of add(), divide(), and so on is ugly. More importantly, this will allow the user to focus more on “PEMDAS”, which the C++ operators do very well.

Features, not a series of failures

In this library, it is clear that TensorFlow does not have a well-defined API, or does but I am not aware of it. For example, if we only want to train the weights of a particular subset, we can backpropagate only to the particular sources we are interested in. This is very useful for transfer learning of convolutional neural networks, because many times large networks like VGG19 can be truncated and then some additional layers are attached, and the weights of these layers are trained using samples from the new domain.

The benchmark

In Python’s TensorFlow library, training iris data sets for 10000 “Epochs” to classify, using the same hyperparameters, we have:

  1. Neural network for TensorFlow: 23812.5ms
  2. Neural network of “Scikit” : 22412.2 ms
  3. “Autodiff” neural Network, Iteration, Optimization: 25397.2ms
  4. “Autodiff” neural network, iteration, no optimization: 29052.4ms
  5. “Autodiff” neural network, with recursion, no optimization: 28121.5ms

Surprisingly, Scikit is the fastest of all. That’s probably because we didn’t do big matrix multiplications. It may also be that TensorFlow requires additional compilation steps, such as variable initialization, and so on. Or maybe we have to run loops in Python instead of C (Python loops are really bad!). I’m not sure myself. I fully understand that this is by no means a comprehensive benchmark because it only applies a single data point in a specific situation. However, the performance of this library is not representative of the current best, so we encourage readers to work with us to improve.


Author: Liu Xiaokun

Source: 51 cto