From algorithm implementation to MiniFlow implementation, build an infrastructure platform for machine learning

Hello everyone, I am Chen Dihao from the fourth Paradigm, currently responsible for the architecture and implementation of wevin machine learning platform.

Today, I am very glad to share with you the topic of “Building an Infrastructure Platform for Machine Learning”, which will mainly introduce the underlying principles and engineering implementation of machine learning. We also welcome more exchanges after the conference.

Compared with big data, cloud computing and deep learning, Infrastructure is not a very popular concept. Many programmers even use MySQL, Django, Spring and Hadoop to develop business logic at the beginning of their employment, without really participating in the development of Infrastructure projects. In the field of machine learning, it is similar. Many business applications can be implemented with Caffe, TensorFlow or AWS or Google CloudML, but frameworks or platforms may become popular or decline as the industry develops. However, the pursuit of high availability, high performance, flexible and easy to use infrastructure is almost eternal.

In his book Why AI Engineers Need to Know some Architecture, Google teacher Wang Yonggang mentioned that research institutes do not only know algorithm, algorithm implementation is not equal to problem solving, problem solving is not equal to on-site problem solving, and architecture knowledge is the common language for engineers to carry out efficient team collaboration. Google makes AI research ahead of the industry with its powerful infrastructure capabilities. The development of the industry also makes deep Learning and Auto Machine Learning possible. In the future, more people will pay attention to the underlying architecture and design.

Therefore, today’s topic is to introduce the machine learning infrastructure, including the following aspects:

Layered design of infrastructure;
Numerical computation of machine learning;
Reimplementation of TensorFlow;
Design of distributed machine learning platform.

Imagine how many layers of application abstraction we would go through if we were to write a TensorFlow application on AWS. First of all, not to mention physical servers and network broadband, through the abstraction of TCP/IP and other protocols, we can operate directly on the AWS VIRTUAL machine and operate locally. Second, the abstraction of the operating System and programming language makes it possible to write code that follows the Python specification without being aware of the underlying physical memory address and System calls that read and write disks. Then, we use the TensorFlow library. We actually just call the top Python API. The bottom layer is Protobuf serialization and SWIG cross-language research, and then gRPC or RDMA communication. At the lowest level, this is the matrix calculation by calling Eigen or CUDA libraries.

Therefore, in order to achieve decoupling and abstraction between software, system architectures often employ layered architectures that shield the low-level implementation details, each of which is equivalent to the infrastructure of the upper applications.

So how do we survive in a layered world?

One might think that since someone has implemented the operating system and programming language, do we need to focus on the underlying implementation details? There is no standard answer to this question. Different people have different feelings at different times. Here are two examples.

In sacrificing performance for 99% of the Time for 1% : Deep_copy () is the Python standard library’s implementation of copy.deep_copy(). In 1% of cases, it means that there may be some reference objects inside the object during the deep copy. Therefore, it is necessary to record all copied object information during the copy. 99% of the time objects are not directly applied to themselves, and the library loses more than 6 times the performance for 100% compatibility. With a deeper understanding of the Python source code, we can address these performance issues by implementing deep-copy algorithms to optimize our business logic.

Another example is Pluto: shared by Ali teacher Yang Jun at Strata Data Conference A distributed heterogeneous deep learning framework “, which introduces the control_dependencies based on TensorFlow to implement the placing of hot and cold data on THE GPU video memory, so as to greatly reduce the use of video memory in the case of almost no user awareness. For those of you who know the source code, TensorFlow’s Dynamic Computation Graph, TensorFlow/FOLD project, is based on Control_Dependencies. It is also not easy to implement dynamic computational graphs in declarative machine learning frameworks. Neither of these implementations exists in the official TensorFlow documentation, and only a deep understanding of the source code is likely to make a significant difference in functionality and performance. Therefore, if you are an enterprise infrastructure maintainer of the TensorFlow framework, It is necessary to break through TensorFlow’s Python API abstraction layer.

The most important of these is the implementation of the machine learning algorithm itself. Next, we will break through the abstraction and take a deeper look at the underlying implementation principles.

Machine learning is essentially a series of numerical calculations, so TensorFlow positioning is not a deep learning library, but a numerical computation library. We don’t need to worry when we hear about Shannon’s entropy, Bayes, and back propagation. It’s all mathematical and can be programmed by a computer.

Those who have been exposed to machine learning know that LR generally refers to Logistic regression and Linear regression, while the former belongs to classification algorithm and the latter to regression algorithm. Both LRS have some tunable hyperparameters, such as Epoch number, Learning rate, Optimizer, etc. The implementation of this algorithm can help us understand their principles and tuning skills.

The following is the simplest Python implementation of linear regression. The model is simple y = w * x + b.

As you can see from this example, implementing a machine learning algorithm does not depend on libraries like SciKit-Learn or TensorFlow. It is essentially a numerical algorithm, and there are performance differences between different languages. Careful friends may find out why the Gradient of W here is -2 * X * (Y — X * X — B), while the Gradient of B is -2 * (Y — W * x-B). How to ensure that the accuracy rate increases while Loss decreases after calculation? This is mathematically guaranteed. We define the Loss function (Mean square error) as the square of y — w * x-b, that is to say, the closer the predicted value is to Y, the smaller the Loss will be. The goal is to find the minimum value of the Loss function under any values of W and B. So you take the partial of w and b and you get these two formulas.

If you are interested, look at the mathematical formula proof of partial derivative of MSE under linear regression.

Logistic regression is similar to linear regression. When there is Normalization of w * X + B due to classification, the Sigmoid method is generally used. This can be done in Python by 1.0 / (1 + numpy.exp(-x)). As the predicted value is different, the definition of Loss function is also different, and the numerical calculation formula obtained by partial derivative is also different. You can also take a look at my formula derivation if you are interested.

As you can see, the resulting partial derivative is very simple and can be easily implemented in any programming language. But our own implementation may not be the most efficient, so why not just use sciKit-learn, MXNet and other open source libraries that already implement algorithms?

Our understanding of this algorithm is actually a very important basis for using it in engineering. In real business scenarios, for example, the characteristics of a sample may have billions, or even one hundred billion, and by the previous algorithm, we learned that the size of the LR model and the dimension of the sample characteristics are the same, that is to say an accept ten billion – dimensional feature model, itself has billions of a parameter, if use the standard double-precision floating-point number save the model parameters, Then the parameters of the 10-billion-dimension model must exceed at least 40G, and the features of the 10-billion-dimension model cannot be loaded by a single machine.

Therefore, although SciKit-Learn achieves high performance LR algorithm through native interface, it can only meet the needs of training on a single machine, and MXNet does not support SpareTensor, so the training efficiency for ultra-high dimensional sparse data is very low. TensorFlow itself supports SpareTensor and it supports model parallelism, it supports model training with ten billion dimensions, but it’s not optimized for LR and it’s not very efficient. In this scenario, the fourth normal Form realizes the ultra-high dimension and high performance machine learning library supporting model parallelism and data parallelism based on Parameter Server. Only on this basis can the training efficiency of large-scale LR, GBDT and other algorithms meet the engineering needs.

There are also many interesting algorithms in machine learning, such as decision tree, SVM, neural network, naive Bayes and so on, which can be easily implemented in engineering only with partial mathematical theoretical basis. Due to space constraints, I will not repeat them here. We have introduced the Imperative programming interface in machine learning. We derive the formula of partial derivation in advance and execute it according to the code sequence like other programming scripts. As we know, TensorFlow provides a Declarative programming interface. Delay and optimize execution by describing computational diagrams, which we’ll cover next.

First of all, you might be wondering, do we need to reimplement TensorFlow? TensorFlow’s flexible programming interface, high performance computing based on Eigen and CUDA, and support for distributed and HDFS integration are all difficult for individuals and even enterprises to fully catch up, and we can use MXNet even if we need the imperative programming interface. There is no strong need for a new TensorFlow framework.

In fact, in the process of learning TensorFlow, I realized a Tensorflow-like project, which not only amazed me at the exquisite design of its source code and interface, but also deepened my understanding of concepts such as declarative programming, DAG implementation, automatic partial derivation and back propagation. Even in Benchmark tests, a pure Python implementation was 22 times faster than TensorFlow in linear regression model training, of course, under certain scenarios. The main reason is the overhead of switching between Python and C++ languages in TensorFlow.

The project is MiniFlow, a numerical library that implements the chain rule, automates derivatives, supports both imperative and declarative programming, and is compatible with the TensorFlow Python API. Interested in can be in this address to participate in the development of https://github.com/tobegit3hub/miniflow, here are the API contrast figure.

For those of you who know the source code for TensorFlow and MXNet (or NNVM), it looks like they both abstract the concepts of Op, Graph, placeholder, or Variable, and show how a model’s computing flow diagram works like a DAG. So, how do you implement a similar interface?

Different from the previous LR code, the graph-based model allows users to customize the Loss function, that is, users can use the traditional Mean Square Error or customize an arbitrary mathematical formula as the Loss function, which requires the framework itself to realize the function of automatic derivation. Instead, we realized the calculation method of derivative in advance according to the Loss function.

Therefore, the minimum operation that can be defined by users, namely Op, requires the platform to implement basic operators, such as ConstantOp, AddOp, MultipleOp, etc. In addition, when users implement user-defined operators, they can be added into the process of automatic derivation without affecting the training process of the framework itself. All Op classes should implement forward() and grad() so that model training can be done automatically, and overloading Python operators can provide a more convenient interface for developers to use.

So for constant (ConstantOp) and variable (VariableOp), their forward operation is to get its own value, and the derivative of constant is 0, the derivative of partial derivative of the variable is 1, other variables are also 0, the specific code is as follows.

In fact, more importantly, we need to realize the forward operation and reverse operation logic of operators such as AddOp, MinusOp, MultipleOp, DivideOp and square (PowerOp), and then according to the chain rule, The derivative of any complex mathematical formula can be reduced to the derivative of these basic operators.

For example, addition and subtraction, we know that the derivative of the addition of two numbers is equal to the addition of the derivatives, so according to this mathematical principle, we can easily implement AddOp, while MinusOp implementation similar not to repeat.

And multiplication and division are more complicated, so obviously the derivative of the multiplication of two numbers is not the same as the multiplication of the derivative, for example, x and x squared, so you multiply the first number and you get 2x, and you multiply the first number and you get 3 times x squared. So you need to use the multiplier rule, the basic formula is, and the code is as follows.

Division and the derivative of the square work in a similar way, because it’s mathematically proven that you only need to code basic forward and reverse operations. Due to limited space, the source code implementation of MiniFlow will not be introduced in detail here. If you are interested, you can find the complete source code implementation through the Github link above. The model performance test results using the same API interface are provided below. MiniFlow performs better in scenarios where small amounts of data are processed and Python/C++ environments are frequently switched.

Machine learning algorithms are introduced and the depth of them in front of the library, not everyone have the ability to rewrite this part or optimize infrastructure, many times we are users of these algorithms, but from another Angle, we need to maintain a high availability of computing platform for the training of the machine learning and prediction, Here’s how to build a distributed machine learning platform.

With the development of big data and cloud computing, implementing a highly available, distributed machine learning platform has become a basic requirement. Caffe, TensorFlow, and our own high-performance machine learning library only solve problems of numerical calculation, algorithm implementation, and model training. Task isolation, scheduling, and Failover need to be implemented by the upper platform.

So what are the capabilities needed to design an infrastructure platform for the whole process of machine learning?

First, resource isolation must be implemented. In a cluster that shares underlying computing resources, training tasks submitted by users should not be affected by other tasks. Ensure that CPU, memory, and GPU resources are isolated as much as possible. If Hadoop or Spark cluster is used, cgroups will be mounted on the task process by default to ensure the isolation of CPU and memory. As container technology such as Docker matures, We can also use projects like Kubernetes, Mesos, and others to initiate and manage user-implemented model training tasks.

Second, realize resource scheduling and sharing. With the popularity of Gpus in general computing, there are more and more scheduling tools that support GPU scheduling. However, in some enterprises, GPU cards are used exclusively, and dynamic scheduling and sharing of resources cannot be realized, which inevitably leads to serious waste of computing resources. In the design of machine learning platform, it is necessary to consider the common cluster sharing scenario as much as possible, such as supporting model training, model storage and model service at the same time. The typical example is Google Borg system.

Then, the platform needs to have flexible compatibility. At present, machine learning business is developing rapidly, and there are more and more machine learning frameworks for different scenarios. Flexible platform architecture can be compatible with almost all mainstream application frameworks, avoiding frequent changes of infrastructure due to business development. At present, Docker is a very suitable container format specification. By writing Dockerfile, you can describe the operating environment and system dependence of the framework. On this basis we can implement it on the platform TensorFlow, MXNet, Theano, CNTK, Torch, Caffe, Keras, SciKit-Learn, XGBoost, PaddlePaddle, Gym, Neon, Chainer, PyTorch, Deeplearning4j , Lasagne, Dsstne, H2O, GraphLab, MiniFlow, etc.

Finally, API services in machine learning scenarios need to be implemented. Aiming at the three main processes of model development, model training and model service of machine learning, we can define the API of submitting training task, creating development environment, starting model service and submitting offline prediction task, and use familiar programming language to realize Web Service interface. To implement a Google-like cloud deep learning platform, follow these three steps.

Of course, to implement a machine learning platform that covers data import, data processing, feature engineering, and model evaluation, we also need to integrate big data processing tools such as HDFS, Spark, and Hive, and implement workflow management tools such as Azkaban and Oozie to achieve ease of use and low threshold.

To sum up, the machine learning infrastructure consists of machine learning algorithms, machine learning libraries and machine learning platforms. Depending on the needs of the business, we can select specific areas for in-depth research and secondary development, using the wheel is as important as adapting the wheel according to the needs.

In machine learning and artificial intelligence are very popular today, hope you can also pay attention to the underlying infrastructure, algorithm researcher can understand more engineering, the design and implementation of research and development engineers can know more principle and optimization algorithm, on the appropriate infrastructure platform for machine learning to play a bigger benefit, the actual scenario of real applications.

That’s all for today’s sharing, thank you very much 🙂

Dihao Chen is the platform architect of Wevin in The Fourth Paradigm. He used to work as an infrastructure r&d engineer in Xiaomi Technology and UnitedStack. Active in it, Kubernetes, TensorFlow open source community, realize the Cloud Machine Learning Cloud deep Learning platform, making account at https://github.com/tobegit3hub.

The fourth paradigm wevin platform is the first mature commercial artificial intelligence full process platform in China, to solve the distributed machine modeling at the same time, but also to solve the application layer problems, and ultimately reduce the threshold of AI commercial applications. Now, “Wevin platform” trial version has been officially open to the public, welcome to scan the TWO-DIMENSIONAL code for free registration.

From algorithm implementation to MiniFlow implementation, build an infrastructure platform for machine learning

Related Posts

Super detailed explanation of gradient descent algorithm (from downhill metaphor, mathematical derivation to code implementation)

Image segmentation based on MATLAB IFCM intuitionistic Fuzzy C-means clustering

Single objective optimization solution based on matlab chaos algorithm to solve the single objective problem