TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

This paper introduces the key parts and general process of building a general machine learning platform based on TF, but does not give a more detailed introduction.Copy the code

KDD 2017 Applied Data Science Paper, the author is a large number of Google, so many authors, according to experience should be the main developers of this platform. The platform described in this article is named Tensorflow Extended, which means that its core components are implemented using Tensorflow. It can be understood as a secondary development based on Tensorflow. So what’s the difference between TFX and TF? I understand it, the difference lies in the tf is provided a set of functions, can be used to perform model training related to a series of work, and TFX is built on the basis of tf includes a machine learning throughout the life cycle of a complete system, this system not only includes the tf machine learning related functions such as training provided by the model, It also provides important functions such as data verification and validation, model hot start, online service, model publishing and so on.

The point of this article is not to teach you how to build a TFX system, but to explain the main components of building a TFX system, the key points to consider when implementing these components, and the lessons learned by the author team. In other words, it’s more of a “teach a man how to fish” article. For those who want to implement a general machine learning platform like TFX, it has important guiding significance. Below I think the more valuable part to do an excerpt and interpretation.

Overall design principles

The core design principles of TFX include the following:

  • Build a unified platform that can serve multiple learning tasks. This requires the system to have enough universality and expansibility.
  • Support continuous training and service. These two things seem simple, but when you consider the details of risk control and automated problem discovery, they are not.
  • Manual intervention. How to gracefully involve people in the process and solve problems that machines can’t solve is also a challenge.
  • Reliability and stability. Reliability and stability here not only mean that the service does not crash, but also that the service remains stable and reliable even when problems occur at the data level.

Data analysis, transformation and validation

Data is the core of machine learning system. How to process data determines the quality of the whole model. In this part, the author introduces the key points of data analysis, transformation and verification.

The data analysis

Data analysis refers to the automatic statistical analysis of the data entered into the system, such as the distribution of data values, the distribution of features on the sample, the distribution of features on each sample, etc. At the same time, it also supports data fragmentation statistics, such as statistics of positive and negative samples, statistics of data from different countries, and so on. One of the difficulties here is that, under the requirements of big data and timeliness, the exact values of many data statistics are difficult to calculate, so streaming approximation algorithms are often used to calculate good approximation values.

Data conversion

The so-called data transformation refers to all kinds of data transformation from original data to trainable features, such as discretization, feature mapping and so on. There are also some details about how to deal with sparse features.

An important point in this section is to ensure consistency of data transformation between training and service, which often leads to poor model performance. TFX’s approach is to avoid such inconsistencies by also output data transformations as part of the model. In other words, it is important to ensure consistency by reusing the same code logic in the training phase and the service phase, rather than implementing different code in both places. In my experience, failure to do so not only leads to inconsistent feature transformations, but also increases development effort and correctness verification effort.

Data validation

The so-called data verification refers to whether the data entering the system meets expectations and whether there are anomalies. TFX introduces a data schema structure that specifies the constraints on a piece of data, such as the type of the data, whether it is required or not, the maximum and minimum values, and so on. The purpose of this is to prevent the data that does not meet expectations from entering the model training stage and affecting the quality of the model. With this schema, you can do to enter the system data validation, further, you can also give Suggestions, especially in the data itself has changed, the original constraint should be changed, system can automatically discover possible changes, feedback to the developers, let developers to decide whether to accept the system puts forward Suggestions for improvement. The following figure shows an example of data validation using schema. The red content in the figure shows the recommendations after validation.

The constraints mentioned above can be extended, but the authors suggest that overly complex constraints are often difficult to give suitable constraints and difficult to maintain if the data changes. To make data validation better serve the system and make it easier for users, here are some core design principles:

  • Users should be able to see at a glance what went wrong and how it affected them.
  • Exceptions should be straightforward and the user knows how to handle them. For example, you can say that an eigenvalue is outside a certain range, but not that the KL divergence of two features exceeds a threshold.
  • Give new schema suggestions based on the natural variation of the data. A lot of data changes over time, so you need to take that into account.
  • Expect users to treat data exceptions like bugs. So, TFX allows data exceptions to be logged, tracked, and resolved like bugs.

Users can also track changes in anomalies to find room for improvement in feature engineering.

Model training

There is not much worth mentioning in the model training part, which is mainly training with Tensorflow. It is worth mentioning the warm start problem of the model. Hot start solves the problem that the model needs a long time of training to converge. At this time, a trained model can be used to select some common feature weights as the initial state of the model, which can make the model converge faster and thus accelerate the training speed. In order to make this part of logic universal, TFX abstracted it and made it open source (probably open source in TF, this part is not verified).

Model evaluation and validation

Machine learning system is a multi-component complex system, which makes it possible for bugs to occur in many places, and these bugs will not crash the system in many cases, and are hard to find, so the model needs to be evaluated and verified.

Define the “good” model

The authors define a good model as one that can safely serve and has the desired prediction quality. Safely providing services means that the model does not break down while serving for various reasons, such as taking up too many resources or formatting the data incorrectly. Prediction quality refers to the accuracy of model prediction, which is closely related to business effect.

Sensitivity of validation

One of the challenges of model validation is determining the sensitivity of validation. If it is too sensitive, the data will be slightly fluctuated to alarm, which will lead to frequent alarm and eventually people will ignore the alarm; If you are too insensitive, you will miss the problem. The author’s experience is that once the model goes wrong, it will generally lead to major changes in various indicators, so the sensitivity can be set thicker. Of course, this comparison business related, or according to their own business to decide.

Shard validation

In addition to validation of the model as a whole, it may sometimes be necessary to validate a particular fragment of data. For example, authentication for male users and so on. This is of great significance for the detailed evaluation and optimization of the model.

other

Later in the article, I’ll leave out some examples of service-level performance optimization and application on Google Play. Those who are interested in the details of the article can search for the original text.