Machine learning, as a hot technology in recent years, is not only known by many “artificial intelligence” products, but also fundamentally enhances traditional Internet products. At the recent ArchSummit 2018 global Architect Summit, Yuan Kai, chief data Architect of IDC, shared “Design and Construction of Data Platform For Machine Learning” based on his years of experience in data platform construction and data product development.

I. Background: Application scenarios of machine learning in individual push business

As an independent intelligent big data service provider, Getuan’s main business includes developer services, precision marketing services and big data services in various vertical fields. Machine learning technology is involved in a variety of businesses and products:

1. Getui can provide intelligent push based on accurate user portraits. The user tag is mainly based on machine learning, which can predict and classify the crowd after training the model.

2. Advertising crowd orientation;

3. Forecast of tourist flow in commercial area;

4. Fake devices are common in mobile development, and machine learning can help developers identify the authenticity of new users;

5. Personalized content recommendation;

6. Churn and retention cycle prediction.

Ii. Specific process of machine learning



1. The original data is processed by ETL and stored in the data warehouse.

2. The blue part above represents machine learning: first matching sample data with our own data, then penetrating this data to generate features, a process called feature engineering. Then, based on these features, appropriate algorithms are selected to train the model, and finally the model is applied to the full amount of data to output the predicted results.

Standard machine learning workflow: We take a specific business problem and turn it into a data problem, or evaluate whether it can be solved with data. Once the data has been imported and filtered, we need to analyze the data in relation to business problems and goals, and reprocess the data as appropriate.

Next we carry out feature engineering. Find out the characteristic variables related to the target from the data, so as to construct or derive some features, and eliminate the meaningless features. We need to spend about 80% of our time on feature engineering. After selecting features, we will use logistic regression and RNN algorithms to train the model. Next, the model needs to be validated to determine whether it meets the goal. The reason why the data does not meet the target may be that the data is not related to the target and needs to be collected again. It could also be that we didn’t do a good job when we were exploring, so we need to re-explore the existing data and then do feature engineering. If the final model meets the business expectations, we apply it to the business line.



Common problems of machine learning project landing

Although the above process is very clear, there will be many problems in the concrete implementation process. Here I will talk about the previous practice and experience.

1. Now most companies have entered the era of big data. Compared with the previous stage of small data level, the skills of our modelers and algorithm experts in machine learning or data mining have become higher, and the work has become more difficult.

In the past, we can complete machine learning data preprocessing, data analysis and final machine learning analysis and on-line on a single machine. But in the case of large amounts of data, you may need to tap into the Hadoop ecosystem.

2. When we do supervised learning, we often need to match samples. The data in the data warehouse may be trillions, and the data extraction cycle is very long, and a lot of time has to be spent waiting for the machine to extract the data.

3. In most cases, many businesses are mined by one or two algorithm engineers, so it is common for modeling tools from different groups to be inconsistent or implementation processes to be irregular. Disunity results in a lot of code duplication, and the modeling process doesn’t settle down well within the team.

4. Many machine learning algorithm engineers have professional limitations in their background, and they may be relatively weak in the awareness and experience of code engineering. The common practice is: algorithm engineers will write the feature generation code and training code in the experimental stage and hand it to the students who do engineering development, but these codes cannot run on the full amount of data. The engineering development students will then re-implement the code to make it highly available and efficient. But even then, the translation is often not in place, resulting in high communication costs and long time to go online.

5. One of the big challenges in machine learning is the use of data, which is very expensive because we spend so much time exploring data.

6. Multiple businesses of QITui are using machine learning, but they are not unified, which will lead to repeated development and lack of platforms for precipitation and sharing. As a result, some of the more useful features that have been derived have not been widely used.

4. Push solutions to machine learning problems

Let’s start with the goals of our platform:

First, we want the internal modeling process to be normalized.

Second, we want to provide an end-to-end solution that covers the entire process from model development to live application.

Third, we want the data of the platform, especially the characteristic data developed, to be operational and shared among different teams within the company.

Fourth, the platform is not for zero-based developers of machine learning, but more for expert and semi-expert algorithm engineers to improve their modeling efficiency. At the same time, the platform should support multi-tenant to ensure data security.

Here’s our own overall plan, broken down into two main chunks:



The second part is the modeling platform, also known as the experimental platform, which is mainly used by algorithm engineers. The modeling platform includes:

1. Correspond to the IDE. On this platform, data exploration, data experiments, and it can support project management and sharing.

2. We want to manage the feature data that has been developed so that users of all platforms can see the data assets.

3. During sample matching, the sample ID may be different from the internal ID. In this case, a unified ID matching service is required.

4. It is also very important to help algorithm engineers extract data quickly from trillions of data.

5. In the process of machine learning, in addition to basic algorithms, there are actually many codes that are repeated or similar, and we need to functionally encapsulate these common codes.

6. Support package deployment of model services.

7. The model also supports version management.

8, in the actual business application model, need to real-time monitoring, follow up the model availability, accuracy, etc.

The upper half is the production environment, running the data processing pipeline and interfacing with the data modeling platform.


In the production environment, the characteristic data corresponding to the model can be divided into two types:

One type is real-time feature data. For example, real-time data collection generates some real-time features and stores them in different clusters according to different business requirements.

The other type is offline feature data, which is stored in Hive for use by model application side after processing.

In a production environment, we can provide online predictive apis or offline predictive data for use by the line of business.

5. Specific points of program practice

First, let’s talk about Jupyter:

The benefits of choosing Jupyter as the primary modeling IDE rather than a custom visual drag and drop modeling tool are interactive analysis, high modeling efficiency, easy expansion, and low development costs. Of course, visual drag-and-drop modeling platforms like Microsoft Azure allow you to see the entire process very clearly, which is suitable for entry-level students to get started quickly. But our target audience is the expert and semi-expert community, so we chose the most appropriate Jupyter.

When using Jupyter, we use Jupyterhub in order to support multi-tenancy. The underlying machine learning framework we use Tensorflow, Pyspark, Sklearn, etc. When exploring data processing, sparkMagic makes it very easy to run the Spark code written on Jupyter onto the Spark cluster.

For Jupyter there is no ready – made version control and project management, we combined git to solve.

In addition, in order to improve the efficiency of modelers in Jupyter, we introduced more plug-ins, such as: Make some typical mining pipeline into Jupyter template, so that when you need to do a similar business, you only need to expand the development based on the template, which better solves the problem of non-standard, avoids a lot of repeated code, and also makes a good foundation for the transformation of experimental code into production code.

Second, the utility functions:

We provide the main machine learning-related libraries and tools internally:

1) Standardized ID Mapping service API.

2) Create an API for data extraction, regardless of the type of storage, and analysts can tune this API. 3) Visualization of standardized function libraries and utility classes.

4) Jupyter2AzkabanFlow: The original code or script written in Jupyter can be automatically converted into AzkabanFlow, which solves the problem of code reuse in the stage of feature engineering.

Third, about using Tensorflow:

When using Tensorflow, our selection is TensorflowOnSpark. The distributed support of native Tensorflow is not good enough, and some node information needs to be specified, which is difficult to use.

TensorflowOnSpark solves the distributed problem of the native Tensorflow Cluster, and the code can be easily migrated to TensorflowOnSpark.

Yarn supports mixed GPU and CPU clusters, which facilitates resource reuse.


Fourth, about model delivery applications:

In terms of model delivery, we frame the entire predictive code to provide a variety of standard frameworks for analysts to choose from directly. The format of the output model file is required. For example, only PMML or Tensorflow PB format can be selected. After standardization, the modeler’s work can be separated from that of the system developer simply by using a standard library of predictive functions.

Finally, let’s share some of our experiences:

First, the number of PS on TensorflowOnSpark is limited, and the resource allocation of Worker and PS nodes is not very flexible.

Second, Jupyter in use, need to do some transformation, some open source library version compatibility problems.

Third, the use of PMML has performance bottlenecks, some are Java object repeated reconstruction, and some are format conversion loss, specific we can capture the JVM information analysis optimization.

Fourth, when using Spark and Hive in the landing process, you need to provide diagnostic tools that are easy to use. Modelers are not experts in Spark and Hive and may not be familiar with how to diagnose and optimize.

Fifth, we should treat the model and feature base as an asset, evaluate its value regularly, and manage its life cycle well.

Sixth, some more low-level problems, such as: hardware selection may pay attention to bandwidth, memory, GPU balance.

Finally, it is necessary to balance the cost of stack increase and maintenance to avoid the introduction of too many new tools and technologies, which may lead to operation and maintenance difficulties.

These are some of my experiences in machine learning that I hope will be helpful to you. Also welcome in the comments area for the relevant issues and I exchange!