Quickly build the whole process online AI application based on OpenMLDB V0.4.0

This article is based on Dihao Chen’s talk at “OpenMLDB Meetup No.1”.

Based on theOpenMLDBV0.4.0 Quickly build the whole process online AI application

At the beginning of the project, OpenMLDB had a lot of performance optimization, including LLVM-based JIT optimization, which can be used to optimize the corresponding code generation for different CPU architectures, Linux servers or MAC servers. Even the latest M1-based ARM architecture apple Computer. It is also possible to have OpenMLDB optimized for this scenario.

As mentioned earlier, the performance of OpenMLDB can be improved by 10 times or more than Spark in some scenarios. In fact, we have made many code optimizations for Spark, including window-skew optimization and window-parallel optimization, which are not supported by open source Spark. We even modified the Spark source code. To achieve this customized performance optimization for AI scenarios.

OpenMLDB also has optimization in storage. Traditional database services are mostly based on files. Such storage based on B + tree data structure is not suitable for high-performance online AI applications, and may need to be optimized for timing characteristics. We implemented a multi-level hop table data structure for partitioning keys and sorting keyboards, which can further improve the read and write performance of OpenMLDB on sequential data.

We recently released OpenMLDB version 0.4.0, which has a number of performance and functionality improvements. In this article, we introduce some of the new features of OpenMLDB version 0.4.0, and how to quickly build a full-process online AI application based on this new version.

First of all, I would like to introduce myself briefly. My name is Chen Dihao, and I am currently working as a platform architect in the Fourth Paradigm. I am the core R&D and PMC member of the OpenMLDB project. Currently, he focuses on distributed system and database design.

P3:

Today I will introduce three aspects to you:

OpenMLDB 0.4.0 full process features
Quick access to OpenMLDB 0.4.0 standalone and Cluster versions
How to quickly build a full-process online AI application using OpenMLDB

【 01 | OpenMLDB 0.4.0 the whole process of the new features introduced 】

OpenMLDB 0.4.0 whole process 1 | new features online offline unified storage

The first new feature is that the online and offline storage of OpenMLDB is unified, that is, the consistent table view. In the upper right corner is the information about our old table.

In 0.4.0 we added a unified table view, which brings offline storage and online storage together. In the lower right corner, an offline table is added under the definition of a common table. The information includes the offline storage path, data format of the offline storage, whether it is a deep copy, and other attributes.

Unified storage is also a design rarely seen in the industry database. It realizes that offline tables and online tables share a set of table names and schemes, a set of index information, and a SQL parsing engine. We use the SQL parsing engine implemented by C++ to compile SQL, and then share the same data import and export flow. In other words, both offline and online tables can use the same SQL statement to import and export data. The only difference is that offline and online tables have independent persistent storage. The online storage we just mentioned is the high-performance multi-level hop table storage with full memory. Offline we support local file storage and distributed storage like HDFS to meet the different requirements of offline and online scenarios.

OpenMLDB 0.4.0 2 | process features of the high available offline task management

The second new feature is a highly available offline task management Service called TaskManager Service,

The highly available offline task management service supports Spark task management on the local or Yarn cluster. It supports SQL to manage tasks, such as SHOW JOBS, SHOW JOBS, and STOP JOBS, by expanding the SQL syntax. It supports a variety of built-in data tasks, including importing online data, importing offline data, and exporting offline data.

OpenMLDB 0.4.0 whole process features 3 | end-to-end AI workflow

The third important feature is the implementation of a true end-to-end AI workflow, which can be used based on an SDK or CLI command line.

The list on the left is the 8 steps of end-to-end AI application implementation, including database creation, database table creation, offline data import, offline feature extraction, model training using machine learning framework, SQL online deployment, online data import, and online feature service.

From steps 1 through 8, almost every step can be implemented via OpenMLDB’s SDK or command line:

Create a database and create a database table using standard SQL statements, which are supported by standard SQL
For example, SQL Server or MySQL can support a syntax like Load Data in File, that is, importing Data from a File into a table.

We also support offline and online data import, as well as offline feature extraction. As mentioned earlier, we use extended SQL language for feature calculation. We integrate offline SQL task submission function in the command line, you can execute a standard SQL statement from the command line, such as select sum here. After a task is submitted using SQL, the task is submitted to a distributed computing cluster, such as the YARN cluster, and distributed offline feature calculation is performed.

For step 5 machine learning model training:

We support external machine learning training frameworks like TensorFlow, PyTorch, LightGBM, XGBoost or OneFlow.
Because we generate standard sample data formats, such as CSV, LIBSVM or TFRecords, users can use frameworks such as TensorFlow for model training.
These frameworks can also be submitted to local, YARN cluster, K8S cluster, etc., for distributed training;
Support the use of GPU and other hardware for acceleration, with our feature database OpenMLDB is fully compatible.

After model training, when our SQL features are ready to go live, we can then directly execute a Deploy command to attach the SQL to go live to launch our online feature service. This is also what we do with SQL extensions.

And then when the service goes live, it needs to be infused with some historical temporal characteristics, which we call a reservoir of characteristics. Some historical Data of the user can also be completed using the SQL statement Load Data.

When this is done, we have an internal service that supports HTTP and RPC interfaces, which can be accessed by clients using standard HTTP requests or using our Java or Python SDK. In the future, we will also integrate this functionality into the CLI to enable command line integration of end-to-end AI workflows throughout the process.

【 02 | OpenMLDB 0.4.0 standalone version/cluster quick-and-dirty 】

After introducing the full flow features of 0.4.0, let’s give you a quick overview of the standalone and cluster features of 0.4.0.

First of all, the differences between the standalone and cluster versions are:

Standalone is easy to deploy and has fewer modules. You only need to download a pre-compiled Binary. There are no external dependencies. The standalone version also has a full range of features, including support for Linux and MAC operating systems, ARM architecture based on THE M1 chip for MAC, or x86 architecture based on Intel CPU chips. Therefore, it is suitable for functional testing and small-scale POC testing.
The Cluster edition has a full and rich feature set.
- It supports high availability and all nodes are highly available with no single point of failure.
- It supports mass storage, and although our online data is stored in memory, it supports horizontal scaling of storage. As the amount of data increases, simply add general-purpose x86 storage servers horizontally.
- It is high-performance and supports distributed parallel computing, both offline and online, accelerating modeling and feature extraction time.

OpenMLDB 0.4.0 stand-alone quick-and-dirty | start stand-alone OpenMLDB database

Standalone version of the use of the method is very simple. Both standalone and cluster versions are open source on GitHub, and the underlying code downloaded from GitHub supports the cluster version. For the standalone version, we provide a script that launches the three components required for the standalone version. On the right is the architecture, which includes a Name Server service and an API Server service. The underlying data is stored on a single Tablet, so users can access our services using the command line or SDK.

OpenMLDB 0.4.0 stand-alone quick-and-dirty | use OpenMLDB client

Using the client side is very simple. After starting the cluster with a script, you can use a client-side command line tool like MySQL to connect to the OpenMLDB database using an IP address and port. After the connection, basic cluster information, including the version number, is displayed.

OpenMLDB 0.4.0 stand-alone quick-and-dirty | execute standard SQL

Once connected, we can use standard SQL statements.

The SQL statements we have supported are listed on the left side of the PPT, and you can see more detailed introduction on our documentation website. Basic SQL statements such as DML and DDL are already supported, as are SELECT INTO and various SELECT subqueries.

On the right are some screenshots of executing SQL commands. The general process of using a database is:

Create a database, then Use the database, the following SQL operations will be done on the default DB;
We can Create a table, which also follows the standard ANSI SQL syntax. But compared with standard SQL, when we create a table, we can also do the index and time column specified;
Run Show tables to view the tables that have been created.

We also support standard SQL insert statements, which insert a single item of data into a database table and query it through a SELECT statement. These are some of the features provided by OpenMLDB as a basic online database.

OpenMLDB 0.4.0 cluster quickly get started | start cluster OpenMLDB edition of database

So I’m going to introduce the cluster version. The cluster version starts in a similar way to the standalone version, with a star-all script. The cluster edition has high availability and multi-component features compared to the standalone edition.

In addition to the tablets, Name Server and API Server mentioned earlier, we will start two tablets by default for high availability to ensure that all data is at least double backed up.
You can configure the number of data backups and the size of the cluster in the configuration file.
Importantly, offline task management is supported in version 0.4.0, so a high availability task management module called Task Manager will also be added.

To the right of the PPT is a basic architecture diagram. In addition to OpenMLDB itself, the highly available implementation currently relies on a ZooKeeper cluster. Some of the basic metadata of OpenMLDB, including the master node service and the information that needs to be persisted, is stored on the ZK. When the Name Server starts up, it registers its high availability address with the ZK. The Tablet connects to the master Name Server through the ZK and listens for some of the metadata updates.

Version version OpenMLDB 0.4.0 cluster quick-and-dirty | cluster OpenMLDB configuration file

The deployment of cluster version will be relatively complex, it has a new task Manager module, you can also take a look at the configuration files of technical components, which is more important is that most components need to configure the IP address and path of ZooKeep, to ensure that all components are connected to the same ZooKeep. High availability metadata management is implemented through Zab protocol to ensure high availability of the entire cluster.

OpenMLDB 0.4.0 cluster quickly get started | using cluster OpenMLDB edition of the client

The cluster version of the client is slightly different from the standalone version. When using the OpenMLDB command line client, it no longer specifies the IP address and port of the Name Server directly. Because the Name Server is also highly available, its IP port may change during Failover. You need to configure ZK information. After the startup, more configurations and version information related to the cluster version will be printed.

It works in a similar way to the standalone version, with the aforementioned SQL statements. You can use it as a super high performance, full memory based timing database, or sqL-enabled database.

OpenMLDB 0.4.0 cluster quickly get started | using cluster OpenMLDB edition of advanced features

Cluster edition also has some more advanced features, here are two for you:

1. Offline mode and online mode. This is a unique feature of the cluster edition, because all computing in the standalone edition is done on a single machine, so there is no distinction between online and offline mode. The cluster version supports storage of massive data such as HDFS, and the offline computing layer is also based on Spark. So how do offline and online modes work?

- We support a standard SQL Set statement, and you can see that execute_mode is online. When it is online, all SQL statements are executed in online mode, that is, to look up data in memory.
- Set @session. exexute_mode = “offline” to change mode to offline.
- It can be seen that the current mode is offline, and the SQL query in offline mode does not need to be checked in the memory, because in real scenarios, such as risk control or gang fraud identification, the offline data may be massive and may be several to several hundred tons. SQL queries certainly do not return results instantly by interaction, and the query results cannot be aggregated entirely on a single node. So in offline mode, we treat SQL queries as a task. You can see the basic information of the task, including the task ID, task type, task status, and so on.
- After executing SQL, version 0.4.0 provides commands such as Show Jobs, view task status, view log information, and so on to achieve this offline task management. This part of the management function is also integrated into the CLI command line.

2. Deploy SQL to online services. This is supported by both the cluster and standalone versions. This is not supported by other online databases. After the database and database table are created, the user can use the Deploy command to go online after the scientist certification is effective for a certain SQL after feature extraction, and then use the SHOW DEPLOYMENT command to see the deployed service, which is similar to a stored procedure in SQL. Each Deployment corresponds to a liveable SQL. When we use an online service as a user, it can perform online SQL execution using the deploy name, similar to our stored procedure.

【 03 | Workshop – quickly set up the whole process online AI application 】

Finally, I will take you through a workshop to quickly build a full-process online AI application from the command line.

Application scenarios

This is the scenario we demonstrated, a Kaggle contest called New Your City Taxi Trip Duration, a machine learning scenario for estimating travel times. We’ll download a historical taxi trip data provided by the competition, and the developer or modeler will need to use this data, using machine learning methods, to estimate trip times for the newly presented test set. The training data is not large, with a total of 11 columns, about more than 1 million rows. It is characterized by including time-sequence data of Timestamp, which is relatively important for the time-sequence data of the trip prediction scene. We need to estimate the final journey time based on the history of each taxi, as well as the characteristics of the previous sequence.

OpenMLDB 0.4.0 technical scheme

This demo uses the technical solution based on OpenMLDB 0.4.0. Here is a summary:

Feature extraction language: SQL language, which is most familiar to scientific modelers;
Model training framework: LightGBM is used in this example, of course you can use TF or PyTorch if you want;
Offline storage engine: Use local file storage, because its sample data volume is not very large, only more than 1 million lines, maybe tens of megabytes of data. In actual scenarios, machine learning samples may be larger and more complex, so OpenMLDB can also support HDFS storage.
Online storage engine: Uses OpenMLDB’s high-performance sequential storage, an in-memory storage based on multi-level hop table data structures;
Online estimation service: The OpenMLDB API server is used to provide standard Restful interfaces and RPC interfaces.

Step 1: Run the OpenMLDB image

Next, we will demonstrate that when we use OpenMLDB modeling, we first need to build an OpenMLDB database running environment.

OpenMLDB itself provides a test demo image. The underlying implementation of OpenMLDB is based on c++, which is relatively stable and easy to install. When using OpenMLDB, we can use the official Docker image we provide on GitHub. MAC or Linux environment can also directly download our source code, local compilation and execution.

After execution, the container is entered, and the screenshot is the contents of its complete Docker file. In order to install the library, you can download the Binary and start the server and client. For example, the library will be installed for pandas and Python. The content of the image is also very clean and there is no need to download additional components.

Step 2: Start the OpenMLDB cluster

The second step is to start the OpenMLDB cluster, either using init.sh (a script we’ve wrapped), or the start script provided with the OpenMLDB project, or directly using Binary compiled by yourself.

Because of the full functionality of the cluster version we are demonstrating this time, we will start the ZooKeeper service and start some of the components we rely on, such as Tablets, Name Server, API Server, and Task Manager. Once these components are started, we have the functionality of the Cluster edition of OpenMLDB. If you are interested, you can also check out the contents of the sh script. Init. sh also supports standalone and cluster versions. Using cluster version, we started an extra ZooKeeper and all the OpenMLDB components.

Component startup is actually very simple, is the content of the start-all script. We will define a number of components and do a loop to separate each component. These components are launched ina binary compiled by the OpenMLDB c++ project, of course, on the respective platform, and then launched using a mon tool.

Step 3: Create the database and tables

After the service has been started, we can use a client like MySQL to do the connection. As long as the ZK address is configured, you can automatically find the name server address, into the database, at this point, you can execute most of the standard SQL statements.

In order to demonstrate our taxi end-to-end machine learning modeling process, we will:

Create a test DB, create DATABASE, then use database;
Using the show databases command, you can see that the database is created.
Then we create a table in the database. Since we have not started offline model training, we cannot know in advance what index the table needs to be built, so we support the user to create the table without specifying the index. Now you can see that the table has about 11 columns, and that the table corresponds to Kaggle’s contest data set, which provides 11 column data types, including multiple timestep columns.
Now you can see create successfully, the table has been created, and it’s called T1.

Step 4: Import offline data

Step 4 we need to start importing offline data. Import the training data provided in Kaggle contest. Currently, we support importing various data formats, including Parquet format and CSV format.

To perform offline data import, we need to change the current execution mode to offline and then load data statement to do so.

Why switch to offline? If the data import is not switched offline, the data import becomes online. If the amount of offline data is very large, or the data is imported from HDFS, then all the data will be transported to our online memory storage is not reliable, so it is very important to switch the execution mode to offline.

An imported SQL is then executed, which submits a task whose status can be seen by showing Jobs. It is a task ImportOfflineData. In a matter of seconds, the task is completed and the data is imported. If we look at the database again, we can see that when we first imported, there was no offline data imported, there was no offline address. After the offline import is successful, the table attributes will contain offline information, which indicates that the offline data has been imported to a current path. You can see that the data file has been imported correctly.

Step 5: Feature extraction using offline data

Let’s continue the demo by switching mode offline. After offline data import, offline feature extraction can be carried out. This step takes different time in different modeling scenarios. Modeling scientists are required to choose what features need to be extracted and then adjust the SQL script of feature extraction continuously.

Next, we can use the over Window sliding window to do timing features to calculate its aggregate values such as min and Max. We can also take the features of a single line to do a single line calculation for a certain line.

Finally, after SQL execution, we need to store the sample data after feature extraction to a location, which can be exported to a local path, or to a distributed storage of HDFS if the data volume is relatively large.

It can be seen from show Jobs that the job ID is 2. The state of the job changes from submitted to running, because it is executed in a distributed way. Although it is not a very complex SQL, its data comes from t1 offline data. In real offline feature extraction, the amount of data may be too large to complete SQL calculation in local memory, so we submit the task to a local or YARN Spark for execution.

You can see that the state has changed to FINISHED, indicating that the data has been exported successfully. Our SQL statement just specified the export path, and we can see from the command line that the sample data has been exported correctly. To support more training frameworks, other sample formats are supported in addition to the default CSV format. The content of this data file is the sample data generated by the SQL statement just described.

Step 6: Use sample data for model training

This sample data can be trained using the open source machine learning framework. Here we use our train script. You can also take a brief look at its contents.

Then the previous is to do a sample data integration, integration of multiple CSV files into a single CSV file, and then read out the CHARACTERISTICS of CSV through Panda. The following machine learning modeling scripts are familiar to modeling users:

First of all, the sample is divided into training set and estimation set, and its label column is extracted.
Import Python’s dataset and configure the machine learning model we use, such as GBDT or decision tree or DNN model.
Use lightgBM’s train function to begin training.

This script can also be replaced with training scripts for any open source machine learning framework such as TensorFlow, PyTorch, or Oneflow. We execute this script, because its sample data is not very much, so it will soon train the new model and export it to the output path, then we can use this model to make the model online. However, we must consider that the influence of our model is not only the influence of Model. Our input data is the original data provided by Kaggle. All end-to-end machine learning processes must include feature extraction and the influence of the model.

Step 7: Deploy SQL online

We need to bring the SQL we just modeled online using a high-performance online feature computing capability provided by OpenMLDB. We re-enter the OpenMLDB database from the client and switch its default DB. The SQL deployment is the same as the SQL offline feature calculation above. At this time, we do not need to do special development for a feature. When deploy, it will partition certain keys and sort the time column. We will pre-index these keys and store the data by index.

Step 8: Import online data into OpenMLDB

Deploy after can do online forecast, forecast the time we must hope to achieve online estimate, because we are doing is characterized by the temporal window features, we hope every feature calculation, everyone can do it according to the previous window min or Max aggregation, so we usually has a reservoir operation, Import some online data into an online database. The online import task is also a distributed task. After executing the job, we can see that a job has been submitted, and the job is running fast. In the local environment, its performance is very good.

Step 9: Start the HTTP estimation service and online estimation

With the imported online data, in step 9 we can start the estimation service for online estimation. We have a script called Start Predict Server, which is a very simple Python HTTP server that we’ve packaged. The HTTP server will wrap some client data and request it to the API server, and then print the results. After the original data comes in, the sample data will be obtained through feature calculation, and the LightGBM model just obtained by model training will be loaded. The model will be used to receive the feature samples returned online, and then the estimated results will be returned.

Step 10: Perform online feature extraction calculation

We started Predict Server and the last step was to do predict online.

Predict is a script that we packaged online. This script is an HTTP client that takes the raw data columns that we provided as parameters (the input includes string data, not the sample after feature extraction) and executes it directly through Python. A single online feature extraction can be done within 10 milliseconds.

How is this speed of execution achieved? This is related to the SQL complexity of our user modeling. For simple features like this, if the data volume is relatively small, it can even be achieved in less than 1 millisecond. The time of pure feature calculation, plus the time estimated by the model, the user is almost unaware and immediately returned. Here are the online samples of feature samples made by our SQL statement respectively, and this is the estimated data of a model returned by LightgBM.

Some people may think that running a few scripts seems to be nothing, you just do some SQL calculations, I went to MySQL to check the data, it seems that it can also be tens of milliseconds or 100 milliseconds. What’s the difference?

For example, if you pass a Feature, trip_Duration might be 10, and then you normalize or transform it, it’s actually a single line Feature. Obviously, single-line feature performance is very high, and whether implemented in Python or c++, you can do a multiplication or an addition to a raw data with almost one CPU calculation.
And the feature that we support is actually a sequential aggregate feature of a sliding window, so when you compute a row of data, you don’t just do a numerical calculation of that row, you pull all the data from the database for that row that was one day ago, Then, according to the window definition syntax of “ROWS BETWEEN” or “RANGE BETWEEN”, slide the window and aggregate the data in the window. Our performance in less than 10 milliseconds is actually the result of aggregating these two features separately.
If we don’t use a special timing database, like if we get all the data from MySQL the day before the current data, maybe the window data fetch is more than 100 milliseconds, we can do window data fetch, window feature count aggregation calculation and finally model estimation, The overall time can be controlled to within 10ms to 20ms, which is closely related to our storage architecture design and is also the difference between our OpenMLDB and other OLTP databases. And then finally we can get an estimate.

Summary of AI application in the whole online process

In fact, the previous steps are relatively simple: start openMLDB cluster, create database, create table, and then carry out offline data import and offline feature calculation. When the SQL extracted from the feature is ok, we can do the SQL online. After the online, we will start a prediction service that supports HTTP, and then make a prediction. In version 0.4.0, almost all steps can be executed and supported on the SQL command line. In the future, we also plan to support online feature extraction based on the command line, and even extend SQL syntax to support model training on the command line.

First of all, I will introduce some new features of OpenMLDB 0.4.0 process, as well as the quick start of standalone and cluster versions. Finally, I will demonstrate how to use OpenMLDB to quickly build an online AI application through a Kaggle competition scenario.

Welcome to participate in our community. At present, all the documents and codes of the project are on Github. If you are interested, you can also participate in the submission of issue and the code development of pull Request. Welcome to join our wechat group. That’s all for my sharing. Thank you very much for listening.