Machine Learning Kubflow Pipelines

This is the 29th day of my participation in the August More Text Challenge

Kubeflow

Kubeflow briefly

The Kubeflow project is built on containers and Kubernetes, aiming to provide data scientists, machine learning engineers, systems operations personnel with an agile deployment, development, training, release and management platform for machine learning business. It takes advantage of cloud native technology to make it faster and easier for users to deploy, use and manage the most popular machine learning software.

Kubeflow is an integration of open source projects in many fields, such as Jupyter, TF Serving, Katib, Fairing, Argo, etc. It can be managed for different stages of machine learning: data preprocessing, model training, model prediction, service management, etc. Once Kubernetes is installed, it can be deployed locally, in the machine room, or in the cloud.

The core components of Kubeflow

The diagram above shows the whole process of machine learning in industry, from data collection, validation, model training, and service delivery. Each widget in the figure is contained in Kubeflow. It can be seen that kubeFlow’s ambition is great, on the other hand, it also expresses its strong function. Here is an overview of each component:

Notebooks of Jupyter: Create and manage multi-user interactive Jupyter notebooks
Tensorflow/PyTorch: Currently the main machine learning engine supported
Seldon: Provides deployment of machine learning models on Kubernetes
Tf-serving: Provides online deployment of the Tensorflow model with versioning and no need to stop online service switching model
Argo: Workflow engine based on Kubernetes
Pipelines: is a based onArgoA workflow project for machine learning scenarios is implemented, which provides the creation, scheduling, and management of machine learning processesWeb UI.
Ambassador: API Gateway for unified external services
Istio: Provides microservice management, Telemetry collection
Ksonnet: Kubeflow uses Ksonnet to deploy the required K8S resources to the Kubernetes cluster
Operator: Provides resource scheduling and distributed training capabilities for different machine learning frameworks (TF-Operator.PyTorch-Operator.Caffe2-Operator.MPI-Operator.MXNet-Operator)
Katib: Based on eachOperatorThe realization of hyperparameter search and simple model structure search system, support parallel search and distributed training. Hyperparameter optimization has not been applied on a large scale in practical work, so this part of the technology still needs some time to mature
Pachyderm: Pachyderm version control data, similar to how Git handles code. You can track data states over time, backtest historical data, share data with teammates, and revert to previous data states

Kubeflow characteristics

Based on Kubernetes, with cloud native features: elastic scaling, high availability, DevOps, etc
Integrate a large number of machine learning tools

The basic features of Kubeflow are briefly introduced above. Next, we introduce Kubeflow Pipelines in detail.

KubeFlow Pipelines briefly

After Kubeflow V0.1.3, Pipelines have become the core component of Kubeflow. The purpose of Kubeflow is mainly to simplify the process of running machine learning tasks on Kubernetes, and finally hope to achieve a complete set of available pipeline to achieve a set of end-to-end process of machine learning from data to model. Pipelines are a workflow platform that can compile and deploy machine learning workflows. In this sense, it’s no surprise that Pipeline is a core component of Kubeflow.

Kubeflow Pipelines include:

A user interface (UI) for managing and tracking experiments, jobs, and runs.
An engine for scheduling multi-step ML workflows.
SDK for defining and manipulating pipes and components.
Use the SDK for Notebooks of system interaction.

Build up KubeFlow Pipelines

The figure above shows the structure diagram of Kubeflow Pipelines, which is divided into eight parts:

Python SDK: A specific language (DSL) for creating Kubeflow Pipelines components.
DSL Compiler: Converts Python code into a YAML static configuration file (DSL Compiler).
Pipeline Web Server: A front-end service for a Pipeline that collects various data to show a view of what is currently runningpipelineList,pipelineExecution history records about eachpipelineRunning debugging information and execution status.
Pipeline Service: the back-end Service of a Pipeline, invokedK8SService fromYAMLcreatepipelineRun.
Kubernetes Resources: Create CRDs to run Pipeline.
Machine Learning Metadata Service: used to monitor operations byPipeline ServiceTo create theKubernetesResources, and persist the state of these resources in the ML metadata service (store between task flow containers)input/outputData interaction).
Artifact Storage: Used for StorageMetadataandArtifact.Kubeflow PipelinesStore metadata inMySQLIn the database, artifacts are stored in an artifact store such as a Minio server or Cloud Storage.
Orchestration Controllers: Task Orchestration, such as the Argo Workflow controller, which coordinates task-driven workflows.

Main features of KubeFlow Pipelines

End-to-end orchestration: Enables and simplifies orchestration of machine learning workflows.
Easy experimentation: Allows you to easily experiment with multiple ideas and methods and manage your various experiments/experiments.
Easy to reuse: Enables you to reuse components and workflows to quickly create end-to-end solutions without having to rebuild each time.

The sample

background

When training a new ML model task, most data scientists and ML engineers will likely first develop some new Python scripts or interactive notebooks that perform the data extraction and pre-processing necessary to build clean data sets for training models. Then, they might create several additional scripts or notebooks to try out different types of models or different machine learning frameworks. Finally, they will collect and explore metrics to assess each model’s performance on the test data set, and then determine which model to deploy into production.

This is obviously an oversimplification of a real machine learning workflow, but the point is that this generic approach requires a lot of human involvement and can’t be easily reused by anyone other than the engineers who originally developed it.

We can use KubeFlow Pipelines to solve these problems. Instead of thinking of data preparation, model training, model validation, and model deployment as a single code base for the particular model we are working on, we can think of this workflow as a series of separate modular steps, each focused on a specific task.

Environment to prepare

pip install kfp
Copy the code

Design workflow

We will create a total of four components, as shown below:

preprocess-data: The component will run fromsklearn.datasetsLoad the Boston housing price data set, then split the data set into a training set and a test set.
Train-model: This component will train a model to predict the median value of a Boston home using the Boston home price data set.
Test-model: This component calculates and outputs the mean square error of the model on the test data set.
Deploy-model: We will not focus on model deployment in this article, so the component will only log a message that it is deploying the model. In a real world scenario, this could be a common component for deploying any model into production.

Next, we write code, develop components, and make images.

Preprocessing component Development (Preprocess-data)

First, write the preprocessing code preprocess.py:

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

def _preprocess_data() :
     X, y = datasets.load_boston(return_X_y=True)
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
     np.save('x_train.npy', X_train)
     np.save('x_test.npy', X_test)
     np.save('y_train.npy', y_train)
     np.save('y_test.npy', y_test)
     
if __name__ == '__main__':
     print('Preprocessing data... ')
     _preprocess_data()
Copy the code

Then, write the image file Dockerfile:

FROM python:3.7-slim

WORKDIR /app

RUN pip install -U scikit-learn numpy

COPY preprocess.py ./preprocess.py

ENTRYPOINT [ "python"."preprocess.py" ]
Copy the code

Next build the image:

docker build -t wintfru/boston_pipeline_preprocess:v1  -f Dockerfile .
Copy the code

Last push image to remote repository:

docker push wintfru/boston_pipeline_preprocess:v1
Copy the code

Remaining component development (train-model, test-model, deploy-model)

The process of developing the remaining components is similar to that of preprocess-data. See the Reference documentation for details.

Build workflow

First, we choreograph workflow pipeline.py:

import kfp
from kfp import dsl


def preprocess_op() :
    return dsl.ContainerOp(
        name='Preprocess Data',
        image='wintfru/boston_pipeline_preprocess:v1',
        arguments=[],
        file_outputs={
            'x_train': '/app/x_train.npy'.'x_test': '/app/x_test.npy'.'y_train': '/app/y_train.npy'.'y_test': '/app/y_test.npy',})def train_op(x_train, y_train) :
    return dsl.ContainerOp(
        name='Train Model',
        image='wintfru/boston_pipeline_train:v1',
        arguments=[
            '--x_train', x_train,
            '--y_train', y_train
        ],
        file_outputs={
            'model': '/app/model.pkl'})def test_op(x_test, y_test, model) :
    return dsl.ContainerOp(
        name='Test Model',
        image='wintfru/boston_pipeline_test:v1',
        arguments=[
            '--x_test', x_test,
            '--y_test', y_test,
            '--model', model
        ],
        file_outputs={
            'mean_squared_error': '/app/output.txt'})def deploy_model_op(model) :
    return dsl.ContainerOp(
        name='Deploy Model',
        image='wintfru/boston_pipeline_deploy:v1',
        arguments=[
            '--model', model
        ]
    )


@dsl.pipeline(
    name='Boston Housing Pipeline',
    description='An example pipeline that trains and logs a regression model.'
)
def boston_pipeline() :
    _preprocess_op = preprocess_op()

    _train_op = train_op(
        dsl.InputArgumentPath(_preprocess_op.outputs['x_train']),
        dsl.InputArgumentPath(_preprocess_op.outputs['y_train'])
    ).after(_preprocess_op)

    _test_op = test_op(
        dsl.InputArgumentPath(_preprocess_op.outputs['x_test']),
        dsl.InputArgumentPath(_preprocess_op.outputs['y_test']),
        dsl.InputArgumentPath(_train_op.outputs['model'])
    ).after(_train_op)

    deploy_model_op(
        dsl.InputArgumentPath(_train_op.outputs['model'])
    ).after(_test_op)


# client = kfp.Client()
# client.create_run_from_pipeline_func(boston_pipeline, arguments={})

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(boston_pipeline, __file__ + '.yaml')


Copy the code

We then compile pipeline.py into a file for the Kubernetes task yamL configuration.

python pipeline.py
Copy the code

Execute workflow

First, enter Kubflow Pipelines graphical user interface to upload Yaml files.

The workflow is then run, and the DAG diagram is shown below.

We can also view input and output results for each component, console logs, and so on.

— Kubflow Pipelines

conclusion

This paper introduces the basic architecture and components of Kubflow and Kubflow Pipelines and how to implement a simple machine learning workflow using Kubflow Pipelines to load some data, train the model, evaluate it on hold data set, and then “deploy” it. By using Kubeflow Pipelines, we were able to encapsulate each step in this workflow into workflow components, each running in its own, isolated Docker container environment. This encapsulation facilitates loose coupling between steps in our machine learning workflow and opens up the possibility of reusing components in future workflows. For example, there is nothing in our training component that is specific to the Boston housing price data set. We can reuse this component anytime we want to use Sklearn to train regression models.

“> < span style =” max-width: 100%; clear: both; min-height: 1em; box-sizing: border-box! Important; word-wrap: break-word! Important;

Reference documentation

Machine Learning Pipelines with Kubeflow