Google releases Machine learning tool library Kubeflow: Best of all OSS solutions

From making

Heart of the machine compiles

Kubeflow is a machine learning tool library published by Google. It is committed to making machine learning on Kubernetes easier, convenient and extensible. The goal of Kubeflow is not to recreate other services, but to provide an easy way to find the best OSS solution.

The Kubeflow project aims to make machine learning on Kubernetes easy, convenient, and extensible. The goal is not to recreate other services, but to provide an easy way to find the best OSS solution. The library contains contained listings for creating:

JupyterHub for creating and managing interactive Jupyter Notebooks
TensorFlow Training Controller that can be configured to use CPU or GPU and tuned to a single cluster size with a single setting
TF Serving Container

This document details the steps to run the KubeFlow project in any environment where Kubernetes can be run.

Kubeflow target

The goal is to make machine learning easier by leveraging Kubernetes’ strengths:

Simple, repeatable portable deployment across different infrastructures (notebook <-> ML equipment <-> training cluster <-> production cluster)
Deploy and manage loosely coupled microservices
Expand as needed

Because machine learning practitioners have so many tools at their disposal, the core goal is that you can customize the stack to your needs and let the system handle “rogue stuff.” While we have started with a little technology, we are working with many different projects to cover additional tools. Ultimately, we want to come up with a simple list of ML stacks that can be easily used wherever Kubernetes is already running and self-configurable according to the cluster deployed.

Set up the

This document assumes that you already have a Kubernetes cluster available. Additional configuration may be required for specific Kubernetes installations.

Minikube

Minikube is a tool that makes it easier to run Kubernetes locally. Minikube runs a single-node Kubernetes cluster in the virtual environment of the laptop, allowing users to experiment with it or perform routine development work in that environment. The following steps apply to the Minikube cluster. This document is currently using the latest version 0.23.0. Kubectl must be configured to access Minikube.

Google Kubernetes engine

Google Kubernetes engine is a managed environment that can be used to deploy containerized applications. It incorporates the latest innovations in increased development productivity, efficient use of resources, automation and open source flexibility to speed model to market and iteration time.

Google has more than 15 years of experience running production workloads in containers, and they’ve incorporated that knowledge into Kubernetes. Therefore, Kubernetes is the industry’s leading open source container coordination system that provides technical support for the Kubernetes Engine.

If the reader is using the Google Kubernetes engine, we should grant ourselves the requested RBAC roles before creating the manifest so that we can create or edit other RBAC roles.


     
      kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin [email protected]
     
Copy the code

Quick start

Run the following command to quickly set up all the components of the stack:


     
      kubectl apply -f components/ -R
     
Copy the code

The above commands set up JupyterHub (API for training with TensorFlow) and a set of deployment files for the service. These services are configured to help users move from training to service in TensorFlow between different environments in a low-energy and portable manner. Refer to the instructions for these components.

use

This section describes the different components and the necessary steps for startup.

Create a Notebook

Once you have created all the listings required for JupyterHub, you have also created a load balancer service. You can view the creation information using the Kubectl command line.


     
      kubectl get svc
      
      NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
      Kubernetes ClusterIP 10.11.240.1 < None > 443/TCP 1h
      tf-hub-0     ClusterIP      None           <none>         8000/TCP       1m
      Tf-hub-lb LoadBalancer 10.11.245.94 XX.yy.zz.ww 80:32481/TCP 1M
     
Copy the code

If you are using minikube, run the following command to get the notebook URL.


     
      minikube service tf-hub-lb --url
      
      http://xx.yy.zz.ww:31942
     
Copy the code

For some cloud deployments, the LoadBalancer service can take up to 5 minutes, which is an external IP issue. The populated external IP field is finally displayed by repeating the kubectl get SVC command again.

Once you have an external IP, you can access it in your browser. The hub is set up to accept any username/password combination by default. Once you have entered your username and password, you can start a single-notebook server, configure computing resources (memory /CPU/GPU), and continue single-node training.

We also provide standard Docker images that can be used to train TensorFlow models on Jupyter.

gcr.io/kubeflow/tensorflow-notebook-cpu
gcr.io/kubeflow/tensorflow-notebook-gpu

In the Spawn window, when starting a new Jupyter instance, you can provide one of the above images, depending on whether you want to run on a CPU or GPU. The image includes all the necessary plug-ins (including Tensorboard for model visualization). Note: GPU-based images can be several gigabytes in size and may take several minutes to download locally.

In addition, when running on Google’s Kubernetes engine, the public address will be exposed and is an insecure endpoint by default. About using SSL and authentication for production deployment, see document: https://github.com/google/kubeflow/blob/master/components/jupyterhub.

training

The TFJob controller uses YAML as the main control parameter server and uses worker to run distributed TensorFlow. Quickly start deploying the TFJob controller and installing the new tensorflow.org/v1alpha1 API. You can submit a technical specification to the aforementioned API for a new TensorFlow training deployment.

The following is an example of technical parameters:


     
      apiVersion: "tensorflow.org/v1alpha1"
      kind: "TfJob"
      metadata:
        name: "example-job"
      spec:
        replicaSpecs:
          - replicas: 1
            tfReplicaType: MASTER
            template:
              spec:
                containers:
                  - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
                    name: tensorflow
                restartPolicy: OnFailure
          - replicas: 1
            tfReplicaType: WORKER
            template:
              spec:
                containers:
                  - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
                    name: tensorflow
                restartPolicy: OnFailure
          - replicas: 2
            tfReplicaType: PS
     
Copy the code

For details, see tensorflow/ K8S project. For more information about running Tensorflow Jobs on Kubernetes using TFJob controller, see tF-controller-examples /.

The service model

Detailed guide to see https://github.com/google/kubeflow/tree/master/components/k8s-model-server, use the built-in TensorFlow service deployment model service.

The original link: https://github.com//google/kubeflow

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]

Google releases Machine learning tool library Kubeflow: Best of all OSS solutions

Related Posts

Watching the Chinese people’s anti-epidemic map, I cried

Didi based on Flink real-time data warehouse construction practice

React: How does a function component differ from a class component?