From making

Heart of the machine compiles

Kubeflow is a machine learning tool library published by Google. It is committed to making machine learning on Kubernetes easier, convenient and extensible. The goal of Kubeflow is not to recreate other services, but to provide an easy way to find the best OSS solution.

The Kubeflow project aims to make machine learning on Kubernetes easy, convenient, and extensible. The goal is not to recreate other services, but to provide an easy way to find the best OSS solution. The library contains contained listings for creating:

  • JupyterHub for creating and managing interactive Jupyter Notebooks

  • TensorFlow Training Controller that can be configured to use CPU or GPU and tuned to a single cluster size with a single setting

  • TF Serving Container

This document details the steps to run the KubeFlow project in any environment where Kubernetes can be run.

Kubeflow target

The goal is to make machine learning easier by leveraging Kubernetes’ strengths:

  • Simple, repeatable portable deployment across different infrastructures (notebook <-> ML equipment <-> training cluster <-> production cluster)

  • Deploy and manage loosely coupled microservices

  • Expand as needed

Because machine learning practitioners have so many tools at their disposal, the core goal is that you can customize the stack to your needs and let the system handle “rogue stuff.” While we have started with a little technology, we are working with many different projects to cover additional tools. Ultimately, we want to come up with a simple list of ML stacks that can be easily used wherever Kubernetes is already running and self-configurable according to the cluster deployed.

Set up the

This document assumes that you already have a Kubernetes cluster available. Additional configuration may be required for specific Kubernetes installations.

Minikube

Minikube is a tool that makes it easier to run Kubernetes locally. Minikube runs a single-node Kubernetes cluster in the virtual environment of the laptop, allowing users to experiment with it or perform routine development work in that environment. The following steps apply to the Minikube cluster. This document is currently using the latest version 0.23.0. Kubectl must be configured to access Minikube.

Google Kubernetes engine

Google Kubernetes engine is a managed environment that can be used to deploy containerized applications. It incorporates the latest innovations in increased development productivity, efficient use of resources, automation and open source flexibility to speed model to market and iteration time.

Google has more than 15 years of experience running production workloads in containers, and they’ve incorporated that knowledge into Kubernetes. Therefore, Kubernetes is the industry’s leading open source container coordination system that provides technical support for the Kubernetes Engine.

If the reader is using the Google Kubernetes engine, we should grant ourselves the requested RBAC roles before creating the manifest so that we can create or edit other RBAC roles.


     
  1. kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin [email protected]

Copy the code

Quick start

Run the following command to quickly set up all the components of the stack:


     
  1. kubectl apply -f components/ -R

Copy the code

The above commands set up JupyterHub (API for training with TensorFlow) and a set of deployment files for the service. These services are configured to help users move from training to service in TensorFlow between different environments in a low-energy and portable manner. Refer to the instructions for these components.

use

This section describes the different components and the necessary steps for startup.

Create a Notebook

Once you have created all the listings required for JupyterHub, you have also created a load balancer service. You can view the creation information using the Kubectl command line.


     
  1. kubectl get svc

  2. NAME         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE

  3. Kubernetes ClusterIP 10.11.240.1 < None > 443/TCP 1h

  4. tf-hub-0     ClusterIP      None           <none>         8000/TCP       1m

  5. Tf-hub-lb LoadBalancer 10.11.245.94 XX.yy.zz.ww 80:32481/TCP 1M

Copy the code

If you are using minikube, run the following command to get the notebook URL.


     
  1. minikube service tf-hub-lb --url

  2. http://xx.yy.zz.ww:31942

Copy the code

For some cloud deployments, the LoadBalancer service can take up to 5 minutes, which is an external IP issue. The populated external IP field is finally displayed by repeating the kubectl get SVC command again.

Once you have an external IP, you can access it in your browser. The hub is set up to accept any username/password combination by default. Once you have entered your username and password, you can start a single-notebook server, configure computing resources (memory /CPU/GPU), and continue single-node training.

We also provide standard Docker images that can be used to train TensorFlow models on Jupyter.

  • gcr.io/kubeflow/tensorflow-notebook-cpu

  • gcr.io/kubeflow/tensorflow-notebook-gpu

In the Spawn window, when starting a new Jupyter instance, you can provide one of the above images, depending on whether you want to run on a CPU or GPU. The image includes all the necessary plug-ins (including Tensorboard for model visualization). Note: GPU-based images can be several gigabytes in size and may take several minutes to download locally.

In addition, when running on Google’s Kubernetes engine, the public address will be exposed and is an insecure endpoint by default. About using SSL and authentication for production deployment, see document: https://github.com/google/kubeflow/blob/master/components/jupyterhub.

training

The TFJob controller uses YAML as the main control parameter server and uses worker to run distributed TensorFlow. Quickly start deploying the TFJob controller and installing the new tensorflow.org/v1alpha1 API. You can submit a technical specification to the aforementioned API for a new TensorFlow training deployment.

The following is an example of technical parameters:


     
  1. apiVersion: "tensorflow.org/v1alpha1"

  2. kind: "TfJob"

  3. metadata:

  4.  name: "example-job"

  5. spec:

  6.  replicaSpecs:

  7.    - replicas: 1

  8.      tfReplicaType: MASTER

  9.      template:

  10.        spec:

  11.          containers:

  12.            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

  13.              name: tensorflow

  14.          restartPolicy: OnFailure

  15.    - replicas: 1

  16.      tfReplicaType: WORKER

  17.      template:

  18.        spec:

  19.          containers:

  20.            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

  21.              name: tensorflow

  22.          restartPolicy: OnFailure

  23.    - replicas: 2

  24.      tfReplicaType: PS

Copy the code

For details, see tensorflow/ K8S project. For more information about running Tensorflow Jobs on Kubernetes using TFJob controller, see tF-controller-examples /.

The service model

Detailed guide to see https://github.com/google/kubeflow/tree/master/components/k8s-model-server, use the built-in TensorFlow service deployment model service.

The original link: https://github.com//google/kubeflow

This article is compiled for machine heart, reprint please contact this public number for authorization.

✄ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Join Heart of the Machine (full-time reporter/intern) : [email protected]

Contribute or seek coverage: [email protected]

Advertising & Business partnerships: [email protected]