Introduction: The research and development of the company needs the relevant technology stack of machine learning to be recommended, so the company started to dabble in KubeFlow, TensorFlow, YoutubeDNN and so on for the first time. (ML: Machine Learning)

The environment

Centos7.5k8s v1.20.5 Docker 19.03.15Copy the code

The installation

Please refer to the official documentation: https://www.kubeflow.org/docs/started/k8s/kfctl-k8s-istio/ wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz tar xf Kfctl_v1.2.0-0-gbc038f9_linux.tar. gz mv KFCTL /usr/sbin Create a directory (custom) : mkdir -p /letv/kubeflow/kf-test export PATH=$PATH:/usr/sbin/kfctl export KF_NAME=kf-test export BASE_DIR=/letv/kubeflow Export KF_DIR=${BASE_DIR}/${KF_NAME} export CONFIG_FILE=${KF_DIR}/ kfcTL_k8s_istio.v1.2.0.yaml CD /letv/kubeflow/ kF-test kfctl apply -V -f ${CONFIG_FILE} kubectl -n kubeflow get allCopy the code

Create the pv

Create kubeFlow pod by KFCTL apply

  1. Gcr. IO image cannot be downloaded.The solution is to mount the proxy to the server.

I use the local ladder after downloading, to Harbor, and then use the server to download a lot of harbor images; Or refer to another method: download the GCR. IO image through Ali Cloud

  1. Create PV and PVC; There are four PVCS to bind (katib-mysq, metadata-mysql, minio-PVC, mysql-PV-claim)

The way I use is to create PV and PVC YAML files, using NFS, for example:

root@vm-10-124-65-248 ~/test/kubeflow/PV$ cat PV_katib-mysql.yaml apiVersion: v1 kind: PersistentVolume metadata: labels: name: pv-katib-mysql name: pv-katib-mysql spec: accessModes: - ReadWriteOnce capacity: storage: 10Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: katib-mysql namespace: kubeflow nfs: path: Letv/nfs_kubeflow/data/v1 / katib - mysql server: 10.124.65.247 persistentVolumeReclaimPolicy: Retain volumeMode: FilesystemCopy the code
root@vm-10-124-65-248 ~/test/kubeflow/PV$ cat PVC_katib-mysql.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app.kubernetes.io/component: katib
    app.kubernetes.io/name: katib-controller
  name: katib-mysql
  namespace: kubeflow
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  volumeMode: Filesystem
  volumeName: pv-katib-mysql
Copy the code

Kubectl edit deploy Activator -n knative-serving kubectl edit sha256: kubectl edit deploy Activator -n knative-serving

Dashboard

masterip:31380

Entering dashboard for the first time causes a namespace to be created, and subsequent pods such as those created by Notebook Servers will be in that namespace.

Run the Notebook the Servers

Please refer to: aws.amazon.com/cn/blogs/ch…

Here’s a pitfall I ran into: When using Notebook Servers to create a test model, I also need to create PVCS and PVCS, otherwise the model data will not be retained. /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan: /home/jovyan

Notebook Servers run the Custom Image

When running Notebook Servers on kubeFlow’s dashboard, you can choose to customize the image, for example:

Therefore, I tried to run with tensorFlow image. After creating THE POD, there was indeed a problem. Describe troubleshooting indicates that docker login has permission problem. Continue research:

The serviceAccount (sa) configuration is missing.

  • Kubeflow requires permission to pull the image, there is no corresponding pull permission
  • Therefore, a secret is created under its namespace, which is specially prepared for the authentication information of Docker Registry. In this case, I call myRegistry
  • Although secret is created, the kubeFlow interface creates the statefulsets application, which specifies an SA: default-Editor
  • The sa: default-editor does not specify any secret.
  • The StatefulSets are working properly.