The author | Xu Xiaozhou source (xiao yuan) | alibaba cloud native public number

background

Due to the natural advantages of cloud computing in resource cost and flexible capacity expansion, more and more customers are willing to build AI systems on cloud, and cloud native technology represented by container and Kubernetes has become the shortest path to release cloud value, and it has become a trend to build AI platforms on cloud based on Kubernetes.

When faced with more complex model training or large amount of data, the computing power of a single machine often cannot meet the requirements of computing power. By using distributed training frameworks such as AiACC from Alibaba or Horovod from the community, a single training task can be extended to support distributed training tasks with just a few lines of code modification. Common on Kubernetes are the TF-operators of the KubeFlow community that support Tensorflow PS mode, or the Mpi-operators that support Horovod’s MPI AllReduce mode.

The status quo

Kubernetes and cloud computing provide agility and scalability. We can set elastic policies for training tasks through components such as Cluster-AutoScaler. We can take advantage of Kubernetes’ elastic capabilities to create on demand and reduce GPU device idling.

But this scaling mode is still slightly inadequate for offline tasks like training:

  • Fault tolerance is not supported. When some workers fail due to device reasons, the whole task needs to be stopped and restarted.
  • The training task usually takes a long time, occupies a large amount of computing power, and the task lacks flexibility. When resources are insufficient, resources cannot be freed up for other businesses as needed unless the task is terminated.
  • The training task takes a long time, and worker dynamic configuration is not supported. Therefore, instance preemption cannot be safely used to maximize the cost performance on the cloud

How to give flexibility to training task is the key way to improve the cost performance. Distributed frameworks such as Horovod have recently begun to support Elastic Training. In other words, a training task is allowed to dynamically expand or shrink the training worker in the execution process without causing the interruption of the training task. Need to make a small correction in the code adapter, may refer to: horovod. Readthedocs. IO/en/stable/e… .

For details on how Elastic Training works, see this Elastic Horovod design document.

In mPI-operator, workers participating in training are designed and maintained as static resources. Support for elastic training mode increases the flexibility of tasks, but also brings challenges to the operation and maintenance layer. For example:

  • Horovordrun provided by Horovod must be used as the entrance. In Horovod, the launcher can log in to the worker through SSH, and the login tunnel between the launcher and the worker needs to be opened.

  • The Elastic Driver module, which calculates the elasticity, can pull or stop worker instances by specifying the DISCOVER_host script to get the latest worker topology information. When the worker changes, the return value of the DISCOVER_host script is first updated.

  • In scenarios such as preemption or price calculation, it is sometimes necessary to specify worker scaling, and K8s’s native orchestration meta, Deployment, statefulset cannot meet this requirement.

The solution

To solve the above problems, we designed and developed et-operator, which provides TrainingJob CRD to describe training tasks, ScaleOut and ScaleIn CRD to describe expansion and reduction operations. Through their combination, our training tasks are more flexible. Open source will be this solution, welcome everyone to demand, exchange, ridicule.

Open source solution address: github.com/AliyunConta…

design

The TrainingJob Controller has the following functions:

  • Maintain the TrainingJob create/delete life cycle and sub-resource management.
  • Perform capacity expansion or reduction operations.
  • Fault tolerance: When the worker is expelled, create a new worker to join the training.

1. Create resources

The TrainingJob sub-resources are created in the following sequence:

  • Create the key pair required to access SSH. Create secret.
  • Create workers, including service and POD, and mount secret public key.
  • Create configMap, including the DISCOVER_host script and hostfile file.
  • Create the Launcher and mount the ConfigMap. Hostfile is copied from configMap to a separate directory through initContainer because the hostfile is later modified depending on the topology.

TrainingJob

The configuration of TrainingJob CR is divided into Lanucher and Worker. By default, et-operator will generate a hostfile and discover_host script according to the worker assignment situation. The discover_host script is mounted to the /etc/edl/discover_hosts.sh file of the Launcher and specified in the horovodrun execution of the entry script with the –host-discovery-script parameter. In Worker Settings, the image and GPU of the Worker can be specified, and the allowed range of workers copies can be specified by maxReplicas/minReplicas.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: TrainingJob
metadata:
  name: elastic-training
  namespace: default
spec:
  cleanPodPolicy: Running
  etReplicaSpecs:
    launcher:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - sh
            - -c
            - horovodrun -np 2 --min-np 1 --max-np 9 --host-discovery-script
              /etc/edl/discover_hosts.sh python /examples/elastic/tensorflow2_mnist_elastic.py
            image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
    worker:
      maxReplicas: 9
      minReplicas: 1
      replicas: 2
      template:
        spec:
          containers:
          - image: registry.cn-huhehaote.aliyuncs.com/lumo/horovod:master-tf2.1.0-torch1.4.0-mxnet-py3.6-gpu
            imagePullPolicy: Always
            name: mnist-elastic
            resources:
              limits:
                nvidia.com/gpu: "1"
              requests:
                nvidia.com/gpu: "1"
status:
  currentWorkers:
  - elastic-training-worker-0
  - elastic-training-worker-1
  - elastic-training-worker-2
  - elastic-training-worker-3
  phase: Succeeded
  replicaStatuses:
    Launcher:
      active: 1
      succeeded: 1
    Worker:
      active: 4
Copy the code

2. Worker Expands/shrinks capacity

In addition to TrainingJob, et-operator supports both ScaleOut and ScaleIn CRDS to deliver capacity expansion and capacity reduction operations for training tasks.

When a ScaleOut CR is delivered, the ScaleOutController triggers Reconcile, which is very simple. In this case, we can find the TrainingJob corresponding to the Scaler based on the Selector field in ScaleOut CR. Set to CR’s OwnerReferences.

Take an example of a ScaleOut operation:

- apiVersion: kai.alibabacloud.com/v1alpha1 kind: ScaleOut metadata: creationTimestamp: "2020-11-04T13:54:26Z name: scaleout-ptfnk namespace: default ownerReferences: - apiVersion: kai.alibabacloud.com/v1alpha1 blockOwnerDeletion: True controller: true kind: TrainingJob name: elastice-training // Point to the object TrainingJob uid: 075b9c4a-22f9-40ce-83c7-656b329a2b9e spec: selector: name: elastic-training toAdd: count: 2Copy the code

Update of ScaleOut CR belonging to TrainingJob is detected in TrainingJobController, which triggers the Reconcile of TrainingJob. Filter ScaleIn and ScaleOut pointed by OwnerReference in TrainingJob, and determine capacity expansion or reduction based on creation time and status time.

apiVersion: kai.alibabacloud.com/v1alpha1 kind: TrainingJob metadata: name: elastic-training namespace: default spec: / /... Launcher and Worker spec status: currentScaler: ScaleIn:default/scaleout-ptfnk phase: Scaling currentWorkers: - elastic-training-worker-0 - elastic-training-worker-1Copy the code

ScaleOut task CR:

ScaleIn mission CR:

Detailed working process:

run

1. Install the ET – Operator

mkdir -p $(go env GOPATH)/src/github.com/aliyunContainerService
cd $(go env GOPATH)/src/github.com/aliyunContainerService
git clone https://http://github.com/aliyunContainerService/et-operator
cd et-operator
kubectl create -f deploy/all_in_one.yaml 
Copy the code

Check CRD installation:

# kubectl get crd
NAME                                    CREATED AT
scaleins.kai.alibabacloud.com           2020-11-11T11:16:13Z
scaleouts.kai.alibabacloud.com          2020-11-11T11:16:13Z
trainingjobs.kai.alibabacloud.com       2020-11-11T11:16:13Z
Copy the code

Check the running state of controller, installed by default in Kube-AI:

# kubectl -n kube-ai get po
NAME                                         READY   STATUS              RESTARTS   AGE
et-operator-controller-manager-7877968489-c5kv4   0/2     ContainerCreating   0          5s
Copy the code

2. Run TrainingJob

Run the prepared example:

kubectl apply -f examples/training_job.yaml
Copy the code

Detection of operating status:

# kubectl get trainingjob
NAME                          PHASE     AGE
elastic-training              Running   77s

# kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          7s
elastic-training-worker-0                 1/1     Running            0          10s
elastic-training-worker-1                 1/1     Running            0          9s
Copy the code

3. Capacity reduction training task Worker

When scaling, you can specify the worker for scaling using the spec.todelete. count or spec.todelete. podNames field in ScaleIn CR.

If count is used to configure the number of capacity reductions, index is used to calculate workers from high to low capacity reductions.

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    count: 1
Copy the code

If you want to shrink a particular Worker, you can configure podNames:

apiVersion: kai.alibabacloud.com/v1alpha1
kind: ScaleIn
metadata:
  name: scalein-workers
spec:
  selector:
    name: elastic-training
  toDelete:
    podNames:
    - elastic-training-worker-1
Copy the code

Run a scaling example to specify the amount of scaling by 1 worker:

kubectl create -f examples/scale_in_count.yaml
Copy the code

Test capacity reduction execution status and training tasks:

# kubectl get scalein NAME PHASE AGE scalein-sample-t8jxd ScaleSucceeded 11s # kubectl get po NAME READY STATUS RESTARTS  AGE elastic-training-launcher 1/1 Running 0 47s elastic-training-worker-0 1/1 Running 0 50sCopy the code

4. Capacity expansion training task

In ScaleOut CR, specify the number of workers to be expanded by the spec.toadd. count field:

apiVersion: kai.alibabacloud.com/v1alpha1
  kind: ScaleOut
  metadata:
    name: elastic-training-scaleout-9dtmw
    namespace: default
  spec:
    selector:
      name: elastic-training
    timeout: 300
    toAdd:
      count: 2
Copy the code

Run example:

kubectl create -f examples/scale_out.yaml
Copy the code

Test capacity reduction execution status and training tasks:

kubectl get scaleout
NAME                                     PHASE            AGE
elastic-training-scaleout-9dtmw          ScaleSucceeded   30s
kubectl get po
NAME                                      READY   STATUS             RESTARTS   AGE
elastic-training-launcher                 1/1     Running            0          2m5s
elastic-training-worker-0                 1/1     Running            0          2m8s
elastic-training-worker-1                 1/1     Running            0          40s
elastic-training-worker-2                 1/1     Running            0          40s
Copy the code

conclusion

Et-operator provides a set of training and scaling CRD and Controller, allowing us to easily run flexible distributed training on Kubernetes, support the distribution of distributed training tasks, and through integration with the distributed framework linkage, Workers who dynamically expand and shrink the capacity during the operation of training tasks. So that our training tasks have elastic ability, combined with preemption examples, can better use the cloud resource flexibility and cost-effective advantages.