PyTorch on Google Cloud. How do I train the PyTorch model on the AI platform

PyTorch is an open source machine learning and deep learning library, primarily developed by Facebook, for an increasingly wide range of use cases to automate machine learning tasks such as image recognition, natural language processing, translation, recommendation systems, and more. Primarily used for research, PyTorch has also gained significant traction in the industry in recent years due to its ease of use and deployment.

Google Cloud AI platform is a fully managed end-to-end platform for data science and machine learning on Google Cloud. Leveraging Google’s expertise in ARTIFICIAL intelligence, the AI Platform provides a flexible, scalable and reliable platform to run your machine learning workloads. The AI platform has built-in support for PyTorch through deep learning containers that have been optimized for performance, tested for compatibility, and ready to deploy.

In this new series of blog posts, PyTorch on Google Cloud, we aim to share how to build, train, and deploy the PyTorch model on a large scale and how to create a repeatable machine learning pipeline on Google Cloud.

Why use PyTorch on Google Cloud AI platform?

The cloud AI platform provides flexible and scalable hardware and secure infrastructure to train and deploy PyTorch based deep learning models.

Flexibility. AI platform notebooks and AI platform training provide the flexibility to design your computing resources to match any workload, while the platform manages most of the dependencies, networking, and monitoring. Spend your time building models, not worrying about infrastructure.
Scalability. Run your experiments on an AI platform laptop using either a pre-built PyTorch container or a custom container, and extend your code by training the model on a GPU or TPU, using the high availability of AI platform training.
Security. The AI platform leverages the same global scale technology infrastructure and aims to provide security across Google’s entire information-processing lifecycle.
Support. The AI platform works closely with PyTorch and NVIDIA to ensure top level compatibility between the AI platform and NVIDIA Gpus, including PyTorch Framework support.

Here’s a quick reference to PyTorch support on Google cloud

(Click to enlarge)

In this article, we will introduce.

PyTorch development environment was set up on JupyterLab notebook using AI platform notebook
Build an emotion classification model using PyTorch and train on the AI platform

You can find the code for this post at the GitHub repository and Jupyter Notebooks.

Let’s get started!

Use cases and data sets

In this article, we will use PyTorch to fine-tune a shaper model (Bert-Base) in the Huggingface shaper library for sentiment analysis tasks. BERT (Bidirectional Encoder Representations from Transformers) is a transformer model pretrained in a self-supervised manner in a large corpus of unmarked texts. We will start experiments on IMDB emotion classification data set on the AI platform notebook. We recommend using a laptop example of an AI platform with limited computational power for development and experimental purposes. Once we are satisfied with the local experiment on the notebook, we will show how to submit the same Jupyter notebook to the AI Platform training service in order to extend the training with a larger GPU shape. AI platform training services optimize the training pipeline by starting the infrastructure for training and shutting it down after training, without requiring you to manage the infrastructure.

In the following articles, we will show how to deploy and service these PyTorch models on the AI Platform Prediction service.

Create a development environment on the AI platform notebook

We will use JupyterLab Notebooks as the development environment for the AI Platform Notebooks. Before you begin, you must set up a project on Google Cloud Platform and enable AI Platform Notebooks API.

Please note that when you create an AI platform laptop instance, you will be charged. You only pay for the time it takes to get your laptop instance up and running. You can choose to stop the instance, which will save your work and only charge for boot disk storage until you restart the instance. Delete the instance when you are finished.

You can create a laptop instance of the AI platform.

Use the AI platform deep Learning Virtual Machine (DLVM) image in a pre-built PyTorch image or
Use custom containers with your own packages

Create a notebook instance using the pre-built PyTorch DLVM image

Notebooks is an AI Platform deep learning avatar instance that enables the JupyterLab notebook environment and is ready to use. AI Platform Notebooks provide the PyTorch image family and support multiple PyTorch versions. You can create a new laptop instance from Google Cloud Console or command line interface (CLI). We will use the GCloud CLI to create a laptop instance on an NVIDIA Tesla T4 GPU. From the Cloud Shell or any terminal with the Cloud SDK installed, run the following command to create a new laptop instance.

To interact with the new notebook instance, go to the AI Platform Notebooks page in The Google Cloud Console, and click on the “OPEN JUPYTERLAB “link next to the new instance. It will become active when it becomes available.

Most of the libraries needed to experiment with PyTorch are already installed on the new instance with a pre-built PyTorch DLVM image. To install additional dependencies, run % PIP Install in the notebook unit. For the sentiment sorting use case, we’ll install additional packages, such as the Hugging Face converter and the dataset library.

A notebook instance with a custom container

Another way to install a dependency with PIP in a Notebook instance is to package the dependency in a Docker container image derived from the AI platform’s deep learning container image and create a custom container. You can use this custom container to create AI platform notebook instances or AI platform training jobs. Here is an example of creating a notebook instance using a custom container.

1. Create a Dockerfile with one of the AI platform deep learning container images as the base image (here we use PyTorch 1.7 GPU images) and run/install the packages or frameworks you need. Use cases for sentiment classification, including converters and data sets.

2. Build images from Docker files using cloud build from terminal or cloud shell and get image location gcr. IO /{project_id}/{image_name}

3. Use the command line to create a notebook instance using the custom image created in Step 2.

Train a PyTorch model on the AI platform

Once you’ve created the AI platform laptop instance, you’re ready to experiment. Let’s take a look at the model specifics of this use case.

The specifics of the model

To analyze the sentiment of film reviews in the IMDB dataset, we will fine-tune the pre-trained BERT model of Hugging Face. Fine-tuning involves taking a model that has been trained for a particular task and then adjusting that model for another similar task. Specifically, tuning involves copying all layers in the pretrained model, including weights and parameters, except for the output layer. A new output classifier layer is then added to predict the label of the current task. The final step is to train the output layer from scratch, while the parameters of all layers of the pre-training model are frozen. This allows learning from pre-trained representations and “fine-tuning” higher-order feature representations that are more relevant to specific tasks, such as analyzing emotions in this case.

For the emotion analysis scenario here, the pre-trained BERT model has encoded a lot of information about language because the model is trained in a self-policing manner on a large Corpus of English data. Now, we just need to tweak them slightly using their output as a feature of the emotion categorization task. This means faster development iterations on a smaller data set, rather than training a particular natural language processing (NLP) model with a larger training data set.

A pre-training model with classification layers. Blue boxes represent pre-trained BERT encoder modules. The output of the encoder is pooled into the linear layer and the number of outputs is the same as the number of target tags (categories).

To train the emotion classification model, we’re going to.

Review data is preprocessed and transformed (tokenized).
Pre-trained BERT models were loaded and sequence classification headers were added for sentiment analysis
Fine-tune BERT model for sentence classification

The following is a code snippet for pre-processing data and fine-tuning the pre-trained BERT model. For complete code and detailed explanations of these tasks, please refer to the Jupyter notebook.

In the clip above, notice that the weights of the encoder (also known as the base model) are not frozen. This is why a very small learning rate (2E-5) was chosen to avoid the loss of pre-trained representation. The learning rate and other hyperparameters are captured under the TrainingArguments object. During training, we only capture accuracy metrics. You can modify the compute_metrics function to capture and report additional metrics.

In the next article in this series, we will explore the integration of hyperparameter tuning services with cloud AI platforms.

Training model on cloud AI platform

While you can do local experiments on your AI platform laptop instance, training on vertically extended computing resources or horizontal distribution is often required for larger data sets or models. The most effective way to perform this task is AI platform training services. AI Platform Training Creates specific computing resources required by tasks, executes Training tasks, and deletes computing resources after Training tasks are complete.

Before running the training application with the AI platform training, the training application code with the required dependencies must be packaged and uploaded to the Google Cloud bucket that your Google Cloud project can access. There are two ways to package applications and run them on AI platform training.

Manually package the application and Python dependencies using the Python setup tool
Use custom containers and use Docker containers to package dependencies

You can structure your training code any way you like. Refer to the GitHub repository or Jupyter Notebooks for our recommended approach to structured training code.

Build manually using Python wrappers

For this emotional sorting task, we had to package the training code with the standard Python dependencies –transformers,datasets, and TQDM — in the setup.py file. The find_packages() function in setup.py includes the training code as a dependency in the package.

You can now submit training jobs to Cloud AI platform training using the gcloud command from Cloud Shell or terminal with the GCloud SDK installed. Gcloud AI-Platform Jobs Submit training commands Perform staged training on the training application on the GCS bucket and submit training jobs. We attached 2 NVIDIA Tesla T4 Gpus to the training operation to speed up the training.

Use custom containers for training

To create a training job with a custom container, you must define a Docker file to install the dependencies required for the training job. You then build and test your Docker image locally to verify it before using it for AI platform training.

Before submitting a training job, you will need to push the image to the Google Cloud Container registry and then submit the training job to the cloud AI platform for training using the gCloud AI-Platform Jobs submitTraining command.

Once the assignment is submitted, you can monitor the status and progress of the training assignment on the Google Cloud Console or using the gcloud command, as shown below.

You can also monitor job status and view job logs from the Google AI Platform Job console.

Let’s use a few examples to run a prediction call locally on a trained model (see the notebook for the full code). The next article in this series will show you how to deploy this model on the AI platform prediction service.

Clean up your Laptop environment

After you finish the experiment, you can stop or delete the AI laptop instance. Remove the AI notebook instance to prevent further charges. If you want to save your work, you can choose to stop the instance instead.

What’s next?

In this article, we explore a fully customizable IDE for the Cloud AI platform notebook developed as a PyTorch model. We then trained the model on a cloud-based AI platform training service, a fully manageable service for training machine learning models on a large scale.

reference

* AI platform notebook introduction

Start using PyTorch | AI training platform

Training platform for PyTorch | AI configuration distributed

Making resource

library

Code and accompanying notebook

In the next article in this series, we’ll look at hyperparameter tuning on a cloud AI platform and deploying the PyTorch model on an AI platform prediction service. We encourage you to explore the capabilities of the cloud-based AI platform we’re working on.

Stay tuned. Thank you for reading! Have questions or want to chat? Find the author here – Rajesh [Twitter | LinkedIn] and Vaibhav [which].

Thanks to Amy Unruh and Karl Weinmeister for their help and comments.