Original text: hippocampus-garden.com/kaggle_cola…

How to Kaggle with Colab Pro & Google Drive

Translation by KBSC13

Contact information:

Github:github.com/ccc013/AI_a…

Zhihu column: Machine learning and Computer Vision, AI paper notes

Wechat official account: AI algorithm notes

preface

Colab Pro(currently only available in the US, Canada, Japan, Brazil, Germany, France, India, THE UK and Thailand) offers cloud computing resources that are readily available and accelerated but expensive and cumbersome to maintain. Unlike its free version, Colab Pro allows users to use TPUs and high-end GPUs like the V100 and P100, access high-memory instances, and keep notebooks running for up to 24 hours at a cost of $10 per month.

Colab Pro can meet the resource requirements of most competitions in Kaggle competitions. There is a problem, however, that each session lasts only 24 hours. Each time you need to prepare the data set, which can take some time depending on how you prepare it. In the table below, five ways to prepare a Kaggle dataset are compared in terms of initial load and disk read and write times:

Unfortunately, according to the above table, there is no way to see that both methods are very fast. Fast disk read and write speeds are even more important given that we want to train the model through multiple iterations on the data set. In this case, I would choose the third option: first download the data set through the Kaggle API, save it as a zip package on Google’s hard drive, unzip it and store it on the instance’s disk when the session starts. This process will be explained step by step in the next section.

Kaggle on Colab Pro

Download the data set to Google Drive

First, you need to download the data set through the Kaggle API and save it as a ZIP package on Google drive, as shown below:

  1. Log in tohttps://www.kaggle.com/<YourKaggleID>/accountAnd then downloadkaggle.json

  1. Create a name on Google drive calledkaggle“, and then uploadkaggle.json
  2. Start a Colab session
  3. Mount Google Drive by clicking on the icon in the upper right corner, as shown below

  1. Copy it from Google drivekaggle.jsonGo to the current session and change the file permissions as follows:
! mkdir -p ~/.kaggle
! cp ./drive/MyDrive/kaggle/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
Copy the code
  1. (Optional) Upgrade the Kaggle API. This pack was pre-installed on the Colab instance, but as of May 2021, the Kaggle notebooks are updated, and the two versions are somewhat different.
! pip install -U kaggle
Copy the code
  1. Download the data set to Google Drive via the Kaggle API, which can take a while to complete and a few minutes to display on the Google Drive interface.
! mkdir -p ./drive/MyDrive/kaggle/<CompetitionID>
! kaggle competitions download -c <CompetitionID> -p ./drive/MyDrive/kaggle/<CompetitionID>
Copy the code

You can also upgrade your Google Drive plan to get more storage space.

Unzip the file to the instance

Decompress the file into the current session as follows, this step also takes some time:

! mkdir -p <CompetitionID>
! unzip -q ./drive/MyDrive/kaggle/<CompetitionID>.zip -d <CompetitionID>
# You can specify the portion of dataset for saving time and disk space
! unzip -q ./drive/MyDrive/kaggle/<CompetitionID>.zip train/* -d <CompetitionID>
Copy the code

This is the time to train the model. After completing the training, you can export the weight files to the Kaggle data set and submit the predictions through the Kaggle API. For the complete tutorial, see github.com/Kaggle/kagg…

Speed is

Compression from Google drive takes a long time. Is this really faster than downloading directly from the Kaggle API or Gsutil? To answer this question, I prepared a house-price prediction contest (www.kaggle.com/c/house-pri… 935KB, and test the decompression time of these three methods, the results are as follows:

The above results may vary somewhat depending on where the instance is located, but in most cases, extracting it from Google’s hard drive is the fastest way.

Pay attention to disk size

Colab Pro currently offers a 150GB disk, so compressed files cannot exceed 75GB.

Is it possible to mount external storage?

Mounting Google Cloud Storage Buckets

Colab can mount Google cloud storage disks and access Kaggle’s data sets without downloading them. First, authorize your account with the following code:

from google.colab import auth
auth.authenticate_user()
Copy the code

Next, install GCSFUSE:

! echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
! curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
! apt update
! apt install gcsfuse
Copy the code

Then, open a Kaggle notebook for your favorite contest project and run the following code to get the path to GCS (Google Cloud Storage) :

from kaggle_datasets import KaggleDatasets
print(KaggleDatasets().get_gcs_path())
Copy the code

For example, the results of House Prices-Advanced Regression Techniques obtained are as follows:

gs://kds-ecc57ad1aae587b0e86e3b9422baab9785fc1220431f0b88e5327ea5

GCS can now be mounted via GCSFUSE:

! mkdir -p <CompetitionID>
! gcsfuse  --implicit-dirs --limit-bytes-per-sec -1 --limit-ops-per-sec -1 <GCSPath without gs://> <CompetitionID>
Copy the code

Run the preceding commands to mount the vm within 1s. But when you start iterating over data sets, disk access is very slow. The speed of access is dependent on the region where the Colab instance and GCS bucket are located, but in general, this mount operation should be avoided.

You can obtain the region information of a Colab instance in the following ways:

! curl ipinfo.io
Copy the code

The GCS bucket region information could have been obtained by using the following command, but I received an AccessDeniedException error and could not be resolved.

! gsutil ls -Lb gs://kds-ecc57ad1aae587b0e86e3b9422baab9785fc1220431f0b88e5327ea5
Copy the code

Mount the Google D disk

This method is too slow for disk access!


conclusion

Colab Pro allows you to use better graphics cards like TPU or P100 or V100 gpus. Of course, this is a paid version that costs $10 per month and only runs 24 hours per session, so if it takes too long to train the model, You need to run the session multiple times, which leads to the need to reload the data set and read the weight files saved from the last training.

Therefore, in order to maximize the time of Colab Pro, of course, it is expected to reduce the time of loading data sets. Based on this idea, the author compares 5 methods. Finally, according to the actual situation, mainly training model, disk reading speed is required to be higher. Therefore, we chose the method of extracting files from Google disk to Kaggle instance, and gave the operation process, and then studied whether we can mount external memory, but the disk access speed is too slow, this operation is not recommended.

In addition, the disk space provided by Colab Pro is only 150GB, and the size of compressed files cannot exceed 75GB. Therefore, this method is not suitable for network models with large data sets and large number of parameters. However, it is not necessary for network models with small data volume. You can still train the model with Colab Pro.

Public account background reply to these keywords, you can get the corresponding information:

  1. Reply to “Getting Started books” for getting started with machine learning resources, including books, videos, and Getting started with Python.
  2. Reply “Data structure”, get data structure algorithm books and Leetcode problem solving;
  3. Reply “Multi-label” to get multi-label image classification codes and data sets implemented using KerAS
  4. Reply “PyTorch Migration Learning” to get the pyTorch migration learning tutorial code
  5. Reply “py_ML” to get beginner’s machine learning tutorial code and data set