Intro

MLPerf: Fair and useful benchmarks for measuring training and inference performance of ML hardware, software, and services.

I will go through the steps on how to set up and run one of the MLPerf training benchmarks. This will enable how you can start executing MLPerf benchmarks on your hardware: whether it is on-premises or in the cloud.

Prerequisite

Hardware environment:

I use a server equipped with 8 x NVIDIA Tesla V100 16GB SXM2 to demonstrate on how to run one of the MLPerf training benchmark.

Software environment:

  • Ubuntu 16.04.6 LTS
  • CUDA
  • docker
  • nvidia-docker2

There are plenty of articles in the internet on how to install them in your system, I won’t go through it here.

Guide step by step

Clone the latest MLPerf training results repository

For training benchmarks I am using training_results_v0.6 as opposed to the reference implementation provided in the mlperf/training repository. Please take note, these implementations are valid as starting points for benchmark implementations but are not fully optimized and are not intended to be used for “real” performance measurements of software frameworks or hardware.

Git clone https://github.com/mlperf/training_results_v0.6.git

In this repo, there are directories for each vendor submission (Google, Intel, NVIDIA, etc.) which contain the code and scripts to generate the results. I will only focus on the NVIDIA submission, since I want to run the benchmark on NVIDIA GPUs.

cpcadm@ubuntu:~$CD training_results_v0.6/ ~/training_results_v0.6$ls baba CONTRIBUTING. MD Fujitsu Google Intel LICENSE NVIDIA readme. md cpcadm@ubuntu:~/training_results_v0.6$CD NVIDIA/; Ls benchmarks licens.md readme.md results systems cpcadm@ubuntu:~/training_results_v0.6/NVIDIA$CD benchmarks/; ls gnmt maskrcnn minigo resnet ssd transformer

There are 6 different training benchmarks in the NVIDIA/benchmarks directory. I use first benchmark “GNMT” which is an recurrent neural network similar model to the one from Google to perform language translation.

Since I only have a single node system, I will pick the submitted result for a single node (NVIDIA DGX-1) and use its documentation to run GNMT on my system.

Download and verify dataset

Execute the script to download and verify the dataset. It took me around 2 hours to download all of them. The actual duration should be depending on your network connection. And it requires around 13GB of filesystem space.

Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks $CD GNMT/implementations; ls data download_dataset.sh logs pytorch verify_dataset.sh Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations $bash download_dataset. Sh Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations $13 g du - sh data/data /

Execute the script to verify that the dataset has been correctly downloaded.

Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations $bash verify_dataset. Sh OK: correct data/train.tok.clean.bpe.32000.en OK: correct data/train.tok.clean.bpe.32000.de OK: correct data/newstest_dev.tok.clean.bpe.32000.en OK: correct data/newstest_dev.tok.clean.bpe.32000.de OK: correct data/newstest2014.tok.bpe.32000.en OK: correct data/newstest2014.tok.bpe.32000.de OK: correct data/newstest2014.de

Launch training jobs

The scripts and code to execute the training job is inside the pytorch directory. Let’s explore the files within this directory.

Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations $CD pytorch /; ls -1 bind_launch.py config_DGX1_multi.sh config_DGX1.sh config_DGX2_multi_16x16x32.sh config_DGX2_multi.sh config_DGX2.sh Dockerfile LICENSE logs mlperf_log_utils.py preprocess_data.py README.md requirements.txt run_and_time.sh  run.sub scripts seq2seq setup.py train.py translate.py

You need to configure config_<system>.sh to reflect your system configuration. If your system has 8 or 16 GPUs, you can use the existing config_DGX1.sh or config_DGX2.sh config file to launch the training job.

Parameters to edit:

DGXNGPU=8

DGXSOCKETCORES=18

DGXNSOCKET=2

You can use nvidia-smi command to get GPU info, and lscpu command to retrieve CPU info, in particular:

  • Core(s) per socket: 18
  • Socket(s): 2

Next is to build the docker image to perpare for the training jobs.

docker build -t mlperf-nvidia:rnn_translator .

Now, we are now ready to execute the training job using the launch script run.sub and setting the environment variables for dataset, log files and the config file.

DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> PULL=0 DGXSYSTEM=<config file> PULL=0 DGXSYSTEM=<system> sudo ./run.sub

For my test, I will using config_DGX1.sh and therefore specify DGXSYTEM as DGX1. Also set the PULL to 0 to indicate that use a local inmage instead pulling the docker image from repository. I created a new directory ‘logs’ to store the benchmark log files and provide the data directory path when launching the benchmark run as specified below:

Cpcadm @ ubuntu: ~ $DATADIR = / home/cpcadm/training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations/data LOGDIR = / home/cpcadm/training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations/logs PULL = 0 DGXSYSTEM = DGX1 sudo ./run.sub

If everything goes well, it will execute 10 trial runs of the benchmark and store the log files in the specified directory. Since I specified 8 GPUs in the config file, you will see that all 8 GPUs are being utilized for training the GNMT model.

You could use this command watch -d -n 1 nvidia-smi to monitor the GPU usage periodically.



And use htop to monitor the CPU utilization if you like.

The benchmark run time will be available in the log file. The results for my runs are displayed below. The average time for a run was around 150 minutes and it took more than to 25 hours to finish running 10 iterations. You can modify Sub to limit the number of runs if you don’t wish to run all 10 iterations.

Cpcadm @ ubuntu: ~ / training_results_v0. 6 / NVIDIA/benchmarks/GNMT/implementations/logs $grep RNN_TRANSLATOR *. The log 200601020951809569527_1.log:RESULT,RNN_TRANSLATOR,,5193,nvidia,2020-05-31 06:11:52 PM 200601020951809569527_2.log:RESULT,RNN_TRANSLATOR,,6487,nvidia,2020-05-31 07:38:43 PM 200601020951809569527_3.log:RESULT,RNN_TRANSLATOR,,5191,nvidia,2020-05-31 09:27:08 PM 200601020951809569527_4.log:RESULT,RNN_TRANSLATOR,,5196,nvidia,2020-05-31 10:53:57 PM 200601020951809569527_5.log:RESULT,RNN_TRANSLATOR,,5194,nvidia,2020-06-01 12:20:51 AM 200601020951809569527_6.log:RESULT,RNN_TRANSLATOR,,5194,nvidia,2020-06-01 01:47:34 AM 200601020951809569527_7.log:RESULT,RNN_TRANSLATOR,,6493,nvidia,2020-06-01 03:14:18 AM 200601020951809569527_8.log:RESULT,RNN_TRANSLATOR,,5193,nvidia,2020-06-01 05:02:41 AM 200601020951809569527_9.log:RESULT,RNN_TRANSLATOR,,5196,nvidia,2020-06-01 06:29:24 AM 200601020951809569527_10.log:RESULT,RNN_TRANSLATOR,,5198,nvidia,2020-06-01 07:56:10 AM

The other training benchmarks in the remaining directories (mask-rcnn, minigo, resnet, ssd and transformer) should be able to run using similar steps – downloading the dataset, building the docker container, launching the training job. You can use the MLPerf training benchmarks to compare different GPU systems or evaluate different software frameworks, etc.

The most important thing

If you ran into an error related to numactl <n-m,n-m>, this is due to the calculation of the cpu ranges out of the actual range. You have to modify the bind_launch.py, the workaround for me is to reduced the upper range by 4.

I am not sure why the calcuation got wrong. After edit, you need to re-build the docker image.

This is the main motivation for me to wrote this article, and I got most of the things from this ariticle.