From L7 compiled by Curtis Northcutt, Heart of the Machine.

If you want deep learning training, the RTX 2080Ti is the most cost-effective graphics card among Nvidia’s new generation of Gpus (see: First Titan RTX Deep Learning review results: Which GPU should you pick for 2019?). . But even without Titan, 9,000 yuan a GPU is expensive. In this article, Curtis Northcutt of MIT shows us the simplest way to build a three-2080Ti deep learning workstation.

In his configuration, the entire system costs $6,200, half as much as one from Lambda Labs, an AI hardware vendor. How to build the most powerful computer for a lab, let’s see how he did it.

I built a multi-GPU deep learning workstation in MIT Quantum Computing Laboratory and Digital Learning Laboratory. Searching the web, I didn’t find a single article that covered all the details of the installation.

Still, I found complete machine vendors like Lambda GPU Workstations. The only problem: one of these machines costs $12,500. It’s the best place to do cutting-edge deep learning research, but nothing if you can’t afford it. In this article, I’ll introduce my own version of an installed configuration that uses the same or better configuration and costs more than half as much: just $6,200. I will share all configuration details in this article for the benefit of all researchers.

If you’re building a smaller deep learning machine, you’ll find this article equally useful. In the text, I’ve included some examples that can further reduce costs.

At the end of the article, I give a time/cost comparison of the self-built machine versus the Google Computing Engine (GCE) deep learning VM. I used the PyTorch ImageNet/ResNet50 training as a benchmark.

Perfect configuration?

There is no such thing as a perfect configuration, because everyone’s needs are different. Even if it has in the past, the optimal configuration changes as new hardware is introduced. So, this article tries to give you the best possible configuration.

All the components of a deep learning workstation

Here’s the list.

All the components I ordered were purchased online on Newegg, but for us, amazon and other sources are available. If you want to go to electronics City for cheaper, you can also try.

All the components of a deep learning workstation.

As of January 31, 2019, each component and its price are as follows:

  • 3 EVGA Nvidia RTX 2080 Ti Gpus

  • EVGA GeForce 2080 Ti, $3,570 ($1,190 each)


  • 20-thread CPU (central processing unit)

  • Intel Core I9 9820X Skylake X 10 Core 3.3Ghz, $850


  • X299 main board (all other components must be connected to the main board)

  • ASUS WS X299 SAGE LGA 2066 Intel X299, $492


  • The case

  • Pirate boat Carbide Series Air $540,130


  • 2TB M.2 SSD SOLID state disks

  • Intel 660P Series M.2 2280 2TB PCI-Express, $280


  • 3TB mechanical hard disk (for speed insensitive file storage)

  • Seagate BarraCuda ST3000DM008 3TB 7200 RPM, $85


  • 128 gb of memory

  • Four Pirate ships Vengeance LPX 32GB, $740 ($185 each)


  • 1600 w PSU power supply

  • EVGA SuperNOVA 1600W P2, $347 (1300W power went off and restarted during ImageNet/ResNet50 benchmark)


  • The radiator

  • Corsair Hydro SERIES H100i PRO Low Noise Version, $110

Purchased with a member account on Newegg, the total cost of all components, before GST, is $6,200 (the upgraded power supply is another $107).

Deep learning Workstation view.

Considerations for each component

There are three goals to keep in mind when choosing components such as Gpus, RAM, CPUS, and motherboards:

  1. Maximum speed and capacity

  2. Avoid bottlenecks between components

  3. Spend less

I listed all the components needed to build a workstation and the considerations for each. The components are arranged in order of their effect on the performance of deep learning model training.

GPU

  • The BENCHMARK RTX 2080 Ti is the best GPU at the $2,500 price point.

  • Buy an After-Market GPU (like EVGA or MSI) instead of the Nvidia Founders Edition.

  • Note the overheating of the RTX 2080 Ti.

  • The workstation does not use a Turbofan-style GPU (which is cheaper), but turbofan Gpus may perform better.

Gpus are the most important component of deep learning machines, and also the most expensive. You should generally consider which GPU to use first: all other component choices in the assembly will be based on this. There are many blogs on how to choose a GPU that meets your needs.

If you want a high-performance GPU, I recommend buying the RTX 2080 Ti without being distracted by marketing. If you want to do your own research and want to choose a cost-effective GPU, check out Videocardbenchmark.net and choose the GPU with the best performance in your price range. Unless you have a budget of more than $2,500, the RTX 2080 Ti is the best bet. If performance is reduced by 30%, you can choose to buy the cheaper RTX 2080 or the older GTX 1080 Ti. For optimal deep learning, I recommend buying a GPU with at least 11GB of ram, which is exactly what the RTX 2080 Ti has.

When buying the RTX 2080 Ti, you will notice a large number of related brands on the market: EVGA, Gigabyte, Asus, MSI, etc. These are so-called After-Market Gpus. You can also buy Nvidia’s Founders Edition directly. In general, if you’re looking for the best performance, don’t buy The Founders Edition. To optimize performance, companies like EVGA customize gpus and sometimes overclock them. Founders Edition is the first attempt, not the best. Non-public Gpus are typically designed with one to three fans, presumably the more fans, the better performance. Some of them are just marketing gimmicks; two fans are usually enough. The main advice here is: buy a non-public GPU from EVGA, Gigabyte, Asus or MSI.

Please note that there are many after-Market Gpus at different prices. Overclocked Gpus tend to be more expensive, but there are often trade-offs that don’t actually improve performance. You usually just need to buy the cheapest.

Some customers have complained about overheating of the RTX 2080 TI. I built my workstation using only three Gpus to increase the cooling flow. If there is no problem, I will add a fourth RTX 2080 TI GPU.

I used open fan Gpus (fans at the bottom of each GPU) in building workstations because they cost less. Turbofan Gpus drain airflow from the side of the chassis for better performance. In the case of the motherboard we used, the GPU was compressed tight, preventing the open GPU fan from discharging air. If you buy a turbofan GPU, the fan pushes air directly out of the side of the chassis.

Solid-state drive (SSD)

  • SSD <> GPU data migration is a major bottleneck for deep learning training and prediction.

  • M.2 SSDS are six times faster than standard SSDS.

  • If your budget is adequate, buy an M.2 SSD. You will need an M.2 compatible motherboard.

Data migration from hard disk to GPU is a major bottleneck of deep learning, which will greatly reduce the speed of training and testing. M. 2 SSDS can solve this problem. The most expensive SSD write speed is 3500 MB /s compared to 500 MB /s for standard SSDS.

I bought a cheaper M.2 SSD to build the workstation, with write speeds of about 1800 MB /s, but a larger capacity of 2 TB. You might find it more useful to buy the smaller 256MB M.2 SSD because it writes faster and costs less. This is really a good way to get better performance at less cost. The only caveat is that you want to make sure all your training data is available on the M.2 SSD.

The main board

  • To support multiple Gpus, you need adequate PCI-E channels.

  • This means you’ll need an X299 (Intel CPU) or x399 (AMD CPU) motherboard.

  • You can choose a cheaper one, but if your budget is adequate, you can consider the Workstation motherboard.

Motherboards are hard to buy, because there are so many choices, many people don’t know why some motherboards are more expensive than others. The most important aspect of the motherboard for deep learning is the number of PCI-E channels. In the workstation I built, the motherboard had 44 PCI-E channels. This means that if I have 3 Gpus (each requiring 16 channels), I can run 2 Gpus on 32 channels (16 channels per GPU) and 1 GPU on 8 channels (40 channels in total). Most benchmarks show negligible performance differences between gpus running on 8 and 16 channels, but the differences could be larger in the future. At a minimum, make sure your motherboard has enough PCI-E channels to meet the minimum number required per GPU. Therefore, for 3 RTX 2080 TI Gpus, at least 24 PCI-E channels are required.

Another consideration is whether to choose the X299 (Intel CPU) or x399 (AMD CPU) motherboard. Intel cpus are faster per processing thread, but AMD cpus are generally cheaper than Intel cpus for the same number of processing threads. I chose to use an Intel processor (20 threads and fast processing), hence the need for an X299 motherboard.

More reliable (and more expensive) motherboards are often referred to as workstation motherboards. Whether the increase in reliability is worth the high price is debatable. I chose a workstation motherboard during my build, but if you want a cheaper option, consider the SUPERMICRO X299 motherboard. It has everything I need, but it’s $100 cheaper.

CPU

  • Choose Intel X Series (X299 motherboard) or AMD ThreadRipper (X399).

  • Intel cpus are faster per thread, but AMD cpus support more threads for the same cost.

Choose a CPU based on your computing needs by considering the following questions:

  1. Do you need to run a lot of multithreading?

  2. Do you need each thread to run fast?

If (1) answered “yes” and (2) answered (no), then you can opt for the 32-thread AMD Ryzen Threadripper 2950X for less. If the answer to the second question is “yes,” you might want to go with an Intel CPU.

For Intel cpus, you need to select the core Intel X Series CPU for multi-GPU deep learning. Only cpus in the X series support the X299 motherboard, and only the X299 motherboard has enough PCI-E channels to support multiple Gpus. If you only use 2 Gpus then you can reduce the cost of motherboard +CPU by choosing the cheaper 300 Series Intel CPU and LGA 1151 motherboard (instead of x299). This allows you to run one GPU on 16 PCI-E channels and another GPU on another 8 channels (most LGA 1151 motherboards have 24 PCI-E channels, but be sure to purchase carefully)

The case

  • Choose a case that fits your motherboard (ATX is standard size, Mini-ATX is smaller).

  • Choose a chassis with airflow space to keep the GPU cold.

  • The Carbide Series™ Air 540 High Airflow ATX Cube Case is suitable for a deep learning workstation.

For multi-GPU workstations, airflow and heat dissipation are of Paramount importance. Select a chassis suitable for the mainboard. Most motherboards that use multiple Gpus are ATX, so you can choose a case that is suitable for an ATX motherboard. The Carbide Series™ Air 540 High Airflow ATX Cube Case is the best choice if you are unsure which Case to buy.

Hard disk drive

If the M. 2 SSD cannot meet the storage requirements, purchase a 7200 RPM mechanical disk.

If the M.2 SSD is too small for your storage needs, you can purchase a mechanical hard drive. It is cheaper and comes in two speeds: 5400 RPM (slower) and 7200 RPM (faster). RPM stands for rotational speed per minute, and these disks are physically rotated inside the computer, so there is noise. But mechanical hard drives are cheaper. You can buy a 7200 RPM one.

memory

  • Buy low-gap memory (RAM) and make sure it fits your case.

  • Avoid buying brands you haven’t heard of.

With RAM, you need to consider its capacity, physical volume, and latency. The workstation I built uses 128GB of RAM, though you can reduce that to 64GB or 32GB depending on the data set size. If you have enough money, I recommend buying 128GB of RAM so you can load the entire dataset into memory when training your deep learning model, avoiding hard-drive <> RAM bottlenecks in each epoch.

For multi-GPU workstations, be sure to buy low-gap RAM (smaller chassis), which is the height of RAM. There are a lot of things to install on the motherboard, and sometimes the large case RAM blocks other components. The Pirate Ship Vengeance is a good low-gap RAM.

Check the motherboard documentation if you don’t use all the RAM slots. Putting RAM in the right slot is important! The motherboard and motherboard documentation will usually specify where to place the RAM.

PSU (Power supply)

  • Make sure your PSU is fully charged. Reference PSU calculator: https://outervision.com/power-supply-calculator

  • Each RTX 2080 Ti requires about 300W of power consumption.

  • Choose full modules because fewer cables means more airflow.

  • My 1300W PSU causes the workstation to restart at maximum load and 1600W is suitable for this workstation.

You might see Gold PSU vs. Platinum PSU. This refers to the metal used by the PSU, platinum > Gold > Silver > Bronze > Basic, which is related to the effectiveness of the PSU. The Bronze PSU, for example, consumes more electricity than the Platinum PSU for the same amount of computation. If you need to think about saving power (and being green), consider a Platinum or Gold PSU.

For the workstation described in this article, I originally bought Seasonic PRIME 1300W PSU, but when I did distributed PyTorch ImageNet/ResNet50 training and maximised all gpus, the workstation was on the point of restarting. So I switched to the EVGA SuperNOVA 1600 P2, and those problems were solved. Note that when I used sudo NVIDIa-SMI-PL 180 to reduce GPU power from 250W to 180W, 1300W PSU was available. However, I still recommend 1600W PSU, otherwise it will limit GPU speed.

The cooling system

  • Generally, good airflow and proper cable management are enough for GPU cooling.

  • High-performance (I9 X-Series) CPU heat dissipation with the Corsair H100I can.

  • Even so, keep the machine in a cool, air-conditioned room if possible.

From cooling fans to system-wide water cooling, you have plenty of options. In general, if the case is large and cable management is appropriate, you don’t need too much fancy stuff. The workstation I built didn’t have a heat sink for the CPU, and I used the Corsair H100I, standard in deep learning workstations. A cheaper option is the Noctua NH-U9S CPU Cooler Fan. The reason I didn’t buy it was that it was too big and might clog part of the RAM slot. If you only need 32 GB of RAM, you can choose this cooling fan.

Benchmarking VS Google Computing Engine

I benchmarked the machine against a Google Computing Engine (GCE) deep learning VIRTUAL machine. These virtual machines are said to be pre-built specifically to optimize deep learning. The GCE deep learning virtual machine uses CUDA versions and source-based drivers that are optimized for its hardware architecture. The GCE virtual machine doesn’t have an Nvidia RTX 2080 Ti GPU, so I used a Tesla K40 instead. Depending on the baseline task, the Nvidia RTX 2080 Ti delivers between two and four times the performance of the GPU Tesla K40. So to be fair, I compared one RTX 2080 Ti on this device to four Tesla K40s on the GCE virtual machine.

For benchmarking purposes, I used PyTorch’s ImageNet distributed case. I downloaded the ImageNet 2012 Training and Validation set and ran the following code on my personal machine and GCE Deep Learning VIRTUAL machine:

python examples/imagenet/main.py -aResnet18-0.1 - lr dist - url'TCP: / / 127.0.0.1: FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 "/location/where/I/stored/imagenet/"Copy the code

GCE Deep learning VM specifications

The VM specifications I created are as follows:

  • Architecture: 64-bit, x86_64

  • K40 GPU Quantity: 8

  • Memory: 394 GB

  • RAM: 172 GB

  • Number of CPU threads: 24

ImageNet training time baseline

Comparison of the time required to train 1 EPOCH:

  • It took 37.5 minutes for one RTX 2080 TI to train one EPOCH on the workstation I built

  • It takes 86.3 minutes to train 4 Tesla K40 Gpus and 1 epoch on GCE VIRTUAL machine

These values were averaged over 50 EPOchs. The code is the same as above, with no other processes running on either machine.

The cost of training each EPOCH GCE

The GCE architecture I used was not the most cost-effective setup, and the training cost was:

It costs $12.77 for 4 Tesla K40 Gpus to train 1 epoch

So training ImageNet with 100 Epoches using a Tesla K40 GPU will cost about $1,277. For the entire virtual machine, it will cost about $21 an hour.

Compare that to Lambda’s 4-GPU workstation

The workstation I built was designed to optimize the cost/performance trade-off. If you want to build workstations that better match Lambda 4-Gpus, Lambda CEO Stephen Balaban shared a few tips on Reddit:

  • Add an extra Turbofan GPU ($1,349)

  • Add $159 to upgrade all 3 other Gpus to turbofan Gpus ($477 total)

  • Add a hot-swappable drive tray ($50)

  • Add 1600W PSU ($107)

  • Upgrade CPU from 10 to 12 cores ($189)

  • The original workstation cost $6,200

After these adjustments, the total cost of the entire station is about $8,372, about $4,000 less than the Lambda station.

other

My operating system is Ubuntu Server 18.04 LTS and I use TensorFlow Cuda 10.1 (installed from source) and PyTorch. When I used all three gpus at full capacity for a long period of time, I found that the uppermost GPU overheated and downclocked, resulting in a 5%-20% performance drop. This is probably due to the dual fan GPU design. If you’re worried about this, a turbofan GPU is recommended to avoid overheating and cLOCKING down.


The original link: l7.curtisnorthcutt.com/build-pro-d…