The emergence of deep learning or AI has changed the way we used to program to solve problems. It is no longer an intuitive expression in code.

To take a simple example, how do we recognize a number (picture) as the number 9? The intuitive way is to have a little circle on top and a vertical line on the bottom. But what if I tilt it a little bit? What if the top circle doesn’t close? What if the vertical line bends? It feels like our everyday programmed judgment (switch) is not converging, so we have to recognize the number that “looks like nine” in a self-evolving way, and this is exactly how our brain learns. The first time we see a number, we’re told it’s nine, and the image has a label. The next time we see something like this, again on tag 9, we’re going to see a lot of things, and then we’re going to see something that’s probably written a little bit more differently and we’re going to be able to recognize 9, and this is a process that’s been learned over a long period of time by hundreds of billions of neurons in our brain.

The real workings of the human brain remain a mystery, but from this process we have developed neural network algorithms that can learn from existing knowledge. Diligence can compensate for stupidity, since the algorithm is not as good as the human brain, through learning a lot of information to speed up the process of learning. MNIST data has 60,000 handwritten digital images, ImageNet data has nearly 15 million images, and youtub-8M video data has TB. In Google’s Open Image dataset, the dataset used in the Open Images Challenge alone reaches 18TB.

There are three cores in AI: algorithms, computing power, and data (storage). Algorithms have their own mature framework, solved by mathematical scientists; Computing power is handled by CPU or even GPU. In the face of such a large amount of data, it is almost impossible for the memory and hard disk of a machine to carry it. For components with strong CPU/GPU computing capability, it is a waste of resources to frequently obtain data from the remote end and wait for I/O. Is there a solution that can not only satisfy the data distance close to calculation, but also carry a large amount of data? Caches are silver bullets! This paper is from Fast20 Quiver: An Informed Storage Cache for Deep Learning.

In the follow-up discussion, there is an important concept, namely mini-batch. It is not easy to understand this concept without actual combat experience.

The essence of deep learning optimization algorithm is gradient descent. There are two ways to update parameters each time.

In the first way, the loss function is calculated once through the whole data set, and then the gradient of the function to each parameter is calculated and the gradient is updated. This method is called Batch gradient Descent (BGD), which requires all the sample data in the data set to be traversed every time parameters are updated. It is costly and slow in calculation, and does not support online learning.

On the other hand, every time a data is trained, the loss function is calculated and then the gradient update parameter is calculated. This is called stochastic gradient descent. The speed of this method is relatively fast, but the convergence performance is not good, it may fluctuate near the optimal point, can not reach the optimal point. The two parameter updates may also cancel each other, resulting in a more violent oscillation of the objective function.

In order to overcome the shortcomings of the two methods, a compromise method, mini-batch gradient decent, is generally adopted now. This method divides the data into several batches and updates the parameters according to the batch. In this way, a group of data in a batch jointly determines the direction of the gradient, and it is not easy to deviate when descending. It reduces randomness.

It is shown in a schematic diagram as follows:

Blue: batch gradient descent, that is, mini Batch size = M, purple: stochastic gradient descent, that is, mini Batch size = 1, green: mini Batch gradient descent, that is, 1 < mini batch size < M.

In each round, mini-bach size=5 was defined. The data set was 1 to 20 digits, and the batch X was obtained each time through torch.DataLoader.

01The basics of deep learning training

Deep Learning Training DLT takes Training data as input, and obtains an output model to represent Training data through Learning from numerous clues.

For training, DLT uses a small random sample (mini-batch, usually 32 to 512) and uses SGD to slowly learn various parameters to improve accuracy.

Training data: Generally, we can think of it as a list, and each element in the list is a binary group

. Input may be a picture or a piece of speech, while label represents the semantics of input, which is exactly what deep learning networks need to learn and correctly distinguish the goals of input. ImageNet’s entire data set, for example, is about 1.5 million images, each around 200KB.
,label>

To be able to access training data randomly, the DLT framework uses indexed sequences to traverse the data. Assuming that the training data has 1 million files, a list containing each file index will be maintained and randomly arranged, and then data will be obtained from the back-end storage according to the mini-batch data amount. When all the data are completely traversed through the training once, an epoch will be completed. For the next epoch, randomize the index again and repeat the process above. A typical DLT mission runs many training rounds, such as 50-200.

Data conversion: Raw data obtained from storage will be converted by the DLT framework, such as color images into black and white images, and images into pixel count matrices, etc. Of course, this is usually done by the CPU.

Multitasking: Because DLT tasks are a trial-and-error process, users will always run different tasks simultaneously with different parameters, and all of these tasks will access the same complete data set, but in a different random order.

02IO characteristics of deep learning

Let’s list its main features from the perspective of DLT task I/O access:

Shareability: In DLT training tasks, there is a large degree of I/O overlap, either within a training task itself or between multiple training tasks. Within a task, it will traverse the same data set for many times (such as multiple epochs), so if the data can be cached at the time of the first epoch, the efficiency of subsequent training will be greatly improved. More importantly, this sharing can even be extended to multiple tasks, for example, for the same training data set, configure different parameters, using the same training model to run multiple different tasks. These tasks may be running on different machines, but they all access the same underlying data set.

Random access: This makes DLT very cache-friendly due to the shareability of the data, but only works if the entire data can be fully cached. Otherwise, DLT randomly accesses the data in a way that makes part of the data cache easily penetrated. For example, if only 20% of the data can be cached, it will be immediately flushed out by subsequent random accesses.

Partial data caching is important for DLT because training data is usually large enough and growing, for example, even a data set of millions such as ImageNet has reached several terabytes in size.

Replaceable: From the perspective of I/O, a training task (EPOCH) is mainly concerned with the following two points: a) each training data must be accessed only once; B) For each mini-batch, it must be a random sequence. Interestingly, the exact order of a set of data does not affect the accuracy or precision of the training task, which means that I/O can be replaced. A DTL task can be replaced with a random set of other data that has not been accessed for a specific number of files. From a cache perspective, this is a unique feature to improve cache hit ratio. Even if the cache only holds 20% of the data, it can access data that is not in the cache and return the existing content by substitution without breaking the training requirements of randomness and uniqueness.

Predictability: Because each mini-batch run time can be obtained in advance, it can be used to assess the sensitivity of a training task to I/O performance, which can then be adjusted to enable the I/O sensitive task to benefit from caching.

03Cache design

To sum up the characteristics of deep learning:

  1. A large amount of data is required
  2. Multiple machines with multiple training parallel
  3. Run each training multiple times
  4. In each training, all the data need to be traversed
  5. For different training parameters and training tasks running on different machines, the data set remains relatively fixed

In view of the above characteristics, when we consider the cache, can not help but have the following question: after all, the cache capacity is limited, how to deal with penetration? What is the expiration replacement strategy for caching? How can security be ensured when different users access different data? And so on.

Quiver and distributed cache, through deep integration with DLT framework, the cache client is integrated into the IO process of training task, thus providing more policy information for the cache server.

The system structure

In the public cloud VM environment, each GPU VM has a local SSD, and each DLT job is run in its own container. In this way, even if multiple users run a DLT job in an isolated environment.

Each user’s data is stored in the cloud storage of their own accounts, which ensures privacy and access. Through distributed cache, even if the training task switches between different hosts due to scheduling and other reasons, caching data can still improve the training efficiency.

Data security

The design of Quiver is a shared distributed cache. Whether it is different tasks or different users, how to ensure the security of data in the shared mode is an important factor. Quiver ensures that users can only see the data they have access to, but this seems to conflict with cache reuse. If two different users have one copy of a data set, such as ImageNet, stored in their respective storage accounts, then logically the cache should cache one copy for each user. This makes caching less efficient, and Quiver addresses reuse and isolation issues through content-addressed addresses.

Content addressing cache

For caching, the basic behavior is through a <key, value> mapping, when we query through the key, we can quickly return the corresponding value. In Quiver, instead of using file names and offsets as cache keys, the cache uses hashes of the cache contents. The granularity of cached content is determined by the specific DLT task, and may be an image that is uniquely located by its hash(for example, SHA1), whether inserted or addressed. The advantage of using hash is that for a file with the same content, no matter where it comes from or whether the file name is the same, only one copy needs to be kept in the cache, which can be shared among different users.

And to ensure data isolation, Quiver uses the digest index to access training data. For each piece of data, the digest index will contain <content_hash: File_location >, so when multiple users have a data set with the same content, each user will have a different file_location because the data is stored on their own storage system, but all content_hash will be the same.

Cache server

Using local SSDS as KV storage media, the key space is distributed across multiple cache servers in a consistent Hash manner.

Cache Manager

Since Quiver is a distributed Cache, a coordinator Cache Manager is required for Cache insertion and cleaning for all Cache servers.

The Cache Manager also evaluates how well each computing task benefits from the Cache by asking the Cache server to Cache misses for a number of mini-batch data that is required for the training task, and then compares it with other time-sensitive models that are hit by the Cache to prioritize the Cache.

Cache client

Caching clients, as part of the training task, access training data by interfering with the interface layer of DLT frameworks such as PyTorch. In PyTorch, the DataSet is used to iterate over all the training data and internally maintains a random index list of files, where the Next interface is used to get the Next Mini-batch data. Quiver tweaks this interface to make use of a digest file that accesses the cache first when the upper layer accesses a set of files.

The client sends some information about the training task to the Cache Manager, such as the training time of each Mini-batch, which the Cache Manager can use to optimize the Cache strategy.

Substitution hit ratio

In a regular cache, if a mini-batch contains 512 files, the Dataset provides 512 file indexes to retrieve the file contents from the back-end store. If only some of these cache hits, remote I/O operations still exist. In Quiver, more data is loaded from the Cache (for example, 10 times the mini-batch amount), and if 512 of these data can be hit, it is returned to the upper training task so that the training task is not blocked by Cache misses. At the same time, Quiver will mark the Cache miss data as pending and continue until the data has been traversed, at which point it will start all over again and only focus on the previous pending data.

Assuming that only 10% of the data is currently in the cache, for the sake of simplicity, we can think of it as 10% of the continuous original data. Since the DLT task will randomly search for data, the cache hit ratio of each mini-batch sequence with length K should be K /10. Therefore, if we search for a sequence with length 10* K, Then just hit the data needed to get mini-batch. When the next round of pending data is searched, another 10% of the data may already be in the cache, which means a 1/9 hit rate. It is important to note that this is still true in multi-task training, so that multiple training tasks, although each accesses random training data, as a whole can be run with full cache hits.

Training accuracy

Due to the aforementioned I/O substitutability, it is reasonable to doubt the accuracy of the final training results. Here borrow the data of the original text to illustrate.

04Management of cache

In the previous description, when only part of the data was cached, Quiver would iterate over the file index again during the training of an EPOCH. In order to get a better hit ratio in this subsequent iteration, another portion of the data must be pre-fetched into the cache.

Quiver solves this problem by caching two chunks of the entire dataset. First, the summary index file of the data set is divided into chunks of fixed data. For example, each chunk contains 10% of the data, and each chunk represents a striped partition. For example, we define the continuous 10% of the data set as a partition, and each partition is divided into 10 stripe units. Thus, each chunk will contain one unit of all partitions. In this way, when the training task operates on the first chunk, the second chunk will be loaded into the cache. Therefore, when part of the training task completes the first traverse and starts the second one, the data will already be in the cache and the training task will run in a progressive manner.

One potential problem is when to swap out the first chunk. If the swap is too fast and part of the task is not completed, the cache will be invalidated, and if it is left too long, the third chunk will not be able to load in. In Quiver, when the second chunk is loaded into the cache, the first chunk is marked as purgable, and new tasks can get hit data from the second chunk. The original tasks still run on the first chunk, and when all tasks have traversed the first chunk, the data will actually be cleared from the cache, and the third chunk of data will start loading.

In the process of the above, if a certain training mission slower compared to the other a lot, so will lead to the previous one cannot release the chunk, generally speaking, in the same training model of multiple tasks and each task of the training time is basically the same, but cannot be avoided in a number of different training model training scenario of the same data set. However, if a task obviously takes a long time, it means that the training time of each Mini-batch on GPU is very long, that is, it is not sensitive to I/O performance, so the cache miss will not affect the training efficiency much. So Quiver sets a threshold to force the first chunk to fail and clear.

05Effect of cache

The author compares the effects through the following configuration environment. From the actual data, the training performance has indeed been greatly improved.

Timeline of Mini-batches

Throughput increase

06conclusion

In deep learning scenarios, more attention is paid to improving computing and network performance, while for storage, existing solutions are utilized, such as manually loading data to SSDS close to the GPU in advance. Through Quiver, the author of this paper provides an automated means to eliminate the storage bottleneck. Of course, the intrusion into the training framework is unavoidable, but Quiver is able to sense the nature of the training I/O so that it can significantly increase cache utilization even if the cache can only hold part of the data.