When training the TensorFlow model, in order to achieve optimal training performance, an efficient data entry process is required, which can prepare data for the next training step before the current training step is completed. The Tf.Data API helps us build this flexible and efficient input flow. It contains a series of data transformation operations that can easily parallelize input data, and these transformation operations are described in detail in this article.

Unoptimized methods

A training process usually includes the following steps:

  1. Opens a handle to the data entry file.

  2. Fetch a batch of data from a file.

  3. Use the data to train the model.

  4. Repeat steps 2-3 until the training is complete.

When the data input process is not optimized, the time cost of each part of the training process is shown in the figure below:

It can be seen that in this synchronous implementation, when the data is obtained from the file, the model training is in idle state, and when the model training, the data input process is in idle state. The total time of training is the sum of the time consumed by each part, which seriously affects the efficiency of training, so we need to optimize the data input process.

Data prefetching

The data prefetch operation refers to the parallel processing process in which the input flow simultaneously reads the training data required in the s+1 step from the file when the model executes the training in the s step. Compared with the unoptimized method, data prefetch can reduce the cumulative time cost of training steps 2-3 to the maximum between them.

The Tf.data API provides a Prefetch transformation to perform data prefetch operations, which decouples the time of data generation from the time of data consumption. This conversion operation uses background threads and an internal buffer to prefetch elements from the input data set before the data request arrives, but there is no guarantee that the prefetch will complete.

The number of prefetched data elements should be equal to (or greater than) the batch size required for a single training step. This parameter is adjustable, can be manually specified, can also set it to tf. Data. The experimental. AUTOTUNE, then prefetch element will be used by the number of tf. The data is dynamically adjusted runtime.

The time cost of the training process transformed by PreFETCH is shown in the figure below:

You can see that the data read and training times overlap, reducing the overall time overhead.

Parallel data extraction

In a real training environment, the input data might be stored in a remote file system such as HDFS. Due to some differences between local and remote storage, a data entry process that works locally may not work as expected when reading data remotely, as follows:

  1. First byte read time: Reading the first byte of a file from remote storage may take several orders of magnitude longer than reading the first byte of a file from local storage, which is time consuming.

  2. Data read throughput: While remote storage typically provides large aggregate bandwidth, sequential reading of individual files may utilize only a small portion of this bandwidth and throughput is not high.

In addition, after the original data has been loaded into memory, it may be necessary to deserialize or decrypt the data (such as data in protobuf format), which may require additional computing resources. Granted, this overhead exists whether the data is stored locally or remotely, but if the data is not effectively prefetched, remotely reading the data can make the overhead even greater.

To mitigate the impact of various data extraction costs, the Interleave transform can be used to parallelize data loading to interleave the contents of multiple data set files (such as reading data sets such as The TextLineDataset). Among them, the number of files read in parallel can be controlled by the cycle_length parameter, which means that the data read from the cycle_length file is staggered, while the number of samples read from each file is controlled by the block_length parameter. This means that block_length is read consecutively from one file and then staggered from other files. The parallelization of file reads is controlled by num_parallel_calls. Interleave transformation also supports tf. Data. The experimental. AUTOTUNE Settings, the use of what level of parallelism and delegated to tf. The data at runtime dynamic decision.

The default parameter of the interleave transform makes it interleaved to read a single sample in multiple data files in turn, and its time cost is shown in the figure below:

You can see that Interleave gets data samples interleaved from two data sets, but its performance does not improve. By setting num_PARALLEL_calls, the time overhead is shown below:

Because multiple data set files can be loaded in parallel, the waiting time required to open data files in sequence is reduced, which in turn reduces the time overhead of global training.

Parallel data conversion

When preparing input data, it may be necessary to preprocess the original data input. For this purpose tF.data provides a map transformation for data preprocessing operations, which applies user-defined functions to each element of the input data set. Because the input elements are independent of each other, this preprocessing operation can be performed in parallel across multiple CPU cores.

Like the Prefetch and Interleave transforms, the map transform provides the num_parallel_calls parameter to specify the degree of parallelism. You can set the value of this parameter yourself. At the same time it also supports tf. Data. The experimental. AUTOTUNE Settings, the use of what level of parallelism and delegated to tf. Data run-time dynamic decision.

For an unparallelized MAP transformation, the time cost is shown in the figure below:

The overall time cost of the training process is the sum of the time cost of other parts and the time cost of pretreatment. After the parallel map transformation, the time cost is shown in the figure below:

You can see that part of the time cost of map overlaps, which reduces the overall time cost of global training.

Data cache

Cache transformations can cache processed data sets in memory or local storage, avoiding the need to perform the same operations (such as file opening and data reading) with each epoch. The time cost of a basic cache conversion is shown below:

It can be seen that in the second epoch, because the data set was cached, the time cost of file opening, data reading and preprocessing was saved. This is because all operations on the dataset prior to the cache transformation will only be performed on the first epoch, and subsequent epochs will directly use the data cached by the cache transformation.

If the user-defined function used for map transformation is time-consuming, cache transformation should be applied after map transformation as long as the preprocessed result data set can be put into memory or local storage to reduce the time cost of map transformation for each subsequent epoch. If a user-defined function increases the amount of space required to store the data set (beyond the cache capacity), you can apply the Map transformation after the cache transformation, or consider preprocessing the data before training to reduce resource usage.

Transform to a quantized map

Calling user-defined functions in a Map transformation incurs additional overhead, so it is best to vectorize the user-defined function (that is, have it process a batch of data inputs at a time) and then apply the Batch transformation before the Map transformation so that the map transformation is applied to each batch of data rather than a single data.

For the dataset that applies the Batch transformation after the MAP transformation, the time cost is shown in the following figure:

You can see that the map function is applied to each of the data samples, although its execution time is fast and also affects the overall time performance. The time cost of applying batch transformed data set before MAP transformation is as shown in the figure below:

You can see that the map function is executed only once and applied to a batch of samples. Although the execution time is longer, the additional time overhead occurs only once, improving the overall time performance.

conclusion

Best practices for TensorFlow data entry include the following sections:

  1. Prefetch transformations are used to overlap the time overhead of data production and consumption.

  2. Use the Interleave transform to parallelize reading data sets.

  3. Parallelize the MAP transformation by setting the num_PARALLEL_calls parameter.

  4. Use cache conversion to cache data in memory or local storage during round 1 training.

  5. Vectorization of user-defined functions for map transformation.

Code implementation

Based on the best practices above, this section uses the Tf.data API to complete the construction of TensorFlow input data. The specific implementation code is as follows:

def make_dataset(input_pattern, shuffle_size, batch_size) :
    # map parse function, note vectorization here
    def labeler(record) :
        fields = tf.io.decode_csv(
            record,
            record_defaults=['0'] * 32,
            field_delim='\t',
        )
        data = tf.strings.to_number(fields[1:32], out_type=tf.int32)
        label = tf.strings.to_number(fields[:1], out_type=tf.int32)

        data = tf.transpose(data)
        label = tf.transpose(label)

        return data, label

    filenames = tf.data.Dataset.list_files(input_pattern)
    dataset = filenames.interleave(
        lambda filename: tf.data.TextLineDataset(filename),
        cycle_length=tf.data.experimental.AUTOTUNE,
        num_parallel_calls=tf.data.experimental.AUTOTUNE,
    )
    dataset = dataset.repeat().shuffle(shuffle_size).batch(batch_size)
    dataset = dataset.map(
        lambda ex: labeler(ex),
        num_parallel_calls=tf.data.experimental.AUTOTUNE,
    ).cache()

    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

    return dataset
Copy the code

Matters needing attention

  1. When using tF.data’s various transformation operations, pay attention to the order of operations between them, and generally speaking, less memory footprint means better transformation order.

  2. If the map the result of the transformation is too big to be put into memory, the other a trade-off of custom handler method is divided into two parts (if it can be divided), is a time-consuming part, another for memory part, then spent part of the transformation results cache, and memory consumed part of the conversion results will not be cached.

The resources

  1. Data Performance