Writing Efficient Input Pipelines Using TensorFlow’s Data API
Author(s): Jonathan Quijas
A Brief Primer on Dataset Processing for Machine Learning Engineering at Scale
This post is heavily inspired by an exercise of chapter 12 in Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow. I highly recommend this book.
Working with very large datasets is a common scenario in Machine Learning Engineering. Usually, these datasets will be too large to fit in memory. This means the data must be retrieved from disk on the fly during training.
Disk access is multiple orders of magnitude slower than memory access, so efficient retrieval is of high priority.
In this post, we will learn how to create a simple yet powerful input pipeline to efficiently load and preprocess a dataset using the tf.data API.
Input Pipelines using the TensorFlow Data API
Disclaimer: TensorFlow provides excellent documentation of its modules and APIs. For more details, refer to the official site.
The main actor in this tutorial is the tf.data.Dataset . The objective is to:
- Create a tf.data.Dataset from data files using the tf.data.Dataset.list_files() static method
- Apply a sequence of functions on the tf.data.Dataset instance
Creating the Dataset
The tf.data.Dataset.list_files() returns a tf.data.Dataset from a provided list of file paths or a file pattern (regular expression).
We will assume our dataset is a collection of separate files stored as follows:
To create a tf.data.Dataset of file paths:
filepath_dataset = tf.data.Dataset.list_files(‘./datadir/file_*.csv’)
The file paths are a good first step, but we need the actual data found inside the files for learning. We will chain a sequence of methods, beginning from file paths, and ending on actual data to use during training.
Transforming the data
Some of the most common and useful tf.data.Dataset methods are: interleave(), shuffle(), batch(), and prefetch(). We will briefly cover each of these in this article.
During training of a model, it might be beneficial to shuffle instances to avoid learning spurious patterns due to ordering of the training data (for example, a file being alphabetically sorted). The interleave() function provides a simple way to enable coarse-grained shuffling of a dataset comprised of separate smaller files.
Coarse grained in this context means at the file level. A finer grained shuffling would be at the .csv row level, which we will see shortly.
By using a lambda function, we generate a tf.data.experimental.CsvDataset from each individual file path.
The variable csv_dataset is now a reference to a tf.data.Dataset comprised of a collection of CsvDataset datasets. The interleave() method combines (or interleaves) together all the individual CsvDatasets datasets into a single one (in this case, csv_dataset).
Setting num_parallel_calls to tf.data.experimental.AUTOTUNE tells TensorFlow to automatically choose the best number of threads to read the.csv files. This increases the pipeline efficiency, as long as the machine has a multicore processor with multithread support (most machines have this today).
As previously mentioned, shuffling the data prevents from learning spurious patterns in the training data. This also improves the convergence of gradient based methods, such as training a neural network using Gradient Descent. To shuffle the data, use the shuffle() method:
The shuffle_buffer_size value specifies the buffer’s size in which to store the data from the original (unshuffled) dataset. This value should be set according to the size of the dataset and amount of memory available. Refer to the shuffle() documentation for more details.
NOTE: Shuffling is not necessary during validation or testing. In fact, some scenarios require for predictions on the test set to occur in the same ordering as the provided test data (for example, Kaggle competitions). Keep this in mind and disable shuffling if you need to!
Another common data preprocessing technique is batching the data, processing small chunks at a time instead of the entire dataset at once. This makes it possible to process and train on very large datasets, since they do not need to be stored in memory entirely at once.
Efficient deep learning pipelines leverage GPU computations for model training, whereas data fetching and preprocessing occurs on a separate compute module such as the CPU.
If the GPU has to wait for the CPU to finish loading and preprocessing, then it will sit idle, wasting valuable cycles doing nothing. Prefetching the data ahead of time helps prevent this. This greatly increases efficiency, as it maximizes GPU utilization (no wasted or idle cycles).
Specifying to prefetch data is as simple as calling the prefetch() method:
Putting it All Together
Combining all the methods together into a function, the final pipeline can look something like this:
In this article, we learned to create a simple yet efficient input pipeline using the tf.data API. We covered how to:
- Handle very large datasets broken into multiple separate files using the tf.data.Dataset.list_files() static method
- Enable parallelism by leveraging multiple threads of execution
- Improve convergence using coarse-grained data shuffling with the interleave() dataset method
- Improve convergence using fine-grained data shuffling using the shuffle() method
- Maximize GPU utilization using the prefetch() method
- Combine all these concepts into a single function that yields a dataset ready for model consumption
- Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow
- Official TensorFlow documentation
Writing Efficient Input Pipelines Using TensorFlow’s Data API was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI