Writing Efficient Input Pipelines Using TensorFlow’s Data API

Last Updated on January 7, 2023 by Editorial Team

A Brief Primer on Dataset Processing for Machine Learning Engineering at Scale

This post is heavily inspired by an exercise of chapter 12 in Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow. I highly recommend this book.

Problem Statement

Working with very large datasets is a common scenario in Machine Learning Engineering. Usually, these datasets will be too large to fit in memory. This means the data must be retrieved from disk on the fly during training.

Disk access is multiple orders of magnitude slower than memory access, so efficient retrieval is of high priority.

Fortunately, TensorFlow’s tf.data API provides a simple and intuitive interface to load, preprocess, and even prefetch data.

In this post, we will learn how to create a simple yet powerful input pipeline to efficiently load and preprocess a dataset using the tf.data API.

Input Pipelines using the TensorFlow Data API

Disclaimer: TensorFlow provides excellent documentation of its modules and APIs. For more details, refer to the official site.

The main actor in this tutorial is the tf.data.Dataset . The objective is to:

Create a tf.data.Dataset from data files using the tf.data.Dataset.list_files() static method
Apply a sequence of functions on the tf.data.Dataset instance

Creating the Dataset

The tf.data.Dataset.list_files() returns a tf.data.Dataset from a provided list of file paths or a file pattern (regular expression).

We will assume our dataset is a collection of separate files stored as follows:

datadir/
    file_001.csv
    file_002.csv
    ...
    file_n.csv

To create a tf.data.Dataset of file paths:

filepath_dataset = tf.data.Dataset.list_files(‘./datadir/file_*.csv’)

The file paths are a good first step, but we need the actual data found inside the files for learning. We will chain a sequence of methods, beginning from file paths, and ending on actual data to use during training.

Transforming the data

Some of the most common and useful tf.data.Dataset methods are: interleave(), shuffle(), batch(), and prefetch(). We will briefly cover each of these in this article.

Interleave

During training of a model, it might be beneficial to shuffle instances to avoid learning spurious patterns due to ordering of the training data (for example, a file being alphabetically sorted). The interleave() function provides a simple way to enable coarse-grained shuffling of a dataset comprised of separate smaller files.

Coarse grained in this context means at the file level. A finer grained shuffling would be at the .csv row level, which we will see shortly.

By using a lambda function, we generate a tf.data.experimental.CsvDataset from each individual file path.

The variable csv_dataset is now a reference to a tf.data.Dataset comprised of a collection of CsvDataset datasets. The interleave() method combines (or interleaves) together all the individual CsvDatasets datasets into a single one (in this case, csv_dataset).

Setting num_parallel_calls to tf.data.experimental.AUTOTUNE tells TensorFlow to automatically choose the best number of threads to read the.csv files. This increases the pipeline efficiency, as long as the machine has a multicore processor with multithread support (most machines have this today).

Shuffle

As previously mentioned, shuffling the data prevents from learning spurious patterns in the training data. This also improves the convergence of gradient based methods, such as training a neural network using Gradient Descent. To shuffle the data, use the shuffle() method:

The shuffle_buffer_size value specifies the buffer’s size in which to store the data from the original (unshuffled) dataset. This value should be set according to the size of the dataset and amount of memory available. Refer to the shuffle() documentation for more details.

NOTE: Shuffling is not necessary during validation or testing. In fact, some scenarios require for predictions on the test set to occur in the same ordering as the provided test data (for example, Kaggle competitions). Keep this in mind and disable shuffling if you need to!

Batch

Another common data preprocessing technique is batching the data, processing small chunks at a time instead of the entire dataset at once. This makes it possible to process and train on very large datasets, since they do not need to be stored in memory entirely at once.

Prefetch

Efficient deep learning pipelines leverage GPU computations for model training, whereas data fetching and preprocessing occurs on a separate compute module such as the CPU.

If the GPU has to wait for the CPU to finish loading and preprocessing, then it will sit idle, wasting valuable cycles doing nothing. Prefetching the data ahead of time helps prevent this. This greatly increases efficiency, as it maximizes GPU utilization (no wasted or idle cycles).

Specifying to prefetch data is as simple as calling the prefetch() method:

Putting it All Together

Combining all the methods together into a function, the final pipeline can look something like this:

Conclusion

In this article, we learned to create a simple yet efficient input pipeline using the tf.data API. We covered how to:

Handle very large datasets broken into multiple separate files using the tf.data.Dataset.list_files() static method
Enable parallelism by leveraging multiple threads of execution
Improve convergence using coarse-grained data shuffling with the interleave() dataset method
Improve convergence using fine-grained data shuffling using the shuffle() method
Maximize GPU utilization using the prefetch() method
Combine all these concepts into a single function that yields a dataset ready for model consumption

Sources

Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow
Official TensorFlow documentation
https://en.wikipedia.org/wiki/Memory_hierarchy

Writing Efficient Input Pipelines Using TensorFlow’s Data API was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Writing Efficient Input Pipelines Using TensorFlow’s Data API

Author(s): Jonathan Quijas

Machine Learning

A Brief Primer on Dataset Processing for Machine Learning Engineering at Scale

Problem Statement

Input Pipelines using the TensorFlow Data API

Creating the Dataset

Transforming the data

Interleave

Shuffle

Batch

Prefetch

Putting it All Together

Conclusion

Sources

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Writing Efficient Input Pipelines Using TensorFlow’s Data API

Author(s): Jonathan Quijas

A Brief Primer on Dataset Processing for Machine Learning Engineering at Scale

Problem Statement

Input Pipelines using the TensorFlow Data API

Creating the Dataset

Transforming the data

Interleave

Shuffle

Batch

Prefetch

Putting it All Together

Conclusion

Sources

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement