Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Writing Efficient Input Pipelines Using TensorFlow’s Data API
Machine Learning

Writing Efficient Input Pipelines Using TensorFlow’s Data API

Last Updated on January 7, 2023 by Editorial Team

Author(s): Jonathan Quijas

Machine Learning

A Brief Primer on Dataset Processing for Machine Learning Engineering at Scale

Photo by Claudio Testa on Unsplash

This post is heavily inspired by an exercise of chapter 12 in Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow. I highly recommend this book.

Problem Statement

Working with very large datasets is a common scenario in Machine Learning Engineering. Usually, these datasets will be too large to fit in memory. This means the data must be retrieved from disk on the fly during training.

Disk access is multiple orders of magnitude slower than memory access, so efficient retrieval is of high priority.

Fortunately, TensorFlow’s tf.data API provides a simple and intuitive interface to load, preprocess, and even prefetch data.

In this post, we will learn how to create a simple yet powerful input pipeline to efficiently load and preprocess a dataset using the tf.data API.

Input Pipelines using the TensorFlow Data API

Disclaimer: TensorFlow provides excellent documentation of its modules and APIs. For more details, refer to the official site.

The main actor in this tutorial is the tf.data.Dataset . The objective is to:

  • Create a tf.data.Dataset from data files using the tf.data.Dataset.list_files() static method
  • Apply a sequence of functions on the tf.data.Dataset instance

Creating the Dataset

The tf.data.Dataset.list_files() returns a tf.data.Dataset from a provided list of file paths or a file pattern (regular expression).

We will assume our dataset is a collection of separate files stored as follows:

datadir/
file_001.csv
file_002.csv
...
file_n.csv

To create a tf.data.Dataset of file paths:

filepath_dataset = tf.data.Dataset.list_files(‘./datadir/file_*.csv’)

The file paths are a good first step, but we need the actual data found inside the files for learning. We will chain a sequence of methods, beginning from file paths, and ending on actual data to use during training.

Transforming the data

Some of the most common and useful tf.data.Dataset methods are: interleave(), shuffle(), batch(), and prefetch(). We will briefly cover each of these in this article.

Interleave

During training of a model, it might be beneficial to shuffle instances to avoid learning spurious patterns due to ordering of the training data (for example, a file being alphabetically sorted). The interleave() function provides a simple way to enable coarse-grained shuffling of a dataset comprised of separate smaller files.

Coarse grained in this context means at the file level. A finer grained shuffling would be at the .csv row level, which we will see shortly.

By using a lambda function, we generate a tf.data.experimental.CsvDataset from each individual file path.

The variable csv_dataset is now a reference to a tf.data.Dataset comprised of a collection of CsvDataset datasets. The interleave() method combines (or interleaves) together all the individual CsvDatasets datasets into a single one (in this case, csv_dataset).

Setting num_parallel_calls to tf.data.experimental.AUTOTUNE tells TensorFlow to automatically choose the best number of threads to read the.csv files. This increases the pipeline efficiency, as long as the machine has a multicore processor with multithread support (most machines have this today).

Shuffle

As previously mentioned, shuffling the data prevents from learning spurious patterns in the training data. This also improves the convergence of gradient based methods, such as training a neural network using Gradient Descent. To shuffle the data, use the shuffle() method:

The shuffle_buffer_size value specifies the buffer’s size in which to store the data from the original (unshuffled) dataset. This value should be set according to the size of the dataset and amount of memory available. Refer to the shuffle() documentation for more details.

NOTE: Shuffling is not necessary during validation or testing. In fact, some scenarios require for predictions on the test set to occur in the same ordering as the provided test data (for example, Kaggle competitions). Keep this in mind and disable shuffling if you need to!

Batch

Another common data preprocessing technique is batching the data, processing small chunks at a time instead of the entire dataset at once. This makes it possible to process and train on very large datasets, since they do not need to be stored in memory entirely at once.

Prefetch

Efficient deep learning pipelines leverage GPU computations for model training, whereas data fetching and preprocessing occurs on a separate compute module such as the CPU.

If the GPU has to wait for the CPU to finish loading and preprocessing, then it will sit idle, wasting valuable cycles doing nothing. Prefetching the data ahead of time helps prevent this. This greatly increases efficiency, as it maximizes GPU utilization (no wasted or idle cycles).

Specifying to prefetch data is as simple as calling the prefetch() method:

Putting it All Together

Combining all the methods together into a function, the final pipeline can look something like this:

Conclusion

In this article, we learned to create a simple yet efficient input pipeline using the tf.data API. We covered how to:

  • Handle very large datasets broken into multiple separate files using the tf.data.Dataset.list_files() static method
  • Enable parallelism by leveraging multiple threads of execution
  • Improve convergence using coarse-grained data shuffling with the interleave() dataset method
  • Improve convergence using fine-grained data shuffling using the shuffle() method
  • Maximize GPU utilization using the prefetch() method
  • Combine all these concepts into a single function that yields a dataset ready for model consumption

Sources


Writing Efficient Input Pipelines Using TensorFlow’s Data API was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓