Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Writing TFRecord Files the Right Way
Latest   Machine Learning

Writing TFRecord Files the Right Way

Last Updated on July 20, 2023 by Editorial Team

Author(s): Dimitre Oliveira

Originally published on Towards AI.

Machine Learning

How to properly generate TFRecord files from your datasets

The TFRecord format is a simple format for storing a sequence of binary records.

In this post you will learn why and when you should use the TFRecords format, and the code necessary to use it.

This post is a more detailed version of the tutorial that I wrote for the official Keras.io code example page.

Converting your data into TFRecord has many advantages, such as:

  • More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files.
  • Fast I/O: the TFRecord format can be read with parallel I/O operations, which is useful for TPUs or multiple hosts.
  • Self-contained files: the TFRecord data can be read from a single source — for example, the COCO2017 dataset originally stores data in two folders (“images” and “annotations”).

An important use case of the TFRecord data format is training on TPUs. First, TPUs are fast enough to benefit from optimized I/O operations. In addition, TPUs require data to be stored remotely (e.g. on Google Cloud Storage), and using the TFRecord format makes it easier to load the data without batch-downloading.

Performance using the TFRecord format can be further improved if you also use it with the tf.data API.

Here, you will learn how to convert data of different types (image, text, and numeric) into TFRecord.

The COCO 2017 dataset

We will be using the COCO2017 dataset, because it has many different types of features, including images, floating-point data, and lists. It will serve as a good example of how to encode different features into the TFRecord format.

This dataset has two sets of fields: images and annotation meta-data.

The images are a collection of JPG files and the meta-data are stored in a JSON file which, according to the official site, contains the following properties:

id: int,
image_id: int,
category_id: int,
segmentation: RLE or [polygon], object segmentation mask
bbox: [x,y,width,height], object bounding box coordinates
area: float, area of the bounding box
iscrowd: 0 or 1, is single object or a collection

Let’s look at one sample from the dataset

{'area': 367.89710000000014,
'bbox': [265.67, 222.31, 26.48, 14.71],
'category_id': 72,
'id': 34096,
'image_id': 525083,
'iscrowd': 0,
'segmentation': [[267.51,
222.31,
292.15,
222.31,
291.05,
237.02,
265.67,
237.02]]}

Starting the data conversion

To start the data conversion process, first, we need to define a few functions.
Beginning with the conversion from raw data to TensorFlow types:

def image_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(
bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()])
)

def bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))

def float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def float_feature_list(value):
"""Returns a list of float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def create_example(image, path, example):
feature = {
"image": image_feature(image),
"path": bytes_feature(path),
"area": float_feature(example["area"]),
"bbox": float_feature_list(example["bbox"]),
"category_id": int64_feature(example["category_id"]),
"id": int64_feature(example["id"]),
"image_id": int64_feature(example["image_id"]),
}
return tf.train.Example(features=tf.train.Features(feature=feature))

These functions are very intuitive but let us get a better understanding of how they will be used.

The basic functions are bytes_feature, float_feature, and int64_feature they will convert basic data like string and numeric data (integer and floats) to the equivalent TensorFlow data types, the difference between float_feature and float_feature_list is that float_feature_list is used to convert a list of floats, not only a single value.
Next, we have image_feature that will be used to convert images, images could also be converted using a regular string or float list, but using a function like encode_jpeg is more efficient.
Finally, we have create_example that brings it all together, it receives all the necessary data and converts it to the appropriate TensorFlow format creating a dictionary that latter will be serialized and written as a TFRecord file.

Now we can define a few parameters to start the process.

num_samples is the number of data samples on each TFRecord file.

num_tfrecods is the total number of TFRecord files that we will create.

root_dir = "datasets" # input data root folder
tfrecords_dir = "tfrecords" # output data folder
images_dir = os.path.join(root_dir, "val2017") # input images folder
# input annotations folder and filepath
annotations_dir = os.path.join(root_dir, "annotations")
annotation_file = os.path.join(annotations_dir, "instances_val2017.json")
num_samples = 4096
num_tfrecods = len(annotations) // num_samples
with open(annotation_file, "r") as f: # load annotation data as list
annotations = json.load(f)["annotations"]

if not os.path.exists(tfrecords_dir):
os.makedirs(tfrecords_dir) # creating TFRecords output folder

Generate data in the TFRecord format

Now we can iterate over the COCO2017 data and create the TFRecord files. The file name format will be file_{number}.tfrec (this is optional, but including the number sequences in the file names can make counting easier).

for tfrec_num in range(num_tfrecods):
samples = annotations[(tfrec_num * num_samples) :
((tfrec_num + 1) * num_samples)]

with tf.io.TFRecordWriter(
tfrecords_dir+"/file_%.2i-%i.tfrec"%(tfrec_num,len(samples))
) as writer:
for sample in samples:
image_path=f"{images_dir}/{sample['image_id']:012d}.jpg"
image = tf.io.decode_jpeg(tf.io.read_file(image_path))
example = create_example(image, image_path, sample)
writer.write(example.SerializeToString())

Here is what happens inside this loop:

First, we slice the annotations list to take only the samples that will be written during this iteration, the slice size is based on the number of samples num_samples that we defined earlier for each TFRecord file.

Next, we use tf.io.TFRecordWriter to create the TFRecord file that will be written, and we use another loop to iterate over the samples sliced at the first step.

For the final loop, we first build the image file path and use it to read the image using tf.io.decode_jpeg, then we can call the create_example function with those attributes, this function will return a TensorFlow example that we will serialize with example.SerializeToString(), and finally, write this example to the TFRecord file created at the second step.

Explore one sample from the generated TFRecord

To open the newly created TFRecord file, are going to need a parse function, this function will take care of converting the sequence of binary records back into the appropriate TensorFlow data types.

def parse_tfrecord_fn(example):
feature_description = {
"image": tf.io.FixedLenFeature([], tf.string),
"path": tf.io.FixedLenFeature([], tf.string),
"area": tf.io.FixedLenFeature([], tf.float32),
"bbox": tf.io.VarLenFeature(tf.float32),
"category_id": tf.io.FixedLenFeature([], tf.int64),
"id": tf.io.FixedLenFeature([], tf.int64),
"image_id": tf.io.FixedLenFeature([], tf.int64),
}
example = tf.io.parse_single_example(example, feature_description)
example["image"] = tf.io.decode_jpeg(example["image"], channels=3)
example["bbox"] = tf.sparse.to_dense(example["bbox"])
return example

Read an image and display

raw_dataset = tf.data.TFRecordDataset(f"{tfrecords_dir}/file_00-{num_samples}.tfrec")
parsed_dataset = raw_dataset.map(parse_tfrecord_fn)

for features in parsed_dataset.take(1):
for key in features.keys():
if key != "image":
print(f"{key}: {features[key]}")

print(f"Image shape: {features['image'].shape}")
plt.figure(figsize=(7, 7))
plt.imshow(features["image"].numpy())
plt.show()

Output:

bbox: [473.07 395.93 38.65 28.67]
area: 702.1057739257812
category_id: 18
id: 1768
image_id: 289343
path: b'datasets/val2017/000000289343.jpg'
Image shape: (640, 529, 3)

Another advantage of TFRecord is that you are able to add many features to it and later use only a few of them, in this case.

Now you could train a model using the newly generated TFRecord files, to see an example check out the Keras.io tutorial that I wrote.

Conclusion

This article demonstrates that instead of reading images and annotations from different sources you can have your data coming from a single source thanks to TFRecord. This process can make storing and reading data simpler and more efficient. For more information, you can go to the TFRecord and tf.train.Example tutorial.

References

TFRecord and tf.train.Example (TensorFlow)
Creating TFRecords (Keras.io)
TFRecords Basics (Kaggle)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓