Writing TFRecord Files the Right Way
Last Updated on July 20, 2023 by Editorial Team
Author(s): Dimitre Oliveira
Originally published on Towards AI.
Machine Learning
How to properly generate TFRecord files from your datasets
The TFRecord format is a simple format for storing a sequence of binary records.
In this post you will learn why and when you should use the TFRecords format, and the code necessary to use it.
This post is a more detailed version of the tutorial that I wrote for the official Keras.io code example page.
Converting your data into TFRecord has many advantages, such as:
- More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files.
- Fast I/O: the TFRecord format can be read with parallel I/O operations, which is useful for TPUs or multiple hosts.
- Self-contained files: the TFRecord data can be read from a single source β for example, the COCO2017 dataset originally stores data in two folders (βimagesβ and βannotationsβ).
An important use case of the TFRecord data format is training on TPUs. First, TPUs are fast enough to benefit from optimized I/O operations. In addition, TPUs require data to be stored remotely (e.g. on Google Cloud Storage), and using the TFRecord format makes it easier to load the data without batch-downloading.
Performance using the TFRecord format can be further improved if you also use it with the tf.data API.
Here, you will learn how to convert data of different types (image, text, and numeric) into TFRecord.
The COCO 2017 dataset
We will be using the COCO2017 dataset, because it has many different types of features, including images, floating-point data, and lists. It will serve as a good example of how to encode different features into the TFRecord format.
This dataset has two sets of fields: images and annotation meta-data.
The images are a collection of JPG files and the meta-data are stored in a JSON file which, according to the official site, contains the following properties:
id: int,
image_id: int,
category_id: int,
segmentation: RLE or [polygon], object segmentation mask
bbox: [x,y,width,height], object bounding box coordinates
area: float, area of the bounding box
iscrowd: 0 or 1, is single object or a collection
Letβs look at one sample from the dataset
{'area': 367.89710000000014,
'bbox': [265.67, 222.31, 26.48, 14.71],
'category_id': 72,
'id': 34096,
'image_id': 525083,
'iscrowd': 0,
'segmentation': [[267.51,
222.31,
292.15,
222.31,
291.05,
237.02,
265.67,
237.02]]}
Starting the data conversion
To start the data conversion process, first, we need to define a few functions.
Beginning with the conversion from raw data to TensorFlow types:
def image_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(
bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()])
)
def bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))
def float_feature(value):
"""Returns a float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def float_feature_list(value):
"""Returns a list of float_list from a float / double."""
return tf.train.Feature(float_list=tf.train.FloatList(value=value))def create_example(image, path, example):
feature = {
"image": image_feature(image),
"path": bytes_feature(path),
"area": float_feature(example["area"]),
"bbox": float_feature_list(example["bbox"]),
"category_id": int64_feature(example["category_id"]),
"id": int64_feature(example["id"]),
"image_id": int64_feature(example["image_id"]),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
These functions are very intuitive but let us get a better understanding of how they will be used.
The basic functions are bytes_feature
, float_feature
, and int64_feature
they will convert basic data like string and numeric data (integer and floats) to the equivalent TensorFlow data types, the difference between float_feature
and float_feature_list
is that float_feature_list
is used to convert a list of floats, not only a single value.
Next, we have image_feature
that will be used to convert images, images could also be converted using a regular string or float list, but using a function like encode_jpeg
is more efficient.
Finally, we have create_example
that brings it all together, it receives all the necessary data and converts it to the appropriate TensorFlow format creating a dictionary that latter will be serialized and written as a TFRecord file.
Now we can define a few parameters to start the process.
num_samples
is the number of data samples on each TFRecord file.
num_tfrecods
is the total number of TFRecord files that we will create.
root_dir = "datasets" # input data root folder
tfrecords_dir = "tfrecords" # output data folder
images_dir = os.path.join(root_dir, "val2017") # input images folder# input annotations folder and filepath
annotations_dir = os.path.join(root_dir, "annotations")
annotation_file = os.path.join(annotations_dir, "instances_val2017.json")num_samples = 4096
num_tfrecods = len(annotations) // num_sampleswith open(annotation_file, "r") as f: # load annotation data as list
annotations = json.load(f)["annotations"]
if not os.path.exists(tfrecords_dir):
os.makedirs(tfrecords_dir) # creating TFRecords output folder
Generate data in the TFRecord format
Now we can iterate over the COCO2017 data and create the TFRecord files. The file name format will be file_{number}.tfrec
(this is optional, but including the number sequences in the file names can make counting easier).
for tfrec_num in range(num_tfrecods):
samples = annotations[(tfrec_num * num_samples) :
((tfrec_num + 1) * num_samples)]
with tf.io.TFRecordWriter(
tfrecords_dir+"/file_%.2i-%i.tfrec"%(tfrec_num,len(samples))
) as writer:
for sample in samples:
image_path=f"{images_dir}/{sample['image_id']:012d}.jpg"
image = tf.io.decode_jpeg(tf.io.read_file(image_path))
example = create_example(image, image_path, sample)
writer.write(example.SerializeToString())
Here is what happens inside this loop:
First, we slice the annotations list to take only the samples that will be written during this iteration, the slice size is based on the number of samples num_samples
that we defined earlier for each TFRecord file.
Next, we use tf.io.TFRecordWriter
to create the TFRecord file that will be written, and we use another loop to iterate over the samples sliced at the first step.
For the final loop, we first build the image file path and use it to read the image using tf.io.decode_jpeg
, then we can call the create_example
function with those attributes, this function will return a TensorFlow example that we will serialize with example.SerializeToString()
, and finally, write this example to the TFRecord file created at the second step.
Explore one sample from the generated TFRecord
To open the newly created TFRecord file, are going to need a parse function, this function will take care of converting the sequence of binary records back into the appropriate TensorFlow data types.
def parse_tfrecord_fn(example):
feature_description = {
"image": tf.io.FixedLenFeature([], tf.string),
"path": tf.io.FixedLenFeature([], tf.string),
"area": tf.io.FixedLenFeature([], tf.float32),
"bbox": tf.io.VarLenFeature(tf.float32),
"category_id": tf.io.FixedLenFeature([], tf.int64),
"id": tf.io.FixedLenFeature([], tf.int64),
"image_id": tf.io.FixedLenFeature([], tf.int64),
}
example = tf.io.parse_single_example(example, feature_description)
example["image"] = tf.io.decode_jpeg(example["image"], channels=3)
example["bbox"] = tf.sparse.to_dense(example["bbox"])
return example
Read an image and display
raw_dataset = tf.data.TFRecordDataset(f"{tfrecords_dir}/file_00-{num_samples}.tfrec")
parsed_dataset = raw_dataset.map(parse_tfrecord_fn)
for features in parsed_dataset.take(1):
for key in features.keys():
if key != "image":
print(f"{key}: {features[key]}")
print(f"Image shape: {features['image'].shape}")
plt.figure(figsize=(7, 7))
plt.imshow(features["image"].numpy())
plt.show()
Output:
bbox: [473.07 395.93 38.65 28.67]
area: 702.1057739257812
category_id: 18
id: 1768
image_id: 289343
path: b'datasets/val2017/000000289343.jpg'
Image shape: (640, 529, 3)
Another advantage of TFRecord is that you are able to add many features to it and later use only a few of them, in this case.
Now you could train a model using the newly generated TFRecord files, to see an example check out the Keras.io tutorial that I wrote.
Conclusion
This article demonstrates that instead of reading images and annotations from different sources you can have your data coming from a single source thanks to TFRecord. This process can make storing and reading data simpler and more efficient. For more information, you can go to the TFRecord and tf.train.Example tutorial.
References
– TFRecord and tf.train.Example (TensorFlow)
– Creating TFRecords (Keras.io)
– TFRecords Basics (Kaggle)
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI