Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

A Gentle Introduction to Audio Classification With Tensorflow
Latest

A Gentle Introduction to Audio Classification With Tensorflow

Last Updated on May 3, 2021 by Editorial Team

Author(s): Dimitre Oliveira

Deep Learning

Applying deep learning to classify audio with Tensorflow

Source: https://www.tensorflow.org/tutorials/audio/simple_audio

We have seen a lot of recent advances in deep learning related to vision and language fields, it is intuitive to understand why CNN performs very well on images, with pixel’s local correlation, and how sequential models like RNNs or transformers also perform very well on language, with its sequential nature, but what about audio? what are the types of models and processes used when we are dealing with audioΒ data?

In this article you will learn how to approach a simple audio classification problem, you will learn some of the common and efficient methods used, and the Tensorflow code to doΒ it.

Disclaimer: The code presented here is based on my work developed for the β€œRainforest Connection Species Audio Detection” Kaggle competition, but for demonstration purposes, I will use the β€œSpeech Commands” dataset.

Waveforms

We usually have audio files in the β€œ.wav” format, they are commonly referred to as waveforms, a waveform is a time series with the signal amplitude at each specific time, if we visualize one of those waveform samples we will get something likeΒ this:

x-axis is the time and y-axis is the normalized signal amplitude

Intuitively one might consider modeling this data like a regular time series (e.g. stock price forecasting) using some kind of RNN model, in fact, this could be done, but since we are using audio signals, a more appropriate choice is to transform the waveform samples into spectrograms.

Spectrograms

A spectrogram is an image representation of the waveform signal, it shows its frequency intensity range over time, it can be very useful when we want to evaluate the signal's frequency distribution over time. Below is the spectrogram representation of the waveform image we sawΒ above.

x-axis is the sampled time and y-axis is the frequency

Speech Commands useΒ case

To make this tutorial simpler we will be using the β€œSpeech Commands” dataset, this dataset has one-second audio clips with spoken words like: β€œdown”, β€œgo”, β€œleft”, β€œno”, β€œright”, β€œstop”, β€œup” andΒ β€œyes”.

Audio processing with Tensorflow

Now that we have an idea of how we process audio data to use with deep learning models, we can proceed to look at the code implementation to do it, our pipeline will follow a simple workflow described by the diagramΒ below:

Simple audio processing diagram

Note that in our use case at the 1st step, the data is loaded directly from β€œ.wav” files, and the 3rd step is optional since the audio files have only one second each, in some cases cropping the audio may be a good idea for longer files and also for keeping a fixed-length across allΒ samples.

Loading theΒ data

def load_dataset(filenames):
dataset = tf.data.Dataset.from_tensor_slices(filenames)
return dataset

The load_dataset function will be responsible for loading theΒ .wav files and converting them into a Tensorflow dataset.

Extracting waveform andΒ label

commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[commands != 'README.md']
def decode_audio(audio_binary):
audio, _ = tf.audio.decode_wav(audio_binary)
return tf.squeeze(audio, axis=-1)
def get_label(filename):
label = tf.strings.split(filename, os.path.sep)[-2]
label = tf.argmax(label == commands)
return label
def get_waveform_and_label(filename):
label = get_label(filename)
audio_binary = tf.io.read_file(filename)
waveform = decode_audio(audio_binary)
return waveform, label

After loading theΒ .wav files we need to decode them, this can be done using the tf.audio.decode_wav function, it will turn theΒ .wav files into float tensors. Next, we need to extract the labels from the files, in this specific use case we can get the labels from each sample’s file path, after that we just need to one-hot encodeΒ them.

Here is an example:
First, we get a file path like thisΒ one:

"data/mini_speech_commands/up/50f55535_nohash_0.wav"

Then we extract the text after the second β€œ/”, in this case, the label is UP, finally, we use the commands list to one-hot encode theΒ labels.

Commands: ['up' 'down' 'go' 'stop' 'left' 'no' 'yes' 'right']
Label = 'up'
After one-hot encoding:
Label = [1, 0, 0, 0, 0, 0, 0, 0]

Transforming waveforms into spectrograms

The next step is to convert the waveforms files into spectrograms, luckily Tensorflow has a function that can do that, tf.signal.stft applies a short-time Fourier transform (STFT) to convert the audio into the time-frequency domain, then we apply the tf.abs operator to remove the signal phase, and only keep the magnitude. Note that the tf.signal.stft function has some parameters like frame_length and frame_step, they will affect the generated spectrogram, I will not go into details about how to tune them but you can refer to this video to learnΒ more.

def get_spectrogram(waveform, padding=False, min_padding=48000):
waveform = tf.cast(waveform, tf.float32)
spectrogram = tf.signal.stft(waveform, frame_length=2048, frame_step=512, fft_length=2048)
spectrogram = tf.abs(spectrogram)
return spectrogram
def get_spectrogram_tf(waveform, label):
spectrogram = get_spectrogram(waveform)
spectrogram = tf.expand_dims(spectrogram, axis=-1)
return spectrogram, label

Transform spectrograms into RGBΒ images

The final step is to transform the spectrograms into RGB images, this step is optional, but here we will be using a model pre-trained on the ImageNet dataset, and this model requires input images with 3 channels, otherwise, you could keep the spectrograms with only oneΒ channel.

def prepare_sample(spectrogram, label):
spectrogram = tf.image.resize(spectrogram, [HEIGHT, WIDTH])
spectrogram = tf.image.grayscale_to_rgb(spectrogram)
return spectrogram, label

Combining allΒ together

HEIGHT, WIDTH = 128, 128
AUTO = tf.data.AUTOTUNE
def get_dataset(filenames, batch_size=32):
dataset = load_dataset(filenames)

dataset = files_ds.map(get_waveform_and_label, num_parallel_calls=AUTO)
dataset = dataset.map(get_spectrogram_tf, num_parallel_calls=AUTO)
dataset = dataset.map(prepare_sample, num_parallel_calls=AUTO)
  dataset = dataset.shuffle(256)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(AUTO)
return dataset

Bringing all together we have the get_dataset function that takes the filenames as inputs and after going through all the steps described above, returns a Tensorflow dataset with RGB spectrograms images and itsΒ labels.

The model

def model_fn(input_shape, N_CLASSES):
inputs = L.Input(shape=input_shape, name='input_audio')
base_model = efn.EfficientNetB0(input_tensor=inputs,
include_top=False,
weights='imagenet')

x = L.GlobalAveragePooling2D()(base_model.output)
x = L.Dropout(.5)(x)
output = L.Dense(N_CLASSES, activation='softmax',name='output')(x)

model = Model(inputs=inputs, outputs=output)

return model

Our model will have an EfficientNetB0 backbone, and at its top, we have added a GlobalAveragePooling2D followed by a Dropout, with a final Dense layer that will do the actual multi-class classification.

With a small dataset EfficientNetB0 may be a good baseline, it has decent accuracy even being a fast and lightΒ model.

Training

model = model_fn((None, None, CHANNELS), N_CLASSES)
model.compile(optimizer=tf.optimizers.Adam(), 
loss=losses.CategoricalCrossentropy(),
metrics=[metrics.CategoricalAccuracy()])

model.fit(x=get_dataset(FILENAMES),
steps_per_epoch=100,
epochs=10)

The training code is very standard for a Keras model, so you probably won’t find anything newΒ here.

Conclusion

Now you should have a clearer understanding of the workflow to apply deep learning to audio files, while this is not the only way you can do it, it is one best options regarding the easiness/performance trade-off. If you are going to model audio you may want to also consider other promising approaches like transformers.

As additional preprocessing steps you can truncate or pad the waveforms, this might be a good idea in cases where your samples have different lengths or if the samples are too long and you just need a smaller part from it, you can find the code on how to do it in the references sectionΒ below.

References
Simple audio recognition: Recognizing keywords
Rainforest-Audio classification Tensorflow starter
Rainforest-Audio classification TFΒ Improved


A Gentle Introduction to Audio Classification With Tensorflow was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Comment (1)

  1. Charlie
    August 27, 2022

    Appreciate this post. Will try it out.|

Feedback ↓