Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


NLP with RNNs and Beam Search
Latest   Machine Learning

NLP with RNNs and Beam Search

Last Updated on July 25, 2023 by Editorial Team

Author(s): Akshith Kumar

Originally published on Towards AI.

All you need to know about beam search and RNN’s.

Photo by Bradyn Trollip on Unsplash

Recurrent Neural Networks (RNNs) are a type of neural network that is commonly used in Natural Language Processing (NLP) tasks. The main advantage of RNNs is their ability to handle sequential data, which is a common feature of text data.

Here’s an overview of how RNNs can be used for NLP tasks:

  1. Preprocessing: The first step in using RNNs for NLP is to preprocess the text data. This may involve tasks such as tokenization, stemming, and removing stop words.
  2. Word Embeddings: Before feeding the text data into an RNN, it is common to convert each word into a dense vector representation, known as word embedding. This allows the RNN to better capture the semantic meaning of words.
  3. Sequence Processing: Once the text data has been preprocessed and converted into word embeddings, it can be fed into an RNN for sequence processing. The RNN will take a sequence of word embeddings as input and output a hidden state at each time step. This hidden state captures the context of the current input word, as well as the context of all previous words in the sequence.
  4. Output Processing: Depending on the specific NLP task, the output of the RNN may need to be processed further. For example, in a text classification task, the output of the RNN may be fed into a fully connected layer with a softmax activation to produce a probability distribution over the possible classes.
  5. Training: The RNN is trained using backpropagation through time (BPTT), which is a variant of backpropagation that takes into account the sequential nature of the input data.

Let’s talk about in-depth details of RNN’s.

Generating Text

Chopping the sequential dataset into multiple windows

The training set now consists of a single sequence of over a million characters, so we can’t just train the neural network directly on it; the RNN will be equivalent to a deep network with over a million layers, and we would have a single instance to train it. Instead, we will use the dataset’s window() method to convert this long sequence of characters into many smaller windows of text. Every instance in the data is a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings. This is called truncated backpropagation.p

n_steps = 100
window_length = n_steps + 1 #input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

We can try tuning n_steps; it is easier to train RNNs on shorter input sequences, but of course, the RNN will not be able to learn any pattern longer than n_steps, so don’t make it too small.

dataset = dataset.flat_map(lambda window: window.batch(window_length))

flat_map() method takes a function argument, which allows transforming each dataset in the nested dataset before flattening.

batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = windows: (windows[:, :-1], windows[:, 1:]))

Shuffling the dataset with batch size and then batch the windows to separate the inputs from the target of the last character.

dataset = X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id),
dataset = dataset.prefetch(1)

Categorizing the input features using the one-hot vectors. Then, add prefetching to it.

Text: The quick brown fox jumped over the lazy dog. The lazy dog slept all day

Window size: 5 words

1. "The quick brown fox jumped over"
2. "the lazy dog. The lazy dog"
3. "slept all day."

Shuffled dataset:
1. "slept all day."
2. "The quick brown fox jumped over"
3. "the lazy dog. The lazy dog"

Building and training the Char-RNN Model

To predict the next char based on the previous 100 characters, we can use the RNN with 2 GRU layers of 128 units each and 20% dropout on both the inputs and some hidden states. The output layer is a Time distributed layer.

Time distributed layer allows us to apply a layer to every temporal slice of an input. In other words, it keeps a one-to-one relationship between input and output.

max_id = 39
max_id = 39

model = Sequential([
keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
keras.layers.GRU(128, return_sequences=True, dropout=0.2),
keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history =, epochs=5)

Pre-processing text to predict by the model

Creating a function to pre-process the text as we did while building the model.

def preprocess(texts):
X = np.array(tokenizer.texts_to_sequences(texts)) - 1
return tf.one_hot(X, max_id)

Predicting the next char by RNN model.

X_new = preprocess(['How are yo'])
Y_pred = model.predict_classes(X_new)
tokenizer.sequences_to_text(Y_pred + 1)[0][-1] #1st sentence, last char

Generating Fake Shakespearean Text

To generate new text using the RNN model, we could feed some text into it, make the model predict the most likely next letter, add it at the end of the text, then give the extended text to predict the next letter, and so on. But in practice, this often leads to the same words being repeated over and over again. Instead, we can randomly pick the next char with a probability equal to the estimated probability using TensorFlow’s tf.random.categorical()

The categorical() function samples random class indices, given the class log probabilities(logits).

We can divide the logits by a number called temperature, which we can tweak as we wish; temp close to 0 will favor the high probability characters, while a very high temperature give all character an equal probability.

def next_char(text, temperature=1):
X_new = preprocess([text])
y_prob = model.predict(X_new)[0, -1:, :]
rescaled_logits = tf.math.log(y_prob)/temperature
char_id = tf.random.categorical(rescaled_logits, num_smaples=1) + 1
return tokenizer.sequences_to_texts(char_id.numpy())[0]

Next, we will create a small function that will call the next_char() to get the next character and append it to the given text

def complete_text(text, n_chars=50, temperature=1):
for _ in range(n_chars):
text += next_char(text, temperature)
return text

# generate text
print(complete_text('t', temperature=0.2) # try setting 1 & 2 as well to check diffence

Stateless RNN

Here, at each training iteration, the model starts with a hidden state full of zeros, then it updates at every time step, and after the last time step, it throws away as it is not needed anymore.

Stateful RNN

As we have seen in the process of Stateless RNN. Instead, if it preserves at the final state after processing one training batch and uses it as the next initial state for the next training batch, this way the model can learn long-term patterns, then it is called Stateful RNN.

While building Stateful RNN, we should consider the input sequence in a batch starts exactly the corresponding sequence in the previous batch is left off. So, Stateful RNN should be sequential and non-overlapping input sequences.

dataset =[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id),
dataset = dataset.prefetch(1)

We should not use Shuffle over here, and Unfortunately, batching is harder when preparing a dataset for a stateful RNN. Indeed, we can call batch(32), but there are no consecutive. Instead use batches containing a single window.

max_id = 39

model = Sequential([
keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
dropout=0.2, stateful=True),
keras.layers.GRU(128, return_sequences=True, dropout=0.2, stateful=True),
keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
# reset the states before we go back to beginining of the test.
class ResetStatesCallback(keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs):
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
history =, epochs=5, callbacks=[ResetStatesCallback()])

When we train this stateful model, it will be only possible to make predictions for batches of the same size as we’re used to during training. To avoid this restriction, create a stateless model and copy the weights of those models to this Stateful model.

tft.compute_and_apply_vocabulary() function: it will go through the dataset to find all distinct words and build the vocabulary, and it will generate the TF operations required to encode each word using this vocabulary.


While training the model, it needs to ignore the padding tokens, which is not necessary to focus on pieces of training. So, we simply add mask_zero=True when creating the embedding layer. This means that padding tokens will be ignored by all downstream layers.

The way these works is that the embedding layer creates a mask tensor equal to K.not_equal(inputs, 0) where K=keras.backend

embed_size = 128
K = keras.backend
inputs = keras.layers.Input(shape=[None])
mask = keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation='sigmoid')(z)
model = keras.Model(inputs=[inputs], outputs=[outputs])

Bidirectional RNNs

The Bidirectional layer will create a clone of the GRU layer (but in the reverse direction), and it will run both and concatenate their outputs. So although the GRU layer has 10 units, the Bidirectional layer will output 20 values per time step.

keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))

Beam Search

Suppose we train model a model to translate the English sentence into a French sentence, ‘Comment vas-tu?’ to English. We hope we will get the output as ‘How are you?’ but unfortunately, it outputs ‘How will you?’. But unfortunately, in this case, it was a mistake, and the model could not go back and fix it, so it tried its best to complete the sentence it could. By greedily outputting the most likely word at every step, it ended up with a suboptimal translation.

What if the model can go back and rectify its mistakes made earlier? One of the most common solutions is Beam Search (it keeps track of a shortlist of the k most promising sentences like, say, top three and at each decoder step it tries to extend them by one word, keeping only the k most likely sentences. The parameter k is called the Beamwidth.

For example, We use the model to translate the ‘Comment vas-tu?’ using beam search with a beam width of 3. In the first decoder step, the model will output an estimated probability for each possible word. Suppose the top 3 words are ‘How’ (estimated probability of 75%), ‘What’ (3%), and ‘You’ (1%). That’s our shortlist so far.

  • Next, we create three copies of our model and use them to find the next word for each sentence. Each model will output one estimated probability per word in the vocabulary.
  • The first model will try to find the next word in the sentence ‘How’ and perhaps it will output a probability of 36% for the word ‘will’, 32% for the word ‘are’, 16% for the word ‘do’, and so on.
  • After computing the probabilities of all 30,000 two-word sentences, we keep only the top 3. ‘How will’ (27%), ‘How are’ (24%), and ‘How to do’ (12%). Right now, ‘How will’ is at the top but ‘How are’ has not been eliminated.
  • Then we repeat the same process, we use three models to predict the next word in each of these sentences and compute the probabilities.
  • ‘How are you (10%), ‘How will you’ (5%), and so on. And now we have a perfectly reasonable translation. We boosted our encoder-decoder model’s performance without any extra training simply by using it more wisely.

Implementing beam search using TensorFlow Addons:

beam_width = 10
decoder = tfa.seq2seq.beam_search_decoder.BeamSearchDecoder(cell=decoder_cell,
beam_width=beam_width, output_layer=output_layer)

decoder_initial_state = tfa.seq2seq.beam_search_decoder.tile_batch(encoder_state,

outputs, _, _ = decoder(embedding_decoder, start_tokens, end_token=end_token,
initial_state=decoder_initial_state) beam_width=beam_width, output_layer=output_layer)

With all this, we can get good translations for fairly short sentences (especially we use pre-trained word embeddings). Unfortunately, this model will be really bad at translating long sentences. Attention mechanisms are the game-changing innovation that addressed this problem.

Thank you for your time, Hope you like it.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓