Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Data Analysis   Data Science   Latest   Machine Learning

Bigram Models Simplified

Author(s): Ghadah AlHabib

Originally published on Towards AI.

Bigram Models Simplified

Image generated by ChatGPT

Introduction to Text Generation

In Natural Language Processing, text generation creates text that can resemble human writing, ranging from simple tasks like auto-completing sentences to complex ones like writing articles or stories. There are many text generation algorithms that can be classified as deep learning-based methods (deep generative models) and probabilistic methods. Deep learning methods include using RNNs, LSTM, and GANs, and probabilistic methods include Markov processes. Probabilistic models assign probabilities to each possible next word and they learn about language from being trained to predict upcoming words from neighboring words.

Introduction to N-gram Language Models

The N-gram is the simplest kind of language that estimates the probability of a word given the n-1 previous words and assigns probabilities to entire sequences. A simple 2-gram model is called a bigram, and the probability of a word depends only on its immediate predecessor.

First, the model trains on a large amount of text to learn dependencies by calculating the frequency of word pairs, triplets, etc. Then, to generate text, the model starts with an initial word and then probabilistically selects the next word based on the learned joint PMFs.

For further reading: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Text Generation Based on the Probability of Word Sequences

import random
from collections import defaultdict, Counter

text = "This is a simple example to illustrate how a 2-gram model works. " \\
"This example is simple but effective for understanding 2-gram models."
tokens = text.split()
bigram_model = defaultdict(Counter)
for i in range(len(tokens) - 1):
bigram_model[tokens[i]][tokens[i + 1]] += 1
def predict_next_word(current_word):
if current_word in bigram_model:
possible_words = list(bigram_model[current_word].keys())
word_weights = list(bigram_model[current_word].values())
return random.choices(possible_words, weights=word_weights)[0]
else:
return random.choice(tokens)
current_word = "bigram"
generated_text = [current_word]
number_of_words_to_be_generated = 40
for _ in range(number_of_words_to_be_generated):
next_word = predict_next_word(current_word)
generated_text.append(next_word)
current_word = next_word
print(' '.join(generated_text))

This code snippet uses the module ‘defaultdict’ to create a dictionary that provides a default value for missing keys and ‘Counter’ to count hash-able objects. It then splits the predefined text into tokens (each word is a token). This text will be used to learn the patterns in the sequence.

Then, we will move on to building the bigram model which utilizes the ‘defaultdict’ and ‘Counter’ modules to count occurrences. In the for loop, we iterate over each pair of adjacent words in ‘tokens’ and for every word we will keep track of how frequently every possible subsequent word follows it in the corpus.

After we have initiated the model, we will predict the next word by providing the desired word to the ‘predict_next_word’ function, which is the ‘current_word’. It checks if the current word is in the model. If so, the function computes the probabilities of the next word based on the frequency of the occurrence. If not, meaning the word was not seen in the training data, the function randomly chooses from the entire corpus.

Afterwards, we will begin the text generation by specifying the starting word and specify the number of words to be generated (the range of the for loop). In the loop, the ‘predict_next_word’ function appends a new word to the array of generated text and then updates the current word to this new word to create a chain of text. The choice of the next word is specified by the frequencies computed from the Bigram model.

What does the ‘bigram_model’ contain?

Output of model 1:

Example Output 1: bigram is a 2-gram model that works. This example is simple but effective for understanding 2-gram model works. This is a simple example is a 2-gram model works. This is a 2-gram model that works. This example illustrates how a 2-gram

Example Output 2: bigram simple example to illustrate how a 2-gram models. illustrate how a 2-gram model works. This example is a simple example to illustrate how a 2-gram model works. This example to illustrate how a simple example illustrates how a simple

Suppose we were to increase the length of the corpus and have the model train on more data:

text = """This is an extended example to illustrate how a 2-gram model works with
a larger corpus.By using more text, we can provide the model with more context,
which should improve its predictive accuracy. The 2-gram model, also known as a
bigram model, predicts the next word based on the previous one, creating chains
of words that form sentences. While simple in concept, bigram models are a
fundamental part of natural language processing and can be quite effective in
various applications. They serve as the building blocks for more complex models
and algorithms in the field of computational linguistics. Understanding how
bigram models function is essential for grasping the basics of text generation
and language modeling. This corpus includes a variety of sentences to help
demonstrate the versatility of the 2-gram approach. As we continue to expand the
corpus, the model's ability to generate coherent and contextually relevant text
should increase, showcasing the power of even simple probabilistic models in
understanding and generating human language."""

Output of model 2:

Example Output 1: bigram models are bigram models in concept, bigram models and generating human language. provide the model works with a larger corpus. By using more context, it should improve its predictive accuracy. The 2-gram model with more text, we can provide

Example Output 2: bigram model, also known as the basics of computational linguistics. Understanding how a fundamental part of text should increase and showcasing the building blocks for more complex models are bigram models and contextually relevant text generation and can be quite effective

Example Output 3: bigram models and contextually relevant text generation and algorithms in concept, bigram model, also known as a fundamental part of natural language modeling. This corpus includes a 2-gram model, also known as a larger corpus. By using more complex models and

Interpreting the differences between the outputs: Statistical Significance

When the model was trained on a larger corpus, more examples of how words are used in different contexts are provided, thus, helping the model understand and predict more accurate word pairs. This achieves better statistical significance because in a bigram model, the probability of a word following another is based on the frequency of their co-occurrence in the corpus. A larger corpus offers more instances of each word pair, establishing more statistically significant probabilities. This means the model’s predictions are less likely to be skewed by rare or unusual usage found in smaller datasets.

Thank you for reading!

Let’s Connect!

Twitter: https://twitter.com/ghadah_alha/

LinkedIn: https://www.linkedin.com/in/ghadah-alhabib/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓