Bigram Models Simplified

Author(s): Ghadah AlHabib

Originally published on Towards AI.

Bigram Models Simplified

Introduction to Text Generation

In Natural Language Processing, text generation creates text that can resemble human writing, ranging from simple tasks like auto-completing sentences to complex ones like writing articles or stories. There are many text generation algorithms that can be classified as deep learning-based methods (deep generative models) and probabilistic methods. Deep learning methods include using RNNs, LSTM, and GANs, and probabilistic methods include Markov processes. Probabilistic models assign probabilities to each possible next word and they learn about language from being trained to predict upcoming words from neighboring words.

Introduction to N-gram Language Models

The N-gram is the simplest kind of language that estimates the probability of a word given the n-1 previous words and assigns probabilities to entire sequences. A simple 2-gram model is called a bigram, and the probability of a word depends only on its immediate predecessor.

First, the model trains on a large amount of text to learn dependencies by calculating the frequency of word pairs, triplets, etc. Then, to generate text, the model starts with an initial word and then probabilistically selects the next word based on the learned joint PMFs.

For further reading: https://web.stanford.edu/~jurafsky/slp3/3.pdf

Text Generation Based on the Probability of Word Sequences

import random
from collections import defaultdict, Counter

text = "This is a simple example to illustrate how a 2-gram model works. " \\
 "This example is simple but effective for understanding 2-gram models." 
tokens = text.split()
bigram_model = defaultdict(Counter)
for i in range(len(tokens) - 1):
 bigram_model[tokens[i]][tokens[i + 1]] += 1
def predict_next_word(current_word):
 if current_word in bigram_model:
 possible_words = list(bigram_model[current_word].keys())
 word_weights = list(bigram_model[current_word].values())
 return random.choices(possible_words, weights=word_weights)[0]
 else:
 return random.choice(tokens) 
current_word = "bigram"
generated_text = [current_word]
number_of_words_to_be_generated = 40
for _ in range(number_of_words_to_be_generated): 
 next_word = predict_next_word(current_word)
 generated_text.append(next_word)
 current_word = next_word
print(' '.join(generated_text))

This code snippet uses the module ‘defaultdict’ to create a dictionary that provides a default value for missing keys and ‘Counter’ to count hash-able objects. It then splits the predefined text into tokens (each word is a token). This text will be used to learn the patterns in the sequence.

Then, we will move on to building the bigram model which utilizes the ‘defaultdict’ and ‘Counter’ modules to count occurrences. In the for loop, we iterate over each pair of adjacent words in ‘tokens’ and for every word we will keep track of how frequently every possible subsequent word follows it in the corpus.

After we have initiated the model, we will predict the next word by providing the desired word to the ‘predict_next_word’ function, which is the ‘current_word’. It checks if the current word is in the model. If so, the function computes the probabilities of the next word based on the frequency of the occurrence. If not, meaning the word was not seen in the training data, the function randomly chooses from the entire corpus.

Afterwards, we will begin the text generation by specifying the starting word and specify the number of words to be generated (the range of the for loop). In the loop, the ‘predict_next_word’ function appends a new word to the array of generated text and then updates the current word to this new word to create a chain of text. The choice of the next word is specified by the frequencies computed from the Bigram model.

What does the ‘bigram_model’ contain?

Output of model 1:

Example Output 1: bigram is a 2-gram model that works. This example is simple but effective for understanding 2-gram model works. This is a simple example is a 2-gram model works. This is a 2-gram model that works. This example illustrates how a 2-gram

Example Output 2: bigram simple example to illustrate how a 2-gram models. illustrate how a 2-gram model works. This example is a simple example to illustrate how a 2-gram model works. This example to illustrate how a simple example illustrates how a simple

Suppose we were to increase the length of the corpus and have the model train on more data:

text = """This is an extended example to illustrate how a 2-gram model works with
 a larger corpus.By using more text, we can provide the model with more context, 
 which should improve its predictive accuracy. The 2-gram model, also known as a 
 bigram model, predicts the next word based on the previous one, creating chains 
 of words that form sentences. While simple in concept, bigram models are a 
 fundamental part of natural language processing and can be quite effective in 
 various applications. They serve as the building blocks for more complex models 
 and algorithms in the field of computational linguistics. Understanding how 
 bigram models function is essential for grasping the basics of text generation 
 and language modeling. This corpus includes a variety of sentences to help
 demonstrate the versatility of the 2-gram approach. As we continue to expand the
 corpus, the model's ability to generate coherent and contextually relevant text
 should increase, showcasing the power of even simple probabilistic models in 
 understanding and generating human language."""

Output of model 2:

Example Output 1: bigram models are bigram models in concept, bigram models and generating human language. provide the model works with a larger corpus. By using more context, it should improve its predictive accuracy. The 2-gram model with more text, we can provide

Example Output 2: bigram model, also known as the basics of computational linguistics. Understanding how a fundamental part of text should increase and showcasing the building blocks for more complex models are bigram models and contextually relevant text generation and can be quite effective

Example Output 3: bigram models and contextually relevant text generation and algorithms in concept, bigram model, also known as a fundamental part of natural language modeling. This corpus includes a 2-gram model, also known as a larger corpus. By using more complex models and

Interpreting the differences between the outputs: Statistical Significance

When the model was trained on a larger corpus, more examples of how words are used in different contexts are provided, thus, helping the model understand and predict more accurate word pairs. This achieves better statistical significance because in a bigram model, the probability of a word following another is based on the frequency of their co-occurrence in the corpus. A larger corpus offers more instances of each word pair, establishing more statistically significant probabilities. This means the model’s predictions are less likely to be skewed by rare or unusual usage found in smaller datasets.

Thank you for reading!

Let’s Connect!

Twitter: https://twitter.com/ghadah_alha/

LinkedIn: https://www.linkedin.com/in/ghadah-alhabib/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Bigram Models Simplified

Author(s): Ghadah AlHabib

Bigram Models Simplified

Introduction to Text Generation

Introduction to N-gram Language Models

Text Generation Based on the Probability of Word Sequences

What does the ‘bigram_model’ contain?

Output of model 1:

Suppose we were to increase the length of the corpus and have the model train on more data:

Output of model 2:

Interpreting the differences between the outputs: Statistical Significance

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Bigram Models Simplified

Author(s): Ghadah AlHabib

Bigram Models Simplified

Introduction to Text Generation

Introduction to N-gram Language Models

Text Generation Based on the Probability of Word Sequences

What does the ‘bigram_model’ contain?

Output of model 1:

Suppose we were to increase the length of the corpus and have the model train on more data:

Output of model 2:

Interpreting the differences between the outputs: Statistical Significance

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement