Deep Dive into Modern Natural Language Processing

Last Updated on October 13, 2025 by Editorial Team

Author(s): Sunil Rao

Originally published on Towards AI.

NLP models have quietly shaped your digital world, from virtual assistants to search results, and their evolution is accelerating.
Read this article for a clear overview of the NLP, architectures — from RNN to the transformative power of Attention — driving this profound shift in how machines understand and generate language.

Deep Dive into Modern Natural Language Processing — AI Generated

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a valuable way.
Think of it as teaching a computer to read, listen, and write just like a person does. The ultimate goal is to bridge the communication gap between humans and machines.

At its core, an NLP system takes unstructured human language as input and transforms it into a structured format that a computer can work with as output.

Input is almost always some form of human language. This could be text from a document, a social media post, a customer review, or spoken words converted to text. This data is “unstructured” because it doesn’t follow a predefined format like a database table.
Output is structured data. This could be a category, a score, a summary, or another piece of text that has been processed. This structured output is what allows a program to make decisions or perform an action.

Let’s consider a common NLP task: sentiment analysis. Company wants to know what customers think about their new product based on online reviews.

Input: Customer review (unstructured text).

“I absolutely love the new camera on this phone! The battery life is amazing too, but the screen is a bit too dim for my liking.”

NLP Process: NLP model would analyze this text. It would identify keywords like “love” and “amazing” as positive indicators and “too dim” as a negative indicator. It might also recognize that the review is about specific features like “camera” “battery life” and “screen.”
Output: Structured data that a computer can analyze. This could be in a format like JSON.

{
 "overall_sentiment": "Positive",
 "sentiment_score": 0.85,
 "key_topics": [
 {
 "topic": "camera",
 "sentiment": "Positive"
 },
 {
 "topic": "battery life",
 "sentiment": "Positive"
 },
 {
 "topic": "screen",
 "sentiment": "Negative"
 }
 ]
}

This structured output can now be easily used by a software application — for example, to populate a dashboard that visualizes customer feedback. 📊

As a software engineer, you interact with NLP applications constantly, often without realizing it.

Search Engines: When you type a query into Google, NLP models work to understand the intent behind your words, correct spelling errors, and find the most relevant web pages, even if they don’t contain your exact keywords.
Code Autocompletion & AI Assistants: Tools like GitHub Copilot, Tabnine, and IntelliSense in IDEs use NLP models trained on vast amounts of code. They understand the context of what you’re writing to suggest the next line of code, complete function names, or even generate entire code blocks.
Spam Filtering: Your email service (like Gmail) uses sophisticated NLP classifiers to analyze incoming emails. It reads the content, sender information, and other features to determine if an email is legitimate or spam and automatically sorts it for you. 📧
Chatbots & Virtual Assistants: When you interact with a support chatbot on a website or use assistants like Siri or Google Assistant, you’re using NLP. These systems parse your requests to understand what you want and generate a relevant response or perform an action.
Machine Translation: Services like Google Translate use advanced NLP models (specifically, Neural Machine Translation) to translate text or speech from one language to another, preserving context and grammatical structure.

NLP is the overarching field of AI that deals with the entire interaction between computers and human (natural) languages.
Its goal is to take raw, messy, unstructured text or speech and process it into a format that a machine can handle, and then execute a task. It’s the full spectrum of capabilities for a machine to “read” and “speak.”
It’s focus is bridging the communication gap: converting language to data, and data back to language and key tasks are Tokenization (splitting sentences into words), Part-of-Speech Tagging, Machine Translation, Sentiment Analysis.
The Entire Translation Process (Input → Output)
Think of NLP as the entire language department in a business, which handles everything from receiving a letter in a foreign language to drafting a reply.

NLU is a critical subset of NLP that focuses specifically on translating raw human language into meaningful, structured data for the machine.
Its goal is to go beyond simple word recognition to truly grasp the context, intent, and meaning of the input. NLU is the core of how AI systems interpret human commands, even when they are ambiguous or poorly phrased. It’s focus is deeper comprehension: What does the user mean?
and key tasks are
Intent Classification (Ex: Is the user asking to book a flight or check the weather?),
Named Entity Recognition (NER) (Ex: Identifying “Paris” as a Location), Sentiment Analysis (Ex: Is the tone positive, negative, or neutral?).
Reading Comprehension Specialist (Input → Meaning)
Imagine an incoming letter with the sentence: “I need to exchange this bat.”
Simple NLP (Word Recognition): The word “bat” is a noun.
NLU (Deep Understanding): An NLU model analyzes the context (Ex: is the letter about baseball equipment or a nocturnal animal?) and determines the Intent is Product Return and the Entity "bat" refers to a Sports Item.

NLG is the other critical subset of NLP that focuses on translating structured, machine-readable data back into fluent, human-like language.
It’s the process of the computer “writing” or “speaking” to a human. This involves making choices about grammar, sentence structure, word choice, and tone to ensure the output is natural and coherent.
It’s Focus is creating human-like output: How should the machine respond?And Key Stages are Text Planning (deciding the key message), Sentence Structuring (grammar and syntax), Linguistic Realization (selecting the final words and phrases).
The Copywriter/Speechwriter (Meaning → Output)
A machine has a structured data record:
Data Input: {Alert: Stock: AAPL, Change: −5%,Time: Today}
NLG Goal: Convert this structured data into a conversational alert.
NLG Output: “Apple’s stock price dropped 5 percent today, continuing a major decline.” (A formal tone for a financial report)
Alternative NLG Output: “Heads up! $AAPL just fell 5% today.” (A casual tone for a trading app alert)

All 3 form the pipeline of any conversational AI system (like a modern chatbot or voice assistant). The process flows in a circle:

User Input: User says, “Hey Google, is it going to rain in Boston tomorrow?”
NLP (Initial Processing): System first uses Speech Recognition (classic NLP task) to convert the audio into a text string.
NLU (Understanding): NLU component analyzes the text to extract:
Intent: Get Weather ForecastEntities: Location = "Boston", Timeframe = "tomorrow"
Data Retrieval (Machine Action): Based on the NLU output, system executes an action: querying a weather API with the structured parameters (Boston, tomorrow).
API returns structured data (Ex. RainProbability: 80%, Temp: 55°F).
NLG (Generation): NLG component takes this structured weather data and generates a natural language response.
AI Output: System says: “In Boston tomorrow, there is an 80 percent chance of rain with temperatures around 55 degrees.”

In this complete loop, NLP is the broad container, NLU handles the input intelligence, and NLG handles the output intelligence.

Basic End-to-End NLP Workflow

Let’s illustrate this with a practical example: Building an email spam classifier. 📧

Stage 1: Data Collection
First step is always to gather the data you need. For our example, we need a large dataset of emails, with each email clearly labeled as either “spam” or “not spam” (often called “ham”). This could come from public datasets or internal company data.
Ex: A collection of thousands of .txt or .eml files, and a corresponding file (e.g., a CSV) that maps each file to its label (spam or ham).

Stage 2: Text Preprocessing (Data Cleaning)
Raw text is messy. It’s full of noise that can confuse a model. This stage cleans and standardizes the text to make it easier for a machine to understand.
Input Text: "Hi!! YOU've WON a PRIZE!! Click here http://spam.com to claim NOW..."Common Steps:

Lowercasing: Convert all text to lowercase. -> "hi!! you've won a prize!!..."
Removing Punctuation & Special Characters: -> "hi youve won a prize click here..."
Removing Stop Words: Eliminate common words with little semantic value (like “a,” “the,” “is,” “in”). -> "hi youve won prize click here..."
Tokenization: Split the text into individual words or “tokens.” -> ['hi', 'youve', 'won', 'prize', 'click', 'here']
Stemming/Lemmatization: Reduce words to their root form (“won” -> “win”). -> ['hi', 'youve', 'win', 'prize', 'click', 'here']

Output: Clean, uniform list of tokens for each email.

Stage 3: Text Representation
Machine learning models don’t understand words; they only understand numbers. This stage converts the cleaned text into a numerical representation.

Input: Clean tokens ['hi', 'youve', 'win', 'prize', 'click', 'here'].
Technique: Common method is Bag-of-Words (BoW) or TF-IDF.
Model creates a vocabulary of all unique words in the entire dataset. Then, it represents each email as a vector (a list of numbers) where each number corresponds to the frequency of a word from the vocabulary.
Output: Vector like [0, 1, 0, 0, 1, 1, 0, ...] where each position represents a word in the total vocabulary and value is its count or TF-IDF score.

Stage 4: Model Building & Training

Now that you have numerical data, you can train a machine learning model. You’ll split your dataset into a training set (to teach the model) and a testing set (to see how well it learned).

Input: Numerical vectors (features) and their corresponding labels (spam/ham).
Process: You choose a classification algorithm, such as Naive Bayes, Support Vector Machine (SVM), or a neural network. You then feed the training data into this algorithm.
Model learns the patterns of numbers (words) that are typically associated with spam versus ham.
Output: Trained model file that can take a new, unseen email vector and predict whether it’s spam.

Stage 5: Evaluation
Once the model is trained, you need to check how well it performs on the unseen test data. This tells you if your model is actually effective.
Process: Model predicts the label for each email in the test set. You then compare its predictions to the actual labels.
Metrics: You measure its performance using metrics like:

Accuracy: What percentage of emails did it classify correctly?
Precision: Of all the emails it flagged as spam, how many were actually spam? (Important for not annoying users by filtering legitimate mail).
Recall: Of all the actual spam emails, how many did it successfully catch? (Important for protecting users).
Output: A report saying something like: “The model has 98% accuracy, 95% precision, and 97% recall.”

Stage 6: Deployment & Monitoring
If the model’s performance is good enough, it’s time to put it into a real-world application. This could be an API that your email server calls to check each incoming email. The job isn’t over; you must continuously monitor its performance to ensure it doesn’t degrade over time as spammers change their tactics.
Ex: Trained model is integrated into the email server.
New email arrives: "Hi team, please review the attached document for our meeting tomorrow."Prediction: Email goes through the same preprocessing and feature extraction steps, is fed to the model, which outputs a prediction: ham. email is then delivered to the user's inbox.

Let’s dive deep into each stages of the NLP workflow:

1. Data Collection

Data collection phase is the crucial foundational step in any NLP project. This is where you gather the raw textual or speech data that your model will learn from. The quality, quantity, and relevance of this data directly impact the performance and success of the final NLP model. Without a robust and representative dataset, the model will be biased or unable to generalize to real-world scenarios.

These terms are often used interchangeably but refer to distinct processes in the data lifecycle.

Data Collection is broad process of acquiring or gathering data from various sources. This is the initial step of obtaining the raw material. Think of it as finding and picking ingredients for a recipe.
Data Ingestion is a more technical term referring to the process of transferring data from its source to a storage system like a database or data lake. It focuses on the mechanics of getting the data “into” the system, which can be done in batches (e.g., daily) or in real-time (e.g., streaming social media data).
Data Integration is the process of combining data from multiple, disparate sources into a unified, coherent view. This often involves cleaning, transforming, and structuring the data so it can be analyzed together. For example, merging customer data from a sales database and a support ticket system to get a complete customer profile.

Data Sources: Where to Collect
Source of your data depends on your project’s goal.

Publicly Available Datasets: Many open-source datasets are available for common NLP tasks like sentiment analysis, machine translation, or text summarization. Ex: IMDB movie reviews, Twitter datasets, and Wikipedia text.

# Reading csv file from public dataset
# Assume 'emails.csv' has 2 columns: 'text' and 'label'
df_csv = pd.read_csv('emails.csv')
print("--- Reading from CSV ---")
print(df_csv.head())

Web Scraping: You can programmatically extract text from websites, forums, or social media platforms. This is a common method for gathering large amounts of specific, public data.
Ex: Scraping Amazon for product reviews, a news site for articles, or a forum for user discussions.
Using libraries like BeautifulSoup or Scrapy in Python to parse HTML content.

url = 'http://example.com/'
response = requests.get(url)
scraped_data = []

soup = BeautifulSoup(response.content, 'html.parser')

# Find specific HTML elements and extract their text
heading = soup.find('h1').get_text()
paragraph = soup.find('p').get_text()
 
# Append the extracted data to our list
scraped_data.append({'heading': heading, 'paragraph': paragraph})

df = pd.DataFrame(scraped_data)

APIs: Many services like Twitter, Reddit, or financial news providers offer APIs that allow you to collect data in a structured way. This is a reliable and efficient method for obtaining specific data streams.
Ex: Querying the X (formerly Twitter) API for tweets about a certain topic or the Reddit API for comments in a subreddit.
Making programmatic HTTP requests to the API endpoint and parsing the JSON or XML response.
Ex: Fetching data from a web API

# Fetch data from a sample API
response = requests.get('https://jsonplaceholder.typicode.com/posts')
data_json = response.json() # The response is a list of dictionaries

# Convert the JSON (list of dicts) into a DataFrame
df_api = pd.DataFrame(data_json)
df_api = df_api[['title', 'body']]
print("\n--- Reading from API (JSON) ---")
print(df_api.head())

Internal Company Data: This includes customer support tickets, emails, internal reports, and call center transcripts. This type of data is valuable for building models tailored to a specific business need, such as an internal chatbot or a system to analyze employee feedback.
Ex: Customer support tickets, product reviews from your e-commerce site, internal documents, chat logs.
Data collection often involves querying internal databases (SQL, NoSQL).
Ex: Directly query a database and load the results into a DataFrame.

# Create a connection engine Ex: SQLite database
engine = create_engine('sqlite:///my_database.db')

# Write a query to select the data
query = "SELECT comment_text, sentiment_label FROM customer_feedback"
# Execute the query and load into a DataFrame
df_sql = pd.read_sql(query, engine)
print("\n--- Reading from SQL Database ---")
print(df_sql.head())

Some of the common issues you’ll face in this stage.

a. Insufficient Data:
This is when you don’t have enough data to train a model that can generalize well to new, unseen examples.
Model trained on a small dataset is likely to “overfit.”
Solution: Data Augmentation. It is the process of creating new, “fake” data from your existing data. It’s a powerful technique to artificially boost your dataset’s size and variety. Here’s how the methods:

Synonym Replacement: Replace words with their synonyms.

Original: “The movie was fantastic and very funny.”

Augmented: “The film was wonderful and highly amusing.”

Bigram Flip / Random Swap: Swap the positions of 2 words in the sentence. This can sometimes break grammar but introduces valuable noise.

Original: “The car is very fast.”

Augmented: “The is car very fast.”

Back Translation: Translate the text to another language and then translate it back to the original. This often results in a paraphrased version of the original sentence. 🤖

Original (EN): “I need to book a flight for tomorrow morning.”

Translate (to Spanish): “Necesito reservar un vuelo para mañana por la mañana.”

Translate Back (to English): “I need to reserve a flight for tomorrow morning.”

Adding Noise: Introduce random typos by swapping, deleting, or inserting characters to make the model more robust to real-world user errors.

Original: “Please review the document.”

Augmented: “Please review the documnet.”

b. Low-Quality or Noisy Data
Your collected data might be full of irrelevant information, like HTML tags, typos, slang, or emojis, which can confuse the model. This is why the next stage, preprocessing, is so critical.

c. Unlabeled Data
For many tasks (like our spam classifier), you need labeled data. Getting humans to label thousands or millions of text examples is expensive and time-consuming.
Techniques like semi-supervised or unsupervised learning can help, but they are more advanced.

d. Data Bias:
Data you collect might not represent the real world. For example, if you train a sentiment analyzer only on movie reviews, it might perform poorly on financial news. Similarly, data can contain societal biases (related to gender, race, etc.) that your model will learn and perpetuate.

Data Formats
Format dictates how text is packaged and stored.

Raw Text/Document: .txt, .csv (for simple text/metadata pairings), .pdf, .docx. This is the most common input.
Semi-structured: JSON and XML. Often used when collecting data via APIs, where the text is nested alongside metadata (e.g., a Twitter JSON object contains the text of the tweet, plus fields for user ID, timestamp, and location).
Annotated/Labeled Data: Custom formats (often JSON or XML with specific tagging) where human annotators have added labels (e.g., marking parts of speech, sentiment, or entities).

Data Models
This refers to how data is logically organized for storage and retrieval.

Relational (Structured) Tables with fixed schemas (rows/columns) and defined relationships.Storing metadata and labeled features (e.g., ReviewID, SentimentScore, Date).
Structured Data is highly organized and follows a predefined schema, typically in a tabular format with rows and columns. It is easy to search and analyze using traditional databases and tools. In NLP, this could include a spreadsheet of customer reviews with columns for Rating, Date, and Review Text. The Review Text itself is unstructured, but the entire dataset is structured.
Document/Key-Value (Unstructured/NoSQL) Stores data in flexible, semi-structured documents (e.g., JSON or BSON). Storing the raw text itself, especially when the text structure is highly variable.
Unstructured Data has no predefined format or schema. It’s the most common data type in NLP and is the primary focus of most NLP projects. Ex: emails, social media posts, news articles, audio recordings, and legal documents. Extracting meaning from this data requires sophisticated NLP techniques.
Graph (Nodal) Stores data as nodes (entities) and edges (relationships).
Knowledge Graph construction from text (e.g., nodes for ‘Person’ and ‘Organization’ connected by a ‘Works For’ edge).

Data Storage Engine and Processing
Data engineering separates storage and processing based on the goal: fast transactions or complex analysis.

Transactional (OLTP — Online Transactional Processing): Used for frequent, small, high-speed read/write operations.
Ex: NoSQL Database (like MongoDB) storing the latest incoming customer feedback (text) that an operational dashboard needs to display instantly.
Analytical (OLAP — Online Analytical Processing): Used for complex queries and analysis over large historical datasets, often for machine learning training.
Ex: Data Lake (e.g., AWS S3, Google Cloud Storage) or a Data Warehouse (e.g., Snowflake, BigQuery) storing all historical customer feedback to train a sentiment analysis model.

Modes of Dataflow
Data pipeline is the core infrastructure responsible for the movement and transformation of data.

ETL (Extract, Transform, Load): Traditional data flow. Data is Extracted from the source, Transformed (cleaned, tokenized, normalized) by the data engineer, and then Loaded into the destination (e.g., a Data Warehouse).
ELT (Extract, Load, Transform): Modern data flow. Data is Extracted and immediately Loaded into a powerful platform (like a Data Lake), and then Transformed in place using cloud-native tools. This is common in NLP because raw text (unstructured data) is often loaded first and transformed later by ML teams.

Data Collection Methods and Tools: How to Collect
Methods you use often depend on your data source and scale.

Manual Collection: Labor-intensive process where humans manually transcribe or enter data. This is often used for small, specialized datasets or for transcribing audio recordings.
Automated Collection: Using scripts, bots, or software to automatically gather data from the web (web scraping) or via APIs.
Crowdsourcing: Platforms like Amazon Mechanical Turk can be used to hire a large number of people to perform tasks like transcribing audio, labeling text, or creating specific types of data.

Data Repositories: Where to Store Data
Once collected, the data needs to be stored in a system optimized for its type and intended use.

Data Warehouse: Centralized repository for structured data.
Data is cleaned, transformed, and organized before being loaded (ETL - Extract, Transform, Load).
It's optimized for fast, complex queries and reporting for business intelligence. Think of it as a highly organized library with a strict cataloging system.
Data Mart: Smaller, more focused version of a data warehouse, designed for a specific business unit or department (Ex: marketing, sales). It contains a subset of data from the main warehouse.
Data Lake: Massive, centralized repository that stores all types of data (structured, semi-structured, and unstructured) in its raw, native format. Data is loaded first and transformed only when it’s needed (ELT - Extract, Load, Transform). It's flexible and ideal for data scientists who need to work with raw, diverse data to build machine learning models.

2. Text preprocessing

Text preprocessing is process of cleaning and preparing raw text data for analysis. It’s a crucial step in any NLP pipeline because machine learning models require structured, numerical data, not raw text. By normalizing the text, we reduce noise and improve the quality of the data, which leads to better model performance.

Let’s illustrate steps of text preprocessing using the example of an email spam/ham classifier. We’ll use the Natural Language Toolkit (NLTK), a popular Python library for NLP.

Our raw text input is a string that represents an email

text = "Hey there! We have a new offer for you. Get 50% discount on all products. Limited time offer."

1. Case Conversion
First step is to convert all characters to a uniform case, typically lowercase. This ensures that words like “Free” and “free” are treated as the same word, reducing the vocabulary size.

text = text.lower()
# Output: "hey there! we have a new offer for you. get 50% discount on all products. limited time offer."

2. Punctuation Removal
Next, we remove punctuation marks. Punctuation often doesn’t contribute to the meaning of a sentence in tasks like sentiment analysis or spam detection. We can use Python’s built-in string module and str.translate() method.

import string
text = text.translate(str.maketrans('', '', string.punctuation))
# Output: "hey there we have a new offer for you get 50 discount on all products limited time offer"

3. Tokenization
Tokenization breaks a stream of text into smaller units called tokens. These can be words, sentences, or subwords.

Word Tokenization: Separates a text into a list of words.
Sentence Tokenization: Separates a text into a list of sentences.

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Sentence Tokenization
sentences = sent_tokenize(text)
# Output: ['hey there we have a new offer for you get 50 discount on all products limited time offer']

# Word Tokenization
words = word_tokenize(text)
# Output: ['hey', 'there', 'we', 'have', 'a', 'new', 'offer', 'for', 'you', 'get', '50', 'discount', 'on', 'all', 'products', 'limited', 'time', 'offer']

4. Stop Word Removal
Stop words are common words like “the,” “is,” “a,” etc., that often don’t add much value to the meaning of the text. Removing them can reduce the feature space and speed up processing.

from nltk.corpus import stopwords

# Download stopwords if not already present
# nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
# Output: ['hey', 'new', 'offer', 'get', '50', 'discount', 'products', 'limited', 'time', 'offer']

5. Stemming & Lemmatization
These techniques reduce words to their base or root form, helping to normalize different forms of the same word.

Stemming
Stemming uses a simple, heuristic algorithm to chop off the ends of words, often resulting in a “stem” that is not a real word.
Ex: a stemmer might remove “ing,” “es,” or “s” from words.
It’s generally faster than lemmatization because it doesn’t need to consult a dictionary. This makes it suitable for applications where processing speed is a priority, such as information retrieval.
The output is a word stem that may or may not be a valid word.
Ex:
“running” → “run”
“goes” → “go”
“studies” → “studi” (not a valid word)
Lemmatization
Lemmatization uses a dictionary and morphological analysis to convert a word to its lemma, or base dictionary form. It considers the word’s part of speech to ensure the output is a valid word.
It’s slower than stemming because it involves a more complex process of looking up words in a dictionary.
The output is always a valid word.
Ex:
“running” → “run” (verb)
“is,” “was,” “am” → “be”
“geese” → “goose” (handles irregular plurals)
“better” → “good” (handles irregular forms)

In short, if accuracy and semantic correctness are more important, use lemmatization.
If you need a quick and dirty solution for tasks where performance is critical, stemming might be sufficient.

from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download wordnet if not already present
# nltk.download('wordnet')

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
# Output: ['hey', 'new', 'offer', 'get', '50', 'discount', 'product', 'limit', 'time', 'offer']

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
# Output: ['hey', 'new', 'offer', 'get', '50', 'discount', 'product', 'limited', 'time', 'offer']

6. Part-of-Speech (POS) Tagging
Part-of-Speech (POS) tagging is process of labeling each word in a sentence with its corresponding part of speech, such as a noun, verb, adjective, or adverb. It is a fundamental step in many Natural Language Processing (NLP) pipelines as it helps machines understand the grammatical structure and role of words in a text.

A part of speech is a category of words that have similar grammatical properties. Think of it as classifying words based on their function in a sentence. For example:

Nouns (NN): Refer to a person, place, thing, or idea (e.g., “dog,” “New York,” “love”).
Verbs (VB): Describe an action or state of being (e.g., “run,” “is,” “think”).
Adjectives (JJ): Modify or describe nouns (e.g., “happy,” “blue,” “tall”).
Adverbs (RB): Modify verbs, adjectives, or other adverbs (e.g., “quickly,” “very,” “well”).

POS tagging goes beyond these basic categories, providing more detailed tags to capture nuances like plural nouns (NNS), proper nouns (NNP), past tense verbs (VBD), and so on.

Custom Grammar Parsing involves defining rules to parse sentences and extract specific patterns. This is often used in Chunking, where a parser identifies and groups words into meaningful chunks, such as noun phrases or verb phrases.

We first perform POS tagging on our text.
Then we define a custom grammar using regular expressions, often in a RegexpParser. For example, we can define a grammar to find a noun phrase (NP) that consists of a determiner (DT) followed by an optional adjective (JJ) and then a noun (NN).
Parser then applies this grammar to the POS-tagged text to identify and extract the defined chunks.

# Assuming you have the original text processed and word tokenized
text = "Hey there! We have a new offer for you."
words = ['hey', 'there', '!', 'we', 'have', 'a', 'new', 'offer', 'for', 'you', '.']


# Download tagger data if not already present
nltk.download('averaged_perceptron_tagger')

tagged_words = nltk.pos_tag(words)
# Example output for a sentence: [('We', 'PRP'), ('have', 'VBP'), ('a', 'DT'), ('new', 'JJ'), ('offer', 'NN'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')]

# Define a custom grammar to find noun phrases (NP)
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = nltk.RegexpParser(grammar)
result = parser.parse(tagged_words)
# The parser identifies and groups the words based on the grammar.
# For example, it might identify '(NP a/DT new/JJ offer/NN)' as a noun phrase.

NOTE: Spacy is generally “better” than NLTK for production-level applications due to its speed, efficiency, and streamlined design. However, NLTK remains a powerful and relevant tool, especially for academic and research purposes. Choice between them depends entirely on the specific needs of your project.

Use NLTK when:

You’re learning NLP and want to understand the underlying algorithms. Its vast collection of tutorials and resources makes it an excellent teaching tool.
You’re doing academic research and need to experiment with different algorithms and custom workflows.
You require access to a wide variety of corpora and lexical resources that aren’t available in spaCy.

Use spaCy when:

You’re building an application for production where speed and efficiency are critical, like a chatbot or a document-processing pipeline.
You need to perform standard NLP tasks like Named Entity Recognition (NER) or dependency parsing with state-of-the-art accuracy out of the box.
You’re processing large volumes of text and need to do so quickly.
You want a simple, “it just works” solution without having to manually select and tune algorithms.

Here’s how you would use spaCy to perform POS tagging on a sample sentence:

import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")
text = "The quick brown fox jumps over the lazy dog."

# Process the text with the nlp object
doc = nlp(text)

# Iterate through the tokens and print the word and its POS tag
for token in doc:
 print(f"{token.text:<10} {token.pos_:<10} {token.tag_}")

The DET DT
quick ADJ JJ
brown ADJ JJ
fox NOUN NN
jumps VERB VBZ
over ADP IN
the DET DT
lazy ADJ JJ
dog NOUN NN
. PUNCT .

In the output above, token.pos_ gives the universal POS tag (e.g., DET, ADJ, NOUN), while token.tag_ provides a more specific, fine-grained tag (e.g., DT for determiner, JJ for adjective, NN for noun).

Tags are the labels assigned to words during POS tagging. spaCy uses both universal POS tags and its own more detailed tag set. For example:

NN (Noun, singular or mass)
VBZ (Verb, third-person singular present)
JJ (Adjective)
DT (Determiner)
IN (Preposition or subordinating conjunction)
NNP (Proper noun, singular)

These tags provide valuable information for downstream tasks and are essential for many NLP applications.

POS tags can be used as a powerful filter to remove tokens that are irrelevant to a specific task. This is a form of token normalization.
For example, in a sentiment analysis task, we might be more interested in adjectives and adverbs (like “amazing,” “terrible,” “very”) that convey emotion, and nouns that are the subject of that emotion. Conversely, we might want to filter out tokens like determiners (a, an, the) and prepositions (of, in, on), as they often do not carry significant sentiment.

import spacy

nlp = spacy.load("en_core_web_sm")
text = "The product's performance was absolutely amazing, but the price was way too high for what it offered."

doc = nlp(text)

# Define a list of POS tags to keep
# We want to keep nouns, verbs, adjectives, and adverbs for sentiment analysis
pos_to_keep = ["NOUN", "VERB", "ADJ", "ADV"]

# Filter the tokens based on their POS tag
filtered_tokens = [token.text for token in doc if token.pos_ in pos_to_keep]
print(filtered_tokens)

#Output

['product', 'performance', 'was', 'absolutely', 'amazing', 'price', 'was', 'way', 'too', 'high', 'offered']

As you can see, this process effectively removed irrelevant tokens, leaving behind a more focused list of words that are likely to be more important for tasks like sentiment analysis or information retrieval.

7. Chunking
also known as shallow parsing, is a NLP technique that groups words into meaningful phrases, such as noun phrases, verb phrases, or prepositional phrases. It’s a less detailed form of parsing compared to full parsing, which creates a complete parse tree of a sentence.

Chunking works by identifying and grouping consecutive words that belong to the same syntactic category. Common approach is to identify noun phrases (NPs).
For example, in the sentence “The big red car raced down the street,” a chunking process would identify “The big red car” as a single noun phrase.

Chunking often relies on part-of-speech (POS) tagging as a preliminary step. Once each word is tagged with its part of speech (ex: noun, verb, adjective), a set of rules or a model can be applied to identify contiguous sequences that form chunks.
For instance, a rule might be: NP: {<DT>?<JJ>*<NN>+}. This rule, in regular expression format, says a noun phrase (NP) can be:

An optional determiner (DT)
Followed by zero or more adjectives (JJ)
Followed by 1 or more nouns (NN)

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

# Print all noun chunks in the document
for chunk in doc.noun_chunks:
 print(chunk.text)


# Output

The quick brown fox
the lazy dog

8. Named Entity Recognition (NER)
NER is a natural language processing (NLP) task that finds and classifies named entities in a text into predefined categories like people, organizations, locations, dates, and more.

NER systems essentially read text and tag specific words or phrases as named entities.
For example, in the sentence, “Tim Cook, the CEO of Apple, announced the new iPhone in Cupertino on September 9, 2025” an NER system would identify:
Tim Cook: Person
Apple: Organization
Cupertino: Location
September 12, 2025: Date

Process generally involves 2 main steps:

Entity Identification: System determines the boundaries of the entity (e.g., recognizing that “Tim Cook” is 1 entity, not 2 separate words).
Entity Categorization: System assigns a type or category to the identified entity.

import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple Inc. is a technology company headquartered in Cupertino, California. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne."

# Process the text with the NLP pipeline
doc = nlp(text)

# Print the identified entities and their labels
for ent in doc.ents:
 print(f"Entity: {ent.text}, Label: {ent.label_}, Explanation: {spacy.explain(ent.label_)}")

This code snippet will output the recognized entities and their corresponding labels, such as
ORG (Organization) for "Apple Inc."
GPE (Geopolitical Entity) for "Cupertino" and
PERSON for "Steve Jobs" "Steve Wozniak" and "Ronald Wayne."

There are 4 primary approaches to building NER systems, each with its own advantages and complexity.

1. Dictionary-Based Approach
This is the simplest method, relying on a predefined list or dictionary of entities.
For example, a dictionary of all country names or a list of company names. System works by performing a direct lookup for words or phrases in text.
It’s fast, easy to implement but extremely limited and brittle. Can’t handle new or unseen entities, spelling variations, or context-specific meanings.

2. Rule-Based Approach
This approach uses a set of handcrafted rules or patterns to identify entities.
For example, a rule might state that any word capitalized after a title like “Mr.” or “Dr.” is a person’s name.
Another rule could look for specific patterns for dates (e.g., Month Day, Year).
Highly accurate for the rules it’s designed for and doesn’t require training data. But its time-consuming to create and maintain rules, and very difficult to scale. And it struggles with ambiguity and exceptions.

3. Machine Learning (ML) Approach
This method uses classical machine learning models like Conditional Random Fields (CRFs) or Support Vector Machines (SVMs). System is trained on a labeled dataset where each word is tagged with its entity type (or “O” for “outside” of an entity). Model learns patterns from features like capitalization, word shape, and surrounding words to make predictions.
It’s more robust and scalable than rule-based systems, as it can learn from data and generalize to unseen text.
It requires a significant amount of high-quality, labeled training data. Feature engineering can be complex and time-consuming.

4. Deep Learning (DL) Approach
This is the state-of-the-art approach, using neural network architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTMs), and more recently, Transformer-based models (like BERT). These models can automatically learn complex features and context from raw text, eliminating the need for manual feature engineering.
It provides the highest accuracy and can capture subtle, long-range dependencies in the text. Highly scalable and adaptable to different languages.
It Computationally expensive to train and requires large amounts of data. Models can be complex and difficult to interpret.

NER is a fundamental NLP task with a wide range of real-world applications across various industries. Some common examples include:

Information Retrieval and Search Engines: Enhancing search results by identifying entities in queries, allowing for more precise results.
Customer Support and Chatbots: Identifying customer names, product names, and issue types in support tickets or chat transcripts to route requests correctly.
Healthcare and Medical Research: Extracting patient details, drug names, symptoms, and medical procedures from electronic health records (EHRs) and research papers.
Financial Services: Analyzing news articles and financial reports to identify company names, key people, and stock tickers for market sentiment analysis.
Legal Industry: Automatically sifting through legal documents to extract names of parties, dates, laws, and case citations.
Social Media Monitoring: Tracking brand mentions, product names, and public figures in social media feeds to analyze trends and public opinion.

9. Relationship extraction
Relationship extraction is an NLP task that identifies and classifies semantic relationships between entities in a text.
For example, it can identify that a specific person “works for” a specific company or that a company is “headquartered in” a specific location.

Relationship extraction generally follows a multi-step process:

Named Entity Recognition (NER): First, a model identifies and labels the entities in the text (e.g., Person, Organization, Location). For instance, in the sentence “Tim Cook, the CEO of Apple, announced the new iPhone,” NER would identify “Tim Cook” as a Person and “Apple” as an Organization.
Relation Detection and Classification: After the entities are identified, the system analyzes the words and syntax between and around them to determine if a relationship exists.

Rule-based approach uses predefined patterns or regular expressions. For example, a rule might be Person followed by "the CEO of" followed by Organization.
Machine Learning Models are trained on a labeled dataset of entity pairs and their corresponding relationship types. The model learns to classify the relationship based on features like the words in between the entities, their part-of-speech tags, and their syntactic structure.
Deep Learning is the most common modern approach, using neural networks to learn complex patterns and contexts automatically.
For example, a model might use a Transformer-based architecture (like BERT) to understand the full context of a sentence and predict the relationship between 2 entities.

from transformers import pipeline

# Load a pre-trained model for relationship extraction
extractor = pipeline("zero-shot-relation-extraction", model="Babelscape/rebel-large")

# Sample text
text = "The iPhone was announced by Steve Jobs, the CEO of Apple."

# Define the entity types and relationship types to look for
candidate_relations = ["headquarters located in", "founded by", "creator", "leader of"]

# Extract relationships from the text
results = extractor(text, relations=candidate_relations)

# Print the extracted relationships
for result in results:
 head = result['head']
 relation = result['relation']
 tail = result['tail']
 print(f"Relationship found: {head} is the {relation} of {tail}")


#Output

Relationship found: Steve Jobs is the leader of Apple.

3. Text Representation

Text representation or Feature extraction is the process of converting raw text into a numerical format that a computer can understand and process. Since machine learning models can only work with numbers, converting words and sentences into vectors or matrices is a fundamental step in nearly every NLP task. Goal is to capture the semantic and syntactic meaning of the text as accurately as possible in this numerical form.

We need text representation because computers can’t directly process human language. They require numerical input to perform calculations, identify patterns, and make predictions. Text representation allows us to:

It translates human-readable text into a machine-understandable format.
It makes it possible to apply mathematical algorithms for tasks like sentiment analysis, text classification, and machine translation.
Effective text representation methods can capture the semantic relationships between words.
For example, the words “king” and “queen” might be represented by vectors that are close to each other in a multi-dimensional space.

To understand text representation, it’s essential to grasp a few core concepts:

Document: A single piece of text, such as a tweet, a sentence, an email, or an entire book.
Corpus: A collection of documents.
For example, a corpus could be a set of all news articles from a specific year.
Vocabulary: The set of all unique words found in a corpus.
Feature: An individual, measurable property of a document. In NLP, features can be words, n-grams (sequences of words), or other numerical attributes.
Feature Engineering: The process of selecting and transforming raw data into features that can be used to build a model. This often involves techniques like creating word counts, TF-IDF scores, or part-of-speech tags.
Vector: A numerical list or array that represents a document, word, or feature. In NLP, vectors are the primary output of text representation techniques.
Bag-of-Words (BoW): A simple text representation model where a document is represented as the multiset of its words, without considering grammar or word order. The value in each dimension of the vector corresponds to the frequency of a word in the document.

Ex: Corpus: “The cat sat on the mat.” and “The dog sat on the log.”
Vocabulary: {“the”, “cat”, “sat”, “on”, “mat”, “dog”, “log”}
Vector for “The cat sat on the mat.”: [2, 1, 1, 1, 1, 0, 0] (counts of each word from the vocabulary)
This vector tells us which words are present and how often, but it loses the order of the words.

Here are some of the most common text representation techniques.

1.One-hot encoding
It is a simple text representation technique that converts categorical data, like words, into a numerical format. It creates a binary vector for each word in a vocabulary, with a length equal to the size of the vocabulary.

To create a one-hot encoded vector for a word, you first need to define a vocabulary from a corpus of text. The vocabulary is the list of all unique words.

Create a Vocabulary: Collect all unique words from your documents and assign each a unique index. For example:

Vocabulary = {'the': 0, 'cat': 1, 'sat': 2, 'on': 3, 'mat': 4}

Generate Binary Vectors: For each word, create a vector of zeros with a length equal to the vocabulary size. Place a 1 at the index corresponding to the word.
The word ‘cat’ is at index 1. Its one-hot vector is: [0, 1, 0, 0, 0]The word ‘sat’ is at index 2. Its one-hot vector is: [0, 0, 1, 0, 0]The word ‘mat’ is at index 4. Its one-hot vector is: [0, 0, 0, 0, 1]

When representing a sentence, you can either create a one-hot vector for each word individually or combine them into a single vector by summing them up (a Bag-of-Words approach).

While simple, one-hot encoding has significant drawbacks that make it unsuitable for most complex NLP tasks.

The size of the one-hot vector is directly proportional to the size of the vocabulary.
For a large corpus, the vocabulary can contain hundreds of thousands of words, leading to extremely long and sparse vectors. This makes computation slow and inefficient, requiring a lot of memory.
One-hot encoding treats every word as completely independent. The vectors for ‘cat’ [0, 1, 0, 0, 0] and 'dog' [0, 0, 0, 1, 0] are completely orthogonal, meaning their dot product is zero. The model cannot infer that 'cat' and 'dog' are semantically similar. This lack of a captured relationship is a major limitation, as it prevents the model from understanding context and meaning.
Representation of a word is always the same, regardless of its context in a sentence. For example, word “bank” has same one-hot vector whether it refers to a financial institution or the side of a river. This inability to handle polysemy (words with multiple meanings) is a major flaw.

You can implement one-hot encoding for text representation using libraries like scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus
corpus = ["The cat sat on the mat", "The dog sat on the log"]

# Create a CountVectorizer instance
# The 'binary=True' argument makes it a one-hot-like encoding (presence/absence)
vectorizer = CountVectorizer(binary=True)

# Fit the vectorizer to the corpus and transform the text
X = vectorizer.fit_transform(corpus)

# Get the feature names (the vocabulary)
vocabulary = vectorizer.get_feature_names_out()

# Print the vocabulary
print("Vocabulary:", vocabulary)

# Print the one-hot encoded vectors (as a sparse matrix)
print("One-Hot Encoded Vectors:")
print(X.toarray())

2. Bag of Words [BoW]
This model is a simple text representation technique that represents a document as an unordered collection of words, or a “bag.” It completely ignores grammar and word order but keeps track of word frequencies.

To create a BoW model, you follow these steps:

Create a Vocabulary: Collect all unique words from a set of documents (your corpus) to form a vocabulary. Each word is assigned a unique index.
Count Word Frequencies: For each document, you create a vector where the length is equal to the vocabulary size. Each entry in the vector represents the count of a specific word from the vocabulary in that document.

Example with Spam Email 📧 Let’s say we have 2 emails:

Email 1 (Spam): “Free money, claim your prize now!”
Email 2 (Not Spam): “Please confirm your meeting attendance.”

Step 1: First, we clean the text (e.g., lowercase, remove punctuation) and create a vocabulary of unique words: ['free', 'money', 'claim', 'your', 'prize', 'now', 'please', 'confirm', 'meeting', 'attendance']

Step 2: Now, we count the occurrences of each word in each email based on our vocabulary.

Vector for Email 1 (Spam):
'free': 1
'money': 1
'claim': 1
'your': 1
'prize': 1
'now': 1
All other words: 0
BoW Vector: [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

Vector for Email 2 (Not Spam):
'please': 1
'confirm': 1
'your': 1
'meeting': 1
'attendance': 1
All other words: 0
BoW Vector: [0, 0, 0, 1, 0, 0, 1, 1, 1, 1]

These numerical vectors can now be used as features for a machine learning model to classify emails as spam or not spam.

Bag of n-grams [Bag of n-words]
Bag of n-grams is an extension of the BoW model. n-gram is a contiguous sequence of n items from a text. Instead of counting single words (1-grams), it counts sequences of 2 (bigrams), 3 (trigrams), or more words.

Main limitation of BoW is that it loses word order and context. For example, “good morning” and “morning good” would have the same BoW vector, but their meanings are different.
N-grams capture some of this local context and word order.

Ex: Sentence: “The food was not good.”
BoW Vector: Counts for “the,” “food,” “was,” “not,” “good.”
Model might classify this as positive because “good” is present.
Bag of Bigrams: Counts for "the food", "food was", "was not", "not good".
Model would now see the phrase "not good", which is a strong indicator of negative sentiment. This additional feature helps the model make a more accurate prediction.

Drawbacks of Bag of Words

Vocabulary can become extremely large for a big corpus. This results in very long vectors with many zeros (sparse vectors), which is computationally expensive and can lead to the curse of dimensionality.
BoW doesn’t capture the relationship between words. The vectors for “good” and “fantastic” would be completely different, even though they are semantically similar.
It completely discards the order of words. “Dog bites man” and “man bites dog” would have the exact same BoW vector, even though their meanings are entirely different. This is a critical flaw for tasks that require understanding sentence structure.

from sklearn.feature_extraction.text import CountVectorizer

# Example emails
corpus = [
 "Free money, claim your prize now!",
 "Please confirm your meeting attendance."
]

# Create a CountVectorizer instance
# We can also specify n-grams here (e.g., ngram_range=(1, 2) for bigrams)
vectorizer = CountVectorizer(binary=False, lowercase=True)

# Learn the vocabulary from the corpus and transform the text
X = vectorizer.fit_transform(corpus)

# Print the vocabulary (feature names)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the BoW matrix (as a dense array for readability)
print("\nBag of Words Vectors (as a sparse matrix):\n")
print(X.toarray())

Code snippets can look similar because a common implementation of uses a CountVectorizer with a binary=True parameter. This special case of BoW acts just like one-hot encoding by only marking a word's presence (1) or absence (0), rather than its frequency.
By default, when binary=False, CountVectorizer performs a true Bag-of-Words representation by counting word frequencies.

3. TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a numerical statistic used to reflect how important a word is to a document in a corpus. It’s a widely used technique for information retrieval and text mining.
Core idea is that a word’s importance increases proportionally to the number of times it appears in the document but is offset by the frequency of the word in the entire corpus. This helps to filter out common words like “the” or “is” which are frequent in all documents but don’t hold much specific meaning.

Term Frequency measures how often a word appears in a specific document. The more frequent the word, the higher its TF score

Let’s say we have 2 documents to classify: a “Science” article and a “Sports” article.

Document 1 (Science): “The atom is the smallest unit of matter. An atom has protons and electrons.”
Document 2 (Sports): “The team won the football match. The players are very happy.”

For Document 1, the word “atom” appears 2 times, and there are 13 total words. TF(′atom′, Document1) = 2/13 ~= 0.154

Document Frequency (DF) is the number of documents in the corpus that contain a specific term.
High DF means the word is common and likely not very unique.

Inverse Document Frequency (IDF) is the inverse of the document frequency. It is a score that decreases the weight of words that appear very frequently across all documents. This helps to emphasize words that are unique to a few documents.

Using our example corpus with 2 documents:

The word “the” appears in both Document 1 and Document 2.
DF(′the′)=2
IDF(′the′)=log(2/2)=log(1)=0
A word with an IDF of 0 has no discriminatory power.
The word “atom” appears only in Document 1.
DF(′atom′)=1
IDF(′atom′)=log(2/1) = approx 0.301

Final TF-IDF score for a word in a document is product of its TF and IDF scores.

TF−IDF(t,d,D)=TF(t,d) * IDF(t,D)

TF-IDF for “atom” in Document 1:
TF−IDF = 0.154 * 0.301 = 0.046

TF-IDF for “the” in Document 1:
TF−IDF = TF * 0 = 0

This shows that “atom” has a higher TF-IDF score than “the” making it a more important term for classifying the document.

Drawbacks of TF-IDF

TF-IDF only considers word frequency and not the meaning or context. It can’t tell the difference between “car” and “automobile,” which are semantically similar.
Like Bag-of-Words, TF-IDF treats a document as an unordered collection of words. It doesn’t capture the relationship or syntax between words, meaning “good food” and “food is good” have similar representations.
For large vocabularies, the resulting vectors are very long and sparse (mostly zeros), which can be computationally expensive and inefficient.

from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus for category classification
corpus = [
 "The atom is the smallest unit of matter. An atom has protons and electrons.", # Science
 "The team won the football match. The players are very happy." # Sports
]

# Create a TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Get the feature names (the vocabulary)
feature_names = vectorizer.get_feature_names_out()

# Print the vocabulary
print("Vocabulary:", feature_names)

# Print the TF-IDF vectors (as a dense array for readability)
print("\nTF-IDF Vectors:")
print(X.toarray())

# You can also get the IDF scores
print("\nIDF Scores:")
for i, name in enumerate(feature_names):
 print(f"{name}: {vectorizer.idf_[i]:.4f}")

Text representation techniques discussed till now like One-Hot Encoding, Bag-of-Words (BoW), and TF-IDF suffer from a major drawback: they fail to capture the semantic relationships between words.

One-Hot Encoding — Creates extremely large and sparse vectors. It treats every word as a completely independent entity, so there’s no way to tell that “king” and “queen” are related.
Bag-of-Words — Ignores word order and context. It counts words but loses the sentence structure. “The dog bit the man” and “The man bit the dog” have the same representation.
TF-IDF — Improves on BoW by weighting important words, but still doesn’t understand semantics. “Car” and “automobile” have completely different representations.

Word Embeddings solve these problems by representing words as dense vectors in a continuous vector space, where words with similar meanings are located close to each other.

4. Word Embedding
Word Embedding is a text representation technique where words or phrases from the vocabulary are mapped to vectors of real numbers. These vectors, also known as distributed representations, are dense and typically have a much lower dimensionality (e.g., 100–300 dimensions) compared to one-hot encoding.
Core idea is that a word’s meaning can be inferred from its context — the words that surround it. Models learn these embeddings by predicting a word from its context or vice versa. This process captures semantic and syntactic relationships, allowing for powerful analogies like:

vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”)

Types of Word Embedding Techniques

1.Word2Vec:
Word2Vec is a neural network-based text representation technique that learns to represent words as dense vectors, called word embeddings. It was developed by Google and is effective at capturing the semantic and syntactic relationships between words.
Core idea is that a word’s meaning can be inferred from its context — the words that surround it.

Word2Vec has 2 main architectures: Continuous Bag-of-Words (CBOW) and Skip-gram. Both models use a shallow neural network to learn the word embeddings.

a. Continuous Bag-of-Words (CBOW)
CBOW’s goal is to predict a target word given its surrounding context words. It’s like a fill-in-the-blank puzzle.
Model takes a window of context words, averages their vectors, and then tries to predict the word that should appear in the middle.
Ex: Consider the sentence, “The cat sat on the mat.”
If the target word is ‘sat’, the context words could be ‘cat’ and ‘on’.
CBOW model would take the embeddings for ‘cat’ and ‘on’, combine them, and try to predict the embedding for ‘sat’.
CBOW is faster to train than Skip-gram and performs well with frequent words.

b. Skip-gram
Skip-gram’s goal is the opposite of CBOW: it predicts the context words given a target word. It takes a single word and tries to predict the words that are likely to appear within a certain window around it.
Ex: Using the same sentence, “The cat sat on the mat.”
If the target word is ‘sat’, Skip-gram model would take the embedding for ‘sat’ and try to predict the embeddings for words like ‘cat’, ‘on’, ‘the’, and ‘mat’.
Skip-gram is slower to train but is known to work better with smaller corpora and is more effective at capturing semantic relationships for rare words.

from gensim.models import Word2Vec

# Sample corpus (tokenized sentences)
corpus = [
 ["the", "cat", "sat", "on", "the", "mat"],
 ["the", "dog", "walked", "on", "the", "street"],
 ["a", "car", "drove", "by", "a", "truck"],
 ["man", "is", "the", "king", "of", "jungle"],
 ["woman", "is", "the", "queen", "of", "a", "family"]
]

# Train the Word2Vec model
# We use the default Skip-gram model (sg=0 for CBOW)
model = Word2Vec(sentences=corpus,
 vector_size=100, # Dimensionality of the word vectors
 window=5, # Maximum distance between the current and predicted word
 min_count=1, # Ignores all words with total frequency lower than this
 workers=4) # Use 4 CPU cores for training

# Get the vector for a word
print("Vector for 'cat':\n", model.wv['cat'])

# Find the most similar words
print("\nWords most similar to 'cat':\n", model.wv.most_similar('cat'))

When you initialize the Word2Vec model, you can adjust several parameters to control its training and the resulting word embeddings.

sentences: This is the input data, which must be a list of tokenized sentences.
vector_size: Dimensionality of the word vectors. A higher number captures more information but can be computationally expensive. Common values are between 100 and 300.
window: Maximum distance between the current and predicted word within a sentence. A smaller window focuses on local context, while a larger one considers a broader context.
min_count: Ignores all words with a total frequency lower than this value. This is useful for filtering out rare words that don't provide much signal and reduces the vocabulary size.
workers: Number of CPU cores to use for training. A higher number speeds up the process.
sg: This is the crucial parameter for choosing between CBOW and Skip-gram:
sg=0 (default): Trains a CBOW model.
sg=1: Trains a Skip-gram model.

For a more detailed and visually rich explanation of Word2Vec, I would recommend Jay Alammar’s blog post The Illustrated Word2vec

2. GloVe (Global Vectors for Word Representation)
GloVe (Global Vectors for Word Representation) is an unsupervised learning model that generates word embeddings by combining the advantages of 2 major embedding approaches: global matrix factorization (like Latent Semantic Analysis) and local context window methods (like Word2Vec).

Latent Semantic Analysis (LSA): LSA is a technique that analyzes the co-occurrence of words in a corpus. It creates a large matrix where rows are words and columns are documents, and the values are word counts (or TF-IDF scores). LSA then uses a mathematical technique called Singular Value Decomposition (SVD) to reduce the dimensionality of this matrix, capturing the “latent” or underlying semantic relationships between words.
Pros: LSA is based on global statistics, meaning it uses information from the entire corpus, which can lead to better representations for less frequent words.
Cons: It performs poorly on capturing more nuanced word relationships (e.g., analogies).
Context-based Models (e.g., Word2Vec): These models, as discussed earlier, learn embeddings by analyzing the local context of words. They predict words based on their neighbors within a small window.
Pros: They are great at capturing local, syntactic, and semantic relationships, making them effective for tasks like word analogy.
Cons: They don’t use the global co-occurrence statistics of the entire corpus, which can be less efficient.

GloVe’s unique approach is that it trains a model to learn word vectors such that their dot product is equal to the logarithm of their co-occurrence probability. This means it tries to find vectors that satisfy the global co-occurrence statistics of the entire corpus.

from glove import Corpus, Glove
import numpy as np


# 1. Create a corpus from a list of tokenized sentences
# This is the same input format as Word2Vec.
corpus = Corpus()
sentences = [
 ['the', 'cat', 'sat', 'on', 'the', 'mat'],
 ['a', 'dog', 'ran', 'down', 'the', 'street'],
 ['the', 'cat', 'was', 'chasing', 'a', 'mouse']
]
corpus.fit(sentences, window=5)

# The corpus object has now built the co-occurrence matrix.
# You can view the dimensions of the matrix.
print(f"Co-occurrence matrix shape: {corpus.matrix.shape}")

# 2. Train the GloVe model
# no_components: dimensionality of the word vectors
# learning_rate: learning rate for training
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, 
 epochs=30, 
 no_threads=4, 
 verbose=True)

# Add words to the model's vocabulary
glove.add_dictionary(corpus.dictionary)

# 3. Use the trained model
# Get the vector for a word
print("\nVector for 'cat':\n", glove.word_vectors[glove.dictionary['cat']])

# Find the most similar words
print("\nWords most similar to 'cat':\n", glove.most_similar('cat'))


# This demonstrates the model's ability to capture relationships
result = glove.most_similar(positive=['cat', 'mouse'], negative=['dog'], number=1)
print("\nAnalogy result (cat - dog + mouse):\n", result)

Glove and Word2Vec are both popular word embedding techniques, but they differ fundamentally in their approach to learning word representations.
Main distinction is that Word2Vec is a predictive model that learns from a local context, while GloVe is a count-based model that leverages global co-occurrence statistics

Learning Method
Predictive: Uses a shallow neural network to predict words from context (CBOW) or context from a word (Skip-gram).
Count-based/Matrix Factorization: Learns word vectors by factorizing a global word-co-occurrence matrix.
Scope
Local Context: Learns relationships based on a sliding window, focusing on words that are close to each other in a sentence.
Global Statistics: Uses information from the entire corpus, capturing the frequency of words appearing together across all documents.
Performance Can be computationally efficient and scales well to large datasets. Often performs better on word similarity tasks.
Can be more effective at capturing semantic relationships like analogies due to its use of global statistics.
Training The model is trained on word-context pairs, one at a time. It’s an incremental training process.
Builds a single large co-occurrence matrix first, then trains the model by minimizing a loss function that tries to reconstruct this matrix.

3. FastText
FastText is an extension of the Word2Vec model that addresses its limitations, particularly with rare and out-of-vocabulary (OOV) words. Key innovation is that FastText treats words not as single, indivisible units, but as a “bag of character n-grams.”

Core principle behind FastText is that the morphological structure of a word contains important semantic information. By breaking words into subword units, FastText can capture this information. This is especially useful for morphologically rich languages like Turkish or German, where words have many forms.

Here’s a step-by-step breakdown of how it works:

Subword Decomposition: For each word in the corpus, FastText breaks it down into a set of character n-grams. The minimum and maximum lengths of these n-grams are configurable parameters. Special characters, < and >, are added to denote the beginning and end of a word.
Ex: Let’s take the word “running” with a min n-gram length of 3 and a max of 4.
3-grams: <ru, run, unn, nni, nin, ing, ng>4-grams: <run, runn, unnin, nning, ning>The set of subwords for “running” would include all of these n-grams plus the word itself: {'<ru', 'run', 'unn', ..., 'running'}.
Learning Embeddings: FastText uses either the Skip-gram or CBOW architecture, similar to Word2Vec. However, instead of learning a single vector for each word, it learns a vector for each subword n-gram.
The embedding for a word is the sum (or average) of its subword n-gram vectors.
Ex: the vector for “running” is the sum of the vectors for <ru, run, …, and running itself.
Handling OOV Words: This subword approach is what allows FastText to handle OOV words. If it encounters a new word like “misunderestimated” during inference, it can generate a vector for it by summing the vectors of its known subwords (e.g., “mis,” “under,” “esti,” “mat,” “ed”). This vector will be more meaningful than a random vector and can be used for downstream tasks
Ex: Consider 2 words: “apple” and “apples”.
A simple Word2Vec model would treat them as 2 completely separate words and learn 2 unrelated vectors.
FastText, however, would break both words down into their n-grams. For instance, both would share n-grams like ap, app, and ple. The final vectors for "apple" and "apples" will therefore be very similar because they share many subword components, correctly reflecting their semantic relationship.

When to Use Each
The choice of which model to use depends on your specific needs and the characteristics of your dataset.

Use Word2Vec: When your corpus is large and you need a fast and efficient way to get high-quality word vectors. It’s a great baseline and often performs well enough for many tasks.
Use GloVe: When you want to leverage global co-occurrence statistics. GloVe can sometimes capture semantic relationships like analogies better than Word2Vec. It’s a solid choice when you need a balance between local context and global information.
Use FastText: When your corpus contains a lot of rare or out-of-vocabulary (OOV) words, or when you are working with a morphologically rich language (like Finnish, Turkish, or German). FastText’s ability to handle unseen words makes it a powerful choice for these scenarios. It’s also a strong contender for text classification tasks where the presence of specific subword features might be highly indicative of a category.

import fasttext
import logging

# Set up logging for a better view of the training process
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample corpus
# Note: fastText requires a file as input for training
# We'll create a simple text file for this example.
data_file_path = "fasttext_corpus.txt"
with open(data_file_path, "w", encoding="utf-8") as f:
 f.write("the cat sat on the mat\n")
 f.write("a dog ran down the street\n")
 f.write("the cat was chasing a mouse\n")
 f.write("the boy is a runner\n")
 f.write("the girl is running\n")
 f.write("he is a king and she is a queen\n")

# 1. Train the fastText model
# We use skip-gram architecture by default
model = fasttext.train_unsupervised(data_file_path,
 model='skipgram',
 dim=100, # Dimension of word vectors
 minn=3, # Min length of char ngrams
 maxn=6, # Max length of char ngrams
 minCount=1) # Ignore words with less than 1 occurrence

# 2. Use the trained model
# Get the vector for a known word
print("Vector for 'cat':\n", model.get_word_vector('cat'))

# 3. Handle out-of-vocabulary (OOV) words
# The word 'running' was in our corpus.
# The word 'runner' was also in our corpus.
# The word 'runners' was NOT in our corpus.
# FastText can still generate a vector for 'runners' by using subword information.
print("\nVector for OOV word 'runners':\n", model.get_word_vector('runners'))

# 4. Find the most similar words
print("\nWords most similar to 'cat':\n", model.get_nearest_neighbors('cat'))
print("\nWords most similar to 'running':\n", model.get_nearest_neighbors('running'))

Here are the few more word embedding techniques in the Contextualized Embeddings category, which will be discussed in upcoming article with Deep Neural Network

4. ELMo
ELMo was one of the first major breakthroughs in contextualized embeddings. It uses a bidirectional LSTM (Long Short-Term Memory) network to create a word’s vector representation. Unlike static embeddings, an ELMo vector for a word is a function of the entire sentence it appears in. This allows it to capture context-specific meaning. For example, the word “bank” in “river bank” would have a different ELMo vector than in “financial bank.” The final ELMo vector is a linear combination of the vectors from each layer of the bidirectional LSTM.

5. Transformer-Based Models: These models are the current state of the art for text representation. They rely on the Transformer architecture and its self-attention mechanism, which can weigh the importance of different words in a sentence when encoding a specific word. This allows them to capture long-range dependencies and complex relationships far more effectively than LSTMs.

BERT (Bidirectional Encoder Representations from Transformers): BERT is pre-trained on a massive corpus using 2 tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM involves masking a certain percentage of words in a sentence and training the model to predict the original words. This forces BERT to learn a deep bidirectional understanding of the entire sentence. The result is a highly expressive embedding for each word that’s a function of its full context.
GPT (Generative Pre-trained Transformer): While GPT models (like GPT-2, GPT-3, and GPT-4) are primarily known for their generative capabilities, they also produce powerful contextualized embeddings. Unlike BERT’s bidirectional approach, GPT models are unidirectional, or autoregressive. They are trained to predict the next word in a sequence. The embeddings from GPT are useful for tasks that require an understanding of forward-looking context, such as text completion or summarization.
RoBERTa (Robustly Optimized BERT Pretraining Approach): RoBERTa is an optimized version of BERT. Its key differences include training on more data, using larger batch sizes, and removing the Next Sentence Prediction task, which was found to be unnecessary. This leads to slightly better performance on many downstream tasks.
XLNet: XLNet combines the best of both worlds. It uses a different pre-training objective called Permutation Language Modeling, which allows it to capture bidirectional context without the limitations of BERT’s masking. It’s an autoregressive model that learns context from both directions by considering all possible permutations of words in a sentence.

4. Model Building and Training in NLP

After the data has been preprocessed and represented numerically, you can begin the model building and training phase. This involves selecting an appropriate algorithm, training it on your data, and evaluating its performance. This phase is crucial for ensuring the model can accurately understand and process new text.

Steps for Model Building and Training

Data Splitting: Before training, you need to split your dataset into 3 parts:

Training Set: The largest portion (typically 70–80% of the data) used to train the model. The model learns patterns from this data.
Validation Set: Used during the training process to tune hyperparameters and check for overfitting. This helps prevent the model from learning the training data too well and performing poorly on new data.
Test Set: A completely unseen portion of the data (typically 10–20%) used for a final, unbiased evaluation of the model’s performance after training is complete.

import numpy as np
from sklearn.model_selection import train_test_split

# Sample data
# X: text documents
# y: corresponding labels (0 or 1 for binary classification)
X = np.array(['this is a great movie', 'this is a terrible film', 'a wonderful day to go outside', 'the worst experience of my life'])
y = np.array([1, 0, 1, 0])

# Split the data into 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train:", X_train)
print("y_train:", y_train)
print("X_test:", X_test)
print("y_test:", y_test)

2. Model Selection: During the model selection phase of an NLP project, you have a wide range of models to choose from, depending on the complexity of your task, the size of your dataset, and the type of text representation you’ve chosen. These models can be broadly categorized into traditional machine learning models and deep learning models.

Traditional Machine Learning Models: These models are often a good starting point for NLP tasks, especially with limited data. They are typically used with Bag-of-Words or TF-IDF representations.

Naive Bayes: A simple probabilistic classifier based on Bayes’ theorem. It’s often used for text classification, such as spam filtering or sentiment analysis, due to its efficiency and good performance.
Logistic Regression: A linear classifier that’s highly effective for binary and multi-class text classification tasks. It works well and is easy to interpret.
Support Vector Machines (SVM): A powerful supervised learning model that finds the optimal hyperplane to separate data points into different classes. SVMs are well-suited for high-dimensional data, making them a strong choice for text classification.
Conditional Random Fields (CRF): A statistical model used for structured prediction tasks, such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. CRFs consider the context of the entire sequence of words, leading to more accurate predictions.

Deep Learning Models: These models are the state-of-the-art for more complex NLP tasks, especially when using word embeddings.

Recurrent Neural Networks (RNNs): Designed to process sequential data, RNNs maintain a hidden state that captures information from previous words in a sentence. They’re effective for tasks like language modeling and machine translation. However, they struggle with long-range dependencies.
Long Short-Term Memory (LSTM) Networks: More advanced type of RNN that solves the vanishing gradient problem. LSTMs use “gates” to control the flow of information, allowing them to remember information over long sequences. LSTMs and their variants (like GRU) are widely used for tasks like text generation and sentiment analysis.
Transformer Models: Current backbone of most cutting-edge NLP. They use a self-attention mechanism to weigh the importance of different words in a sequence, allowing them to process text in parallel and understand long-range dependencies better than RNNs. Popular transformer-based models include:
BERT (Bidirectional Encoder Representations from Transformers): A powerful model that learns a deep, bidirectional representation of a sentence by considering the entire context. It’s excellent for tasks requiring a deep understanding of context, such as question answering, sentiment analysis, and text classification.
GPT (Generative Pre-trained Transformer): A family of models designed for text generation. Unlike BERT, GPT models are unidirectional, learning to predict the next word in a sequence.
RoBERTa (Robustly Optimized BERT Pretraining Approach): An optimized version of BERT that improves performance by training on more data and for longer periods.

3. Training the Model
The training process involves feeding the preprocessed and represented data into the chosen algorithm. The model’s internal parameters are adjusted through an iterative process to minimize a loss function, which measures the difference between the model’s predictions and the actual values.

4. Hyperparameter Tuning
Hyperparameter tuning is the process of finding the optimal set of configuration settings for a machine learning model to achieve the best possible performance. These settings, called hyper-parameters, are external to the model and are not learned from the data. Instead, a data scientist must set them manually before the training process begins.
For example, when training a neural network for a text classification task, you might need to decide on the number of hidden layers, the learning rate, or the batch size. Choosing the right combination of these values can drastically change the model’s performance, preventing issues like underfitting or overfitting.

Hyper-parameters: These are external configuration settings that you set manually before the training process. They control the training algorithm’s behavior and the model’s architecture.
Ex: learning_rate, number_of_epochs, batch_size, number_of_hidden_layers, and dropout_rate.
Model Parameters: These are internal variables that the model learns automatically from the training data during the training process. They are the actual values that make up the model and are used to make predictions.
Ex: The weights and biases in a neural network or the coefficients in a Logistic Regression model.

hyperparameter space is the set of all possible values for each hyperparameter you want to tune. You define this space by specifying a range of values for each hyperparameter. Goal of hyperparameter tuning is to find the single best combination of values within this defined space.
Ex: If you want to tune 2 hyperparameters for a classifier:

C (regularization strength): [0.1, 1, 10, 100]
ngram_range: [(1, 1), (1, 2)]

hyperparameter space is Cartesian product of these lists. The tuning process would test every combination, such as (C=0.1, ngram_range=(1,1)), (C=0.1, ngram_range=(1,2)), and so on.

Hyperparameter tuning is typically an iterative process. You define the search space, select a tuning technique, train and evaluate a model for each combination of hyperparameters, and then select the best one based on a performance metric like accuracy or F1-score.

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Sample text data and labels
X = ['great movie', 'terrible acting', 'wonderful film', 'bad experience']
y = [1, 0, 1, 0]

# Create a pipeline
pipeline = Pipeline([
 ('vectorizer', TfidfVectorizer()),
 ('classifier', LogisticRegression())
])

# Define the hyperparameter space to search
param_grid = {
 'vectorizer__ngram_range': [(1, 1), (1, 2)], # Single words or pairs of words
 'classifier__C': [0.1, 1, 10] # Regularization strength
}

# Create a GridSearchCV object
# cv=2 means 2-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=2, verbose=1)

# Perform the grid search on the data
grid_search.fit(X, y)

# Print the best parameters found
print("Best parameters found: ", grid_search.best_params_)

Hyperparameter Optimization Techniques: These are the algorithms used to intelligently search the hyperparameter space.

a. Grid Search: This is the most basic technique. It performs an exhaustive search over all specified combinations of hyperparameters in the defined grid.
Pros: Guaranteed to find the best combination within the given search space.
Cons: Extremely computationally expensive, especially with many hyperparameters or large search spaces.

b. Random Search: Instead of exhaustively checking every combination, it samples a fixed number of random combinations from the search space.
Pros: Much faster than grid search. Often finds a near-optimal solution because important hyperparameters can have a wider range of optimal values.
Cons: Not guaranteed to find the absolute best combination.

c. Bayesian Optimization: This is a more advanced and efficient technique. It treats hyperparameter tuning as a regression problem. It builds a probabilistic model (surrogate model) that maps hyperparameters to an objective score (e.g., accuracy). It then uses this model to intelligently choose the next set of hyperparameters to try, balancing the exploration of new areas and the exploitation of known good areas.
Pros: Significantly more efficient than Grid or Random Search, especially for complex models or large search spaces.
Cons: More complex to implement. Popular libraries include Optuna and Hyperopt.

5. Model Evaluation
After training, you evaluate your model using the test set to ensure it generalizes well to new, unseen data. Common evaluation metrics include:

a. Classification Metrics: These are used for tasks where you predict a category, such as sentiment analysis, spam detection, or topic classification.

Accuracy:
Accuracy measures the percentage of correct predictions out of all predictions made. It’s the simplest metric but can be misleading for imbalanced datasets.

Accuracy = Number of Correct Predictions / Total Number of Predictions

Ex: For a spam filter, if you correctly classify 95 out of 100 emails, your accuracy is 95%.

Precision, Recall, and F1-Score:
These metrics are more informative, especially for imbalanced classes (e.g., when spam emails are rare).

Precision measures how many of the items identified as positive were actually positive. It answers the question, “Of all the emails I flagged as spam, how many were actually spam?”

Precision = True Positives / True Positives+False Positives

Recall measures how many of the actual positive items were identified. It answers the question, “Of all the actual spam emails, how many did I correctly identify?”

Recall = True Positives / (True Positives + False Negatives)

F1-Score is the harmonic mean of precision and recall. It provides a balanced measure of the model’s performance and is useful when you need to consider both precision and recall.

F1−Score = 2 × (Precision × Recall / Precision + Recall)

Ex: You have 100 emails: 90 are “Not Spam,” and 10 are “Spam.”

Your model predicts 8 emails as “Spam,” but only 5 are actually “Spam.”
Precision: 5/(5+3)=0.625 (62.5% of the emails flagged as spam were correct).
Recall: 5/(5+5)=0.50 (You only found 50% of the actual spam emails).

b. Sequence-to-Sequence Metrics: These metrics are used for tasks like machine translation, text summarization, and question answering, where the output is a sequence of words.

BLEU (Bilingual Evaluation Understudy):
BLEU measures the similarity between a machine-generated text and a set of reference texts. It’s a precision-based metric that counts the number of n-grams (sequences of words) in the generated text that also appear in the reference texts. A score of 1.0 means a perfect match.

Ex: Generated Translation: “The cat is on the mat.”
Reference Translations: “A cat sat on the rug.” and “The cat sat on the mat.”
BLEU would look for matching unigrams (“the,” “cat”), bigrams (“the cat”), etc., to score the generated sentence’s quality.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
ROUGE is primarily used for summarization. It measures the overlap of n-grams between the generated summary and one or more reference summaries. It’s a recall-based metric, meaning it focuses on how much of the information from the reference summary is present in the generated one.

c. Generation Metrics: These are used for evaluating text generation tasks like story writing or chatbot responses.

Perplexity: Perplexity measures how well a language model predicts a sample of text. A lower perplexity score indicates a better model. It essentially quantifies how “surprised” the model is by the next word in a sequence. A good model with low perplexity assigns a high probability to the correct next word.
Ex: A model that predicts “The capital of France is Paris” would have a lower perplexity than one that predicts “The capital of France is London,” because “Paris” is a much more likely next word.

d. Semantic Similarity Metrics: These metrics evaluate how semantically similar 2 pieces of text are, often used for tasks like information retrieval or semantic search.

Cosine Similarity: Cosine Similarity measures the cosine of the angle between 2 vectors. In NLP, this is often used to compare 2 document or word vectors. A score of 1 means the vectors are identical, 0 means they are orthogonal (unrelated), and -1 means they are opposite. Ex:
Document A vector: [2, 1, 0, 1] (counts of “apple,” “orange,” “car,” “truck”)
Document B vector: [1, 2, 0, 1]
Cosine Similarity would measure the angle between these vectors to determine how similar the documents are.

Deploying NLP models to production and monitoring their performance are critical steps for any real-world application. The best strategies ensure the model is reliable, scalable, and continues to perform well over time.

Deployment Strategies

Choosing a deployment strategy depends on the application’s specific needs, such as latency requirements, traffic volume, and computational resources.

a. On-Demand (Serverless) Deployment: This is ideal for applications with unpredictable or low traffic. Services like AWS Lambda or Google Cloud Functions automatically scale the model up or down based on demand.
You package your model and its code into a function. The cloud provider runs this function only when an API call is made.
Pros: Highly cost-effective (you only pay for what you use), automatically scales, and requires minimal infrastructure management.
Cons: Can introduce cold start latency for the first request after a period of inactivity.

b. Real-Time (API-Based) Deployment: This is the most common approach for interactive applications like chatbots or search engines where low latency is crucial.
The NLP model is wrapped in a REST API (e.g., using frameworks like Flask or FastAPI) and hosted on a dedicated server (virtual machine or container). Each API request sends text to the model, which returns a prediction in real-time.
Pros: Low latency, high availability, and easy to integrate with other services.
Cons: Requires more infrastructure management and can be more expensive than serverless options if traffic is low.

c. Batch Deployment: This is used for processing large volumes of data offline, where real-time predictions are not necessary. Examples include analyzing a month’s worth of customer reviews or a large legal document corpus.
Data is collected in batches and processed by the model at scheduled intervals (e.g., once a day). The results are then stored for later analysis.
Pros: Very efficient for large datasets, as it can be optimized for throughput rather than latency.
Cons: Not suitable for applications requiring real-time predictions.

d. Edge Deployment: This strategy involves deploying the model directly on a device (e.g., a smartphone, smart speaker, or IoT device).
The model is converted to a lightweight format (e.g., using ONNX or TensorFlow Lite) and embedded into the application itself.
Pros: Extremely low latency (no network calls), works offline, and enhances user privacy as data doesn’t leave the device.
Cons: Limited by the device’s computational power and memory. Model updates are also more complex.

Monitoring NLP Applications

Once an NLP application is deployed, monitoring is essential to ensure it continues to perform as expected and to detect issues before they impact users. Monitoring strategies can be broken down into 2 main types: technical monitoring and model-centric monitoring.

a. Technical Monitoring: This focuses on the health and performance of the infrastructure and code.

Latency: Time taken for the model to return a prediction. High latency can indicate performance bottlenecks.
Throughput: The number of requests processed per second.
Error Rates: The frequency of failed requests (e.g., a 500 error).

Tools: Cloud-native monitoring services (e.g., AWS CloudWatch, Google Cloud Monitoring) and application performance monitoring (APM) tools (e.g., Datadog, New Relic) are commonly used to track these metrics.

b. Model-Centric Monitoring: This is specific to machine learning models and focuses on the quality of predictions.

Model Drift: This occurs when the model’s performance degrades over time because the data it’s seeing in production has changed.
Ex: Sentiment analysis model trained on social media text from 2020 might struggle with new slang and emojis in 2025. Monitoring this involves comparing the distribution of input data over time.
Data Drift: Change in the distribution of the input data itself. This is often the cause of model drift.
Ex: Your spam filter suddenly starts receiving emails with a completely new type of scam that it has never seen before. You can monitor this by tracking the statistical properties of the incoming text.

Use a library like Evidently AI or Great Expectations to compare the statistical properties of your training data with your production data

evidently.report.get_data_drift_report(current_data=prod_data, reference_data=train_data)

Concept Drift: Change in the relationship between the input data and the target variable.
Ex: Model that classifies “positive” sentiment based on certain keywords might see those keywords’ meanings change over time due to cultural shifts or slang.
Performance Metrics: Continuously monitor key performance indicators (KPIs) like accuracy, precision, and recall on a sample of live data.
Ex: You might manually label a small portion of daily predictions to calculate the model’s precision and recall and compare it to the initial test set performance.

Log the model’s predictions and periodically sample them for manual review or use A/B testing to compare a new model’s performance to the current one.
Ex: Deploy a new version of your sentiment analysis model to 10% of users and measure if its user engagement metrics improve compared to the old model.

Topic modeling

Topic modeling is an unsupervised machine learning technique used to discover the abstract “topics” that occur in a collection of documents. It helps you understand what a large set of documents is about by identifying hidden themes or subjects.

Ex: Given a collection of news articles, a topic model might identify topics like “Politics” (related to words like election, government, policy), “Technology” (with words like software, data, algorithm), or “Sports” (with words like team, game, player).

Topics: In topic modeling, a topic is not a single word but a collection of words that frequently appear together. Model represents a topic as a probability distribution over the vocabulary.
Ex: a “Health” topic might be defined by high probabilities for words like doctor, patient, hospital, symptoms, while a “Finance” topic might have high probabilities for stock, market, economy, investment.
Document-Topic Distribution: Model assumes that each document is a mixture of several topics. Document is not assigned to a single topic; instead, it’s represented as a probability distribution over all topics.
Ex: a document about a company’s new healthcare technology might be 80% “Technology” and 20% “Health.”
Topic-Word Distribution: Conversely, each topic is a mixture of words. This is the model’s output — it learns which words are most likely to belong to a specific topic.

How Topic Models Work?
Topic models work by iteratively analyzing the co-occurrence of words in documents. Process is based on the idea that if 2 words often appear together in the same documents, they are likely part of the same topic.

Common algorithm, Latent Dirichlet Allocation (LDA), works as follows:

Initialization: Algorithm randomly assigns a topic to each word in every document.
Iteration: For each word in each document, the model performs 2 steps:

It calculates the probability of the word belonging to a topic, based on the words already assigned to that topic.
It calculates the probability of the word belonging to a topic, based on the topics already present in the document.

3. Reassignment: Word is then reassigned to a new topic based on these calculated probabilities. This process is repeated for every word and every document for a number of iterations.

4. Convergence: Over time, this iterative process results in a stable state where the word-topic and document-topic distributions are well-defined and reflect the underlying topics.

Types of Topic Modeling Techniques
There are several types of topic modeling techniques, with the most common being statistical methods and more modern neural methods.

a . Latent Semantic Analysis (LSA): LSA is a linear algebra technique that uses Singular Value Decomposition (SVD) on a term-document matrix. It identifies latent topics by reducing the dimensionality of the matrix, grouping similar documents and terms.

b. Latent Dirichlet Allocation (LDA): LDA is a probabilistic, generative model that’s the most widely used topic modeling technique. It assumes that documents are generated from a mixture of topics and that each topic generates words from its own distribution.

c. Non-Negative Matrix Factorization (NMF): NMF is a matrix factorization technique similar to LSA but with the constraint that all matrix values must be non-negative. This is useful because word counts and frequencies are non-negative, which leads to a more interpretable result.

d. Neural Topic Models (e.g., BERTopic): Modern techniques use deep learning models like BERT to create powerful contextualized embeddings. BERTopic, for example, combines these embeddings with clustering techniques to find topics. It generates high-quality topics that are more coherent than traditional methods.

Code snippet to implement Topic Modeling using Latent Dirichlet Allocation (LDA) with the popular gensim library.

 # 1. Import necessary libraries
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords if you haven't already
# nltk.download('stopwords')

# 2. Sample data
# Each entry in the list is a document.
documents = [
 "The federal government is working to improve policies related to public health and safety.",
 "Scientists are researching new data on climate change and its effect on our planet.",
 "The new political party is focused on economic reform and social justice.",
 "Researchers presented their findings on the new algorithm at the technology conference.",
 "Medical professionals are concerned about the spread of infectious diseases.",
 "A team of programmers is developing a new software for data analysis."
]

# 3. Preprocessing the data
# This is a crucial step to clean the text for the model.
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def preprocess_text(text):
 # Tokenize the text and remove stop words
 # simple_preprocess tokenizes and lowercases the text
 tokens = [word for word in simple_preprocess(text, deacc=True) if word not in stop_words]
 return tokens

# Apply the preprocessing function to all documents
processed_docs = [preprocess_text(doc) for doc in documents]

# 4. Create a dictionary and corpus
# The dictionary is a mapping of words to unique IDs
id2word = corpora.Dictionary(processed_docs)
# The corpus is a list of bags-of-words, where each bag represents a document
corpus = [id2word.doc2bow(doc) for doc in processed_docs]

# 5. Build the LDA model
# We set num_topics to the number of topics we want to find
num_topics = 3
lda_model = gensim.models.LdaMulticore(corpus=corpus,
 id2word=id2word,
 num_topics=num_topics,
 random_state=100,
 chunksize=100,
 passes=10,
 per_word_topics=True)

# 6. View the discovered topics
# The model gives you the top words for each topic with their weights
print("Discovered Topics:")
for idx, topic in lda_model.print_topics(num_words=5):
 print(f"Topic: {idx} \nWords: {topic}\n")

# 7. Evaluate the model (optional but recommended)
# A higher coherence score indicates more coherent and meaningful topics.
coherence_model = CoherenceModel(model=lda_model, 
 texts=processed_docs, 
 dictionary=id2word, 
 coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score}")

Since a basic understanding of deep neural networks is helpful for learning about RNNs, you can refer this article which includes a detailed explanation.

Comprehensive Guide to Deep Learning — Neural Networks

Ever wondered how AI can recognize faces, or translate languages in an instant? The magic behind these capabilities…

medium.com

We need Recurrent Neural Networks (RNNs) because traditional neural networks, like the feedforward networks we’ve discussed in my previous article, have a fundamental limitation: they cannot handle sequential data or remember past information. RNNs were created to address this very issue.

Earlier neural networks, such as Feedforward Neural Networks (FNNs), operate on a simple principle: they assume that each input is independent of all other inputs. This works perfectly for tasks like image classification, where the content of one image doesn’t depend on the previous one.

However, this independence assumption breaks down completely for sequential tasks. Consider these examples:

Language: The meaning of a word in a sentence depends on the words that came before it. The word “bank” has a different meaning in “river bank” than in “money bank” .
An FNN processing “money bank” would see “money” and “bank” as separate inputs, losing the context.
Time Series: To predict tomorrow’s stock price, you need to know today’s and yesterday’s prices. An FNN has no mechanism to carry this historical information forward.

Because FNNs have no “memory,” they cannot recognize patterns or dependencies across a sequence of inputs.

How RNNs Addressed the Issue
RNNs solved this problem by introducing a recurrent loop within their structure. This loop allows information to persist from one step of the sequence to the next.

Internal Memory: An RNN has an internal state, often called a hidden state, which acts as a form of short-term memory. When an input is fed into the network, the hidden state is updated based on both the current input and the hidden state from the previous time step.
Sequence Processing: This loop enables the network to process inputs one at a time while maintaining a representation of the information seen so far. This makes RNNs ideal for tasks like:
NLP: Processing sentences, machine translation, and text generation.
Speech Recognition: Converting spoken words into text.
Time Series Analysis: Predicting stock prices or weather patterns.

In short, while earlier neural networks are excellent for data without a temporal or sequential dependency, RNNs were a breakthrough because they allowed neural networks to learn from sequences, opening the door to a vast new range of applications.

A Recurrent Neural Network (RNN) is a type of neural network designed to handle sequential data, such as text, speech, or time series.
Unlike a traditional feedforward neural network (FFNN), an RNN has internal memory that allows it to process sequences by considering past information. This makes it a foundational model for many NLP tasks.

The term “recurrent” refers to this looping mechanism, where the same operation is applied at each time step, with the output of the current step being fed back as an input to the next step. This creates a kind of “memory” that allows the network to process sequential data effectively.

RNN Architecture

The basic RNN architecture consists of a repeating module or cell. This module takes 2 inputs at each time step:

Current Input: The data point for the current time step (e.g., a word in a sentence).
Previous Hidden State: The memory from the previous time step.

Inside the module, these 2 inputs are combined (e.g., through a weighted sum) and passed through a non-linear activation function (like tanh or ReLU). This process generates 2 outputs:

Current Output: A prediction or an output for the current time step.
Updated Hidden State: The new memory that is passed to the next time step.

This recurring loop is what gives the RNN its “memory.”

Let’s use an example to walk through how an RNN processes the sentence, “The movie was great!” for sentiment analysis

Step 1: Initialization The RNN starts with an initial hidden state, usually a vector of zeros

Step 2: Processing the First Word (“The”)

Input: The numerical representation (e.g., a word embedding) of the word “The”
Process: The RNN combines the input for “The” with the initial hidden state
Output: The network produces a temporary output and an updated hidden state that encodes information about the word “The” This new hidden state is now the memory for the next step

Step 3: Processing the Second Word (“movie”)

Input: The numerical representation of the word “movie”
Process: The RNN combines the input for “movie” with the hidden state from the previous step.
Output: It generates a new output and an updated hidden state that contains information about both “The” and “movie”

Step 4: Continuing the Sequence (“was,” “great!”)
This process repeats for each word in the sentence. With each new word, the RNN updates its hidden state, which accumulates a richer representation of the sentence’s context. By the time it processes the word “great,” the hidden state contains a “memory” of the entire sentence up to that point.

Step 5: Final Output
After processing the last word, the final output and hidden state can be used to make a prediction.
For a sentiment classification task, the final hidden state would be fed into a classification layer (e.g., a simple feedforward layer with a softmax activation) to predict the sentiment — in this case, “Positive.”

Elman Network
It’s also known as a Simple Recurrent Network (SRN), is one of the foundational types of RNNs. It was introduced by Jeffrey Elman in 1990. Its key feature is the inclusion of a “context layer” that explicitly stores a copy of the hidden layer’s output from the previous time step. This context layer feeds back into the hidden layer, providing the network with its memory.
An Elman network is a three-layer network:

Input Layer: Takes the current input from the sequence.
Hidden Layer: The main processing layer. It receives input from both the input layer and the context layer.
Context Layer: A special layer that holds a copy of the hidden layer’s activations from the previous time step. This is the source of the network’s recurrence.
Output Layer: Produces the network’s prediction.

Let’s see how an Elman network processes the sentence for same example above: “This movie was great!”

Time Step 1: “This”
Input Layer: Receives the word embedding for “This”
Context Layer: Contains an initial state (e.g., all zeros).
Hidden Layer: Calculates its activations based on “This” and the initial context.
Context Layer Update: The hidden layer’s activations are copied to the context layer for the next time step.

Time Step 2: “movie”
Input Layer: Receives the word embedding for “movie.”
Context Layer: Contains the activations from the previous hidden layer (after processing “This”).
Hidden Layer: Calculates its activations based on “movie” and the context from the first time step.
Context Layer Update: The new hidden layer activations are copied to the context layer.

This process continues for “was” and “great!” By the time the network processes “great!”, the context layer contains a summary of the entire sentence so far (“This movie was…”). Final output from the network is then used to predict the sentiment, which would be “Positive.”

Elman network, despite its simplicity, demonstrates core principle of RNNs: using a recurrent connection to remember and process information in sequences.

Folded model is the conceptual and compact representation of an RNN. It shows a single, repeating neural network module with a loop. This loop represents the network’s recurrent nature, where the output of the hidden layer at time step t is fed back as an input to the same hidden layer at time step t+1.

xₜ: The input at the current time step.
hₜ: The hidden state (or “memory”) at the current time step.
Oₜ: The output at the current time step.
U, W, V: The weight matrices that are shared across all time steps.

This diagram is useful for understanding that the RNN is a single function that is repeatedly applied to a sequence of data.

Calculations within this folded module are the same at every time step. Network uses 3 shared weight matrices:

U: Weights for the input-to-hidden layer connection.
W: Weights for the hidden-to-hidden (recurrent) connection.
V: Weights for the hidden-to-output connection.

New hidden state (hₜ) is calculated by combining the current input (xₜ) and the previous hidden state (hₜ₋₁). This is a crucial step as it integrates new information with the network’s existing memory. Combined value is then passed through a non-linear activation function, such as tanh or ReLU, to enable the network to learn complex patterns.

hₜ = tanh(U . xₜ + W . hₜ₋₁)

Output is calculated based on the newly computed hidden state(hₜ). For many NLP tasks, this output is the final prediction for that time step.
An activation function like softmax is often used here for classification tasks to produce a probability distribution over the possible classes.

Oₜ = softmax( V. hₜ)

Unfolded model is a sequential representation of RNN’s behavior over time. It “unrolls” the loop to show the network as a chain of identical neural network modules, one for each time step in the sequence. Each module takes the input at its time step and the hidden state from the previous time step to produce an output and a new hidden state.

Types of RNNs

There are several types of Recurrent Neural Networks (RNNs), each designed to handle specific types of sequence-to-sequence problems. Main variations are based on the input and output structures.

1. One-to-One
This is the most basic structure, where a single input corresponds to a single output. It’s essentially a feedforward neural network, as there is no true sequential processing or memory.
Application: Basic classification tasks where the input and output are not sequences. Ex: Image classification. You input a single image and get a single label (e.g., “cat,” “dog”).

2. One-to-Many
Single input produces a sequence of outputs.
Application: Image captioning, where a single image is described by a sequence of words. Ex: RNN takes an image of a baseball game as input and generates the sentence “A baseball player is at bat.”

3. Many-to-One
Sequence of inputs is processed to produce a single output.
Application: Sentiment analysis and text classification. The entire sequence of words is used to determine a single outcome.
Ex: Model takes a movie review like “The movie was absolutely fantastic, I loved every moment of it!” and outputs a single label: “Positive.”

4. Many-to-Many
Sequence of inputs produces a sequence of outputs. This type can be further divided into 2 subtypes:

Many-to-Many (Encoder-Decoder): Sequence is processed and then a new, separate sequence is generated. There is a pause between the input and output sequences.
Application: Machine translation. Entire input sentence must be read before the translation can begin.
Ex: RNN takes the English sentence “How are you?” and outputs the translated French sentence “Comment allez-vous?”
Many-to-Many (Synchronous): Output is generated at each time step as the input is being processed. Lengths of the input and output sequences are the same.
Application: Video classification at the frame level, where each frame is classified. Ex: RNN takes a sequence of video frames and at each frame, it outputs a label, such as “running,” “jumping,” or “standing.”

When we train an RNN, we are teaching it to learn the relationships and patterns within sequential data, such as a sentence. The network adjusts its internal parameters (weights and biases) to minimize the difference between its predictions and the actual data. This process is how the network learns to predict the next word in a sentence or classify the sentiment of a review.

Backpropagation Through Time (BPTT)

BPTT is the algorithm used to train RNNs. It’s a specialized version of the standard backpropagation algorithm designed for models that process sequences over time. BPTT is used after the network has made a prediction on a full sequence and the error (or loss) for that prediction has been calculated.

Primary goal of BPTT is to calculate the gradients of the loss with respect to the RNN’s weights, so those weights can be updated to improve performance.

Process of BPTT involves 2 main steps: a forward pass and a backward pass.

Forward Pass: RNN processes the entire input sequence from start to finish. At each step, it calculates a hidden state and an output, and a loss is computed by comparing the predicted output to the actual output. The network stores these intermediate hidden states and outputs, which are needed for the backward pass.
Backward Pass (BPTT): Starting from the last time step, the algorithm works backward through the sequence. It calculates the gradient of the loss with respect to the weights at each time step. Because the weights are shared across all time steps, the gradients from each step are summed up. This accumulation of gradients is crucial as it accounts for the influence of a weight on the loss at every point in the sequence.

Let’s use sentence: “Agentic AI intelligent, autonomous AI systems that can reason, make decisions, and act independently to perform complex tasks without constant human guidance” to train an RNN to predict the next word. The network processes the sentence word by word.

Forward Pass:

Network takes “Agentic” as input, calculates a hidden state, and tries to predict the next word. Let’s say it incorrectly predicts “robot” instead of “AI” It calculates a loss for this mistake.
It continues this process for the entire sentence. Network’s hidden state at each step is influenced by all the words that came before it.
By the end of the sentence, a total loss is computed, which is the sum of the losses at each step.

Backward Pass (BPTT):

BPTT begins at the end of the sentence. It takes the loss from the final prediction and calculates the gradients for the weights (U, W, V) at that time step.
It then propagates this error backward to the previous time step, and the one before that, all the way to the start of the sentence.
At each step, it adds the local gradients to the running total. This process ensures that the gradients for the weights are influenced by the errors from the entire sequence.
Ex: incorrect prediction “robot” was likely due to the initial weights. BPTT makes sure that this error contributes to the final gradient, leading to a more accurate weight adjustment.

Once the backward pass is complete, the total accumulated gradient is used to update the shared weights. This update improves the network’s ability to make better predictions on the next sequence

Let’s build a simple sentiment analysis model for IMDB movie reviews using an RNN. The process involves loading the data, preprocessing it, building the RNN model, training it, and evaluating its performance.

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. Load the data
# We'll limit the vocabulary to the top 10,000 most frequent words
vocab_size = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# 2. Pad sequences to ensure uniform length
# Set a max length for the reviews
max_len = 256
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)

# 3. Build the RNN model
model = models.Sequential()
# Embedding layer to convert word indices into dense vectors
model.add(layers.Embedding(vocab_size, 128))
# SimpleRNN layer with 128 hidden units
model.add(layers.SimpleRNN(128))
# Output dense layer with sigmoid activation for binary classification
model.add(layers.Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam',
 loss='binary_crossentropy',
 metrics=['accuracy'])

# Print a summary of the model architecture
model.summary()

# 4. Train the model
history = model.fit(x_train, y_train,
 epochs=5,
 batch_size=32,
 validation_split=0.2)

# 5. Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {accuracy}")

# Example of using the model to predict sentiment on a new review
new_review = "This movie was absolutely amazing and breathtaking, a masterpiece!"
# Preprocess the new review
word_index = imdb.get_word_index()
review_encoded = [word_index.get(word, 0) for word in new_review.lower().split()]
review_padded = pad_sequences([review_encoded], maxlen=max_len)

# Make a prediction
prediction = model.predict(review_padded)[0][0]
print(f"\nPrediction score: {prediction}")
if prediction > 0.5:
 print("The sentiment is POSITIVE.")
else:
 print("The sentiment is NEGATIVE.")

Major disadvantages of standard Recurrent Neural Networks (RNNs) are the vanishing and exploding gradient problems, which make them difficult to train on long sequences, and their slow computation due to their sequential nature.

Vanishing Gradients

Vanishing gradients occur when the gradients become extremely small as they are backpropagated through many time steps. This effectively means that the model’s memory of past information fades away over time.
The gradient is a measure of how much a small change in a weight will affect the network’s final loss. During Backpropagation Through Time (BPTT), this gradient is multiplied by the weight matrix (W) at each time step as it is passed backward through the network. If the values in the weight matrix are small (e.g., less than 1), repeatedly multiplying the gradient by these small numbers will cause it to shrink exponentially, eventually becoming negligible.

Ex: Consider sentence: “Agentic AI intelligent, autonomous AI systems that can reason… to perform complex tasks without constant human guidance.”

An RNN is trying to predict the final word guidance.
The word Agentic at the beginning of the sentence is crucial for understanding the context. However, the connection between the start of the sentence and the end is very long.
As the network's error for its final prediction is backpropagated to the beginning of the sentence, the gradients will pass through dozens of layers (one for each word). If the weights are small, the gradient will vanish, and the network will not be able to adjust its weights to remember the context of "Agentic" As a result, the network effectively "forgets" the beginning of the sentence, making it difficult to learn long-range dependencies.

Exploding Gradients

Exploding gradients occur when the gradients become extremely large during backpropagation. This is the opposite problem of vanishing gradients.
This happens when the values in the weight matrix (W) are large (Ex. greater than 1). As the gradient is multiplied by these large numbers at each time step, it grows exponentially, becoming so large that it causes the weight updates to overshoot the optimal solution. The model becomes unstable and its performance drops dramatically.

Ex: Using the same sentence, if the network has large weights, the gradient from the final error will explode as it backpropagates to the beginning of the sentence. This will cause the network’s weights to be updated by a huge amount, essentially “blowing up” the model. While easier to detect and fix than vanishing gradients, exploding gradients make the training process unstable and can lead to a useless model.

The most effective way to overcome both vanishing and exploding gradients is by using Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks and Gradient Clipping. We will explore in detail.

Ex Let’s go back to our sentence: “Agentic AI… to perform complex tasks without constant human guidance.”

LSTM or GRU would use its gates to explicitly store the relevant context of “Agentic AI” When the model processes the final part of the sentence, it can access this stored information directly, rather than relying on a continuously updated hidden state that might have “forgotten” the beginning.
Gradient can flow more directly through these gates, avoiding the chain of multiplications that cause the vanishing problem in standard RNNs. This ensures that the weights at the beginning of the network get meaningful updates, allowing the model to learn long-term dependencies.
Gradient clipping is a simple and effective technique used to combat exploding gradients. It’s a simple fix that prevents the gradients from getting too large during backpropagation.
If the gradient’s value exceeds a certain threshold, it is clipped or scaled down to a predetermined maximum value. This prevents the weights from being updated by an extreme amount, which would destabilize the network.
Ex: During BPTT on our long sentence, if the calculated gradient for a specific weight becomes 50 (a very large value), we can set a clipping threshold of, say, 10. The gradient will then be scaled down to 10, ensuring the weight update is manageable and the network remains stable.

LSTM (Long Short-Term Memory)

network is a type of Recurrent Neural Network (RNN) designed to overcome the vanishing gradient problem. It’s an enhanced version of the RNN with a more complex internal structure that allows it to learn and remember long-term dependencies in sequential data. LSTMs are widely used for tasks like machine translation, speech recognition, and time series forecasting.

LSTM cell is the core component of a Long Short-Term Memory (LSTM) network. It’s a specialized unit designed to handle sequential data by selectively remembering or forgetting information. It has complex internal structure that includes a cell state and various gates.

Cell State (Cₜ)
cell state is the core of the LSTM’s memory. Think of it as a “conveyor belt” that runs through the entire sequence, carrying important information from one time step to the next. The cell state is a separate pathway for information to flow, which helps to maintain a stable gradient and prevent it from vanishing or exploding over long sequences.

Hidden State (hₜ): This is the LSTM’s short-term memory. It’s the output of the cell at each time step and contains information about the current input and the context of the recent past. The hidden state is used as the input for the next time step and is also used to generate the final output.

Cell vs. Hidden:
cell state stores information over a very long duration (the entire sequence), while the hidden state provides a more immediate, filtered output of the cell state for the current time step.

2. Gates
The gates are the neural networks within the LSTM cell that regulate the flow of information. They act like filters, deciding what information to keep, forget, or use. Each gate is a sigmoid neural network, and its output is a number between 0 and 1, where 0 means “block this information” and 1 means “allow this information to pass.”

a. Forget Gate:
The Forget Gate is responsible for deciding what old information to discard from the long-term memory (cell state). It takes the current input (xₜ) and the previous hidden state (hₜ₋₁) and outputs a value between 0 and 1 for each piece of information in the cell state.
A 0 means “completely forget this” while
a 1 means “completely keep this” It helps the network filter out irrelevant noise from the past.

fₜ = σ(Wf⋅ [hₜ₋₁, xₜ] + bf)

b. Input Gate & “Learn” Function
The Input Gate decides what new information to learn and store in the long-term memory. It has 2 parts:

A sigmoid layer (the “Input Gate” itself) decides which new values to update.
A tanh layer (the “learn” gate) creates a vector of new candidate values (~Cₜ) that could be added to the cell state.

These 2 parts work together to select and prepare new, relevant information to be added to the cell’s memory.

iₜ = σ (Wᵢ⋅[ hₜ₋₁, xₜ ] + bᵢ)

~Cₜ = tanh(W꜀⋅ [hₜ₋₁, xₜ ] + b꜀)

c. “Remember” Function
This is not a separate gate but rather the core function of the LSTM cell that updates the long-term memory by combining the results of the Forget and Input gates. It allows the network to selectively remember a mix of old and new information. The old cell state is multiplied by the forget gate’s output (to discard what’s forgotten), and the new candidate values are added after being filtered by the input gate.

Cₜ = fₜ⋅ Cₜ₋₁ + iₜ ⋅ ~Cₜ

Here, Cₜ₋₁ is the old cell state, and Cₜ is the new cell state.

d. Output Gate & “Use” Function
The Output Gate decides what information from the updated long-term memory will be used to create the new short-term memory (hidden state). It filters the updated cell state to generate a new hidden state that serves as the output for the current time step and the input for the next. This ensures that only relevant information is passed on.

ot = σ( Wₒ⋅ [hₜ₋₁, xₜ] + bₒ)

hₜ = oₜ⋅ tanh(Cₜ)

To perform sentiment analysis using an LSTM on the IMDB movie review dataset, we’ll follow a similar process to the RNN example, but with a more powerful model architecture. The core change is replacing the SimpleRNN layer with an LSTM layer, which is better at handling long sequences.

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. Load the data
vocab_size = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# 2. Pad sequences to ensure uniform length
max_len = 256
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)

# 3. Build the LSTM model
model = models.Sequential()
# Embedding layer to convert words into dense vectors
model.add(layers.Embedding(vocab_size, 128))
# LSTM layer with 128 hidden units
model.add(layers.LSTM(128))
# Output dense layer with sigmoid activation for binary classification
model.add(layers.Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam',
 loss='binary_crossentropy',
 metrics=['accuracy'])

# Print a summary of the model architecture
model.summary()

# 4. Train the model
history = model.fit(x_train, y_train,
 epochs=5,
 batch_size=32,
 validation_split=0.2)

# 5. Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f"\nTest Accuracy: {accuracy}")

# Example of using the model to predict sentiment on a new review
new_review = "This movie was incredibly boring and a waste of time, but the ending was spectacular!"
# Preprocess the new review
word_index = imdb.get_word_index()
review_encoded = [word_index.get(word, 0) for word in new_review.lower().split()]
review_padded = pad_sequences([review_encoded], maxlen=max_len)

# Make a prediction
prediction = model.predict(review_padded)[0][0]
print(f"\nPrediction score: {prediction}")
if prediction > 0.5:
 print("The sentiment is POSITIVE.")
else:
 print("The sentiment is NEGATIVE.")

LSTMs are a significant improvement over standard RNNs, but they still have limitations, particularly with very long sequences and in their computational efficiency.

1. Limited “Attention Span”
While LSTMs are good at handling long-term dependencies, they still have a limited “attention span.” The cell state is a fixed-size vector, and as it processes very long sequences, it can still lose some information. For tasks that require understanding context from a document with thousands of words, LSTMs can struggle to retain information from the very beginning. This problem is similar to the vanishing gradient issue in standard RNNs, though far less severe. This limitation is a key reason why more advanced architectures with explicit attention mechanisms were developed.

2. Computational Inefficiency
LSTMs are computationally expensive and slow to train compared to feed-forward networks. Their primary bottleneck is their sequential nature:

No Parallelization: LSTMs must process each element in a sequence one after the other. The computation for a given time step depends on the output of the previous time step. This means you can’t parallelize the processing of a single sequence on modern hardware like GPUs, making training and inference slow for long sequences.
High Complexity: Each LSTM cell has a more complex internal structure than a standard RNN cell, with 3 gates and a cell state. This increases the number of parameters and computations per time step, contributing to a slower overall process.

3. Handling Multiple Sequences
LSTMs are not naturally suited for handling multiple, separate sequences simultaneously. While you can batch sequences of similar length, the sequential dependency within each sequence still requires step-by-step processing. Architectures like the Transformer, on the other hand, can process all elements of a sequence in parallel, making them much more efficient for tasks that involve processing large volumes of text. This parallelization capability is a major reason why the Transformer has become the dominant model in NLP.

Types of LSTM

1. Vanilla LSTM
The standard LSTM architecture with a single layer. It has a cell state and 3 gates (forget, input, and output) to regulate information flow.
What it addresses: The vanishing gradient problem of simple RNNs.
Structure: A single LSTM layer followed by a dense output layer.
Use Case: Sentiment analysis on short texts, time series forecasting.
Ex: Classifying movie reviews as positive or negative.
Limitation: It struggles to capture complex, multi-level dependencies and is slow on very long sequences.

2. Stacked LSTM
Multiple LSTM layers stacked on top of each other. The output of one LSTM layer serves as the input for the next. This increases the network’s depth and allows it to learn more complex features.
What it addresses: Capturing more abstract and hierarchical representations of data.
Structure: The output of each LSTM layer (except the last one) is passed as a sequence to the next LSTM layer.
Use Case: Advanced language modeling, machine translation, and complex sequence-to-sequence problems.
Ex: A stacked LSTM could be used to translate a sentence from English to French, where the first layer learns basic word patterns and subsequent layers learn more complex grammatical structures.
Limitation: Computationally expensive and still suffers from the sequential bottleneck.

3. Bidirectional LSTM (Bi-LSTM)
An architecture that consists of 2 separate LSTMs: one processes the input sequence forward (e.g., left to right), and the other processes it backward (right to left). The outputs of both LSTMs are combined to form a single final output.
What it addresses: The limitation of a standard LSTM to access future context. For example, in the sentence “I love the movie, it was so good,” the word “love” provides context for “good.” But in a regular LSTM, a forward pass would process “love” before “good.” A Bi-LSTM processes both directions simultaneously.
Structure: 2 LSTMs with opposing directions.
Use Case: Named Entity Recognition (NER), where the classification of a word (e.g., “Paris”) depends on both the words before and after it.
Ex: Identifying a person’s name like “M. L. King” requires looking at the full phrase. A forward LSTM might see “M. L.” but only the backward one can recognize “King” as a surname, allowing for the correct classification.
Limitation: Computationally more expensive than a single LSTM.

4. Convolutional LSTM (ConvLSTM)
An LSTM variant that uses convolutional operations within its gates. This allows it to process and learn from spatial or spatio-temporal data, such as images and video.
What it addresses: Applying LSTM’s memory capabilities to data with a grid-like structure, like images or video frames.
Structure: The input, hidden state, and cell state are all 3D tensors (with height, width, and channels), and the gates use 2D convolutional filters to perform their operations.
Use Case: Weather forecasting, video prediction, and action recognition in videos.
Ex: Predicting the next frame in a video sequence based on previous frames.
Limitation: More complex and computationally intensive than standard LSTMs.

5. Gated Recurrent Unit (GRU)
A simplified version of the LSTM. It combines the cell state and hidden state into a single hidden state and uses only 2 gates: a reset gate and an update gate.
What it addresses: Reduces the complexity and number of parameters of an LSTM while retaining most of its power in handling vanishing gradients.
Structure: Simpler than an LSTM with fewer gates.
Use Case: Any task where LSTMs are used. Often preferred due to its faster computation and comparable performance on many tasks.
Ex: Can be used for sentiment analysis or machine translation, often with results similar to an LSTM but with less training time.
Limitation: May not perform as well as LSTMs on very specific, highly complex problems.

6. Encoder-Decoder LSTM with Attention Mechanism
This architecture consists of 2 LSTMs: an encoder that reads the entire input sequence and compresses it into a single context vector, and a decoder that uses this vector to generate the output sequence. The attention mechanism is a layer that allows the decoder to “pay attention” to specific parts of the input sequence, rather than relying solely on the single context vector.
What it addresses: The bottleneck of the standard encoder-decoder model, where all information must be compressed into a fixed-size vector, regardless of the sequence length.
Structure: An encoder LSTM, a decoder LSTM, and an attention layer that connects the two.
Use Case: State-of-the-art machine translation, text summarization, and question answering.
Ex: When translating “I like the food at the restaurant,” the attention mechanism helps the decoder focus on “restaurant” when generating the translated word for “restaurant.”
Limitation: The attention mechanism still adds computational complexity, which is why the Transformer model, which uses attention exclusively, has largely replaced it.

Sequence-to-Sequence (Seq2Seq) Models

Sequence-to-Sequence (Seq2Seq) model is a powerful neural network architecture designed to transform one sequence of data (the input) into another sequence of data (the output), where the lengths of the 2 sequences can differ. This architecture revolutionized tasks like Neural Machine Translation (NMT), text summarization, and dialogue systems.

The standard Seq2Seq model, particularly the early versions that used RNNs/LSTMs, consists of 2 main components: an Encoder and a Decoder.

I. Encoder: Reading and Compressing (English → Vector)
The encoder’s role is to process the entire input sequence and compress its meaning, context, and structure into a single, fixed-size vector.

Input Processing — Tokenization and Embedding
The source sentence is broken into tokens (words). Each token is converted into a numerical embedding vector.
Encoding Step-by-Step Accumulating Context (using LSTMs)
The encoder (usually an LSTM or GRU) processes each word’s embedding sequentially. At each time step, it updates its hidden state (ht) and cell state (Ct), integrating the current word’s meaning with the context accumulated from all previous words.
Context Vector Generation Summarizing the Sentence
After processing the last word in the input sequence, the encoder’s final hidden state and cell state are taken as the Context Vector. This vector is intended to be a complete summary, or “thought,” of the entire source sentence.

Ex: Translating “I love machine learning” (English)
The encoder processes: “I”→”love”→”machine”→”learning”.
The final hidden state after processing “learning” becomes the Context Vector C.

II. Decoder: Generating the Output (Vector → French)
The decoder’s role is to take the context vector and generate the target sequence one word at a time, making sure the generated sentence is grammatically correct and faithful to the source’s meaning.

Initialization Receiving the Summary
The decoder (also an LSTM or GRU) is initialized using the Context Vector as its starting hidden state and cell state. The first input it receives is a special token, typically “<START>”.
Output Generation
Predicting the First Word
The decoder uses its initial state and the “<START>” token to predict the first word of the target sentence.
It uses a Softmax layer to output a probability distribution over the entire target vocabulary.
Target Sentence Construction Autoregressive Generation
The word with the highest probability is selected as the output (Ex: “J’aimeˊ”). This generated word is then fed back as the input for the next time step.
Iteration Continuing the Sequence
The decoder continues this loop, generating one word at a time, until it predicts a special “<END>” token, which signals the completion of the translation.

Ex: Translating to French: “J’aimé l’apprentissage automatique”

Input: “<START>”+ C. Output: “J’aimeˊ”.
Input: “J’aimeˊ”+ C. Output: “l’apprentissage”.
Input: “l’apprentissage”+ C. Output: “automatique”.
Input: “automatique”+ C. Output: “<END>”.
Final Output: “J’aimeˊ l’apprentissage automatique”

Context Vector is the lynchpin connecting the encoder and decoder in a sequence-to-sequence (Seq2Seq) model. It is the fixed-size numerical summary generated by the encoder that encapsulates the entire meaning, structure, and relevant information of the source input sequence.
Its utilization in the decoder is critical for achieving accurate translation or sequence generation.

The Context Vector is used for 2 main purposes in the decoder of a standard (RNN/LSTM-based) Seq2Seq model:

1. Initialization of the Decoder’s State
The most direct use of the Context Vector is to initialize the starting hidden state (h₀) and cell state (c₀) of the decoder’s RNN or LSTM.
By setting the decoder’s memory (its initial state) to the compressed representation of the entire source sentence, the decoder begins its work with a full understanding of the input.

It’s like a translator reading a source document and forming a complete understanding before speaking the first word of the translation.
The Context Vector is that initial understanding.

2. Providing Continuous Context
In some Seq2Seq implementations, the Context Vector is fed as an additional input at every time step of the decoder’s generation process.
This provides a constant, unchanging reference to the source sentence’s meaning, ensuring that the decoder remains grounded in the original text as it generates the target sequence.
Without this constant reminder, the decoder’s internal state might drift away from the original meaning, especially when translating very long sentences.

This example will translate a simple sequence (e.g., numbers to reverse numbers) or a small corpus.

We will define the overall process in 5 logical parts.

Importing and Data Preparation
This stage sets up the environment, defines the vocabulary, and creates data pipelines.

# 1. Importing Necessary Dependencies
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import random

# Setting the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Dummy data for a simple demonstration (e.g., reversing sequences)
RAW_DATA = [("hello", "olleh"), ("world", "dlrow"), ("python", "nohtyp")]

# Special tokens
SOS_token = 0 # Start of Sequence
EOS_token = 1 # End of Sequence

# 2. Creating Vocabulary and Numerlizer
class Vocab:
 def __init__(self, name):
 self.name = name
 self.word2index = {"<SOS>": SOS_token, "<EOS>": EOS_token}
 self.word2count = {}
 self.index2word = {SOS_token: "<SOS>", EOS_token: "<EOS>"}
 self.n_words = 2 # Count SOS and EOS

 def add_word(self, word):
 if word not in self.word2index:
 self.word2index[word] = self.n_words
 self.word2count[word] = 1
 self.index2word[self.n_words] = word
 self.n_words += 1
 else:
 self.word2count[word] += 1

# Helper function to convert sequence of words to sequence of indices
def tensor_from_sequence(vocab, sequence):
 indices = [vocab.word2index[word] for word in sequence]
 indices.append(EOS_token)
 return torch.tensor(indices, dtype=torch.long, device=device).view(-1, 1)

# Initialize vocab objects
input_vocab = Vocab("input")
output_vocab = Vocab("output")
for input_seq, output_seq in RAW_DATA:
 for word in input_seq: input_vocab.add_word(word)
 for word in output_seq: output_vocab.add_word(word)
 
# 3. Data Loaders (Simplified for demonstration)
training_pairs = [(tensor_from_sequence(input_vocab, inp), tensor_from_sequence(output_vocab, out)) 
 for inp, out in RAW_DATA]

def get_random_batch():
 return random.choice(training_pairs)

Building the Model
We define the Encoder and the Decoder using PyTorch’s nn.LSTM module.

# Model Hyperparameters
HIDDEN_SIZE = 256
EMBEDDING_DIM = 256

# --- ENCODER CLASS ---
class EncoderLSTM(nn.Module):
 def __init__(self, input_size, hidden_size, embedding_dim):
 super(EncoderLSTM, self).__init__()
 self.hidden_size = hidden_size
 
 # Embedding layer converts tokens to vectors
 self.embedding = nn.Embedding(input_size, embedding_dim)
 
 # LSTM layer: input_size=embedding_dim, output_size=hidden_size
 self.lstm = nn.LSTM(embedding_dim, hidden_size)

 def forward(self, input_tensor, hidden_state, cell_state):
 # input_tensor is a single word index (1x1)
 embedded = self.embedding(input_tensor).view(1, 1, -1)
 
 # Forward pass through LSTM
 output, (hidden_state, cell_state) = self.lstm(embedded, (hidden_state, cell_state))
 
 return output, hidden_state, cell_state

 def init_hidden(self):
 # Initialize the hidden state and cell state to zeros
 return torch.zeros(1, 1, self.hidden_size, device=device), \
 torch.zeros(1, 1, self.hidden_size, device=device)

# --- DECODER CLASS ---
class DecoderLSTM(nn.Module):
 def __init__(self, output_size, hidden_size, embedding_dim):
 super(DecoderLSTM, self).__init__()
 self.hidden_size = hidden_size

 self.embedding = nn.Embedding(output_size, embedding_dim)
 self.lstm = nn.LSTM(embedding_dim, hidden_size)
 
 # Output layer to predict the next word
 self.out = nn.Linear(hidden_size, output_size)
 self.softmax = nn.LogSoftmax(dim=1)

 def forward(self, input_tensor, hidden_state, cell_state):
 # input_tensor is the previously predicted word (1x1)
 embedded = self.embedding(input_tensor).view(1, 1, -1)
 
 # Apply ReLU (optional activation)
 # output = F.relu(embedded) 
 
 # Forward pass through LSTM
 output, (hidden_state, cell_state) = self.lstm(embedded, (hidden_state, cell_state))
 
 # Prediction layer
 output = self.softmax(self.out(output[0]))
 
 return output, hidden_state, cell_state

Training the Model
This involves initialization, a single training step function, and the main training loop.

# --- Model Initialization, Weight Initialization, Optimizer and Loss Initialization ---
encoder = EncoderLSTM(input_vocab.n_words, HIDDEN_SIZE, EMBEDDING_DIM).to(device)
decoder = DecoderLSTM(output_vocab.n_words, HIDDEN_SIZE, EMBEDDING_DIM).to(device)

# Using Adam optimizer and Negative Log Likelihood Loss
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.01)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.01)
criterion = nn.NLLLoss()

# --- Creating a Training Loop (single step) ---
def train_step(input_tensor, target_tensor, encoder, decoder, 
 encoder_optimizer, decoder_optimizer, criterion):
 
 # 1. Initialize Encoder States
 encoder_hidden, encoder_cell = encoder.init_hidden()

 encoder_optimizer.zero_grad()
 decoder_optimizer.zero_grad()

 input_length = input_tensor.size(0)
 target_length = target_tensor.size(0)

 loss = 0

 # 2. ENCODER LOOP
 for ei in range(input_length):
 _, encoder_hidden, encoder_cell = encoder(
 input_tensor[ei], encoder_hidden, encoder_cell)

 # 3. DECODER INITIALIZATION
 # Decoder starts with <SOS> token as input
 decoder_input = torch.tensor([[SOS_token]], device=device)
 
 # Decoder states are initialized with the final encoder states (Context Vector)
 decoder_hidden = encoder_hidden
 decoder_cell = encoder_cell

 # 4. DECODER LOOP (using Teacher Forcing for training)
 for di in range(target_length):
 decoder_output, decoder_hidden, decoder_cell = decoder(
 decoder_input, decoder_hidden, decoder_cell)
 
 # Calculate loss (decoder_output is log-softmax of prediction)
 loss += criterion(decoder_output, target_tensor[di])
 
 # Teacher Forcing: Use the *true* target word as the next input
 decoder_input = target_tensor[di] 

 # 5. Backpropagation and Optimization
 loss.backward()

 encoder_optimizer.step()
 decoder_optimizer.step()

 return loss.item() / target_length

# --- Main Training Loop ---
def train_model(encoder, decoder, n_iters=500):
 print("Starting training...")
 total_loss = 0
 
 for iter in range(1, n_iters + 1):
 input_tensor, target_tensor = get_random_batch()
 
 loss = train_step(input_tensor, target_tensor, encoder, decoder, 
 encoder_optimizer, decoder_optimizer, criterion)
 total_loss += loss

 if iter % 100 == 0:
 avg_loss = total_loss / 100
 print(f'Iteration {iter} - Loss: {avg_loss:.4f}')
 total_loss = 0

# train_model(encoder, decoder, n_iters=1000) # Uncomment to run training

Creating a Function to Translate the Sentence
This is the inference function, which operates without “teacher forcing” (it uses its own predictions as input).

def translate_sentence(encoder, decoder, sentence):
 with torch.no_grad():
 # 1. ENCODER INFERENCE
 input_tensor = tensor_from_sequence(input_vocab, sentence)
 input_length = input_tensor.size(0)
 encoder_hidden, encoder_cell = encoder.init_hidden()
 
 for ei in range(input_length):
 _, encoder_hidden, encoder_cell = encoder(
 input_tensor[ei], encoder_hidden, encoder_cell)

 # 2. DECODER INFERENCE (Initialization)
 decoder_input = torch.tensor([[SOS_token]], device=device)
 decoder_hidden = encoder_hidden
 decoder_cell = encoder_cell

 translated_words = []
 
 # 3. DECODER LOOP
 # We limit the length to prevent infinite loops
 max_length = 10 
 for di in range(max_length):
 decoder_output, decoder_hidden, decoder_cell = decoder(
 decoder_input, decoder_hidden, decoder_cell)
 
 # Get the word with the highest probability
 topv, topi = decoder_output.data.topk(1)
 
 # Check for EOS token
 if topi.item() == EOS_token:
 translated_words.append('<EOS>')
 break
 else:
 translated_words.append(output_vocab.index2word[topi.item()])

 # Use the model's prediction as the next input (NO Teacher Forcing)
 decoder_input = topi.squeeze().detach()

 return ' '.join(translated_words)

Evaluating the Model
Once trained, the model can be tested using the translation function.

# --- EVALUATION ---
# Assuming the model has been trained by uncommenting train_model() above

# Example sentences from the RAW_DATA
sentence_hello = "hello" 
sentence_world = "world"

# translation_hello = translate_sentence(encoder, decoder, sentence_hello)
# translation_world = translate_sentence(encoder, decoder, sentence_world)

# print(f"Input: {sentence_hello}, Output: {translation_hello}")
# print(f"Input: {sentence_world}, Output: {translation_world}")

The primary limitation of the original Sequence-to-Sequence (Seq2Seq) model, particularly those built with RNNs/LSTMs, was the information bottleneck created by the Context Vector.
The Attention mechanism solved this by replacing the single fixed-size vector with a dynamic focus on the input sequence during decoding.

Information Bottleneck
The original Seq2Seq architecture required the Encoder to compress the entire input sequence (no matter how long) into a single, fixed-size vector called the Context Vector. This single vector was then used to initialize and condition the Decoder.

Problem: For short sentences, this vector was usually sufficient to capture the full meaning. However, for long input sequences, forcing all the nuanced information (meaning, grammar, long-range dependencies) into this fixed-size vector caused a loss of information. The network would struggle to “remember” the beginning of a long sentence by the time it finished encoding the end.
Result: The quality of the output sequence (e.g., the translation) deteriorated significantly as the input sentence length increased.

Ex: Imagine translating a long English sentence: “The brilliant, critically acclaimed director, who started his career making low-budget horror films, has finally released his latest movie, which is spectacular.”
When the encoder finishes, the Context Vector C is supposed to hold the entire meaning. However, for a model with a fixed memory capacity, C might only retain the most recent words (“released his latest movie, which is spectacular”) while forgetting the details from the beginning (“brilliant, critically acclaimed director”). Consequently, the decoder might produce an incomplete or contextually inaccurate translation.

The Attention mechanism fundamentally solved the bottleneck problem by eliminating the need to compress all information into a single vector. Instead, it allows the decoder to dynamically access the full set of encoder hidden states at every decoding step.

How Attention Mechanism Works

The Attention mechanism works by creating a new, dynamic Context Vector (Cₜ) at every time step t of the decoding process. This vector is an informed summary of the encoder’s entire output, weighted by relevance.

The process involves 4 main steps: Query, Key-Value Matching (Scoring), Weighting, and Context Generation.

Query Decoder’s Current State (sₜ)
At time t, the decoder generates a Query based on its current hidden state. This query represents the decoder’s demand: “What information do I need to generate the next word?”
Key-Value Matching (Scoring)Alignment Function (eₜᵢ)
The Query (sₜ) is compared against every Key (which is typically every hidden state hᵢ produced by the encoder). The comparison (often a dot product or concatenation) yields an alignment score (eₜᵢ), indicating how well the i-th input word matches the decoder’s current need.
Weighting Softmax Activation (αₜᵢ)
The raw scores (eₜᵢ) are passed through a Softmax function to normalize them into a set of Attention Weights (αₜᵢ). These weights are probabilities that sum to 1, showing the distribution of “attention” across the input words.
Context Generation Weighted Sum (Cₜ)
The final dynamic Context Vector (Cₜ) is calculated as the weighted sum of all the encoder’s hidden states (the Values), using the attention weights (αₜᵢ). This vector is a highly focused summary of the source input, customized for the current output word.
Output Prediction
The decoder then combines its current state (sₜ) with the dynamic Context Vector (Cₜ) to make the final prediction for the next output word.

Ex: English to French Translation
Source Sentence (Encoder Outputs): “The apple is red.”
(Hidden states: h₁ h₂ h₃ h₄)

hₜₕₑ hₐₚₚₗₑ hᵢₛ hᵣₑₔ

Target Output Step (Decoder is predicting the word for “red”):

Query (sₜ): The decoder’s hidden state, reflecting that it needs to output a word related to color.
Scoring: Query (sₜ) is compared with all encoder states (h1,h2,h3,h4):

Score(Query, hₜₕₑ) = low
Score(Query, hₐₚₚₗₑ) = low
Score(Query, hᵢₛ) = low
Score(Query, hᵣₑₔ) = HIGH

3. Weighting (αₜᵢ): Softmax normalizes these scores:

αred ≈0.90
αₐₗₗ ₒₜₕₑᵣₛ ≈0.10

4. Context Generation (Cₜ): The new Context Vector is overwhelmingly dominated by the vector for the word “red” because it has a 90% weight.

Cₜ = 0.90 * hᵣₑₔ + 0.10 *(sum of other hᵢ)

5. Output: The decoder uses this highly focused Cₜ to confidently predict the French word, “rouge.”

Global Attention (Soft Attention)

In Global Attention, the decoder calculates attention weights over all of the encoder’s source hidden states for every single time step of the decoding process. This means the dynamic context vector is a weighted average of the entire source sequence.

The decoder’s query (st) is compared against all encoder keys (h1,…,hN).
Pros: Highly accurate because it always has access to the full context, regardless of where the relevant information lies.
Cons: Computationally expensive for very long sequences (e.g., documents) due to the need to compute N alignment scores for every output step.
Ex: Standard sentence-level Neural Machine Translation (NMT). To translate a word, the decoder looks at every single word in the source sentence.

Local Attention

Local Attention is designed to address the computational bottleneck of Global Attention by having the decoder focus on only a small, pre-selected window of source hidden states, rather than the entire sequence.

The model first predicts an aligned position (pt) in the source sequence and then defines a fixed-size window around this position (e.g., 10 words). The attention mechanism then only computes weights and the context vector within this window.
Pros: Much more computationally efficient than Global Attention, making it suitable for processing very long documents or paragraphs.
Cons: If the actual aligned information lies outside the predicted local window, the model can miss crucial context.

Local Attention has 2 sub-variants:

a. Local-m (Monotonic Alignment)
The aligned position pt is simply set to the previous aligned position,
pt=pt−1. This assumes the input and output sequences are generally monotonically ordered (i.e., words translate in the same order as they appear).
Use Case: Languages with similar word order (e.g., English to Spanish).

b. Local-p (Predictive Alignment)
The aligned position pt is predicted dynamically by a sub-network (e.g., a small feed-forward network) based on the decoder’s current hidden state.
Use Case: Languages with highly different word orders (e.g., English to Japanese), where the model needs to be flexible about where it focuses its attention.

Implementing the Attention Mechanism in the provided Seq2Seq LSTM code requires modifying the Decoder to allow it to dynamically access and weight the Encoder’s hidden states, rather than relying solely on the final context vector.

Here’s how to integrate the Global Attention mechanism (a common type) into the Encoder-Decoder LSTM architecture using PyTorch.

The implementation involves 3 key changes:

Encoder: Must now return all its hidden states.
Attention Class: A new module is created to calculate the alignment scores and the dynamic context vector.
Decoder: The standard decoder is replaced with an AttnDecoder that uses the calculated context vector before making the final word prediction.

1. Modifying the Encoder
The encoder must now return its output sequence (all hidden states) in addition to its final hidden and cell states.

# --- MODIFIED ENCODER CLASS ---
class EncoderLSTM(nn.Module):
 def __init__(self, input_size, hidden_size, embedding_dim):
 super(EncoderLSTM, self).__init__()
 self.hidden_size = hidden_size
 self.embedding = nn.Embedding(input_size, embedding_dim)
 
 # NOTE: return_sequences=True is needed in Keras/TensorFlow. 
 # In PyTorch, we collect all outputs manually in the forward loop.
 self.lstm = nn.LSTM(embedding_dim, hidden_size)

 def forward(self, input_tensor, hidden_state, cell_state):
 embedded = self.embedding(input_tensor).view(1, 1, -1)
 
 # Output is the hidden state for the current time step
 output, (hidden_state, cell_state) = self.lstm(embedded, (hidden_state, cell_state))
 
 return output, hidden_state, cell_state # output here is h_t
 
# NOTE: The overall training loop must be changed to collect all encoder_outputs:
# encoder_outputs = torch.zeros(max_length, self.hidden_size, device=device)
# for ei in range(input_length):
# encoder_output, encoder_hidden, encoder_cell = encoder(...)
# encoder_outputs[ei] = encoder_output[0, 0]

2. Creating the Attention Module
This module takes the decoder’s current hidden state (Query) and all encoder hidden states (Keys/Values) to compute the attention weights and the context vector.

# --- ATTENTION MECHANISM CLASS ---
class Attn(nn.Module):
 def __init__(self, method, hidden_size):
 super(Attn, self).__init__()
 self.method = method
 self.hidden_size = hidden_size
 
 # We'll use the 'dot' method (a simple dot product for scoring)
 if self.method == 'general': # A slightly more complex scoring function
 self.attn = nn.Linear(self.hidden_size, hidden_size)
 elif self.method == 'concat': # A common way to score
 self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
 self.v = nn.Parameter(torch.rand(hidden_size))

 def forward(self, decoder_hidden, encoder_outputs):
 
 # decoder_hidden: (1, 1, hidden_size) -> Query
 # encoder_outputs: (input_length, hidden_size) -> Keys/Values

 # Calculate attention scores (raw energies)
 if self.method == 'dot':
 # Score = Query^T * Key
 # We need to squeeze and transpose encoder_outputs for batch processing
 attn_energies = torch.sum(decoder_hidden * encoder_outputs, dim=2)
 
 elif self.method == 'general':
 # Score = Query^T * W * Key
 attn_energies = torch.sum(decoder_hidden * self.attn(encoder_outputs), dim=2)
 
 # 1. Normalize energies to get attention weights (probabilities)
 attn_weights = nn.functional.softmax(attn_energies.squeeze(0), dim=0).unsqueeze(0)
 
 # 2. Compute the dynamic context vector (weighted sum of encoder outputs)
 # Context Vector = Attention_Weights * Encoder_Outputs
 context = attn_weights.bmm(encoder_outputs.unsqueeze(0))

 # context: (1, 1, hidden_size)
 return context.squeeze(0), attn_weights.squeeze(0)

3. Modifying the Decoder with Attention
The AttnDecoder combines the standard LSTM operation with the attention mechanism.

# --- MODIFIED DECODER CLASS WITH ATTENTION ---
class AttnDecoderLSTM(nn.Module):
 def __init__(self, output_size, hidden_size, embedding_dim, dropout_p=0.1, method='dot'):
 super(AttnDecoderLSTM, self).__init__()
 self.hidden_size = hidden_size
 self.output_size = output_size
 self.dropout_p = dropout_p
 
 self.embedding = nn.Embedding(output_size, embedding_dim)
 self.dropout = nn.Dropout(self.dropout_p)
 self.lstm = nn.LSTM(embedding_dim + hidden_size, hidden_size) # Input now includes Context Vector
 
 self.attn = Attn(method, hidden_size) # Initialize Attention module
 
 # Output layer is fed the concatenation of hidden state and context vector
 self.concat = nn.Linear(hidden_size * 2, hidden_size)
 self.out = nn.Linear(hidden_size, output_size)
 self.softmax = nn.LogSoftmax(dim=1)

 def forward(self, input_tensor, hidden_state, cell_state, encoder_outputs):
 
 # 1. Embed and Dropout input
 embedded = self.dropout(self.embedding(input_tensor)).view(1, 1, -1)
 
 # 2. Get Dynamic Context Vector (Attention)
 # The query is the decoder's current hidden state
 context, attn_weights = self.attn(hidden_state, encoder_outputs) 
 
 # 3. Concatenate Embedded Input and Context Vector
 # The LSTM input is now enriched with the focused context
 lstm_input = torch.cat((embedded[0], context), 1).unsqueeze(0)
 
 # 4. LSTM Forward Pass
 output, (hidden_state, cell_state) = self.lstm(lstm_input, (hidden_state, cell_state))
 
 # 5. Final Prediction
 # The final prediction uses a combination of the LSTM's output and the context
 output = torch.cat((output[0], context), 1)
 output = self.concat(output)
 output = self.out(output)
 
 output = self.softmax(output)

 return output, hidden_state, cell_state, attn_weights

4. Updating the Training Step
The training function must be updated to pass all encoder_outputs to the decoder at every step.


# OLD: No Attention
# decoder_output, decoder_hidden, decoder_cell = decoder(decoder_input, decoder_hidden, decoder_cell)


# NEW: With Attention
decoder_output, decoder_hidden, decoder_cell, attn_weights = decoder(
 decoder_input, decoder_hidden, decoder_cell, encoder_outputs)

The Transformer architecture, introduced in the 2017 paper “Attention Is All You Need,” is a model that revolutionized sequence processing by entirely replacing recurrence (RNNs/LSTMs) with the Attention mechanism.

In simple terms, the Transformer is a powerful network that allows every word in a sentence to instantly access and weigh the importance of every other word, regardless of how far apart they are.

The Core Idea: Parallel Attention over Recurrence
Imagine translating a sentence. An LSTM processes it word-by-word, creating a memory state at each step. If the sentence is long, the initial words are often forgotten by the time it reaches the end — a sequential bottleneck.

The Transformer, however, processes the entire sentence in parallel at once. It uses the Self-Attention mechanism to calculate how relevant every word is to every other word, instantly establishing long-range dependencies.

RNN/LSTM: Slow, sequential, poor memory retention over long distances.
Transformer: Fast, parallel, excellent long-range memory retention.

Architecture: Encoder and Decoder Stacks
The original Transformer architecture uses the same Encoder-Decoder structure as the Seq2Seq models, but the internal components are entirely different.

Encoder
The Encoder is the half of the Transformer architecture responsible for taking the input sequence and transforming it into a rich, contextual representation.
It consists of 3 main components per layer: Input Processing, Multi-Head Self-Attention, and the Feed-Forward Network.

a. Input Embedding
This step is standard in most NLP models. Each word (token) in the input sequence is converted into a continuous numerical vector (an embedding). This vector captures the semantic meaning of the word.

Why Positional Encoding is Used
The core problem the Transformer solves is the sequential bottleneck of RNNs/LSTMs by processing the entire input sequence in parallel.

Problem: By processing all words at once, the Transformer loses the inherent order of the words. Without knowing the order, the meaning of a sentence is lost (e.g., “Dog bites man” is different from “Man bites dog”).
Solution: Positional Encoding (PE): A vector is created that uniquely represents the position of a word in the sequence. This PE vector is then added to the word’s Embedding vector.

Benefit of Positional Encoding: Positional Encoding gives the Transformer a sense of sequence order and distance. It allows the Self-Attention mechanism to distinguish between the first word, the last word, and 2 identical words appearing at different positions in the sentence. This is crucial for capturing syntactic and grammatical relationships that depend on word order.

b. Multi-Head Self-Attention
Self-Attention is the mechanism that allows a word in a sequence to look at and weigh the relevance of every other word in that same sequence to determine its own context-aware representation.
It achieves this through 3 learned linear projections of the input word vector (x):

Query (Q): Represents the information the current word is looking for.
Key (K): Represents the information the other words possess.
Value (V): Contains the actual content that will be summed up to form the new representation.

The new output vector for a word is calculated as a weighted sum of all Value vectors, where the weights are determined by the similarity between the word’s Q and every word’s K.

Ex: In the sentence: “The city council approved the budget because it was necessary.” When calculating the new vector for the word “it”:

The Query of “it” is compared to the Keys of all other words.
The Keys for “city council” and “budget” will have the highest match scores with “it’s” Query.
The final contextual vector for “it” will be heavily weighted by the Values of “city council” and “budget,” correctly indicating that “it” refers to the budget.

Multi-Head Attention is simply running the Self-Attention mechanism multiple times in parallel (Ex 8 times).

Each independent “head” learns to pay attention to different kinds of relationships.
One head might learn syntactic relationships (e.g., linking a verb to its subject).
Another head might learn semantic relationships (e.g., linking a pronoun to its antecedent, like the “it” and “budget” example).
The outputs from all attention heads are concatenated and then linearly projected back to the desired output dimension, creating a single, highly refined vector that captures multiple perspectives on the word’s context.

c. Feed-Forward Network (FFN)
The FFN is a simple, position-wise, two-layer fully connected network that is applied identically and independently to the output of the Multi-Head Attention sub-layer at every position in the sequence.

Structure: FFN(x)=max(0, xW₁ + b₁)W₂ + b₂

Role:
It adds non-linearity to the model, allowing it to learn more complex features.
It further processes and transforms the contextualized information generated by the attention sub-layer.

Key Feature: While the attention mechanism mixes information across all positions in the sequence, the FFN allows the network to apply local, internal processing to the resulting vector at each word’s position.

In essence, the Encoder stack’s job is to take raw words, inject position information, contextually enrich them using multiple parallel attention mechanisms, and finally refine those representations using a simple neural network.

Decoder
The Transformer’s Decoder is the part of the architecture responsible for generating the output sequence (e.g., the translated sentence) based on the contextual information provided by the Encoder . It does this one word at a time, incorporating 3 key layers per stack: Masked Self-Attention, Encoder-Decoder Attention, and the Feed-Forward Network.

a. Masked Multi-Head Self-Attention
This layer is essentially the same as the Self-Attention in the Encoder, but with one critical addition: a mask.

What is Masked Self-Attention?
When the Decoder generates the output sequence, it must do so autoregressively, meaning it predicts the next word based only on the words it has already generated.
Problem: During parallel training, the Decoder is fed the entire target sentence at once. Without a mask, the attention mechanism could “cheat” by looking at the subsequent (future) words in the target sequence when calculating the attention score for the current word.
Solution: The Mask: The Masked Self-Attention layer applies a triangular mask (often composed of negative infinity values) to the scoring matrix. This mask effectively blocks the attention weights for any future positions.

The masking operation is the only difference.
Function: If the Decoder is trying to generate the word at position 5, the mask ensures that the attention weights for positions 6, 7, 8, etc., are set to zero (or near zero, via negative infinity passed into Softmax).
Benefit: This preserves the sequential generation property of the decoder during parallel training, ensuring the model’s behavior during training matches its sequential generation process during inference.

Ex: If the target sequence is “I am happy.” and the decoder is calculating the contextual vector for the word “am” (position 2):

Unmasked Attention would see and use “happy” and “.”
Masked Attention only sees “I” and “am” (itself). The words “happy” and “.” are masked, preventing information leakage from the future.

b. Encoder-Decoder Attention (Cross-Attention)
This is the second, unique attention layer in the Decoder, providing the crucial link between the Encoder and the Decoder.

What is Encoder-Decoder Attention?
This mechanism allows the Decoder to dynamically focus on the relevant parts of the source input sequence (from the Encoder) whenever it generates an output word. It operates identically to the Attention mechanism used in the Seq2Seq-with-Attention architecture.

Cross-Attention
In this layer, the Q, K, and V vectors come from different sources:

Query (Q)
Source — Comes from the output of the previous Masked Self-Attention layer in the Decoder.
Role — Asks: “What source content do I need to look at to generate the next target word?”
Key (K)
Source — Comes from the final outputs of the Encoder stack.
Role — Represents the content available in the source sequence.
Value (V)
Source — Comes from the final outputs of the Encoder stack.
Role — Represents the actual source information to be weighted and summed into the context vector.

Ex: To translate the French phrase “chat noir” (black cat), which has a different word order from English:

Decoder is about to output “black.”
Query (Q) vector reflects this need.
Q is compared to K vectors of the Encoder outputs (“chat” & “noir”).
Attention score will be highest for the K of “noir.”
Cross-Attention layer creates a dynamic context vector dominated by the information from “noir”, allowing the decoder to correctly output “black” (even though “noir” came second in French).

c. Feed-Forward Network (FFN) in the Decoder

The FFN in the Decoder serves the same purpose as the one in the Encoder.

Function: It is a simple, position-wise, two-layer fully connected network applied independently to the output of the Cross-Attention layer at every position.
Role: It adds non-linearity and applies further local processing to the contextualized vectors before the final output prediction is made.

These final components are crucial for stabilizing the training process and generating the final output probabilities.

d. Add & Normalize
This is applied immediately after the Multi-Head Attention and the Feed-Forward Network sub-layers.

“Add” (Residual Connection): The input to the sub-layer is added to the output of the sub-layer. This technique, borrowed from ResNet architectures, helps training deep networks by ensuring that the gradient can flow directly through the layers, preventing the vanishing gradient problem.
“Normalize” (Layer Normalization): The output is then normalized across the features for each sample. This stabilizes the training process by maintaining similar input distributions to the subsequent layers.

e. Linear
The final layer in the Decoder is a Linear layer (also called a fully-connected layer).
It takes the final contextualized vector from the top layer of the Decoder stack and projects it into a much larger vector space — specifically, a space equal to the size of the entire target vocabulary.

f. Softmax
The final operation is the Softmax function applied to the output of the Linear layer.
Softmax converts the raw scores (logits) from the Linear layer into a probability distribution over all possible words in the target vocabulary.
Output: The word with the highest probability is chosen as the model’s prediction for the current time step.

Implementing the full Transformer architecture in Keras/TensorFlow for a simple sequence-to-sequence task is quite involved, as it requires defining custom layers for multi-head attention, positional encoding, and the complete Encoder-Decoder structure.

Since the previous example was a simplified character-level reversal, I will provide a Keras/TensorFlow implementation that uses the core components of the Transformer to achieve a simple Seq2Seq task, focusing on the architecture’s spirit.

This implementation uses Keras’s built-in layers and the custom MultiHeadAttention layer, which is the heart of the Transformer.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Layer, Input, Embedding, Dropout, Dense, MultiHeadAttention, LayerNormalization
import numpy as np
import random
import string

# Hyperparameters
MAX_SEQUENCE_LENGTH = 10
VOCAB_SIZE = 30 # A-Z + special tokens
EMBED_DIM = 64
NUM_HEADS = 4
FF_DIM = 128
NUM_LAYERS = 2 # Number of encoder/decoder stacks
DROPOUT_RATE = 0.1

Data Preparation: We’ll use a simple character-level reversal task (like the LSTM example) to demonstrate the architecture.

# Create vocabulary and tokenizers
def create_vocab():
 chars = string.ascii_lowercase + ' '
 char_to_index = {ch: i + 1 for i, ch in enumerate(chars)}
 char_to_index.update({'<SOS>': 0, '<EOS>': 27, '<PAD>': 28})
 index_to_char = {i: ch for ch, i in char_to_index.items()}
 return char_to_index, index_to_char

CHAR_TO_INDEX, INDEX_TO_CHAR = create_vocab()
VOCAB_SIZE = len(CHAR_TO_INDEX)

# Generate simple data pairs
def generate_data(num_samples=1000):
 data = []
 for _ in range(num_samples):
 length = random.randint(3, MAX_SEQUENCE_LENGTH - 2)
 input_seq = ''.join(random.choices(string.ascii_lowercase, k=length))
 
 # Tokenize and pad input
 input_tokens = [CHAR_TO_INDEX[ch] for ch in input_seq]
 input_padded = input_tokens + [CHAR_TO_INDEX['<PAD>']] * (MAX_SEQUENCE_LENGTH - len(input_tokens))
 
 # Tokenize and pad output (reversed with SOS/EOS)
 output_seq = input_seq[::-1]
 output_tokens = [CHAR_TO_INDEX['<SOS>']] + [CHAR_TO_INDEX[ch] for ch in output_seq] + [CHAR_TO_INDEX['<EOS>']]
 output_padded = output_tokens + [CHAR_TO_INDEX['<PAD>']] * (MAX_SEQUENCE_LENGTH + 2 - len(output_tokens))

 # We need 2 outputs for training: the shifted input and the target output
 encoder_input = input_padded[:MAX_SEQUENCE_LENGTH]
 decoder_input = output_padded[:MAX_SEQUENCE_LENGTH] # Shifted target
 decoder_target = output_padded[1:MAX_SEQUENCE_LENGTH+1] # Target output

 data.append((encoder_input, decoder_input, decoder_target))
 
 encoder_inputs, decoder_inputs, decoder_targets = zip(*data)
 
 return (np.array(encoder_inputs), np.array(decoder_inputs), np.array(decoder_targets))

X_train, Y_shifted_train, Y_target_train = generate_data()

Positional Encoding Layer: This custom layer adds sinusoidal position information to the word embeddings.

class PositionalEmbedding(Layer):
 def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
 super().__init__(**kwargs)
 self.token_embeddings = Embedding(vocab_size, embed_dim)
 self.position_embeddings = Embedding(sequence_length, embed_dim)
 self.sequence_length = sequence_length
 self.embed_dim = embed_dim

 def call(self, inputs):
 # inputs is the token index sequence (batch_size, sequence_length)
 length = tf.shape(inputs)[-1]
 positions = tf.range(start=0, limit=length, delta=1)
 embedded_tokens = self.token_embeddings(inputs)
 embedded_positions = self.position_embeddings(positions)
 
 # Add the 2 embeddings
 return embedded_tokens + embedded_positions

 def compute_mask(self, inputs, mask=None):
 # We need to compute a mask for the <PAD> tokens
 mask = tf.not_equal(inputs, CHAR_TO_INDEX['<PAD>'])
 return mask

Transformer Encoder Layer: A single layer of the Encoder stack, containing Multi-Head Attention and the Feed-Forward Network.

class TransformerEncoderLayer(Layer):
 def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1, **kwargs):
 super().__init__(**kwargs)
 self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
 self.ffn = keras.Sequential([
 Dense(ff_dim, activation="relu"), 
 Dense(embed_dim)
 ])
 self.layernorm1 = LayerNormalization(epsilon=1e-6)
 self.layernorm2 = LayerNormalization(epsilon=1e-6)
 self.dropout1 = Dropout(rate)
 self.dropout2 = Dropout(rate)

 def call(self, inputs, mask=None):
 # Multi-Head Self Attention
 attn_output = self.att(inputs, inputs, attention_mask=mask)
 attn_output = self.dropout1(attn_output)
 
 # Add & Normalize 1
 out1 = self.layernorm1(inputs + attn_output)
 
 # Feed Forward
 ffn_output = self.ffn(out1)
 ffn_output = self.dropout2(ffn_output)
 
 # Add & Normalize 2
 return self.layernorm2(out1 + ffn_output)

Transformer Decoder Layer: A single layer of the Decoder stack, including Masked Self-Attention and Cross-Attention.

class TransformerDecoderLayer(Layer):
 def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1, **kwargs):
 super().__init__(**kwargs)
 # Masked Multi-Head Self Attention
 self.att1 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim) 
 # Encoder-Decoder (Cross) Attention
 self.att2 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
 self.ffn = keras.Sequential([
 Dense(ff_dim, activation="relu"), 
 Dense(embed_dim)
 ])
 self.layernorm1 = LayerNormalization(epsilon=1e-6)
 self.layernorm2 = LayerNormalization(epsilon=1e-6)
 self.layernorm3 = LayerNormalization(epsilon=1e-6)
 self.dropout1 = Dropout(rate)
 self.dropout2 = Dropout(rate)
 self.dropout3 = Dropout(rate)

 def call(self, inputs, encoder_outputs, mask=None):
 # inputs: Decoder input (shifted target)
 # encoder_outputs: Output from the Encoder stack

 # 1. Masked Multi-Head Self Attention
 attn_output1 = self.att1(inputs, inputs, attention_mask=mask)
 attn_output1 = self.dropout1(attn_output1)
 out1 = self.layernorm1(inputs + attn_output1) # Add & Norm 1
 
 # 2. Encoder-Decoder (Cross) Attention (K, V from encoder_outputs, Q from decoder out1)
 attn_output2 = self.att2(out1, encoder_outputs, attention_mask=mask)
 attn_output2 = self.dropout2(attn_output2)
 out2 = self.layernorm2(out1 + attn_output2) # Add & Norm 2
 
 # 3. Feed Forward
 ffn_output = self.ffn(out2)
 ffn_output = self.dropout3(ffn_output)
 
 return self.layernorm3(out2 + ffn_output) # Add & Norm 3

Building and Training the Transformer Model — The main function to construct and compile the full Encoder-Decoder model.

def build_transformer(max_seq_len, vocab_size, embed_dim, num_heads, ff_dim, num_layers):
 # --- Input Layers ---
 encoder_inputs = Input(shape=(max_seq_len,), dtype=tf.int32, name="encoder_inputs")
 decoder_inputs = Input(shape=(max_seq_len,), dtype=tf.int32, name="decoder_inputs")

 # --- ENCODER ---
 # Positional Embedding
 encoder_embedding_layer = PositionalEmbedding(max_seq_len, vocab_size, embed_dim)
 encoder_outputs = encoder_embedding_layer(encoder_inputs)
 encoder_mask = encoder_embedding_layer.compute_mask(encoder_inputs)
 
 # Encoder Stacks
 for i in range(num_layers):
 encoder_outputs = TransformerEncoderLayer(embed_dim, num_heads, ff_dim, rate=DROPOUT_RATE)(
 encoder_outputs, mask=encoder_mask
 )

 # --- DECODER ---
 # Positional Embedding
 decoder_embedding_layer = PositionalEmbedding(max_seq_len, vocab_size, embed_dim)
 decoder_outputs = decoder_embedding_layer(decoder_inputs)
 
 # Decoder Stacks
 for i in range(num_layers):
 # In Keras, the Masked Self-Attention is implicitly handled by passing 'use_causal_mask=True'
 decoder_outputs = TransformerDecoderLayer(embed_dim, num_heads, ff_dim, rate=DROPOUT_RATE)(
 decoder_outputs, encoder_outputs, mask=encoder_mask # Mask is for padding, causal is inside the layer
 )

 # --- Output Layers ---
 # Linear and Softmax (Dense output layer)
 decoder_outputs = Dense(vocab_size, activation="softmax")(decoder_outputs)

 # Define the final model
 transformer = keras.Model(
 inputs=[encoder_inputs, decoder_inputs], 
 outputs=decoder_outputs,
 name="transformer"
 )
 
 # Compile the model
 transformer.compile(
 "adam", 
 loss="sparse_categorical_crossentropy", 
 metrics=["accuracy"]
 )
 
 return transformer

# Build and train the model
transformer_model = build_transformer(MAX_SEQUENCE_LENGTH, VOCAB_SIZE, EMBED_DIM, NUM_HEADS, FF_DIM, NUM_LAYERS)
transformer_model.summary()

# Training (Note: For this simple data, few epochs are sufficient)
# transformer_model.fit(
# [X_train, Y_shifted_train], 
# Y_target_train, 
# batch_size=32, 
# epochs=20
# )

Creating a Function to Translate the Sentence — The inference function must be sequential because the decoder must generate one word, then use that word as the input for the next step (autoregressive decoding).

def translate_sentence_transformer(model, input_text):
 # 1. Prepare Encoder Input
 input_tokens = [CHAR_TO_INDEX[ch] for ch in input_text if ch in CHAR_TO_INDEX]
 input_padded = input_tokens + [CHAR_TO_INDEX['<PAD>']] * (MAX_SEQUENCE_LENGTH - len(input_tokens))
 
 encoder_input = np.array([input_padded]) # (1, MAX_SEQUENCE_LENGTH)

 # 2. Prepare Decoder Input (Starts with <SOS>)
 decoder_input_tokens = [CHAR_TO_INDEX['<SOS>']]
 
 for i in range(MAX_SEQUENCE_LENGTH - 1):
 # Pad the current decoder input to the max sequence length
 decoder_input_padded = decoder_input_tokens + [CHAR_TO_INDEX['<PAD>']] * (MAX_SEQUENCE_LENGTH - len(decoder_input_tokens))
 decoder_input = np.array([decoder_input_padded]) # (1, MAX_SEQUENCE_LENGTH)
 
 # 3. Predict the next word's probabilities
 predictions = model.predict([encoder_input, decoder_input])
 
 # Get the prediction for the *current* time step (the last generated token position)
 predicted_token_index = np.argmax(predictions[0, i, :])
 
 # 4. Check for <EOS>
 if predicted_token_index == CHAR_TO_INDEX['<EOS>']:
 break
 
 # 5. Append the predicted word for the next loop iteration
 decoder_input_tokens.append(predicted_token_index)
 
 # 6. Convert token indices back to characters (excluding <SOS>)
 translated_text = "".join([INDEX_TO_CHAR[i] for i in decoder_input_tokens[1:]])
 return translated_text

# Example of use (after training the model)
# print(f"Input: hello, Output: {translate_sentence_transformer(transformer_model, 'hello')}")

The “Encoder-Decoder” architecture, like the original Transformer, is designed for sequence-to-sequence tasks where the input and output sequences are distinct (Ex: Translation). However, many modern and highly successful models simplify this into Encoder-only or Decoder-only structures, each suited for different tasks.

Encoder-Only Models (e.g., BERT, RoBERTa)
Encoder-only models are designed to generate a rich, contextual representation of the input text. They excel at understanding a sequence rather than generating a new one.

An Encoder-only model consists solely of the Transformer Encoder stack. Its core mechanism is bidirectional self-attention, meaning every token attends to all other tokens (both before and after it) in the input sequence. This allows the model to compute a deep, non-directional understanding of the context for every single word.
Goal: To output a contextualized embedding for every input token.
Training: Typically trained using Masked Language Modeling (MLM), where the model must predict randomly masked tokens based on the full, unmasked context of the surrounding words. This forces the model to learn deep contextual relationships.

Ex [BERT] : Consider the sentence: “The river bank was muddy.”

When the Encoder-only model processes the word “bank,” its self-attention mechanism looks both backward (at “river”) and forward (at “was muddy”).

If the model sees “river,” it knows “bank” likely refers to the geographical shore.
If the sentence were “The money bank was open,” it would see “money” and know “bank” refers to the financial institution.

The model outputs a vector for “bank” that already contains the precise contextual meaning.

Use Cases: Encoder-only models are used for tasks that involve deeply analyzing the input text:

Sentiment Analysis: Classifying a text as positive, negative, or neutral.
Named Entity Recognition (NER): Identifying and classifying entities (people, places, organizations) in text.
Question Answering (Extractive): Finding the exact answer span within a given document.
Text Classification: Categorizing documents (e.g., topic labeling).

Decoder-Only Models (e.g., GPT, Llama)
Decoder-only models are designed for generative tasks. They excel at predicting the next token in a sequence, effectively learning the grammar, style, and content of human language.

A Decoder-only model consists solely of the Transformer Decoder stack (or typically, just the masked self-attention and FFN sub-layers). The critical difference from the Encoder is the use of causal (or masked) self-attention.
Goal: To generate text one token at a time, predicting tokent based only on the tokens generated before it (token1…t−1).
Training: Typically trained using Causal Language Modeling (CLM), where the model tries to predict the next word given all previous words in a sequence. This is a highly efficient way to learn language structure.
Ex [GPT]: When a Decoder-only model is prompted with “The quick brown fox,” it generates text in a sequential, left-to-right manner:

Input: “The quick brown fox”
Output 1: Predicts “jumps” (based on “The quick brown fox”).
Input: “The quick brown fox jumps”
Output 2: Predicts “over” (based on “The quick brown fox jumps”).
…and so on, until it generates an <EOS> (End of Sequence) token.

The masked attention ensures that the model can never peek ahead, making it ideal for simulating human text generation.

Use Cases: Decoder-only models are used for tasks that involve creating new content:

Text Generation: Writing articles, stories, and emails.
Chatbots and Dialogue Systems: Holding conversational responses.
Code Generation: Writing or completing programming code snippets.
Zero-shot/Few-shot Prompting: Utilizing a single prompt to guide a model to perform various tasks (translation, summarization, Q/A) without explicit training for those tasks.

Thank you for following along on this detailed journey through the Natural Language Processing architectures! From the foundational steps of tokenization and the internal memory of LSTMs to the parallel might of the Transformer.

Please feel free to clap 👏 and provide any feedback you have.

See you later in the next article!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Deep Dive into Modern Natural Language Processing

Author(s): Sunil Rao

Basic End-to-End NLP Workflow

1. Data Collection

2. Text preprocessing

3. Text Representation

4. Model Building and Training in NLP

Deployment Strategies

Monitoring NLP Applications

Topic modeling

Comprehensive Guide to Deep Learning — Neural Networks

Ever wondered how AI can recognize faces, or translate languages in an instant? The magic behind these capabilities…

RNN Architecture

Types of RNNs

Backpropagation Through Time (BPTT)

Vanishing Gradients

Exploding Gradients

LSTM (Long Short-Term Memory)

Sequence-to-Sequence (Seq2Seq) Models

How Attention Mechanism Works

Global Attention (Soft Attention)

Local Attention

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Frequently Used, Contextual References

Resources

Deep Dive into Modern Natural Language Processing

Author(s): Sunil Rao

Basic End-to-End NLP Workflow

1. Data Collection

2. Text preprocessing

3. Text Representation

4. Model Building and Training in NLP

Deployment Strategies

Monitoring NLP Applications

Topic modeling

Comprehensive Guide to Deep Learning — Neural Networks

Ever wondered how AI can recognize faces, or translate languages in an instant? The magic behind these capabilities…

RNN Architecture

Types of RNNs

Backpropagation Through Time (BPTT)

Vanishing Gradients

Exploding Gradients

LSTM (Long Short-Term Memory)

Sequence-to-Sequence (Seq2Seq) Models

How Attention Mechanism Works

Global Attention (Soft Attention)

Local Attention

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement