🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Last Updated on March 12, 2025 by Editorial Team

Author(s): Ecem Karaman

Originally published on Towards AI.

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

🔹 From Raw Text to Model-Ready Input

In my previous post, I laid out the 8-step LLM pipeline, decoding how large language models (LLMs) process language behind the scenes. Now, let’s zoom in — starting with Step 1: Input Processing.

In this post, I’ll explore exactly how raw text transforms into structured numeric inputs that LLMs can understand, diving into text cleaning, tokenization methods, numeric encoding, and chat structuring. This step is often overlooked, but it’s crucial because the quality of input encoding directly affects the model’s output.

🧩 1. Text Cleaning & Normalization (Raw Text → Pre-Processed Text)

Goal: Raw user input → standardized, clean text for accurate tokenization.

📌 Why Text Cleaning & Normalization?

Raw input text → often messy (typos, casing, punctuation, emojis) → normalization ensures consistency.
Essential prep step → reduces tokenization errors, ensuring better downstream performance.
Normalization Trade-off: GPT models preserve formatting & nuance (more token complexity); BERT aggressively cleans text → simpler tokens, reduced nuance, ideal for structured tasks.

🔍 Technical Details (Behind-the-Scenes)

Unicode normalization (NFKC/NFC) → standardizes characters (é vs. é).
Case folding (lowercasing) → reduces vocab size, standardizes representation.
Whitespace normalization → removes unnecessary spaces, tabs, line breaks.
Punctuation normalization (consistent punctuation usage).
Contraction handling (“don’t” → “do not” or kept intact based on model requirements). GPT typically preserves contractions, BERT-based models may split.
Special character handling (emojis, accents, punctuation).

import unicodedata
import re

def clean_text(text):
 text = text.lower() # Lowercasing
 text = unicodedata.normalize("NFKC", text) # Unicode normalization
 text = re.sub(r"\\s+", " ", text).strip() # Remove extra spaces
 return text

raw_text = "Hello! How’s it going? 😊"
cleaned_text = clean_text(raw_text)
print(cleaned_text) # hello! how’s it going?

🔡 2. Tokenization (Pre-Processed Text → Tokens)

Goal: Raw text → tokens (subwords, words, or characters).

Tokenization directly impacts model quality & efficiency.

📌 Why Tokenization?

Models can’t read raw text directly → must convert to discrete units (tokens).
Tokens: Fundamental unit that neural networks process.

Example: “interesting” → [“interest”, “ing”]

🔍 Behind the Scenes

Tokenization involves:

Mapping text → tokens based on a predefined vocabulary.
Whitespace and punctuation normalization (e.g., spaces → special markers like Ġ).
Segmenting unknown words into known subwords.
Balancing vocabulary size & computational efficiency.
Can be deterministic (fixed rules) or probabilistic (adaptive segmenting)

🔹 Tokenizer Types & Core Differences

✅ Subword Tokenization (BPE, WordPiece, Unigram) is most common in modern LLMs due to balanced efficiency and accuracy.

Types of Subword Tokenizers:

Byte Pair Encoding (BPE): Iteratively merges frequent character pairs (GPT models).
Byte-Level BPE: BPE, but operates at the byte level, allowing better tokenization of non-English text (GPT-4, LLaMA-2/3)
WordPiece: Optimizes splits based on likelihood in training corpus (BERT).
Unigram: Removes unlikely tokens iteratively, creating an optimal set (T5, LLaMA).
SentencePiece: Supports raw text directly; whitespace-aware (DeepSeek, multilingual models).

Different tokenizers output different token splits based on algorithm, vocabulary size, and encoding rules.

GPT-4 and GPT-3.5 use BPE — good balance of vocabulary size and performance.
BERT uses WordPiece — more structured subword approach; slightly different handling of unknown words.

📌 The core tokenizer types are public, but specific AI Models may use fine tuned versions of them (e.g. BPE is an algorithm that decides how to split text, but GPT models use a custom version of BPE). Model-specific tokenizer customizations optimize performance.

# GPT-2 (BPE) Example
from transformers import AutoTokenizer
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer_gpt2.tokenize("Let's learn about LLMs!")
print(tokens)
# ['Let', "'s", 'Ġlearn', 'Ġabout', 'ĠLL', 'Ms', '!']
# Ġ prefix indicates whitespace preceding token

# OpenAI GPT-4 tokenizer example (via tiktoken library)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Let's learn about LLMs!")
print(tokens) # Numeric IDs of tokens
print(encoding.decode(tokens)) # Decoded text

🔢 3. Numerical Encoding (Tokens → Token IDs)

Goal: Convert tokens into unique numerical IDs.

LLMs don’t process text directly — they operate on numbers. → Tokens are still text-based units
Every token has a unique integer representation in the model’s vocabulary.
Token IDs (integers) enable efficient tensor operations and computations inside neural layers.

🔍 Behind the Scenes

Vocabulary lookup tables efficiently map tokens → unique integers (token IDs).

Vocabulary size defines model constraints (memory usage & performance) (GPT-4: ~50K tokens):

→Small vocabulary: fewer parameters, less memory, but more token-splits.

→Large vocabulary: richer context, higher precision, but increased computational cost.

Lookup tables are hash maps: Allow constant-time token-to-ID conversions (O(1) complexity).
Special tokens (e.g., [PAD], <EOS>, [CLS]) have reserved IDs → standardized input format.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokens = tokenizer.tokenize("LLMs decode text.")
print("Tokens:", tokens) # Tokens: ['LL', 'Ms', 'Ġdecode', 'Ġtext', '.']

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids) # Token IDs: [28614, 12060, 35120, 1499, 13]

📜 4. Formatting Input for LLMs (Token IDs → Chat Templates)

Goal: Structure tokenized input for conversational models (multi-turn chat)

Why: LLMs like GPT-4, Claude, LLaMA expect input structured into roles (system, user, assistant).
Behind-the-scenes: Models use specific formatting and special tokens → maintain conversation context and roles.

🔍 Behind the Scenes

Chat Templates Provide:

Role Identification: Clearly separates system instructions, user inputs, and assistant responses.
Context Management: Retains multi-turn conversation history → better response coherence.
Structured Input: Each message wrapped with special tokens or structured JSON → helps model distinguish inputs clearly.
Metadata (optional): May include timestamps, speaker labels, or token-counts per speaker (for advanced models).

**Comparison of Chat Templates:** Different styles directly influence model context interpretation.

📐 5. Model Input Encoding (Structured Text → Tensors)

Goal: Convert numeric token IDs → structured numeric arrays (tensors) for GPU-based neural computation compatibility.

✅ Why Tensors?

Neural networks expect numeric arrays (tensors) with uniform dimensions (batch size × sequence length), not simple lists of integers.
Token IDs alone = discrete integers; tensor arrays add structure & context (padding, masks).
Proper padding, truncation, batching → directly affect model efficiency & performance.

🔍 Technical Details (Behind-the-Scenes)

Padding: Adds special tokens [PAD] to shorter sequences → uniform tensor shapes.
Truncation: Removes excess tokens from long inputs → ensures compatibility with fixed context windows (e.g., GPT-2: 1024 tokens).
Attention Masks: Binary tensors distinguishing real tokens (1) vs. padding tokens (0) → prevents model from attending padding tokens during computation.
Tensor Batching: Combines multiple inputs into batches → optimized parallel computation on GPU.

🔍 Key Takeaways

✅ Input processing is more than just tokenization — it includes text cleaning, tokenization, numerical encoding, chat structuring, and final model input formatting.

✅ Tokenizer type → model trade-offs: BPE (GPT), WordPiece (BERT), Unigram (LLaMA) — choice affects vocabulary size, speed, complexity.

✅ Chat-based models rely on structured formatting (chat templates)→ directly impacts coherence, relevance, conversation flow.

✅ Token IDs → tensors critical: Ensures numeric compatibility for efficient neural processing.

📖 Next Up: Step 2 — Neural Network Processing

Now that we’ve covered how raw text becomes structured model input, the next post will break down how the neural network processes this input to generate meaning — covering embedding layers, attention mechanisms, and more.

If you’ve enjoyed this article:

💻 Check out my GitHub for projects on AI/ML, cybersecurity, and Python
🔗 Connect with me on LinkedIn to chat about all things AI

💡 Thoughts? Questions? Let’s discuss! 🚀

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Author(s): Ecem Karaman

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

🧩 1. Text Cleaning & Normalization (Raw Text → Pre-Processed Text)

🔡 2. Tokenization (Pre-Processed Text → Tokens)

🔍 Behind the Scenes

🔹 Tokenizer Types & Core Differences

🔢 3. Numerical Encoding (Tokens → Token IDs)

🔍 Behind the Scenes

📜 4. Formatting Input for LLMs (Token IDs → Chat Templates)

🔍 Behind the Scenes

📐 5. Model Input Encoding (Structured Text → Tensors)

🔍 Key Takeaways

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Author(s): Ecem Karaman

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

🧩 1. Text Cleaning & Normalization (Raw Text → Pre-Processed Text)

🔡 2. Tokenization (Pre-Processed Text → Tokens)

🔍 Behind the Scenes

🔹 Tokenizer Types & Core Differences

🔢 3. Numerical Encoding (Tokens → Token IDs)

🔍 Behind the Scenes

📜 4. Formatting Input for LLMs (Token IDs → Chat Templates)

🔍 Behind the Scenes

📐 5. Model Input Encoding (Structured Text → Tensors)

🔍 Key Takeaways

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement