
π Decoding LLM Pipeline β Step 1: Input Processing & Tokenization
Last Updated on March 12, 2025 by Editorial Team
Author(s): Ecem Karaman
Originally published on Towards AI.
🔎 Decoding LLM Pipeline β Step 1: Input Processing & Tokenization
🔹 From Raw Text to Model-Ready Input
In my previous post, I laid out the 8-step LLM pipeline, decoding how large language models (LLMs) process language behind the scenes. Now, letβs zoom in β starting with Step 1: Input Processing.
In this post, Iβll explore exactly how raw text transforms into structured numeric inputs that LLMs can understand, diving into text cleaning, tokenization methods, numeric encoding, and chat structuring. This step is often overlooked, but itβs crucial because the quality of input encoding directly affects the modelβs output.
🧩 1. Text Cleaning & Normalization (Raw Text β Pre-Processed Text)
Goal: Raw user input β standardized, clean text for accurate tokenization.
📌 Why Text Cleaning & Normalization?
- Raw input text β often messy (typos, casing, punctuation, emojis) β normalization ensures consistency.
- Essential prep step β reduces tokenization errors, ensuring better downstream performance.
- Normalization Trade-off: GPT models preserve formatting & nuance (more token complexity); BERT aggressively cleans text β simpler tokens, reduced nuance, ideal for structured tasks.
🔍 Technical Details (Behind-the-Scenes)
- Unicode normalization (NFKC/NFC) β standardizes characters (Γ© vs. Γ©).
- Case folding (lowercasing) β reduces vocab size, standardizes representation.
- Whitespace normalization β removes unnecessary spaces, tabs, line breaks.
- Punctuation normalization (consistent punctuation usage).
- Contraction handling (βdonβtβ β βdo notβ or kept intact based on model requirements). GPT typically preserves contractions, BERT-based models may split.
- Special character handling (emojis, accents, punctuation).
import unicodedata
import re
def clean_text(text):
text = text.lower() # Lowercasing
text = unicodedata.normalize("NFKC", text) # Unicode normalization
text = re.sub(r"\\s+", " ", text).strip() # Remove extra spaces
return text
raw_text = "Hello! Howβs it going? 😊"
cleaned_text = clean_text(raw_text)
print(cleaned_text) # hello! howβs it going?
🔡 2. Tokenization (Pre-Processed Text β Tokens)
Goal: Raw text β tokens (subwords, words, or characters).
Tokenization directly impacts model quality & efficiency.
📌 Why Tokenization?
- Models canβt read raw text directly β must convert to discrete units (tokens).
- Tokens: Fundamental unit that neural networks process.
Example: βinterestingβ β [βinterestβ, βingβ]
🔍 Behind the Scenes
Tokenization involves:
- Mapping text β tokens based on a predefined vocabulary.
- Whitespace and punctuation normalization (e.g., spaces β special markers like
Δ
). - Segmenting unknown words into known subwords.
- Balancing vocabulary size & computational efficiency.
- Can be deterministic (fixed rules) or probabilistic (adaptive segmenting)
🔹 Tokenizer Types & Core Differences
✅ Subword Tokenization (BPE, WordPiece, Unigram) is most common in modern LLMs due to balanced efficiency and accuracy.
Types of Subword Tokenizers:
- Byte Pair Encoding (BPE): Iteratively merges frequent character pairs (GPT models).
- Byte-Level BPE: BPE, but operates at the byte level, allowing better tokenization of non-English text (GPT-4, LLaMA-2/3)
- WordPiece: Optimizes splits based on likelihood in training corpus (BERT).
- Unigram: Removes unlikely tokens iteratively, creating an optimal set (T5, LLaMA).
- SentencePiece: Supports raw text directly; whitespace-aware (DeepSeek, multilingual models).
- GPT-4 and GPT-3.5 use BPE β good balance of vocabulary size and performance.
- BERT uses WordPiece β more structured subword approach; slightly different handling of unknown words.
📌 The core tokenizer types are public, but specific AI Models may use fine tuned versions of them (e.g. BPE is an algorithm that decides how to split text, but GPT models use a custom version of BPE). Model-specific tokenizer customizations optimize performance.
# GPT-2 (BPE) Example
from transformers import AutoTokenizer
tokenizer_gpt2 = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer_gpt2.tokenize("Let's learn about LLMs!")
print(tokens)
# ['Let', "'s", 'Δ learn', 'Δ about', 'Δ LL', 'Ms', '!']
# Δ prefix indicates whitespace preceding token
# OpenAI GPT-4 tokenizer example (via tiktoken library)
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Let's learn about LLMs!")
print(tokens) # Numeric IDs of tokens
print(encoding.decode(tokens)) # Decoded text
🔢 3. Numerical Encoding (Tokens β Token IDs)
Goal: Convert tokens into unique numerical IDs.
- LLMs donβt process text directly β they operate on numbers. β Tokens are still text-based units
- Every token has a unique integer representation in the modelβs vocabulary.
- Token IDs (integers) enable efficient tensor operations and computations inside neural layers.
🔍 Behind the Scenes
Vocabulary lookup tables efficiently map tokens β unique integers (token IDs).
- Vocabulary size defines model constraints (memory usage & performance) (GPT-4: ~50K tokens):
βSmall vocabulary: fewer parameters, less memory, but more token-splits.
βLarge vocabulary: richer context, higher precision, but increased computational cost.
- Lookup tables are hash maps: Allow constant-time token-to-ID conversions (O(1) complexity).
- Special tokens (e.g.,
[PAD]
,<EOS>
,[CLS]
) have reserved IDs β standardized input format.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize("LLMs decode text.")
print("Tokens:", tokens) # Tokens: ['LL', 'Ms', 'Δ decode', 'Δ text', '.']
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids) # Token IDs: [28614, 12060, 35120, 1499, 13]
📜 4. Formatting Input for LLMs (Token IDs β Chat Templates)
Goal: Structure tokenized input for conversational models (multi-turn chat)
- Why: LLMs like GPT-4, Claude, LLaMA expect input structured into roles (system, user, assistant).
- Behind-the-scenes: Models use specific formatting and special tokens β maintain conversation context and roles.
🔍 Behind the Scenes
Chat Templates Provide:
- Role Identification: Clearly separates system instructions, user inputs, and assistant responses.
- Context Management: Retains multi-turn conversation history β better response coherence.
- Structured Input: Each message wrapped with special tokens or structured JSON β helps model distinguish inputs clearly.
- Metadata (optional): May include timestamps, speaker labels, or token-counts per speaker (for advanced models).
📐 5. Model Input Encoding (Structured Text β Tensors)
Goal: Convert numeric token IDs β structured numeric arrays (tensors) for GPU-based neural computation compatibility.
✅ Why Tensors?
- Neural networks expect numeric arrays (tensors) with uniform dimensions (batch size Γ sequence length), not simple lists of integers.
- Token IDs alone = discrete integers; tensor arrays add structure & context (padding, masks).
- Proper padding, truncation, batching β directly affect model efficiency & performance.
🔍 Technical Details (Behind-the-Scenes)
- Padding: Adds special tokens
[PAD]
to shorter sequences β uniform tensor shapes. - Truncation: Removes excess tokens from long inputs β ensures compatibility with fixed context windows (e.g., GPT-2: 1024 tokens).
- Attention Masks: Binary tensors distinguishing real tokens (
1
) vs. padding tokens (0
) β prevents model from attending padding tokens during computation. - Tensor Batching: Combines multiple inputs into batches β optimized parallel computation on GPU.
🔍 Key Takeaways
✅ Input processing is more than just tokenization β it includes text cleaning, tokenization, numerical encoding, chat structuring, and final model input formatting.
✅ Tokenizer type β model trade-offs: BPE (GPT), WordPiece (BERT), Unigram (LLaMA) β choice affects vocabulary size, speed, complexity.
✅ Chat-based models rely on structured formatting (chat templates)β directly impacts coherence, relevance, conversation flow.
✅ Token IDs β tensors critical: Ensures numeric compatibility for efficient neural processing.
📖 Next Up: Step 2 β Neural Network Processing
Now that weβve covered how raw text becomes structured model input, the next post will break down how the neural network processes this input to generate meaning β covering embedding layers, attention mechanisms, and more.
If youβve enjoyed this article:
💻 Check out my GitHub for projects on AI/ML, cybersecurity, and Python
🔗 Connect with me on LinkedIn to chat about all things AI
💡 Thoughts? Questions? Letβs discuss! 🚀
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI