Unlock Hidden Data: How LLMs Extract Product Brand, Name & Size with AI Precision
Author(s): Taha Azizi
Originally published on Towards AI.
Imagine a world where your inventory lists magically standardize themselves. No more manual data entry, no more ambiguous product names, just clean, actionable insights. Sound like a dream? With the power of Large Language Models (LLMs), it’s rapidly becoming reality.
Every business dealing with products — from e-commerce giants to local grocery stores — grapples with a fundamental headache: unstructured product data. Think about it: supplier manifests, internal inventories, competitive listings… they all describe the same physical item in wildly different ways.
- “Coca-Cola Classic 12oz Cans 24-pack”
- “Coke Can, 12 oz, 24 Count”
- “Soda, Cola, 24×12 fl oz”
Extracting critical, structured information like the Brand, the Core Product Name, and the Size from these messy, free-form descriptions is a monumental task. Traditionally, this meant armies of data entry clerks, complex rule-based systems that broke with every new product variation, or endless, frustrating manual clean-up.
But what if you could teach an AI to understand context, identify patterns, and spit out perfectly structured data? That’s exactly what we’re doing.
The Problem: When “Long Names” Hide Data Gold
Our goal is simple: given a product’s “long name” (like the examples above), we want to extract:
- Product Core Name: The main type of product (e.g., “Cola,” “Chocolate Bar,” “Ground Coffee”).
- Product Brand: The manufacturer or brand name (e.g., “Coca-Cola,” “Snickers,” “Folgers”).
- Size: The quantity with units (e.g., “12oz,” “24-pack,” “30.5 oz”).
Why is this hard? Because humans understand context and nuance; machines, not so much. “12 oz” could be at the beginning, middle, or end. “Pack” could be “PK”, “count”, or implied. Brands might be abbreviated. This is where LLMs shine.
The AI Solution: Conversing with Your Data via LLMs
Instead of writing endless if-else
statements or brittle regex patterns, we tap into the natural language understanding capabilities of LLMs. Our approach is surprisingly elegant: we simply ask the LLM to do the extraction for us.
Here’s the magic trick: Prompt Engineering.
We craft a precise instruction, a “prompt,” that guides the LLM to act as a data extractor. This prompt isn’t just a question; it’s a carefully structured set of rules and examples that tells the LLM exactly what we need and in what format.
Python
# The essence of our LLM prompt for extraction
prompt = f"""
Analyze this product and extract information: "{product_long_name}"
Extract and return ONLY valid JSON in this exact format:
{{"product_core_name": "main product type", "product_brand": "brand name", "size": "size with units"}}
Rules:
- product_core_name: Main product type without brand, size, or packaging.
- product_brand: Brand/manufacturer name or "Unknown".
- size: Size/quantity with units or "Unknown".
Examples:
- "Coca-Cola Classic 12oz Cans 24-pack" -> {{"product_core_name": "Cola", "product_brand": "Coca-Cola", "size": "12oz 24-pack"}}
- "Oreo Original Chocolate Sandwich Cookies 14.3oz" -> {{"product_core_name": "Chocolate Sandwich Cookies", "product_brand": "Oreo", "size": "14.3oz"}}
Return ONLY the JSON, no other text:
"""
# Using Ollama to interact with a local LLM (e.g., gemma2:27b)
# response = ollama.chat(model='gemma2:27b', messages=[{'role': 'user', 'content': prompt}])
# extracted_data = json.loads(response['message']['content'])
This prompt does several crucial things:
- Clearly defines the task: “Analyze this product and extract information.”
- Specifies the output format: “ONLY valid JSON in this exact format.” This is key for structured data.
- Provides clear rules: Each field’s definition is explicit.
- Gives examples: Few-shot examples (even just a couple) greatly improve the LLM’s understanding and consistency.
- Reinforces output constraints: “Return ONLY the JSON, no other text.” This minimizes conversational filler.
The Backend: main.py
in Action
Our main.py
script orchestrates this process. It reads your input CSV (e.g., Data_Internal.csv
with a LONG_NAME
column), iterates through each product, sends the prompt to a local Ollama LLM (like gemma2:27b
), parses the JSON response, and adds the extracted PRODUCT_CORE_NAME
, PRODUCT_BRAND
, and SIZE
to new columns in your dataset.
Crucially, it includes a fallback mechanism. If the LLM’s response isn’t perfectly parsable JSON (which can happen occasionally), a simple rule-based regex extractor kicks in to ensure some data is still captured. This makes the system robust for real-world messy data.
Python
# Simplified snippet from main.py showing the core extraction function
def extract_product_info(long_name: str, llm_model: str, llm_temperature: float) -> dict:
"""Extract product information using a single Ollama LLM call with a prompt."""
prompt = f""" # ... (full prompt as shown above) ... """
try:
response = ollama.chat(model=llm_model, messages=[{'role': 'user', 'content': prompt}])
result = response['message']['content'].strip()
# Robust JSON parsing
json_match = re.search(r'\{.*\}', result, re.DOTALL)
if json_match:
return json.loads(json_match.group())
else:
print(f"WARNING: No JSON found for {long_name[:50]}, using fallback.")
return fallback_extraction(long_name) # simple regex fallback
except Exception as e:
print(f"ERROR: LLM call or JSON parsing failed for '{long_name[:50]}...': {str(e)}")
return fallback_extraction(long_name) # fallback on error
The Impact: Clean Data, Clear Decisions
The result? Your raw, unstructured product list is transformed into a clean, structured dataset, ready for:
- Accurate Product Matching: Now that you have consistent brand, name, and size fields, matching external lists to internal ones becomes exponentially easier.
- Enhanced Analytics: Understand product categories, sales by brand, and inventory by size.
- Improved Inventory Management: No more ordering duplicates because “Coke Can” and “Coca-Cola Soda” weren’t recognized as the same item.
- Better Customer Experience: Consistent product information across all touchpoints.
Get Started Today!
This project demonstrates a practical, effective way to leverage LLMs for a common data challenge. If you’re struggling with messy textual data, this pattern can be applied to countless other extraction tasks.
You can explore the full code and run it yourself! The project is available on GitHub:
https://github.com/Taha-azizi/Product_Info_Extractor

Set up Ollama, pull an open source model, and unleash the power of AI on your unstructured data. The future of data management is here, and it’s powered by intelligent language models.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.