Unlock Hidden Data: How LLMs Extract Product Brand, Name & Size with AI Precision

Author(s): Taha Azizi

Originally published on Towards AI.

Imagine a world where your inventory lists magically standardize themselves. No more manual data entry, no more ambiguous product names, just clean, actionable insights. Sound like a dream? With the power of Large Language Models (LLMs), it’s rapidly becoming reality.

Every business dealing with products — from e-commerce giants to local grocery stores — grapples with a fundamental headache: unstructured product data. Think about it: supplier manifests, internal inventories, competitive listings… they all describe the same physical item in wildly different ways.

“Coca-Cola Classic 12oz Cans 24-pack”
“Coke Can, 12 oz, 24 Count”
“Soda, Cola, 24×12 fl oz”

Extracting critical, structured information like the Brand, the Core Product Name, and the Size from these messy, free-form descriptions is a monumental task. Traditionally, this meant armies of data entry clerks, complex rule-based systems that broke with every new product variation, or endless, frustrating manual clean-up.

But what if you could teach an AI to understand context, identify patterns, and spit out perfectly structured data? That’s exactly what we’re doing.

The Problem: When “Long Names” Hide Data Gold

Our goal is simple: given a product’s “long name” (like the examples above), we want to extract:

Product Core Name: The main type of product (e.g., “Cola,” “Chocolate Bar,” “Ground Coffee”).
Product Brand: The manufacturer or brand name (e.g., “Coca-Cola,” “Snickers,” “Folgers”).
Size: The quantity with units (e.g., “12oz,” “24-pack,” “30.5 oz”).

Why is this hard? Because humans understand context and nuance; machines, not so much. “12 oz” could be at the beginning, middle, or end. “Pack” could be “PK”, “count”, or implied. Brands might be abbreviated. This is where LLMs shine.

The AI Solution: Conversing with Your Data via LLMs

Instead of writing endless if-else statements or brittle regex patterns, we tap into the natural language understanding capabilities of LLMs. Our approach is surprisingly elegant: we simply ask the LLM to do the extraction for us.

Here’s the magic trick: Prompt Engineering.

We craft a precise instruction, a “prompt,” that guides the LLM to act as a data extractor. This prompt isn’t just a question; it’s a carefully structured set of rules and examples that tells the LLM exactly what we need and in what format.

Python

# The essence of our LLM prompt for extraction
prompt = f"""
Analyze this product and extract information: "{product_long_name}"
Extract and return ONLY valid JSON in this exact format:
{{"product_core_name": "main product type", "product_brand": "brand name", "size": "size with units"}}
Rules:
- product_core_name: Main product type without brand, size, or packaging.
- product_brand: Brand/manufacturer name or "Unknown".
- size: Size/quantity with units or "Unknown".
Examples:
- "Coca-Cola Classic 12oz Cans 24-pack" -> {{"product_core_name": "Cola", "product_brand": "Coca-Cola", "size": "12oz 24-pack"}}
- "Oreo Original Chocolate Sandwich Cookies 14.3oz" -> {{"product_core_name": "Chocolate Sandwich Cookies", "product_brand": "Oreo", "size": "14.3oz"}}
Return ONLY the JSON, no other text:
"""
# Using Ollama to interact with a local LLM (e.g., gemma2:27b)
# response = ollama.chat(model='gemma2:27b', messages=[{'role': 'user', 'content': prompt}])
# extracted_data = json.loads(response['message']['content'])

This prompt does several crucial things:

Clearly defines the task: “Analyze this product and extract information.”
Specifies the output format: “ONLY valid JSON in this exact format.” This is key for structured data.
Provides clear rules: Each field’s definition is explicit.
Gives examples: Few-shot examples (even just a couple) greatly improve the LLM’s understanding and consistency.
Reinforces output constraints: “Return ONLY the JSON, no other text.” This minimizes conversational filler.

The Backend: `main.py` in Action

Our main.py script orchestrates this process. It reads your input CSV (e.g., Data_Internal.csv with a LONG_NAME column), iterates through each product, sends the prompt to a local Ollama LLM (like gemma2:27b), parses the JSON response, and adds the extracted PRODUCT_CORE_NAME, PRODUCT_BRAND, and SIZE to new columns in your dataset.

Crucially, it includes a fallback mechanism. If the LLM’s response isn’t perfectly parsable JSON (which can happen occasionally), a simple rule-based regex extractor kicks in to ensure some data is still captured. This makes the system robust for real-world messy data.

Python

# Simplified snippet from main.py showing the core extraction function
def extract_product_info(long_name: str, llm_model: str, llm_temperature: float) -> dict:
 """Extract product information using a single Ollama LLM call with a prompt."""
 prompt = f""" # ... (full prompt as shown above) ... """
 
 try:
 response = ollama.chat(model=llm_model, messages=[{'role': 'user', 'content': prompt}])
 result = response['message']['content'].strip()
 
 # Robust JSON parsing
 json_match = re.search(r'\{.*\}', result, re.DOTALL)
 if json_match:
 return json.loads(json_match.group())
 else:
 print(f"WARNING: No JSON found for {long_name[:50]}, using fallback.")
 return fallback_extraction(long_name) # simple regex fallback
 
 except Exception as e:
 print(f"ERROR: LLM call or JSON parsing failed for '{long_name[:50]}...': {str(e)}")
 return fallback_extraction(long_name) # fallback on error

The Impact: Clean Data, Clear Decisions

The result? Your raw, unstructured product list is transformed into a clean, structured dataset, ready for:

Accurate Product Matching: Now that you have consistent brand, name, and size fields, matching external lists to internal ones becomes exponentially easier.
Enhanced Analytics: Understand product categories, sales by brand, and inventory by size.
Improved Inventory Management: No more ordering duplicates because “Coke Can” and “Coca-Cola Soda” weren’t recognized as the same item.
Better Customer Experience: Consistent product information across all touchpoints.

Get Started Today!

This project demonstrates a practical, effective way to leverage LLMs for a common data challenge. If you’re struggling with messy textual data, this pattern can be applied to countless other extraction tasks.

You can explore the full code and run it yourself! The project is available on GitHub:

https://github.com/Taha-azizi/Product_Info_Extractor

Set up Ollama, pull an open source model, and unleash the power of AI on your unstructured data. The future of data management is here, and it’s powered by intelligent language models.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Unlock Hidden Data: How LLMs Extract Product Brand, Name & Size with AI Precision

Author(s): Taha Azizi

The Problem: When “Long Names” Hide Data Gold

The AI Solution: Conversing with Your Data via LLMs

The Backend: `main.py` in Action

The Impact: Clean Data, Clear Decisions

Get Started Today!

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Unlock Hidden Data: How LLMs Extract Product Brand, Name & Size with AI Precision

Author(s): Taha Azizi

The Problem: When “Long Names” Hide Data Gold

The AI Solution: Conversing with Your Data via LLMs

The Backend: main.py in Action

The Impact: Clean Data, Clear Decisions

Get Started Today!

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

The Backend: `main.py` in Action