Automating Product Matching with LLMs: A Step-by-Step Guide
Author(s): Taha Azizi
Originally published on Towards AI.
A Deep Dive into Intelligent Product Matching for E-commerce and Supply Chain Efficiency
In today’s fast-paced digital economy, businesses constantly grapple with vast amounts of data, especially when managing product inventories from diverse suppliers. The challenge of accurately matching external product lists with internal catalogs can be a significant bottleneck, often relying on time-consuming, error-prone manual processes. Imagine a scenario where new shipments arrive weekly, each introducing new products that need to be seamlessly integrated into your existing system. This isn’t just a hypothetical; it’s a real problem facing many stakeholders, including those operating convenience-store-like markets.
This article delves into an intelligent, automated solution designed to tackle this very problem. We’ll explore how a combination of data engineering, advanced NLP techniques, and Large Language Models (LLMs) can create a robust system for exact product matching, focusing on manufacturer, name, and size. This approach, exemplified by a recent project on product matching, aims to maximize accuracy while minimizing human intervention.
The Core Challenge: Exact Product Matching
Our objective is precise: to map external supplier products to internal market products only when their manufacturer, name, and size are identical. This strict requirement often makes traditional matching methods fall short, as even minor discrepancies can lead to mismatches.
For instance, consider these examples of correct and incorrect matches:
Correct Matches: | External_Product_Name | Internal_Product_Name | | : — — — — — — — — — — — — — — — — — | : — — — — — — — — — — — — — — — — — — — — — | | DIET LIPTON GREEN TEA W/ CITRUS 20 OZ | Lipton Diet Green Tea with Citrus (20oz) | | CH-CHERRY CHS CLAW DANISH 4.25 OZ | Cloverhill Cherry Cheese Bearclaw Danish (4.25oz) |
Wrong Matches: | External_Product_Name | Internal_Product_Name | | : — — — — — — — — — — — — — — — | : — — — — — — — — — — — — — — — — — — — — — | | Hersheys Almond Milk Choco 1.6 oz | Hersheys Milk Chocolate with Almonds (1.85oz) | | COOKIE PEANUT BUTTER 2OZ | Famous Amos Peanut Butter Cookie (2oz) |
The subtle differences, like “1.6 oz” vs. “1.85oz” or “Almond Milk” vs. “Milk Chocolate,” are critical.
Step 1: Data Understanding and Preprocessing
Any robust AI solution begins with a thorough understanding and cleaning of the data. We start with two CSV files: Data_Internal.csv
and Data_External.csv
.
Initial exploration reveals the structure and content of our product lists. Key columns include NAME
, OCS_NAME
, and LONG_NAME
in the internal data, and PRODUCT_NAME
in the external data.
Step 2: The Multi-Layered Matching Strategy
We employ a phased approach, progressively increasing the sophistication of our matching algorithms.
2.1 Attempt 1: Exact Matching — The Baseline
Our first instinct is always to check for perfect, direct matches. We attempt to find PRODUCT_NAME
in the external data that precisely matches NAME
or LONG_NAME
in the internal data.
2.2 Attempt 2: Fuzzy Matching — Embracing Variations
Real-world data rarely offers perfect consistency. Product names often have minor spelling errors, abbreviations, or reorderings. Fuzzy matching accounts for these variations by calculating a similarity score between strings. We utilize rapidfuzz
's token_set_ratio
, which is robust to word order and missing words.
3.3 Attempt 3: Vector Database Matching — Understanding Semantics
To move beyond superficial string comparisons, we leverage the power of semantic understanding through embedding models. We transform product names into high-dimensional numerical vectors, where similar products are represented by vectors that are close in space.
The SentenceTransformer
library with the 'all-MiniLM-L6-v2' model is used to create these embeddings. A FAISS (Facebook AI Similarity Search) index is then built for efficient similarity searches.
3.4 Attempt 4: LLM-Enhanced Vector Matching — The Power of Prompt Engineering
This is where the true “intelligent” aspect comes into play. We combine the efficiency of vector similarity search with the nuanced understanding of Large Language Models (LLMs). The idea is to use the vector database to retrieve a small set of highly relevant candidates, and then use an LLM to perform a fine-grained, rule-based validation on these candidates.
We define a prompt template that explicitly asks the LLM to compare an external product name with a potential internal match and determine if they are an exact match, considering manufacturer, name, and size. The LLM used here is ‘gemma3:27b’.
3.5 Attempt 5: Few-Shot Prompting — Learning from Examples (Unsuccessful)
To address the remaining inaccuracies, especially concerning size discrepancies, few-shot prompting was attempted. This involves providing the LLM with a few examples of correct and incorrect matches directly within the prompt, guiding its reasoning process.
Result: Few-shot prompting did not yield satisfactory results in this specific scenario. This might be due to the subtle nature of the differences or the limited number of examples provided.
3.6 Attempt 6: Sequential Prompting — Double-Checking for Size Accuracy
Since few-shot prompting didn’t resolve the size issue, a sequential LLM approach was implemented. After the initial LLM validation, a second LLM call was made specifically to verify size compatibility. If a potential match passed the first LLM check but failed the second size-specific check, it was then nulled out.
This involves sending potential matches to another LLM prompt (prompt_size.txt
) designed solely for size verification.
The culmination of this process is a table that lists every external item and its corresponding matched internal product. If no exact match is found (after all stages of validation, including LLM checks), the internal product column will be NULL
. This directly fulfills the stakeholder's requirement for an automated mapping system.
Reflections and Future Directions
This project demonstrates the power of a hybrid AI approach for complex data matching. By combining the strengths of:
- Semantic Search (Vector Databases): For efficient retrieval of semantically similar candidates.
- LLM Validation (Prompt Engineering): For precise, rule-based verification of exact matches and handling nuanced details like size.
- Sequential Reasoning (Multi-Stage Prompting): To refine results and address specific discrepancies systematically.
This system moves beyond simplistic methods, offering a robust and scalable solution for product alignment.
However, it’s crucial to acknowledge certain aspects for future improvement:
- Lack of Labeled Data: The current evaluation relied on manual inspection. In a real-world scenario, a labeled dataset is indispensable for quantitatively measuring accuracy, precision, and recall, and for training a supervised model.
- Exact Match Constraint: The strict “exact match” requirement inherently increases false negatives (missed potential matches). For use cases where some flexibility is allowed, tuning parameters like
K
(number of vector candidates) andJ
(number of candidates sent to LLM) could yield more matches. - Prompt Engineering Optimization: The LLM prompts can always be further refined. Techniques like Chain-of-Thought (CoT) prompting could be explored to encourage more detailed reasoning from the LLM.
- Scalability for Larger Datasets: For extremely large datasets, optimizing the vector database indexing, potentially using more advanced FAISS indices or distributed solutions, would be critical.
- User Interface and Feedback Loop: Integrating this system into a user-friendly frontend where human reviewers can provide feedback on matches would create a powerful continuous learning loop, iteratively improving accuracy over time.
This project, available on GitHub (Taha-azizi/product-matching-system), serves as a strong foundation for building intelligent automation in product data management, illustrating how thoughtful AI integration can transform manual workflows into efficient, accurate processes. The future of data management is undoubtedly automated, and solutions like this are paving the way.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.