Automating Product Matching with LLMs: A Step-by-Step Guide

Author(s): Taha Azizi

Originally published on Towards AI.

A Deep Dive into Intelligent Product Matching for E-commerce and Supply Chain Efficiency

In today’s fast-paced digital economy, businesses constantly grapple with vast amounts of data, especially when managing product inventories from diverse suppliers. The challenge of accurately matching external product lists with internal catalogs can be a significant bottleneck, often relying on time-consuming, error-prone manual processes. Imagine a scenario where new shipments arrive weekly, each introducing new products that need to be seamlessly integrated into your existing system. This isn’t just a hypothetical; it’s a real problem facing many stakeholders, including those operating convenience-store-like markets.

This article delves into an intelligent, automated solution designed to tackle this very problem. We’ll explore how a combination of data engineering, advanced NLP techniques, and Large Language Models (LLMs) can create a robust system for exact product matching, focusing on manufacturer, name, and size. This approach, exemplified by a recent project on product matching, aims to maximize accuracy while minimizing human intervention.

The Core Challenge: Exact Product Matching

Our objective is precise: to map external supplier products to internal market products only when their manufacturer, name, and size are identical. This strict requirement often makes traditional matching methods fall short, as even minor discrepancies can lead to mismatches.

For instance, consider these examples of correct and incorrect matches:

The subtle differences, like “1.6 oz” vs. “1.85oz” or “Almond Milk” vs. “Milk Chocolate,” are critical.

Automating Product Matching with LLMs: A Step-by-Step Guide

Step 1: Data Understanding and Preprocessing

Any robust AI solution begins with a thorough understanding and cleaning of the data. We start with two CSV files: Data_Internal.csv and Data_External.csv.

Initial exploration reveals the structure and content of our product lists. Key columns include NAME, OCS_NAME, and LONG_NAME in the internal data, and PRODUCT_NAME in the external data.

Step 2: The Multi-Layered Matching Strategy

We employ a phased approach, progressively increasing the sophistication of our matching algorithms.

2.1 Attempt 1: Exact Matching — The Baseline

Our first instinct is always to check for perfect, direct matches. We attempt to find PRODUCT_NAME in the external data that precisely matches NAME or LONG_NAME in the internal data.

2.2 Attempt 2: Fuzzy Matching — Embracing Variations

Real-world data rarely offers perfect consistency. Product names often have minor spelling errors, abbreviations, or reorderings. Fuzzy matching accounts for these variations by calculating a similarity score between strings. We utilize rapidfuzz's token_set_ratio, which is robust to word order and missing words.

3.3 Attempt 3: Vector Database Matching — Understanding Semantics

To move beyond superficial string comparisons, we leverage the power of semantic understanding through embedding models. We transform product names into high-dimensional numerical vectors, where similar products are represented by vectors that are close in space.

The SentenceTransformer library with the 'all-MiniLM-L6-v2' model is used to create these embeddings. A FAISS (Facebook AI Similarity Search) index is then built for efficient similarity searches.

3.4 Attempt 4: LLM-Enhanced Vector Matching — The Power of Prompt Engineering

This is where the true “intelligent” aspect comes into play. We combine the efficiency of vector similarity search with the nuanced understanding of Large Language Models (LLMs). The idea is to use the vector database to retrieve a small set of highly relevant candidates, and then use an LLM to perform a fine-grained, rule-based validation on these candidates.

We define a prompt template that explicitly asks the LLM to compare an external product name with a potential internal match and determine if they are an exact match, considering manufacturer, name, and size. The LLM used here is ‘gemma3:27b’.

3.5 Attempt 5: Few-Shot Prompting — Learning from Examples (Unsuccessful)

To address the remaining inaccuracies, especially concerning size discrepancies, few-shot prompting was attempted. This involves providing the LLM with a few examples of correct and incorrect matches directly within the prompt, guiding its reasoning process.

Result: Few-shot prompting did not yield satisfactory results in this specific scenario. This might be due to the subtle nature of the differences or the limited number of examples provided.

3.6 Attempt 6: Sequential Prompting — Double-Checking for Size Accuracy

Since few-shot prompting didn’t resolve the size issue, a sequential LLM approach was implemented. After the initial LLM validation, a second LLM call was made specifically to verify size compatibility. If a potential match passed the first LLM check but failed the second size-specific check, it was then nulled out.

This involves sending potential matches to another LLM prompt (prompt_size.txt) designed solely for size verification.

The culmination of this process is a table that lists every external item and its corresponding matched internal product. If no exact match is found (after all stages of validation, including LLM checks), the internal product column will be NULL. This directly fulfills the stakeholder's requirement for an automated mapping system.

Reflections and Future Directions

This project demonstrates the power of a hybrid AI approach for complex data matching. By combining the strengths of:

Semantic Search (Vector Databases): For efficient retrieval of semantically similar candidates.
LLM Validation (Prompt Engineering): For precise, rule-based verification of exact matches and handling nuanced details like size.
Sequential Reasoning (Multi-Stage Prompting): To refine results and address specific discrepancies systematically.

This system moves beyond simplistic methods, offering a robust and scalable solution for product alignment.

However, it’s crucial to acknowledge certain aspects for future improvement:

Lack of Labeled Data: The current evaluation relied on manual inspection. In a real-world scenario, a labeled dataset is indispensable for quantitatively measuring accuracy, precision, and recall, and for training a supervised model.
Exact Match Constraint: The strict “exact match” requirement inherently increases false negatives (missed potential matches). For use cases where some flexibility is allowed, tuning parameters like K (number of vector candidates) and J (number of candidates sent to LLM) could yield more matches.
Prompt Engineering Optimization: The LLM prompts can always be further refined. Techniques like Chain-of-Thought (CoT) prompting could be explored to encourage more detailed reasoning from the LLM.
Scalability for Larger Datasets: For extremely large datasets, optimizing the vector database indexing, potentially using more advanced FAISS indices or distributed solutions, would be critical.
User Interface and Feedback Loop: Integrating this system into a user-friendly frontend where human reviewers can provide feedback on matches would create a powerful continuous learning loop, iteratively improving accuracy over time.

This project, available on GitHub (Taha-azizi/product-matching-system), serves as a strong foundation for building intelligent automation in product data management, illustrating how thoughtful AI integration can transform manual workflows into efficient, accurate processes. The future of data management is undoubtedly automated, and solutions like this are paving the way.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Automating Product Matching with LLMs: A Step-by-Step Guide

Author(s): Taha Azizi

A Deep Dive into Intelligent Product Matching for E-commerce and Supply Chain Efficiency

The Core Challenge: Exact Product Matching

Step 1: Data Understanding and Preprocessing

Step 2: The Multi-Layered Matching Strategy

2.1 Attempt 1: Exact Matching — The Baseline

2.2 Attempt 2: Fuzzy Matching — Embracing Variations

3.3 Attempt 3: Vector Database Matching — Understanding Semantics

3.4 Attempt 4: LLM-Enhanced Vector Matching — The Power of Prompt Engineering

3.5 Attempt 5: Few-Shot Prompting — Learning from Examples (Unsuccessful)

3.6 Attempt 6: Sequential Prompting — Double-Checking for Size Accuracy

Reflections and Future Directions

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Automating Product Matching with LLMs: A Step-by-Step Guide

Author(s): Taha Azizi

A Deep Dive into Intelligent Product Matching for E-commerce and Supply Chain Efficiency

The Core Challenge: Exact Product Matching

Step 1: Data Understanding and Preprocessing

Step 2: The Multi-Layered Matching Strategy

2.1 Attempt 1: Exact Matching — The Baseline

2.2 Attempt 2: Fuzzy Matching — Embracing Variations

3.3 Attempt 3: Vector Database Matching — Understanding Semantics

3.4 Attempt 4: LLM-Enhanced Vector Matching — The Power of Prompt Engineering

3.5 Attempt 5: Few-Shot Prompting — Learning from Examples (Unsuccessful)

3.6 Attempt 6: Sequential Prompting — Double-Checking for Size Accuracy

Reflections and Future Directions

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement