Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Matching and Analyzing Products in Marketplaces Using LLMs
Latest   Machine Learning

Matching and Analyzing Products in Marketplaces Using LLMs

Author(s): Igor Novikov

Originally published on Towards AI.

Image by the author

There is a classical problem in any marketplace of making sense of product listings, that is especially exacerbated by users creating a mess of a description of really simple products. You’ve got a simple, everyday item β€” like a blue T-shirt β€” described as everything from β€œSky Blue” to β€œRoyal Blue” to β€œNavy Aqua Marine.” Some sellers even cram product details into the image itself, scribbling β€œBest Shirt Ever!” in neon green letters over a poorly lit photo. The result? Two identical products can appear as if they came from completely different universes.

This makes searching, categorizing, and analyzing these products a real headache (or a job security guarantee for data scientists β€” depends on how you look at it). For marketplace pricing management, it could be important to parse competitors to do the competitive pricing analysis, so it adds another level of complexity as competitive platforms have different product card formats.

Of course, over the years several ways to deal with that were invented:

  1. Category-Specific Matcher
    Train a dedicated β€œmatcher” model or algorithm for each product category β€” one for electronics, one for clothing, one for cosmetics, and so on. Each approach is specialized but can become a juggling act if you have 10,000 categories.
  2. Candidate Search Using Embeddings
    Embeddings are vector representations of data (e.g., product descriptions or names) used to determine similarity. Using text or image processing methods (e.g., word2vec, sentence-transformers), it is possible to find similar products.
  3. Attribute Extraction for Each Product
    Product data (such as brand, model, color, size, etc.) is extracted from descriptions to analyze and match products at a deeper level.
  4. Gradient Boosting
    Gradient boosting algorithms (such as CatBoost or XGBoost) are used for classification tasks, determining whether products are similar or not. These models are trained on pre-labeled data and take into account both textual and numerical attributes.

It does help to a point, but people are still amazingly creative in describing items. Attribute hell is real when you’ve got 400 ways to say β€œcomfortable,” and clothing items can look identical but be labeled in drastically different ways (β€œEco-friendly hemp T-shirt” vs. β€œ100% sustainable plant-based top”).

One common problem is attribute hell, where you have so many attributes for similar things that matching fails.

For clothing and things like that, that can be identical but described very differently β€” matching can only be done using photos and images based on colors, shapes, and fabric patterns.

The new approach uses LLMs’ and Vision-Language Models (VLMs) multimodal abilities to do that. Many platform already are doing that:

β€œWe’ve used multiple large language models to accurately create or improve over 850 million pieces of data in a catalog.

Without the use of generative AI, this work would have required nearly 100 times the current headcount to complete in the same amount of time.”

Doug McMillon, CEO, Walmart

Here is the general framework:

1. Candidate Search Using Embeddings

These embeddings allow the system to compute the similarity between different products efficiently, so the first step is the same as before. For example, embeddings generated by advanced LLMs (like OpenAI’s models or Sentence-Transformers) can be used to identify potential matches by comparing the vectors of product descriptions or attributes.

2. Attribute Extraction for Each Product Using LLMs

LLMs extract specific product attributes (e.g., brand, color, size, material) from their textual descriptions.

For instance, β€œStylish bright red cotton T-shirt for men” is broken down into color=red, material=cotton, targeted demographic=men, etc. It’s like having your own personal (text) shopper who’s annoyingly accurate.

See examples below.

3. Matching Two Products Based on Extracted Attributes Using LLMs

Once attributes are extracted, LLMs are used to compare two products based on these attributes and determine if they are a match.

LLMs can consider both the explicit similarities (e.g., same brand and size) and implicit ones (e.g., β€œeco-friendly” vs. β€œsustainable”).

For example, matching β€œMen’s Nike Air Max sneakers” with β€œNike Air Max shoes for men” by recognizing that they refer to the same product.

See examples below.

4. Matching Based on Images Using VLMs (Vision-Language Models)

Some attributes, like color, design, or unique patterns, are best analyzed visually rather than textually. VLMs combine visual and textual data for more accurate product matching. These models analyze product images alongside their descriptions to better understand the product.

For example: matching a product image of a black leather handbag with another similar handbag by identifying visual features (shape, texture, etc.) and combining them with textual data.

Benefits of LLMs for matching

  • More accurate matching. Fewer false mismatches (like that time you got a β€œturquoise” shirt instead of β€œteal”).
  • Precision and recall in the 90–99% range, especially with fine-tuned models
  • Handles diverse data. Text, images, random emojis, you name it.
  • Enhanced understanding of product descriptions and visual features.
  • No specialized training needed: No more building and maintaining 50 different specialized matchers.

Drawbacks

  • Requires much more computing power
  • The latency is not great. It works for batch processing but not real-time

Overall, for some categories, there is no point in using LLM if it can be avoided. If the category is simple and standard algorithms handle it well, it will be significantly cheaper. LLMs are more suitable for processing medium and complex categories. For example, car parts β€” they have VINs that identify the part, and matching by VINs is enough for most purposes.

Attributes extraction using LLM

I have a product card from the "Refrigerators" category on a marketplace. 
I need to extract and format attributes from it.

key_attributes_list = [
{
"name": "model",
"attribute_comment": "Refrigerator model. Include brand or company name but exclude color and size."
},
{
"name": "capacity",
"attribute_comment": "Total volume of the refrigerator, typically measured in liters (L). Look for terms like 'Total Capacity'. If unavailable, set value to null."
},
{
"name": "energy_efficiency",
"attribute_comment": "Extract the energy efficiency class, such as 'A++', 'A+', or 'B'. Look for terms like 'Energy Efficiency Class'. If unavailable, set value to null."
},
{
"name": "number_of_doors",
"attribute_comment": "Number of doors, such as '1', '2', or 'Side-by-Side'. Look for terms like 'Doors', 'Number of Doors'. If missing, set value to null."
},
{
"name": "freezer_position",
"attribute_comment": "Position of the freezer, such as 'Top', 'Bottom', or 'Side'. Extract from terms like 'Freezer Position'. If unspecified, set value to null."
},
{
"name": "defrost_system",
"attribute_comment": "Type of defrost system, such as 'No Frost' or 'Manual Defrost'. Look for terms like 'Defrost System'. If not mentioned, set value to null."
},
{
"name": "dimensions",
"attribute_comment": "Include height, width, and depth in centimeters (e.g., '180x70x60'). Look for terms like 'Dimensions''. If not available, set value to null."
}
]

Output:

extracted_attributes = {
"model": "Upstreman UF70",
"capacity": "7.0 cubic feet",
"energy_efficiency": "ENERGY STAR certified",
"number_of_doors": "1",
"freezer_position": "Convertible",
"defrost_system": "Low Frost Design",
"dimensions": "56.1x21.3x22.76 inches",
}

Matching with LLM

Matching Color example

Prompt:

I have two product cards from the category "Freezers" on a marketplace.
I need to compare the attribute "color."
The response should include:
is_match: A boolean indicating whether the values match.
reason: A detailed explanation of the comparison result.

Format the answer in JSON.

Output:

{
"is_match": false,
"reason": "The color attributes do not match. Product 1 has 'White' while Product 2 has 'Stainless Steel'."
}

Matching multiple attributes

Prompt:

I have two product cards from the category "Freezers" on a marketplace.
I need to compare their attributes to determine if the two products are similar enough to be considered the same model.

Attributes to compare:
Model name: Replace + with plus and remove spaces for comparison.
Capacity: Match if the difference is less than or equal to 0.2 cubic feet.
Energy efficiency: Match if both are certified by ENERGY STAR or have identical efficiency ratings.
Dimensions: Match if the height, width, and depth differences are each within 1 inch.
Color: Match only if the color is identical.

The response should include:
is_match: A boolean indicating whether the two products are considered the same model.
reason: A detailed explanation of the comparison result.

Format the answer in JSON.

Output:

{
"is_match": true,
"reason": "The two freezers are considered the same model based on the following comparisons:
1. Model name matches after normalization ('UpstremanUF70').
2. Capacity is similar (7.0 vs 6.9 cubic feet, difference is 0.1 cubic feet, which is within the acceptable range of 0.2 cubic feet).
3. Both are ENERGY STAR certified, indicating identical energy efficiency.
4. Dimensions are nearly identical (differences: height 0.1 inch, width 0.1 inch, depth 0.1 inch, all within the acceptable tolerance of 1 inch).
5. Both products share a similar color: 'White and β€˜Pearl White’."

}

Working with Photos and Images

Attribute extraction

Say we have an image like that. We are going to use the same prompt as for text extraction:

I have a product card from the "Refrigerators" category on a marketplace. 
I need to extract and format attributes from it.
key_attributes_list = [
…

{
"name": "dimensions",
"attribute_comment": "Include height, width, and depth in centimeters (e.g., '180x70x60'). Look for terms like 'Dimensions''. If not available, set value to null."
}
]

Output:

{
"dimensions": "56.10x21.30x22.76 inches"
}

One more:

Output:

{
"category": "Headphones",
"brand": "KVIDIO",
"color": "Black",
"features": [
"Over-ear design",
"Wireless",
"Bluetooth connectivity",
"Padded ear cushions"
]
}

Comparing clothing items

Prompt:

As a fashion expert, you are tasked with comparing two photographs of women's tops from a marketplace.
Your analysis should focus exclusively on the tops themselves, disregarding:
Any other visible clothing items, even if part of the outfit or matching
in style.
Variations in poses, body movements, or how the tops are worn.

Your goal:
Determine whether the tops are identical or an exact match.
Provide a clear "Identical" or "Different" answer, followed by a single
concise sentence explaining your reasoning.
Answer in JSON format.

And the result? I was able to detect a very subtle difference in the neckline, which I, being a man, was not able to notice.

Output:

{
"result": "Different",
"reason": "The tops have distinct necklines; the first features a crew neck while the second has a scoop neck."
}

Here’s another one:

Output:

{
"result": "Identical",
"reasoning": "Both tops have the same color, design, and fabric characteristics, including the long sleeves, slim fit, and light aqua tone."
}

Perfecto!

Models and throughput

Throughput is very important for a marketplace as there are thousands of products and more added every day. Throughput will depend on two things:

  1. Your hardware
  2. Model size and type

If you use a 70b model like LLama or Qwen, it will work well but slow. So unless you have a supercomputer, you will face 0–5 RPS for large models on commercially available GPUs.

To improve performance several things can be done.

Using a smaller model fine-tuned on your product dataset. How small depends on the task and likely you will have to train several small models as that can fit only so many trainable categories. 7b models in general work well for text extraction, and 1b for a limited subset of fine-tuned attributes. It is possible to distill such a model from a bigger one.

That may improve your throughput by 10–20x. However, note that smaller models can’t handle lots of attribute extraction in one go, and will struggle with complex prompts, so you’ll have to test for your task.

Quantization also shows a 20–50% improvement in RPS without a significant drop in model work quality.

In a huge volumes scenario, it doesn’t make sense to use anything but a self-hosted model as costs for using OpenAi or Anthropic will be prohibitively high, but they are good for prototyping and trying things up. For the self-hosted scenario, I would go with the latest instruct versions of LLama or QWen as they have shown the best results in our testing, starting with 70b for testing and then optimizing to smaller models until the moment you stop getting a satisfactory performance.

You will likely have to fine-tune the model for certain categories, for example, medical space has a lot of abbreviations as well as the construction industry. Generic model may not handle well thos, so a LORA pass is a good idea in this case. Additionally take the language into consideration, for example, LLama works well in English but in Chinese is likely to hallucinate. Gwen would likely be a better choice.

Have fun!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓