Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

LLM-Powered Metadata Extraction Algorithm
Latest   Machine Learning

LLM-Powered Metadata Extraction Algorithm

Last Updated on October 12, 2024 by Editorial Team

Author(s): Vladyslav Fliahin

Originally published on Towards AI.

Introduction

Source: Image generated by the author using AI (Flux AI)

Did you know that businesses receive thousands of customer reviews daily, each containing valuable insights? Yet, making sense of this data and processing it is a challenging task. The volume of unstructured data is constantly growing β€” social media posts, customer reviews, articles, etc. Processing this data and extracting meaningful insights is crucial for businesses to understand customer feedback.

Many techniques were created to process this unstructured data, such as sentiment analysis, keyword extraction, named entity recognition, parsing, etc. But often, these methods fail on more complex tasks. The evolution of Large Language Models (LLMs) allowed for the next level of understanding and information extraction that classical NLP algorithms struggle with.

Use Case

Suppose you own a business and want to understand how customers feel about your products and identify possible ways to improve them. While numerical ratings in reviews provide a general understanding of the satisfaction with the product, they could be more meaningful in terms of specific features. To truly understand the feedback on the product, you have to process a large number of reviews: understand the pros and cons, what could be improved, the average usage time of the product, etc. A human-level understanding of the text is required to extract these essential features for further analysis. This is where LLMs come into play with their capabilities to interpret customer feedback and present it in a structured way that is easy to analyze.

This article will focus on LLM capabilities to extract meaningful metadata from product reviews, specifically using OpenAI API.

Data

We decided to use the Amazon reviews dataset. Below is a citation from the readme of the chosen dataset:

The Amazon reviews dataset consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The Amazon reviews full score dataset is constructed by randomly taking 600,000 training samples and 130,000 testing samples for each review score from 1 to 5. In total there are 3,000,000 training samples and 650,000 testing samples.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 5), review title and review text.

Data processing

Since our main area of interest is extracting metadata from reviews, we had to choose a subset of reviews and label it manually with selected fields of interest. We have chosen the following structure to be extracted from reviews:

class ProductReview(BaseModel):
pros: list[str]
cons: list[str]
product_features: list[str]
use_case: str
experience: str
usage_duration: str
improvements: list[str]
stars_of_review: int
support_quality: int
refund: bool

Here is the breakdown of the fields:

  • Pros: list of events or features that had a positive impact on the customer experience
  • Cons: list of events or features that hurt the customer experience
  • Product features: list of features mentioned in the review
  • Use case: the use case of the product mentioned in the review
  • Experience: the customer’s overall experience based on the review’s semantics. It can be positive, negative, or mixed
  • Usage duration: how long the customer used the product when writing the review if mentioned.
  • Improvements: list of features or fixes that the customer mentioned in the review
  • Stars of review: LLM’s prediction of the review star, based on the review text

We chose 62 reviews as a test dataset and subsequently labeled their metadata.

Some of the valuable fields (support quality and whether the customer requested a refund), weren’t present in our subset, but we left them in our pipeline. Just in case they are present in your dataset.

Pipeline

Source: Image generated by the author using AI (Flux AI)

We leveraged the zero-shot prompting approach using OpenAI’s LLM, specifically GPT-4o. It allows for the interpretation of reviews and data extraction without needing large amounts of labeled datasets. It reduces training and pipeline setup time while enabling fast setup and results calculation. The zero-shot technique helps to perform tasks that LLM wasn’t explicitly trained for, based on prompts with a current task description.

An effective prompt that explains to LLM how to process the data and what we expect as a result is the key to success. Here is the prompt that we utilized for our task (incorrect markdown rendering on the Medium side):

You are an intelligent assistant designed to analyze product reviews and extract specific information to populate a structured data model. Your task is to process a given product review text and extract the following fields:

1. **pros** (`List[str]`). Empty list if no pros.
- **Description**: A list of positive aspects or advantages mentioned in the review.

2. **cons** (`List[str]`). Empty list if no cons.
- **Description**: A list of negative aspects or disadvantages mentioned in the review.

3. **product_features** (`List[str]`). Empty list if no features.
- **Description**: Specific features or components of the product discussed in the review.

4. **use_
case**
(`str`). Empty string if no use case mentioned.
- **Description**: The intended or actual use case of the product (e.g., daily usage, gaming, professional use, present).

5. **experience** (`str`)
- **Description**: A summary of the reviewer's overall experience with the product. Positive or negative

6. **usage_duration** (`str`). Empty string if not specified in review.
- **Description**: Duration of product usage mentioned in the review (e.g., "6 months", "1 year").

7. **improvements** (`List[str]`). Empty list if not specified in review.
- **Description**: Suggested improvements or enhancements for the product.

8. **stars_
of_review** (`int`, between 1 and 5).
- **Description**: Star rating of the review, ranging from 1 to 5.

9. **support_
quality**
(`int`, default -1):
- **Description**: Rating or feedback on customer support, if mentioned. Use a scale (e.g., 1-5). If not mentioned, set to -1.

10. **refund** (`bool`, default False):
- **Description**: Indicates whether the reviewer mentions receiving a refund (`True`) or not (`False`).

**Instructions:**

- **Input**: You will be provided with a single product review text.
- **Output**: Extract the specified fields and present them in a well-formatted JSON object that matches the `ProductReview` schema.
- **Formatting Requirements**:
- Use JSON syntax.
- Use appropriate data types for each field.
- If a field is not mentioned in the review, use the default value as specified.
- For the `stars_of_review` field process the review and convert it into an integer scale ranging from 1 to 5
- For the `support` field, if feedback on customer support is mentioned, extract the rating or qualitative feedback and convert it to an integer scale (e.g., 1-5). If not mentioned, set it to -1.
- For the `refund` field, set it to `True` if the review mentions receiving a refund, otherwise `False`.

One of the significant challenges is processing the output of the LLM. The model’s response might not always be what you expect, as it might hallucinate, truncate the output, etc. We utilized the new OpenAI feature β€” structured output to address this issue. It ensures 100% consistency with the JSON schema you provide. Structured output allows input as a JSON schema:

{
"type": "json_schema",
"json_schema": {
"name": "product_review",
"schema": {
"type": "object",
"properties": {
"pros": {
"type": "array",
"items": {
"type": "string"
}
},
"cons": {
"type": "array",
"items": {
"type": "string"
}
},
"product_features": {
"type": "array",
"items": {
"type": "string"
}
},
"use_case": {
"type": "string"
},
"experience": {
"type": "string"
},
"usage_duration": {
"type": "string"
},
"improvements": {
"type": "array",
"items": {
"type": "string"
}
},
"stars_of_review": {
"type": "integer"
},
"support_quality": {
"type": "integer"
},
"refund": {
"type": "boolean"
}
},
"required": [
"pros",
"cons",
"product_features",
"use_case",
"experience",
"usage_duration",
"improvements",
"stars_of_review",
"support_quality",
"refund"
],
"additionalProperties": False
},
"strict": True
}
}

Or, similarly a pydantic model:

from pydantic import BaseModel


class ProductReview(BaseModel):
pros: list[str]
cons: list[str]
product_features: list[str]
use_case: str
experience: str
usage_duration: str
improvements: list[str]
stars_of_review: int
support_quality: int
refund: bool

In our implementation, we used the LangChain framework, which facilitates work with different LLMs and allows easy switching between various models. To reduce costs, we used batch API that offers a 50% discount on requests to OpenAI API.

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import os
import config


os.environ['OPENAI_API_KEY'] = config.OPENAI_API_KEY

MODEL = 'gpt-4o-2024-08-06'

llm = ChatOpenAI(
model=MODEL,
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
)
structured_llm = llm.with_structured_output(ProductReview)

response = structured_llm.batch(batch)

Metrics

We computed metrics in two key areas to assess the model’s accuracy: stars rating prediction and metadata extraction. For star ratings, we compared the model’s predictions against the ground truth ratings available in the initial dataset. As for metadata extraction, we designed a custom embedding-based metric to compare predictions with the labels.

Stars accuracy

For stars, we simply chose 100 reviews (a bit more than our labeled dataset). To evaluate the accuracy of our model’s star rating predictions, we employed the Root Mean Squared Error (RMSE) metric. RMSE measures the error between predicted and actual errors with a heavy impact of significant errors on the metric, as it squares the difference between values.

Source: Image taken from source

RMSE computed on predictions of the LLM using the zero-shot technique without prior model fine-tuning resulted in 0.76. You may think it’s not that close to 0, but for reviews having a +- 1 star is okay due to personal biases in estimations.

We visualized the confusion metric for the obtained results to understand the nature of the errors. It shows the frequency of predictions on different star ratings assigned and how they correlate to stars in labels.

Source: Image created by the author

The matrix shows that the model’s predictions typically differ from actual ratings in the dataset by 1 star. Diving further into this issue, we found the following reviews:

This movie with all of its animals really keeps my grandson occupied when I’m babysitting. He loves watching it over and over again. I can get so much accomplished before he gets bored with it. So, it is worth the money! As a matter of fact, I’m buying one for his house, his 3 y/o cousins’ and my house. We’re saying goodbye to Netflix! Way to go Amazon!!Gammy

Surprisingly, this review is only 3 stars, which is odd, given the tone of this review. Given how it is written, the model assigned to this review 5 stars, which seems appropriate.

This book was in my library’s Genealogy section, and you’re not allowed to take Genealogy books out of the library. But I liked it enough to sit in the library and read it, in two sittings.This a quick read, because it has loads of pictures: photos of the ghetto and its inhabitants, and also pictures of artifacts such as ration cards, work certificates, yellow stars, etc. It’s more like a museum exhibit than a simple book. Because of the format I think young people would be able to get something out of the book too, although it is clearly written for adults. Certainly it probably has the most information on the Kovno/Kaunas ghetto all in one place.Both researchers and the ordinary person interested in the Holocaust would enjoy this.Bonus: pages from the diary of Ilya Gerber are printed in this book; you can read extracts from the diary in Alexandra Zapruder’sSalvaged Pages: Young Writers’ Diaries of the Holocaust.

This review has a rating of 4 stars, while the model assigned it 5 stars.

The best Racket ever made. I’ve been using this model racket for around 15 years and it’s still one of the best rackets ever made by Prince. The Prince Graphite Mid Plus is still one of the stiffest rackets you can get and it also has excellence weight and balance.This racket is more suited to the higher standard player. It provides exceptional control especially with volleys. I have these rackets strung at between 62–64 pounds for hard court and grass court play.It will be a sad day when there are no more of these rackets available.

Given this overly excited text, the model assigned to it 5 stars. However, for some reason, this review is only 4 stars.

This analysis shows that sometimes, the stars of the review do not always clearly correspond to the actual text of the review and are very subjective. LLM’s prediction of the stars is based on the semantics of the review and its interpretation; therefore, its star prediction might better generalize the review context.

Note: LLMs have their own biases, and sometimes they may play a crucial role in the model prediction. So always keep that in mind when working with the LLMs. Don’t take all the outputs for granted.

Extracted metadata

RMSE is not applicable to measure the quality of metadata extraction cause all the fields are either strings or lists of strings. Therefore, we had to represent these strings as vectors and compute their similarity.

To correctly measure the similarity between string values, we must consider underlying context and semantics. We can calculate the angle between two vectors by transforming our strings into high-dimensional embedding representations that capture semantic details. This is a common practice in NLP called a cosine similarity. It is often used to measure similarity between texts. For this cause, we used the OpenAI text embedding model (text-embedding-3-large) to obtain embeddings from strings and computed cosine similarity between them. Cosine similarity is calculated as follows:

Source: Image taken from source

Where A Β· B is the dot product of vectors A and B, and ||A|| is the Euclidean norm of the vector.

Also, we have a list of strings in some fields. Therefore, there might be situations where there are more strings in the list than in predictions, or vice-versa. To compute metrics, we obtained embeddings of all the strings in the list and obtained a matrix, where (i, j) entry is cosine similarity between i-th entry in predictions and j-th entry in labels.

Taking the maximum value over each row in the matrix gives us the best label correspondence of the predicted vector. Similarly, the maximum value over each column gives us the best correspondence of label vector in predictions. Having separate correspondence from labels to predictions and vice-versa is useful when there is a mismatch in entries in lists (more entries or less). With the best correspondence between vectors, we can calculate average cosine similarity values, precision, recall, and f1 score.

Below is the implementation of the algorithm defined above:

def compute_bidirectional_similarity(true_embeddings, pred_embeddings):
if not pred_embeddings and not true_embeddings:
return {"precision": 1.0, "recall": 1.0, "f1": 1.0}

if not pred_embeddings or not true_embeddings:
return {"precision": 0.0, "recall": 0.0, "f1": 0.0}

pred_matrix = np.array(pred_embeddings)
true_matrix = np.array(true_embeddings)

sim_pred_to_true = cosine_similarity(pred_matrix, true_matrix)

max_sim_pred = sim_pred_to_true.max(axis=1)
precision = max_sim_pred.mean()

max_sim_true = sim_pred_to_true.max(axis=0)
recall = max_sim_true.mean()

f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall else 0


return {
"precision": round(precision, 4),
"recall": round(recall, 4),
"f1": round(f1_score, 4)
}

Here are the results obtained for the chosen labeled subset of reviews.

Source: Image created by the author

Conclusion

Source: Image generated by the author using AI (Flux AI)

Our data extraction algorithm with OpenAI’s GPT-4o demonstrated how easily we can gather important metadata from unstructured data for further analysis, while most classical algorithms struggle with this task.

Considering the results, LLM performed well in predicting star ratings and extracting key features from reviews. Furthermore, the model’s predictions of the user feedback rating might hold better justification and lack personal subjectivity. Who knows, maybe soon we will have 2 product ratings (first based on the user scores and the second based on the LLM scoring), or some average value that will be more beneficial to the end consumer.

Overall, LLMs are a powerful tool for businesses to better understand customers’ needs, facilitate analysis, enhance products, and process large amounts of unstructured data faster.

All the labeled data alongside the code can be found on the Drive.

P.S.

I would like to mention Bohdan Shevchuk as a co-author of this article, as he contributed significantly to its development. In addition, I want to thank my colleagues, Alex Simkiv, Andy Bosyi, and Nazar Savchenko for fruitful discussions and collaboration as well as the entire MindCraft.ai team for their constant support.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓