Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.


Using LLMs to Build Explainable Recommender Systems
Latest   Machine Learning

Using LLMs to Build Explainable Recommender Systems

Last Updated on January 12, 2024 by Editorial Team

Author(s): Hang Yu

Originally published on Towards AI.

Photo by Sean Benesh on Unsplash

In recent days, there’s no doubt that LLMs have become the most eye-catching superhero in the spotlight. While this powerful technique has raised a range of ethical concerns, there are more benefits and value being explored to either enhance the existing use cases or discover new opportunities. In this blog, I’ll demonstrate how LLMs can be used to improve recommender systems in two ways at the same time:

  1. Increase the predictive power. This is achieved by ranking the candidates generated by the upstream model.
  2. Provide explainability. Leveraging the rich knowledge compressed, each recommendation will have a context-based explanation. This will largely benefit use cases that need to interpret model behaviors.

For the sake of clarity, only the most essential code is displayed in this article. However, this experiment can be reproduced and extended based on the code hosted on Github.

Recommender system

First, let’s recap what a personalized recommender system looks like from an architectural perspective. As depicted below, it’s generally a multi-stage funnel whereby each stage has a decreased number of candidate items with an increased relevance. It consists of two essential stages, which are recall (or matching) and ranking [1]. For a target person, the recall stage retrieves the top items that are potentially of interest from a wide range of channels, including interactions, promotions, etc., and those candidates are then ordered by the ranking stage to prioritize the best ones. The ranking stage is usually a model trained based on the fine-grained user and item knowledge.

In modern days, the funnel can optionally have modules like the pre-ranking and re-ranking stages before and after ranking, respectively, to accommodate large candidate sets, high model complexity, and business-specific rules based on the needs of platforms.

A multi-stage recommender system U+007C Image by author

LLM-based recommender system

This work mainly focuses on applying LLM in the ranking stage because of the following reasons:

  1. LLMs’ rich external knowledge, which is complementary to the original dataset, would better identify the relative relationships among items.
  2. LLMs are limited by the token lengths. The ranking stage has a smaller number of candidates to be processed, so it would mitigate the impact of the token limitation.

Briefly, the system is a two-stage architecture whereby the recall stage is implemented using Matrix Factorization (MF) followed by an LLM-based ranking module generating ranks and explanations. Next, I’ll describe the dataset used, and how the recall and ranking stages are implemented and evaluated.


The dataset used is the publicly available MovieLens 100k. This popular benchmark dataset contains 100k movie ratings from 1000 users and 1700 movies. The ratings are then transformed to binary implicit feedback as these are more common in the real world.

ratings['like'] = ratings['rating'] > 3

The processed dataset is shown below. The new column ‘like’ is transformed implicit signal from the original ratings.

A sample of MovieLens 100k

Next, the dataset is split into a training set and a test set with a split ratio of 0.9.

train_ratio = 0.9
train_size = int(len(ratings)*train_ratio)
ratings_train = ratings.sample(train_size, random_state=42)
ratings_test = ratings[~ratings.index.isin(ratings_train.index)]

Now, we have the data prepared. Let’s build the recall stage!

Recall stage

The recall stage is implemented using the widely adopted Matrix Factorization, which is a type of collaborative filtering. The basic idea is to project users and items into the same embedding space and model the similarity based on the binary user-item interactions. In terms of implementation, a fast version that is Alternating Least Squares is adopted.

Firstly, the data needs to be transformed into a sparse matrix with the row indices representing user ID and column indices representing item ID. It’s worth noting that the “ID minus 1” operation transforms actual IDs to indices that start from 0.

from scipy.sparse import csr_matrix

n_users = ratings_train['user id'].max()
n_item = ratings_train['item id'].max()
ratings_train_pos = ratings_train[ratings_train['like']]
ratings_test_pos = ratings_test[ratings_test['like']]

row=ratings_train_pos['user id'].values - 1
col=ratings_train_pos['item id'].values - 1
user_item_data = csr_matrix((data, (row, col)), shape=(n_users, n_item))

Now we simply use the interaction matrix to train the MF model using the default parameters.

import implicit

# initialize a model
model = implicit.als.AlternatingLeastSquares(factors=50, random_state=42)

# train the model on a sparse matrix of user/item/confidence weights

As shown by the following function, retrieving the top-N candidate items for a given user is pretty simple. Here, the movies already watched are filtered out as we aim to recommend new ones.

def recall_stage(model, user_id, user_item_data, ratings_train, N):
filter_items = ratings_train[ratings_train['user id']==user_id]['item id'].values
filter_items = filter_items - 1
user_id = user_id - 1

recs, scores = model.recommend(user_id,
recs = recs.flatten() + 1
return recs

Ranking stage

Now, it’s time to build the ranking stage using LLM. Briefly, the idea is that we let LLM tell us if the user likes each item returned by the recall stage based on the user preference in the training data and its own knowledge. This is implemented as a typical few-shot prompt shown below.

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
import openai
import os
from google.colab import userdata

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(temperature=0.0, model=llm_model)

prompt = ChatPromptTemplate.from_template(
"""The person has a list of liked movies: {movies_liked}. \
The person has a list of disliked movies: {movies_disliked}. \
Tell me if this person likes each of the candidate movies: {movies_candidates}.\
Return a list of boolean values and explain why the person likes or dislikes.

Return a markdown code snippet with a list of JSON object formatted to look like:
"title": string \ the name of the movie in candidate movies
"like": boolean \ true or false
"explanation": string \ explain why the person likes or dislikes the candidate movie

REMEMBER: Each boolean and explanation for each element in candidate movies.
REMEMBER: The explanation must relate to the person's liked and disliked movies.


chain = LLMChain(llm=llm, prompt=prompt)

This prompt requires three input parameters:

  1. movies_liked: a list of titles of the movies liked by the user.
  2. movies_disliked: a list of titles of the movies disliked by the user.
  3. movies_candidates: a list of titles of the candidate movies to be judged by the LLM based on the few-shot user preference and its compressed knowledge.

Specifically, movies_liked and movies_disliked are extracted from the training data, whereas movies_candidates is the output of the recall stage. The LLM is instructed to provide both the binary decision and explanation about the user’s feedback to each candidate movie. Moreover, it is expected to associate the explanation with user preference to make it contextual.

The OpenAI GPT 3.5 Turbo is adopted; however, this can be changed to other LLMs. It’s also encouraged to explore and refine the prompt template as it plays an important role for generating recommendations.

This prompt is then utilized by the ranking function below to generate binary judgments.

def ranking_stage(chain, user_id, ratings_train, pre_recs, movie, batch_size=10):

few_shot = ratings_train[(ratings_train['user id']==user_id)]
if len(few_shot) >= 300:
few_shot = few_shot.sample(300, random_state=42)
recall_recs = movie.set_index('item id').loc[pre_recs].reset_index()

movies_liked = ','.join(few_shot[few_shot['like']]['title'].values.tolist())
movies_disliked = ','.join(few_shot[~few_shot['like']]['title'].values.tolist())

n_batch = int(np.ceil(len(recall_recs)/batch_size))
candidates = recall_recs[['item id', 'title']]
result_json = []

for i in range(n_batch):
candidates_batch = candidates.iloc[i*batch_size: (i+1)*batch_size]
movies_candidates = ','.join(candidates_batch['title'].values.tolist())
result =, movies_disliked=movies_disliked, movies_candidates=movies_candidates)
result_list = result.replace('\n', '').replace('},', '}\n,').split('\n,')
result_json_batch = [json.loads(i) for i in result_list]
result_json = result_json + result_json_batch

result_rank = pd.DataFrame.from_dict(result_json)
result_rank['item id'] = recall_recs['item id'].values
result_rank = pd.concat([result_rank[result_rank['like']], result_rank[~result_rank['like']]])

return result_rank

In detail, for a given user, the ranking function first prepares the prompt inputs based on the training data and the results of the recall stage. For a user, the number of movie preferences is limited to a maximum of 300 to minimize the risk of token length violation.

Next, it calls the LLM with these inputs in batch mode to avoid violating the token length. For each batch, the full movies_liked and movies_disliked are passed as one piece to keep the complete context, whereas the candidate movies are injected in batches.

Finally, the candidate list with binary decisions is re-ordered to prioritize the preferred items. As illustrated below, to accommodate the binary signals, the candidate items classified as ‘dislikes’ are de-ranked in the way that they are pulled out of the candidate list and appended to the liked ones.

Ranking operation for binary signals U+007C Image by author


To have a sense of the effect of the LLM-based ranking stage, an ablation study has been conducted to compare the architecture that only has the recall stage, viz. MF and the one adding the ranking stage called MF+GPT.

The metrics adopted are Precision@K [2], Recall@K [2], and DCG@K [3], which are popular options for evaluating recommender systems. Here, the values of K include 5, 10, 15, and 20. To balance validity and compute speed, 20 users are randomly sampled to calculate the average value for each metric. For each user, 30 candidates are generated by the recall stage and the top K movies are compared against the corresponding liked movies in the user’s test data.

The results shown below look promising! MF+GPT is superior to MF based on the uplift of these metrics, which demonstrates the success of the LLM-based ranking stage.

Evaluation results


Besides the accuracy uplift, another advantage of the LLM ranking module is the generated explanations for each recommendation. Now, let’s eyeball some examples to sense and check the explainability.

The table below shows the ranking results for one user in the test set. The columns from left to right are the movie titles, explanations, and ranks. After a rough investigation, the explanations make sense in most cases and LLM pretty much meets my expectation.

As instructed, the reasoning is considering both user preference and LLM’s external knowledge that characterizes candidates' movies from various angles. One good example is Hoop Dreams (1994): “The person dislikes ‘Hoop Dreams (1994)’ because it is a documentary film, and the person has not shown a preference for this genre. The person’s favorite movies are primarily narrative fiction films, such as ‘Secrets & Lies (1996)’ and ‘L.A. Confidential (1997)’. Additionally, ‘Hoop Dreams (1994)’ focuses on basketball, which may not be a subject of interest for the person based on their movie preferences.”.

It associates the candidate movie with two others via the genre and storyline that are far beyond the movie titles injected! Feel free to browse more examples to get a better understanding.

Ranking results and explanations for a user

Final thoughts

Based on the experiment, LLM-based ranking has demonstrated its value in improving the quality of a recommender system. However, there are some limitations identified during the R&D:

  1. Latency. The time consumed to generate recommendations makes it infeasible to enable real-time response to adapt to the user’s latest interest. This is likely to harm the online user experience.
  2. Token length limit. The maximal token length largely limits the context and prior knowledge injected. As a result, the features need to be either carefully engineered or even omitted.
  3. Knowledge recency and validity. The ranking and reasoning rely on the compressed knowledge of LLMs, so they only make sense when such knowledge is up-to-date and correct. For instance, an LLM is unlikely to make good suggestions if the title of a candidate movie is new and not recognized.

As an emerging topic, more ideas are expected to be proposed and experimented with to marry LLMs and recommender systems. I’m eager to receive your feedback and collaborate to make things better.

Thanks for your time, and I hope you like this work.





Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓