Inventory Optimization with Data Science: Hands-On Tutorial with Python
Last Updated on November 5, 2023 by Editorial Team
Author(s): Peyman Kor
Originally published on Towards AI.
Intro
Effective inventory management is important for businesses across various industries. In the previous blog, we explored how to model the process and how to frame it with the Markov Process.
The below illustration shows how to model the probabilistic movement of the states:
In this blog, we go one step ahead and think about how to frame inventory optimization with the Markov Reward Process (MRP). Here, we have the additional term βReward,β meaning we need to think about how to assign βRewardβ to the process.
This blog will have the following structure:
- Markov Reward Process (MRP)
- Reward Modeling Bike Shop Inventory Optimization
- Hands-On Python Coding
- Bellman Equation
- Putting Everything Together
- Key Summary
At the end of the blog, you will learn:
- How to frame the Markov Reward Process (MRP) for Bike Shop Inventory Optimization.
- Write hands-on code to model MRP in Python.
Markov Reward Process:
Markov Reward Process (in short MRP) is a Markov Process, where the sequence of βRewards, Stateβ tuple follows the Markov Reward Property as:
Here, R_{t+1} is a reward induced if we move from state S_{t} to S_{t+1}.
MRP need a new Python data structure. The new data structure not only keeps state evolution with its probability, but this time itβs rewards too.
To show how MRP data structure looks like in Python, I am giving an example of MarkovRewProcess_Exa data structure. Let me explain what it looks like:
MarkovRewProcess_Exa = {"Current State A":{("NextS1fromA", "RewardS1fromA"):"PNextS1fromA",
("NextS2fromA", "RewardS2fromA"):"PNextS2fromA"},
"Current State B":{("NextS1fromB", "RewardS1fromB"):"PNextS1fromB",
("NextS2fromB", "RewardS2fromB"):"PNextS2fromB"}}
The **MarkovRewProcess_Exa** is a dictionary where the keys are the current state, S_t. Then, each state going to a new state would have some rewards, and from state S_t to tuple of (S_{t+1}, R_{t+1}) has a specific probability.
Here you can see when you are at S_t = Current State A, and you move to next (state, reward) written as (βNextS1fromAβ, βRewardS1fromAβ), the probability of happening of this move is : **PNextS1fromA**:
MarkovRewProcess_Exa["Current State A"][("NextS1fromA", "RewardS1fromA")]
And we can iterate over all possible (new state, reward) to get the overview of the process evolves:
for (state, value) in MarkovRewProcess_Exa.items():
print("The Current state is: {}".format(state))
for (next_state,reward),trans_prob in value.items():
print("The ({} with {}) \nwith Probability of: {} \
".format(next_state, reward, trans_prob ))p
Reward Modeling Bike Shop Inventory Optimization
Now, at this stage, we need to assign a reward to the Markov Process of Bike Shop, as we discussed in the previous article. What we mean is that given you are at state S_t, if you move to state S_{t+1}, how much βcost/rewardβ is accrued?
In the example of Bike Shop Inventory Optimization, the cost is compromised of two parts:
– 1) When we hold inventory items in the inventory, we are making the cost of overnight holding costs. Imagine holding items in inventory is costly (maintenance or other fees).
– 2) The second cost is missed customer cost. This happens when you have three bicycles in your inventory, but there are five customers. In this case, you missed two customers.
In other words, reward in the example of Bike Shop is defined as:
However, we can not write the equation straightforwardly, as we need to distinguish two different cases,
Case I) The number of available items in inventory meets the βdemandβ
In this case, we do not have any βMissed Customer costβ. Meaning only we are dealing with overnight holding costs. We assume this cost of overholding per bicycle (item) is equal to h. We can write that
Note that states are βrecordedβ each day at 8 PM . For example, today is Monday, and at 8 PM, you record that you have two bicycles in the inventory (Ξ± = 2) and one bicycle order that arrives tomorrow morning at 6 AM (Ξ² = 1). If tomorrow's demand is i=1. Given this, the reward is the equal cost of holding:
-hΞ±=-hΓ2
Case II) The number of available items in inventory is lower than the βdemandβ
In this case, when demand i is equal and greater than all the bicycles you have in the early morning, Ξ± + Ξ². In this case, the cost is first -hΞ± plus the Missed Costumer cost, where the unit cost of the missed customer is equal to p.
The right side of the above equation needs to be explained. The term p is simply the unit cost of the missing customer, then the term.
It is equal to an expected number of missed customers multiplied by p.
Hands-On Python Coding:
In the below, we wrote a function that is the generator of the Markov Reward Process Dictionary. The final output of the dictionary structure for the Markov Reward Process is as follows:
from typing import Dict, Tuple
MRP_dict: Dict[Tuple[int, int], Dict[Tuple[Tuple[int, int], float], float]] = {}
I did a little bit of explanation and annotation of the final data structure below:
from typing import Dict, Tuple
from scipy.stats import poisson
def generate_Markov_Rew_Process_Dict(user_capacity: int, user_poisson_lambda: int,
holding_cost: int, missedcostumer_cost: int):
"""
Generate a Markov Reward Process Dictionary based on the provided parameters.
Args:
user_capacity (int): The the capacity of the inventory.
user_poisson_lambda (int): The Poisson lambda parameter.
holding_cost (int): The unit cost of holding item in inventory overnight.
missedcostumer_cost (int):The unit cost of missed costumer.
Returns:
Dict[Tuple[int, int], Dict[Tuple[Tuple[int, int], float], float]]: The Markov Reward Process Dictionary.
"""
MRP_dict: Dict[Tuple[int, int], Dict[Tuple[Tuple[int, int], float], float]] = {}
for alpha in range(user_capacity + 1):
for beta in range(user_capacity + 1 - alpha):
state = (alpha, beta)
init_inv = alpha + beta
beta1 = user_capacity - init_inv
base_reward = -alpha * holding_cost
for demand in range(init_inv + 1):
if demand <= (init_inv - 1):
# Calculate transition probability for demand
transition_prob = poisson.pmf(demand, user_poisson_lambda)
if state in MRP_dict:
MRP_dict[state][((init_inv - demand, beta1), base_reward)]= transition_prob
else:
MRP_dict[state] = {((init_inv - demand, beta1), base_reward):transition_prob}
else:
# probability of not meeting the demands
transition_prob = 1 - poisson.cdf(init_inv - 1, user_poisson_lambda)
transition_prob2 = 1 - poisson.cdf(init_inv, user_poisson_lambda)
reward = base_reward - missedcostumer_cost * ((user_poisson_lambda * transition_prob) -
init_inv * transition_prob2)
if state in MRP_dict:
MRP_dict[state][((0, beta1),reward)]= transition_prob
else:
MRP_dict[state] = {((0, beta1),reward):transition_prob}
return MRP_dict
Having the function, we can give the function's input (same as the previous article) and then get the dictionary of the Markov Reward Process.
# Example usage:
inv_capacity_val = 2
poisson_lambda_val = 2.0
holding_cost_val = 1
missedcostumer_cost_val = 10
MRP_dict = generate_Markov_Rew_Process_Dict(
user_capacity= inv_capacity_val,
user_poisson_lambda = poisson_lambda_val,
holding_cost = holding_cost_val,
missedcostumer_cost = missedcostumer_cost_val)
We can print out this dictionary and figure out how the data has been stored in the dictionary:
for (state, value) in MRP_dict.items():
print("The Current state is: {}".format(state))
for (next_state,reward),trans_prob in value.items():
print("The (State ,Reward): ({} , {:.2f}) occurs with Probability of: {:.2f} \
".format(next_state, reward, trans_prob ))
Bellman Equation:
The main idea of the Markov Reward Process is to figure out the expected total return of each state if we continue the process for a long time under a specific policy. The βState Value Functionβ that maps βExpected Return of Each Stateβ to βStateβ called as βState Value Functionβ can be defined as:
Some explanation of the second equation is needed. The V(s) is the expected total return of following specific policy from state s, compromised of two parts. The first part is the reward we expect to get right away (immediate), plus the average value of all the different states we could go to next, times the value of those new states V(Sβ).
In other words, this equation is recursive.
The goal of the rest of this blog is to find the state value function V(s), which consists of two parts:
– Part I) Find the Expected Immediate Reward for all states
– Part II) Build the Transition Probability Matrix
Part I) Find the Expected Immediate Reward
The below function will find theR(s) for all the states. In fact, R(s) is the expected immediate reward for states in the inventory problem.
def calculate_expected_immediate_rewards(MarkovRewProcessDict):
"""
Calculate the expected immediate rewards for each state in a Markov Reward Process.
Args:
MarkovRewProcessDict (dict): Markov Reward Process Dictionary.
Returns:
dict: Dictionary with states as keys and their expected immediate rewards as values.
"""
E_immediate_R = {} # Initialize a dictionary to store expected immediate rewards
for from_state, value in MarkovRewProcessDict.items():
# Calculate the expected immediate reward for the current state
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward
return E_immediate_R
# Example usage:
# E_immediate_R = calculate_expected_immediate_rewards(MarkovRewProcessDict)
Calling the function:
R_exp_imm =calculate_expected_immediate_rewards(MRP_dict)
R_exp_imm
Part II) Build the Transition Probability Matrix
The transition probability matrix is like a table where the *states* are the column names and rows, and each cell shows the likelihood of moving from one state to another.
We can initialize this matrix with:
states = list(MRP_dict.keys())
df_trans_prob = pd.DataFrame(0.0, columns=states, index=states)
df_trans_prob
The above is the initialization of the matrix. Now we can go and start filling out this matrix as below:
import numpy as np
import pandas as pd
def create_transition_probability_matrix(MarkovRewProcessDict):
"""
Create a transition probability matrix from a Markov Reward Process Dictionary.
Args:
MarkovRewProcessDict (dict): Markov Reward Process Dictionary.
Returns:
pd.DataFrame: Transition probability matrix.
"""
states = list(MarkovRewProcessDict.keys())
num_states = len(states)
# Initialize an empty matrix with zeros
trans_prob = np.zeros((num_states, num_states))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in MarkovRewProcessDict.get(from_state, {}):
if new_state == to_state:
probability = MarkovRewProcessDict[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability
return df_trans_prob
Calling the function:
df_trans_prob = create_transition_probability_matrix(MRP_dict)
df_trans_prob
Let me explain what this matrix means. For example, think about the third row, which is state S_t=(0,2), then the following probability can be written as:
Now, having the transition matrix probability and expected immediate rewards of states, we can calculate the value function of each state using the vector and matrix format of the Bellman equation as below:
import numpy as np
import pandas as pd
def calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma):
"""
Calculate the state value function using transition probability matrix and expected immediate rewards.
Args:
trans_prob_mat (pd.DataFrame): Transition probability matrix.
expected_immediate_rew (dict): Dictionary with states as keys and expected immediate rewards as values.
gamma (float): Discount factor.
Returns:
pd.DataFrame: DataFrame with 'Expected Immediate Reward' and 'Value Function' columns indexed by states.
"""
states = list(expected_immediate_rew.keys())
R_exp = np.array(list(expected_immediate_rew.values()))
# Calculate the value function vector using the equation
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)
# Create a DataFrame with 'Expected Immediate Reward' and 'Value Function' columns
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)
return MarkRevData
# Example usage:
# MarkRevData = calculate_state_value_function(df_trans_prob, E_immediate_R, gamma)
Calling this function:
calculate_state_value_function(trans_prob_mat=df_trans_prob, expected_immediate_rew=R_exp_imm, gamma=0.9)
Putting Everything Together:
One way is to put all codes together, make a MarkovRewardProcess class, and call modules of the class. I have put full code (less than 100 lines) here:
from typing import Dict, Tuple
from scipy.stats import poisson
import numpy as np
import pandas as pd
class MarkovRewardProcess:
def __init__(self):
self.MRP_dict = {}
def generate_Markov_Rew_Process_Dict(self, user_capacity: int, user_poisson_lambda: int,
holding_cost: int, missedcostumer_cost: int):
self.MRP_dict = {} # Initialize the Markov Reward Process Dictionary
for alpha in range(user_capacity + 1):
for beta in range(user_capacity + 1 - alpha):
state = (alpha, beta)
init_inv = alpha + beta
beta1 = user_capacity - init_inv
base_reward = -alpha * holding_cost
for demand in range(init_inv + 1):
if demand <= (init_inv - 1):
transition_prob = poisson.pmf(demand, user_poisson_lambda)
if state in self.MRP_dict:
self.MRP_dict[state][((init_inv - demand, beta1), base_reward)] = transition_prob
else:
self.MRP_dict[state] = {((init_inv - demand, beta1), base_reward): transition_prob}
else:
transition_prob = 1 - poisson.cdf(init_inv - 1, user_poisson_lambda)
transition_prob2 = 1 - poisson.cdf(init_inv, user_poisson_lambda)
reward = base_reward - missedcostumer_cost * ((user_poisson_lambda * transition_prob) -
init_inv * transition_prob2)
if state in self.MRP_dict:
self.MRP_dict[state][((0, beta1), reward)] = transition_prob
else:
self.MRP_dict[state] = {((0, beta1), reward): transition_prob}
def calculate_expected_immediate_rewards(self):
E_immediate_R = {}
for from_state, value in self.MRP_dict.items():
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward
return E_immediate_R
def create_transition_probability_matrix(self):
states = list(self.MRP_dict.keys())
num_states = len(states)
trans_prob = np.zeros((num_states, num_states))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in self.MRP_dict.get(from_state, {}):
if new_state == to_state:
probability = self.MRP_dict[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability
return df_trans_prob
def calculate_state_value_function(self, trans_prob_mat, expected_immediate_rew, gamma):
states = list(expected_immediate_rew.keys())
R_exp = np.array(list(expected_immediate_rew.values()))
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)
return MarkRevData
mrp = MarkovRewardProcess()
# Generate the Markov Reward Process Dictionary
user_capacity = 2
user_poisson_lambda = 2
holding_cost = 1
missedcostumer_cost = 10
mrp.generate_Markov_Rew_Process_Dict(user_capacity, user_poisson_lambda, holding_cost, missedcostumer_cost)
E_immediate_R = mrp.calculate_expected_immediate_rewards()
trans_prob_mat = mrp.create_transition_probability_matrix()
gamma = 0.9 # Replace with your desired discount factor
MRP_Data = mrp.calculate_state_value_function(trans_prob_mat, E_immediate_R, gamma)
print(MRP_Data)
Key Summary:
– In this blog, the Markov Reward Process was introduced, and in the example of inventory optimization, the βrewardβ of the process was defined.
– A new Python data structure, βMRP_dict,β was introduced where it tracks the states, the reward, and the probability of the next (state, reward).
– The Markov Reward Process aims to find a State Value Function, meaning what the βExpected Total Returnβ for every state is. This was achieved by using the calculate_state_value_function in the Python code.
[1] You can read this example more in-depth in βFoundation of Reinforcement Learning with Application in Financeβ. However, I have rewritten Python codes in this blog to make it easier to understand.
Thanks for reading so far!
I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.
If you think this article helped you to learn more about inventory optimization and Markov Process, please give it a U+1F44F and follow!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI