Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code
Last Updated on December 11, 2023 by Editorial Team
Author(s): Peyman Kor
Originally published on Towards AI.
Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code
How to master inventory decisions using dynamic programming for maximum profitability
Intro
Inventory optimization is a broad problem that arises across many domains. The central question is all about figuring out:
βHow many products to order for your store periodically, in order maximize the profitability?β
I think you are a bike shop manager. Every day, you need to contact your supplier to order a number of bikes for your store. If you order too many bikes daily, you will spend too much money on maintaining and keeping the bikes in the store. On the other hand, if you order fewer, you might not have enough bikes to meet customer demands. You need to have an βordering strategyβ that balances two factors.
We will break down this problem and find the optimal βordering strategyβ. The βoptimal ordering strategyβ is based on three pillars:
- Markov Process,
- Markov Rewards Process
- and Markov Decision Processes.
Here and here, I covered how to frame the βinventory optimizationβ problem, the Markov Process, and the Markov Reward Process.
Framing the Inventory Optimization Problem
Inventory Optimization is a Sequential Decision-Making Problem. We can frame this problem by defining what the states are, uncertainties, and what the **recurring decisions** are.
- State:
The state (the internal representation of the process) can be described by two components:
- Ξ± : is the number of bikes you already have in store
- Ξ² : is the number of bikes that you ordered the previous day and will arrive tomorrow morning (these bikes are on a truck)
2. Uncertainty:
In this example, uncertainty is random demand demand modeled following Poisson distributions (with Poisson parameter Ξ»βR>). A demand of i bikes for each i=0,1,2β― occurs with probability:
3. Decisions:
In this example, the bike shop owner must decide every day at 6PM, on how many bikes to order from the supplier.
Sequence of 24-hr cycle
The sequence of events in a 24-hour cycle for this bike shop is as follows:
- At 6 PM: you observe the state S_t:(Ξ±,Ξ²)
- At 6 PM: you order the new bikes
- At 6 Am: You receive bikes that you ordered 36hr ago
- At 8 AM: You open the store
- From 8 Am to 6 PM you experience demand i during the day (modelled using Poisson distribution) more below.
- At 6 PM you close the store
The diagram below visualizes this sequence:
In this blog, we are going to summarize everything and connect the Markov Process and Markov Reward Process with the Markov Decision Process to arrive at the optimal βordering strategyβ, all with codes in Python.
Markov Decision Process:
Markov Decision Process (MDP) is a mathematical model whose key components are states, actions, transition probabilities, and rewards.
In the Markov Reward Process, the decisions were fixed, and we were not considering all possible actions for each state. In Markov Decision Process (MDP), we need to build a data structure that for each state, we need to consider all possible actions, and following the $(S_t,A_t)$, we need to know what will be $(S_{t+1}, R_{t+1})$.
Example of Markov Decision Process Data Structure
Here, I am giving you an example of an MDP dictionary, where you can see for each state, all possible actions need to be considered:
MarkovDecProcessDict = {"Current State A":{"Action 1":{("NextS1fromAact1", "Reward1"): "PNextS1fromAact1"
,("NextS2fromAact1", "Reward2"): "PNextS2fromAact1"},
"Action 2":{("NextS1fromAact2", "Reward2"): "PNextS1fromAact2"
,("NextS2fromAact2", "Reward2"): "PNextS2fromAact2"}},
"Current State B":{"Action 1":{("NextS1fromBact1", "Reward1"): "PNextS1fromBact1"
,("NextS2fromBact1", "Reward2"): "PNextS2fromBact1"},
"Action 2":{("NextS1fromBact2", "Reward2"): "PNextS1fromBact2"
,("NextS2fromBact2", "Reward2"): "PNextS2fromBact2"}}
}
for current_state, actions in MarkovDecProcessDict.items():
print(f"Current State: {current_state}")
for action, transitions in actions.items():
print(f" Action: {action}")
for (next_state, reward), probability in transitions.items():
print(f" ({next_state},{reward}): {probability}")
Markov Decision Process for Inventory Optimization
Here, we are building the dictionary, where for each state, we are considering all possible actions, and for each action, we are considering all possible states and rewards. This dictionary name is **MDP_dict**, and I wrote the code of it below:
from typing import Dict, Tuple
# poisson is used to find pdf of Poisson distribution
from scipy.stats import poisson
MDP_dict: Dict[tuple, Dict[tuple, tuple]] = {}
user_capacity = 2
user_poisson_lambda = 1
holding_cost = 1
missedcostumer_cost = 10
for alpha in range(user_capacity+1):
for beta in range(user_capacity + 1 - alpha):
# This is St, the current state
state = (alpha, beta)
# This is initial inventory, total bike you have at 8AM
init_inv = alpha + beta
base_reward = -alpha* holding_cost
action = {}
# Consider all possible actions
for order in range(user_capacity-init_inv +1):
#action = {}
dict1 = {}
for i in range(init_inv +1):
# if initial demand can meet the deman
if i <= (init_inv-1):
# probality of specifc demand can happen
transition_prob = poisson.pmf(i,user_poisson_lambda)
dict1[((init_inv - i, order), base_reward)] = transition_prob
# if initial demand can not meet the demand
else:
transition_prob = 1- poisson.cdf(init_inv -1, user_poisson_lambda)
# probability of not meeting the demands
transition_prob2 = 1- poisson.cdf(init_inv, user_poisson_lambda)
# total reward
reward = base_reward - missedcostumer_cost*((user_poisson_lambda*transition_prob) - \
init_inv*transition_prob2)
dict1[((init_inv - i, order),reward)] = transition_prob
#if state in MDP_dict:
action[order] = dict1
MDP_dict[state]= action
MDP_dict
# Constants
βMDP_dictβ is a Python dictionary where each key represents the βcurrent state,β and the value is βall possible actions at that state,β corresponding to the next state, immediate reward, and probability of the next state. Here is the explanation in the schematic:
Dynamic Programming
Richard Bellman (1950s) first coined the term called Dynamic Programming (DP). Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems.
DP generally refers to general theories of the Markov Decision Process and algorithms to find an optimal policy in MDP, relying heavily on the Bellman Equation.
In the context of this article, we use the term Dynamic Programming with the goal of finding the optimal policy for the Inventory Optimization problem.
There are generally two import DP algorithms:
– Value Function Iteration Algorithm (Bellman 1957)
– Policy Iteration Algorithm (Howard 1960)
In this article, we are going to focus on the Policy Iteration Algorithm, and we are going to implement it in Python.
Policy Iteration Algorithm for Inventory Optimization:
The Policy Iteration Algorithm is a method for finding the optimal policy for a given MDP. The algorithm is based on the following idea:
– 1) Start with an initial policy $Ο_0$.
– 2) Evaluate the policy $Ο_0$ by computing the state-value function $V^{Ο_0}$.
– 3) Improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.
The algorithm iterates on the above steps until the policy converges (policy stops to change). We are going to go through all three stages in the following sections.
1) Start with the Initial Policy
The policy iteration algorithm needs an initial policy (ordering strategy), which can be any arbitrary policy. In this article, we are going to use the policy that we found in the second article, which is the following:
Initial policy:
Ο_0=C-(Ξ± + Ξ²)
The initial policy is that at each stage of the inventory, we order the amount of $C-(Ξ± + Ξ²)$, which is the difference between the capacity of the inventory and the sum of initial items in the inventory ($Ξ±$) and the ones will come tomorrow morning ($Ξ²$) .
Here is the code for the initial policy:
user_capacity_val = 2
def policy_0_gen(user_capacity: int):
# Generate an initial policy
return {(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}
policy_0 = policy_0_gen(user_capacity_val)
policy_0
2) Evaluate the policy Ο_0 by computing the state-value V^{Ο_0}.
Note that any Markov Decision Process with a fixed policy leads to an **implied** Markov Reward Process. So, like the previous article, if we have the Markov Reward Process, we can find each state's value function.
In other words:
The below function will get a **fixed policy** (as input), and then it will return an implied Markov Reward Process:
def MRP_using_fixedPolicy(full_MDP, policy):
# Calculate the Markov Reward Process using a fixed policy
MRP_policy = {}
for state in full_MDP.keys():
action = policy[state]
MRP_policy[state] = full_MDP[state][action]
return MRP_policy
As an example, we can give the initial policy to the below function, and it will return the implied Markov Reward Process:
MRP_p0=MRP_using_fixedPolicy(MDP_dict, policy_0)
MRP_p0
Having the Markov Reward Process, it is pretty easy to find Immediate Rewards for each state and also the state value function for each state:
def calculate_expected_immediate_rewards(MRP_policy):
# Calculate the expected immediate rewards from the MRP policy
E_immediate_R = {}
for from_state, value in MRP_policy.items():
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward
return E_immediate_R
R_ime_p0 = calculate_expected_immediate_rewards(MRP_p0)
R_ime_p0
import numpy as np
import pandas as pd
def create_transition_probability_matrix(MRP_policy):
# Create the transition probability matrix
states = list(MRP_policy.keys())
num_states = len(states)
trans_prob = np.zeros((num_states, num_states))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in MRP_policy.get(from_state, {}):
if new_state == to_state:
probability = MRP_policy[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability
return df_trans_prob
def calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma=0.9):
# Calculate the state value function
states = list(expected_immediate_rew.keys())
R_exp = np.array(list(expected_immediate_rew.values()))
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec},
index=states)
return MarkRevData
trans_prob_p0 = create_transition_probability_matrix(MRP_p0)
state_val_p0 = calculate_state_value_function(trans_prob_mat=trans_prob_p0,
expected_immediate_rew=R_ime_p0)
state_val_p0
3) Improve the policy by acting greedily with respect to $V^{Ο_0}$ to get a new policy $Ο_1$.
The last part of the Policy Iteration Algorithm is to improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.
The greedy equation is based on the Bellman Equation, and in fact, it is about finding the action with the highest **State-Value** function for each state.
Here is the code for the greedy operation:
def greedy_operation(MDP_full, state_val_policy, old_policy, gamma=0.9):
# Perform the greedy operation to improve the policy
new_policy = {}
for state in old_policy.keys():
max_q_value, best_action = float('-inf'), None
state_val_dict = state_val_policy.to_dict(orient="index")
for action in MDP_full[state].keys():
q_value = 0
for (next_state, immediate_reward), probability in MDP_full[state][action].items():
q_value = q_value + probability * (immediate_reward + gamma *
(state_val_dict[next_state]["Value Function"]))
if q_value > max_q_value:
max_q_value, best_action = q_value, action
new_policy[state] = best_action
return new_policy
new_policy = greedy_operation(MDP_full=MDP_dict,
state_val_policy=state_val_p0,
old_policy=policy_0)
The new policy after the first iteration of the policy iteration algorithm is the following:
new_policy
In the Policy Iteration algorithm, we keep iteration on the above three steps until the policy converges. In other words, we keep iterating on the above three steps until the policy stops to change. Here is the code for the Policy Iteration Algorithm:
def policy_iteration():
# Perform policy iteration to find the optimal policy
policy = policy_0_gen(user_capacity_val)
while True:
MRP_policy_p0 = MRP_using_fixedPolicy(MDP_dict, policy)
expected_immediate_rew = calculate_expected_immediate_rewards(MRP_policy_p0)
trans_prob_mat_val = create_transition_probability_matrix(MRP_policy_p0)
value_function = calculate_state_value_function(trans_prob_mat=trans_prob_mat_val,
expected_immediate_rew=expected_immediate_rew,
gamma=0.9)
new_policy = greedy_operation(MDP_full=MDP_dict,
state_val_policy=value_function,
old_policy=policy)
if new_policy == policy:
break
policy = new_policy
opt_policy = new_policy
opt_value_func = value_function
return opt_policy, opt_value_func
What is the Optimal Order?
Now, we can look at the outcome of policy iteration and see what the optimal order is for each state. Here is the code for it:
for state, order_quantity in opt_policy.items():
print(f"For state {state}, the optimal order quantity is: {order_quantity}")
Putting Everything Together:
I am putting all the codes together in one Python class with its methods.
import numpy as np
from scipy.stats import poisson
import pandas as pd
class MarkovDecisionProcess:
def __init__(self, user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma):
# Initialize the MDP with given parameters
self.user_capacity = user_capacity
self.poisson_lambda = poisson_lambda
self.holding_cost, self.stockout_cost = holding_cost, stockout_cost
self.gamma = gamma
self.full_MDP = self.create_full_MDP() # Create the full MDP
def create_full_MDP(self):
# Create the full MDP dictionary
MDP_dict = {}
for alpha in range(self.user_capacity + 1):
for beta in range(self.user_capacity + 1 - alpha):
state, init_inv = (alpha, beta), alpha + beta
action = {}
for order in range(self.user_capacity - init_inv + 1):
dict1 = {}
for i in range(init_inv + 1):
if i <= (init_inv - 1):
transition_prob = poisson.pmf(i, self.poisson_lambda)
dict1[((init_inv - i, order), -alpha * self.holding_cost)] = transition_prob
else:
transition_prob = 1 - poisson.cdf(init_inv - 1, self.poisson_lambda)
transition_prob2 = 1 - poisson.cdf(init_inv, self.poisson_lambda)
reward = -alpha * self.holding_cost - self.stockout_cost * (
(self.poisson_lambda * transition_prob) - init_inv * transition_prob2)
dict1[((0, order), reward)] = transition_prob
action[order] = dict1
MDP_dict[state] = action
return MDP_dict
def policy_0_gen(self):
# Generate an initial policy
return {(alpha, beta): self.user_capacity - (alpha + beta)
for alpha in range(self.user_capacity + 1)
for beta in range(self.user_capacity + 1 - alpha)}
def MRP_using_fixedPolicy(self, policy):
# Create the MRP using a fixed policy
return {state: self.full_MDP[state][action]
for state, action in policy.items()}
def calculate_state_value_function(self, MRP_policy):
# Calculate the expected immediate rewards from the MRP policy
E_immediate_R = {}
for from_state, value in MRP_policy.items():
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward
# Create the transition probability matrix
states = list(MRP_policy.keys())
trans_prob = np.zeros((len(states), len(states)))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in MRP_policy.get(from_state, {}):
if new_state == to_state:
probability = MRP_policy[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability
# Calculate the state value function
R_exp = np.array(list(E_immediate_R.values()))
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - self.gamma * df_trans_prob, R_exp)
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)
return MarkRevData
def greedy_operation(self, MDP_full, state_val_policy, old_policy):
# Perform the greedy operation to improve the policy
new_policy = {}
for state in old_policy.keys():
max_q_value, best_action = float('-inf'), None
state_val_dict = state_val_policy.to_dict(orient="index")
for action in MDP_full[state].keys():
q_value = 0
for (next_state, immediate_reward), probability in MDP_full[state][action].items():
q_value = q_value + probability * (immediate_reward + self.gamma *
(state_val_dict[next_state]["Value Function"]))
if q_value > max_q_value:
max_q_value, best_action = q_value, action
new_policy[state] = best_action
return new_policy
def policy_iteration(self):
# Perform policy iteration to find the optimal policy
policy = self.policy_0_gen()
while True:
MRP_policy_p0 = self.MRP_using_fixedPolicy(policy)
value_function = self.calculate_state_value_function(MRP_policy_p0)
new_policy = self.greedy_operation(self.full_MDP, value_function, policy)
if new_policy == policy:
break
policy = new_policy
opt_policy, opt_value_func = new_policy, value_function
return opt_policy, opt_value_func
Example of using the Markov Decision Process class
# Example usage:
user_capacity = 2
poisson_lambda = 1.0
holding_cost = 1
stockout_cost = 10
gamma = 0.9
MDP_Example = MarkovDecisionProcess(user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma)
opt_policy, opt_val = MDP_Example.policy_iteration()
# Print the optimal policy
print("Optimal Policy:")
for state, order_quantity in opt_policy.items():
print(f"For state {state}, the optimal order quantity is: {order_quantity}")
# Print the optimal value function
print("\nOptimal Value Function:")
print(opt_val)
Summary:
- Inventory optimization is not a static optimization problem; rather, it is a sequential decision-making problem. It means we need an βadaptiveβ policy where a decision (number of orders) is dependent on the state of the inventory.
- In this blog, we found the βoptimalβ ordering policy, based on Dynamic Programming Algorithm.
- The goal was to build a foundation on how to think about Inventory Optimization problems, how to frame them, and how to solve them using appropriate mathematical modeling.
References:
- [1] You can read this example more in-depth in βFoundation of Reinforcement Learning with Application in Financeβ. However, I have rewritten Python codes in this blog to make it easier to understand.
- Bellman, Richard., A Markovian Decision Process (1957), Journal of Mathematics and Mechanics.
- Howard, R. A. 1960. Dynamic Programming and Markov Processes, (1960.), Cambridge, MA: MIT Press.
Thanks for reading so far!
I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.
If you think this article helped you to learn more about inventory optimization and Markov Process, please give it a U+1F44F and follow!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI