Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Last Updated on December 11, 2023 by Editorial Team

Author(s): Peyman Kor

Originally published on Towards AI.

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

How to master inventory decisions using dynamic programming for maximum profitability

Intro

Inventory optimization is a broad problem that arises across many domains. The central question is all about figuring out:

“How many products to order for your store periodically, in order maximize the profitability?”

I think you are a bike shop manager. Every day, you need to contact your supplier to order a number of bikes for your store. If you order too many bikes daily, you will spend too much money on maintaining and keeping the bikes in the store. On the other hand, if you order fewer, you might not have enough bikes to meet customer demands. You need to have an “ordering strategy” that balances two factors.

We will break down this problem and find the optimal “ordering strategy”. The “optimal ordering strategy” is based on three pillars:

Markov Process,
Markov Rewards Process
and Markov Decision Processes.

Here and here, I covered how to frame the “inventory optimization” problem, the Markov Process, and the Markov Reward Process.

Framing the Inventory Optimization Problem

Inventory Optimization is a Sequential Decision-Making Problem. We can frame this problem by defining what the states are, uncertainties, and what the **recurring decisions** are.

State:

The state (the internal representation of the process) can be described by two components:

α : is the number of bikes you already have in store
β : is the number of bikes that you ordered the previous day and will arrive tomorrow morning (these bikes are on a truck)

2. Uncertainty:

In this example, uncertainty is random demand demand modeled following Poisson distributions (with Poisson parameter λ∈R>). A demand of i bikes for each i=0,1,2⋯ occurs with probability:

3. Decisions:

In this example, the bike shop owner must decide every day at 6PM, on how many bikes to order from the supplier.

Sequence of 24-hr cycle

The sequence of events in a 24-hour cycle for this bike shop is as follows:

At 6 PM: you observe the state S_t:(α,β)
At 6 PM: you order the new bikes
At 6 Am: You receive bikes that you ordered 36hr ago
At 8 AM: You open the store
From 8 Am to 6 PM you experience demand i during the day (modelled using Poisson distribution) more below.
At 6 PM you close the store

The diagram below visualizes this sequence:

In this blog, we are going to summarize everything and connect the Markov Process and Markov Reward Process with the Markov Decision Process to arrive at the optimal “ordering strategy”, all with codes in Python.

Markov Decision Process:

Markov Decision Process (MDP) is a mathematical model whose key components are states, actions, transition probabilities, and rewards.

In the Markov Reward Process, the decisions were fixed, and we were not considering all possible actions for each state. In Markov Decision Process (MDP), we need to build a data structure that for each state, we need to consider all possible actions, and following the $(S_t,A_t)$, we need to know what will be $(S_{t+1}, R_{t+1})$.

Example of Markov Decision Process Data Structure

Here, I am giving you an example of an MDP dictionary, where you can see for each state, all possible actions need to be considered:


MarkovDecProcessDict = {"Current State A":{"Action 1":{("NextS1fromAact1", "Reward1"): "PNextS1fromAact1"
 ,("NextS2fromAact1", "Reward2"): "PNextS2fromAact1"},
 "Action 2":{("NextS1fromAact2", "Reward2"): "PNextS1fromAact2"
 ,("NextS2fromAact2", "Reward2"): "PNextS2fromAact2"}},
 
 "Current State B":{"Action 1":{("NextS1fromBact1", "Reward1"): "PNextS1fromBact1"
 ,("NextS2fromBact1", "Reward2"): "PNextS2fromBact1"},
 "Action 2":{("NextS1fromBact2", "Reward2"): "PNextS1fromBact2"
 ,("NextS2fromBact2", "Reward2"): "PNextS2fromBact2"}}
}

for current_state, actions in MarkovDecProcessDict.items():
 print(f"Current State: {current_state}")
 
 for action, transitions in actions.items():
 print(f" Action: {action}")
 
 for (next_state, reward), probability in transitions.items():
 print(f" ({next_state},{reward}): {probability}")

Output of the Python code — source: Author

Markov Decision Process for Inventory Optimization

Here, we are building the dictionary, where for each state, we are considering all possible actions, and for each action, we are considering all possible states and rewards. This dictionary name is **MDP_dict**, and I wrote the code of it below:

from typing import Dict, Tuple

# poisson is used to find pdf of Poisson distribution 
from scipy.stats import poisson

MDP_dict: Dict[tuple, Dict[tuple, tuple]] = {}

user_capacity = 2
user_poisson_lambda = 1
holding_cost = 1
missedcostumer_cost = 10



for alpha in range(user_capacity+1): 
 
 for beta in range(user_capacity + 1 - alpha):
 
 # This is St, the current state
 state = (alpha, beta) 

 # This is initial inventory, total bike you have at 8AM 
 init_inv = alpha + beta 
 
 base_reward = -alpha* holding_cost
 
 action = {}
 # Consider all possible actions
 for order in range(user_capacity-init_inv +1):
 
 #action = {}
 dict1 = {}
 for i in range(init_inv +1):

 # if initial demand can meet the deman
 if i <= (init_inv-1):
 
 # probality of specifc demand can happen
 transition_prob = poisson.pmf(i,user_poisson_lambda)

 dict1[((init_inv - i, order), base_reward)] = transition_prob

 
 # if initial demand can not meet the demand
 else:
 
 transition_prob = 1- poisson.cdf(init_inv -1, user_poisson_lambda)
 
 # probability of not meeting the demands
 transition_prob2 = 1- poisson.cdf(init_inv, user_poisson_lambda)
 
 # total reward
 
 reward = base_reward - missedcostumer_cost*((user_poisson_lambda*transition_prob) - \
 init_inv*transition_prob2) 

 dict1[((init_inv - i, order),reward)] = transition_prob

 #if state in MDP_dict:

 action[order] = dict1

 MDP_dict[state]= action

MDP_dict
# Constants

“MDP_dict” is a Python dictionary where each key represents the “current state,” and the value is “all possible actions at that state,” corresponding to the next state, immediate reward, and probability of the next state. Here is the explanation in the schematic:

Schematic of the MDP_dict — Source: Author

Dynamic Programming

Richard Bellman (1950s) first coined the term called Dynamic Programming (DP). Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems.

DP generally refers to general theories of the Markov Decision Process and algorithms to find an optimal policy in MDP, relying heavily on the Bellman Equation.

In the context of this article, we use the term Dynamic Programming with the goal of finding the optimal policy for the Inventory Optimization problem.

There are generally two import DP algorithms:

– Value Function Iteration Algorithm (Bellman 1957)

– Policy Iteration Algorithm (Howard 1960)

In this article, we are going to focus on the Policy Iteration Algorithm, and we are going to implement it in Python.

Policy Iteration Algorithm for Inventory Optimization:

The Policy Iteration Algorithm is a method for finding the optimal policy for a given MDP. The algorithm is based on the following idea:

– 1) Start with an initial policy $π_0$.

– 2) Evaluate the policy $π_0$ by computing the state-value function $V^{π_0}$.

– 3) Improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The algorithm iterates on the above steps until the policy converges (policy stops to change). We are going to go through all three stages in the following sections.

A simple diagram to explain the **Policy Iteration Algorithm** — Source: Author

1) Start with the Initial Policy

The policy iteration algorithm needs an initial policy (ordering strategy), which can be any arbitrary policy. In this article, we are going to use the policy that we found in the second article, which is the following:

Initial policy:

π_0=C-(α + β)

The initial policy is that at each stage of the inventory, we order the amount of $C-(α + β)$, which is the difference between the capacity of the inventory and the sum of initial items in the inventory ($α$) and the ones will come tomorrow morning ($β$) .

Here is the code for the initial policy:

user_capacity_val = 2

def policy_0_gen(user_capacity: int):
 
 # Generate an initial policy

 return {(alpha, beta): user_capacity - (alpha + beta) 
 for alpha in range(user_capacity + 1) 
 for beta in range(user_capacity + 1 - alpha)}

policy_0 = policy_0_gen(user_capacity_val)
policy_0

2) Evaluate the policy π_0 by computing the state-value V^{π_0}.

Note that any Markov Decision Process with a fixed policy leads to an **implied** Markov Reward Process. So, like the previous article, if we have the Markov Reward Process, we can find each state's value function.

In other words:

The below function will get a **fixed policy** (as input), and then it will return an implied Markov Reward Process:

def MRP_using_fixedPolicy(full_MDP, policy):
 # Calculate the Markov Reward Process using a fixed policy
 MRP_policy = {}
 for state in full_MDP.keys():
 action = policy[state]
 MRP_policy[state] = full_MDP[state][action]
 return MRP_policy

As an example, we can give the initial policy to the below function, and it will return the implied Markov Reward Process:

MRP_p0=MRP_using_fixedPolicy(MDP_dict, policy_0)
MRP_p0

Having the Markov Reward Process, it is pretty easy to find Immediate Rewards for each state and also the state value function for each state:

def calculate_expected_immediate_rewards(MRP_policy):
 # Calculate the expected immediate rewards from the MRP policy
 E_immediate_R = {}
 for from_state, value in MRP_policy.items():
 expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
 E_immediate_R[from_state] = expected_reward
 return E_immediate_R
R_ime_p0 = calculate_expected_immediate_rewards(MRP_p0)
R_ime_p0

import numpy as np
import pandas as pd
def create_transition_probability_matrix(MRP_policy):
 # Create the transition probability matrix
 states = list(MRP_policy.keys())
 num_states = len(states)
 trans_prob = np.zeros((num_states, num_states))
 df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
 for i, from_state in enumerate(states):
 for j, to_state in enumerate(states):
 for (new_state, reward) in MRP_policy.get(from_state, {}):
 if new_state == to_state:
 probability = MRP_policy[from_state].get((new_state, reward), 0.0)
 df_trans_prob.iloc[i, j] = probability
 
 return df_trans_prob

def calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma=0.9):
 # Calculate the state value function
 states = list(expected_immediate_rew.keys())
 R_exp = np.array(list(expected_immediate_rew.values()))
 val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)
 MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, 
 index=states)
 return MarkRevData

trans_prob_p0 = create_transition_probability_matrix(MRP_p0)

state_val_p0 = calculate_state_value_function(trans_prob_mat=trans_prob_p0,
 expected_immediate_rew=R_ime_p0)
state_val_p0

3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.

The last part of the Policy Iteration Algorithm is to improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The greedy equation is based on the Bellman Equation, and in fact, it is about finding the action with the highest **State-Value** function for each state.

Bellman Optimality Equation— Source: Author

Here is the code for the greedy operation:

def greedy_operation(MDP_full, state_val_policy, old_policy, gamma=0.9):
 # Perform the greedy operation to improve the policy
 new_policy = {}
 for state in old_policy.keys():
 max_q_value, best_action = float('-inf'), None
 state_val_dict = state_val_policy.to_dict(orient="index")
 for action in MDP_full[state].keys():
 q_value = 0
 for (next_state, immediate_reward), probability in MDP_full[state][action].items():
 q_value = q_value + probability * (immediate_reward + gamma *
 (state_val_dict[next_state]["Value Function"]))
 if q_value > max_q_value:
 max_q_value, best_action = q_value, action
 new_policy[state] = best_action
 return new_policy

new_policy = greedy_operation(MDP_full=MDP_dict, 
 state_val_policy=state_val_p0, 
 old_policy=policy_0)

The new policy after the first iteration of the policy iteration algorithm is the following:

new_policy

In the Policy Iteration algorithm, we keep iteration on the above three steps until the policy converges. In other words, we keep iterating on the above three steps until the policy stops to change. Here is the code for the Policy Iteration Algorithm:

def policy_iteration():
 # Perform policy iteration to find the optimal policy
 policy = policy_0_gen(user_capacity_val)
 while True:
 MRP_policy_p0 = MRP_using_fixedPolicy(MDP_dict, policy)
 expected_immediate_rew = calculate_expected_immediate_rewards(MRP_policy_p0)
 trans_prob_mat_val = create_transition_probability_matrix(MRP_policy_p0)
 value_function = calculate_state_value_function(trans_prob_mat=trans_prob_mat_val,
 expected_immediate_rew=expected_immediate_rew,
 gamma=0.9)
 new_policy = greedy_operation(MDP_full=MDP_dict, 
 state_val_policy=value_function, 
 old_policy=policy)
 if new_policy == policy:
 break
 policy = new_policy
 
 opt_policy = new_policy
 opt_value_func = value_function
 
 return opt_policy, opt_value_func

What is the Optimal Order?

Now, we can look at the outcome of policy iteration and see what the optimal order is for each state. Here is the code for it:

for state, order_quantity in opt_policy.items():
 print(f"For state {state}, the optimal order quantity is: {order_quantity}")

The optimal final policy — Source: Author

Putting Everything Together:

I am putting all the codes together in one Python class with its methods.

import numpy as np
from scipy.stats import poisson
import pandas as pd

class MarkovDecisionProcess:
 def __init__(self, user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma):
 # Initialize the MDP with given parameters
 self.user_capacity = user_capacity
 self.poisson_lambda = poisson_lambda
 self.holding_cost, self.stockout_cost = holding_cost, stockout_cost
 self.gamma = gamma
 self.full_MDP = self.create_full_MDP() # Create the full MDP

 def create_full_MDP(self):
 # Create the full MDP dictionary
 MDP_dict = {}
 for alpha in range(self.user_capacity + 1):
 for beta in range(self.user_capacity + 1 - alpha):
 state, init_inv = (alpha, beta), alpha + beta 
 action = {}
 for order in range(self.user_capacity - init_inv + 1):
 dict1 = {}
 for i in range(init_inv + 1):
 if i <= (init_inv - 1):
 transition_prob = poisson.pmf(i, self.poisson_lambda)
 dict1[((init_inv - i, order), -alpha * self.holding_cost)] = transition_prob
 else:
 transition_prob = 1 - poisson.cdf(init_inv - 1, self.poisson_lambda)
 transition_prob2 = 1 - poisson.cdf(init_inv, self.poisson_lambda)
 reward = -alpha * self.holding_cost - self.stockout_cost * (
 (self.poisson_lambda * transition_prob) - init_inv * transition_prob2)
 dict1[((0, order), reward)] = transition_prob
 action[order] = dict1
 MDP_dict[state] = action
 return MDP_dict

 def policy_0_gen(self):
 # Generate an initial policy
 return {(alpha, beta): self.user_capacity - (alpha + beta) 
 for alpha in range(self.user_capacity + 1) 
 for beta in range(self.user_capacity + 1 - alpha)}

 def MRP_using_fixedPolicy(self, policy):
 # Create the MRP using a fixed policy
 return {state: self.full_MDP[state][action] 
 for state, action in policy.items()}
 
 def calculate_state_value_function(self, MRP_policy):
 # Calculate the expected immediate rewards from the MRP policy
 E_immediate_R = {}
 for from_state, value in MRP_policy.items():
 expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
 E_immediate_R[from_state] = expected_reward

 # Create the transition probability matrix
 states = list(MRP_policy.keys())
 trans_prob = np.zeros((len(states), len(states)))
 df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
 for i, from_state in enumerate(states):
 for j, to_state in enumerate(states):
 for (new_state, reward) in MRP_policy.get(from_state, {}):
 if new_state == to_state:
 probability = MRP_policy[from_state].get((new_state, reward), 0.0)
 df_trans_prob.iloc[i, j] = probability

 # Calculate the state value function
 R_exp = np.array(list(E_immediate_R.values()))
 val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - self.gamma * df_trans_prob, R_exp)
 MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)
 return MarkRevData

 def greedy_operation(self, MDP_full, state_val_policy, old_policy):
 # Perform the greedy operation to improve the policy
 new_policy = {}
 for state in old_policy.keys():
 max_q_value, best_action = float('-inf'), None
 state_val_dict = state_val_policy.to_dict(orient="index")
 for action in MDP_full[state].keys():
 q_value = 0
 for (next_state, immediate_reward), probability in MDP_full[state][action].items():
 q_value = q_value + probability * (immediate_reward + self.gamma *
 (state_val_dict[next_state]["Value Function"]))
 if q_value > max_q_value:
 max_q_value, best_action = q_value, action
 new_policy[state] = best_action
 return new_policy

 def policy_iteration(self):
 # Perform policy iteration to find the optimal policy
 policy = self.policy_0_gen()
 while True:
 MRP_policy_p0 = self.MRP_using_fixedPolicy(policy)
 value_function = self.calculate_state_value_function(MRP_policy_p0)
 new_policy = self.greedy_operation(self.full_MDP, value_function, policy)
 if new_policy == policy:
 break
 policy = new_policy
 opt_policy, opt_value_func = new_policy, value_function
 return opt_policy, opt_value_func

Example of using the Markov Decision Process class

# Example usage:
user_capacity = 2
poisson_lambda = 1.0
holding_cost = 1
stockout_cost = 10
gamma = 0.9

MDP_Example = MarkovDecisionProcess(user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma)

opt_policy, opt_val = MDP_Example.policy_iteration()

# Print the optimal policy
print("Optimal Policy:")
for state, order_quantity in opt_policy.items():
 print(f"For state {state}, the optimal order quantity is: {order_quantity}")

# Print the optimal value function
print("\nOptimal Value Function:")
print(opt_val)

Summary:

Inventory optimization is not a static optimization problem; rather, it is a sequential decision-making problem. It means we need an “adaptive” policy where a decision (number of orders) is dependent on the state of the inventory.
In this blog, we found the “optimal” ordering policy, based on Dynamic Programming Algorithm.
The goal was to build a foundation on how to think about Inventory Optimization problems, how to frame them, and how to solve them using appropriate mathematical modeling.

References:

[1] You can read this example more in-depth in “Foundation of Reinforcement Learning with Application in Finance”. However, I have rewritten Python codes in this blog to make it easier to understand.
Bellman, Richard., A Markovian Decision Process (1957), Journal of Mathematics and Mechanics.
Howard, R. A. 1960. Dynamic Programming and Markov Processes, (1960.), Cambridge, MA: MIT Press.

Thanks for reading so far!

I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.

If you think this article helped you to learn more about inventory optimization and Markov Process, please give it a U+1F44F and follow!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Author(s): Peyman Kor

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

How to master inventory decisions using dynamic programming for maximum profitability

Intro

Framing the Inventory Optimization Problem

Sequence of 24-hr cycle

Markov Decision Process:

Markov Decision Process for Inventory Optimization

Dynamic Programming

Policy Iteration Algorithm for Inventory Optimization:

1) Start with the Initial Policy

2) Evaluate the policy π_0 by computing the state-value V^{π_0}.

3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.

What is the Optimal Order?

Putting Everything Together:

Example of using the Markov Decision Process class

Summary:

References:

Thanks for reading so far!

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Boost your Local Business with Data Analysis and Web Scraping

The 2025 AI Revolution: 10 Breakthroughs That Will Change Your Life

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

Llm Fine Tuning Guide: Do You Need It and How to Do It

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Author(s): Peyman Kor

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

How to master inventory decisions using dynamic programming for maximum profitability

Intro

Framing the Inventory Optimization Problem

Sequence of 24-hr cycle

Markov Decision Process:

Markov Decision Process for Inventory Optimization

Dynamic Programming

Policy Iteration Algorithm for Inventory Optimization:

1) Start with the Initial Policy

2) Evaluate the policy π_0 by computing the state-value V^{π_0}.

3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.

What is the Optimal Order?

Putting Everything Together:

Example of using the Markov Decision Process class

Summary:

References:

Thanks for reading so far!

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement