Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code
Data Science   Latest   Machine Learning

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Last Updated on December 11, 2023 by Editorial Team

Author(s): Peyman Kor

Originally published on Towards AI.

Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

How to master inventory decisions using dynamic programming for maximum profitability

Photo by Christin Hume on Unsplash

Intro

Inventory optimization is a broad problem that arises across many domains. The central question is all about figuring out:

“How many products to order for your store periodically, in order maximize the profitability?”

I think you are a bike shop manager. Every day, you need to contact your supplier to order a number of bikes for your store. If you order too many bikes daily, you will spend too much money on maintaining and keeping the bikes in the store. On the other hand, if you order fewer, you might not have enough bikes to meet customer demands. You need to have an “ordering strategy” that balances two factors.

We will break down this problem and find the optimal “ordering strategy”. The “optimal ordering strategy” is based on three pillars:

  • Markov Process,
  • Markov Rewards Process
  • and Markov Decision Processes.

Here and here, I covered how to frame the “inventory optimization” problem, the Markov Process, and the Markov Reward Process.

Framing the Inventory Optimization Problem

Inventory Optimization is a Sequential Decision-Making Problem. We can frame this problem by defining what the states are, uncertainties, and what the **recurring decisions** are.

  1. State:

The state (the internal representation of the process) can be described by two components:

  • α : is the number of bikes you already have in store
  • β : is the number of bikes that you ordered the previous day and will arrive tomorrow morning (these bikes are on a truck)

2. Uncertainty:

In this example, uncertainty is random demand demand modeled following Poisson distributions (with Poisson parameter λ∈R>). A demand of i bikes for each i=0,1,2⋯ occurs with probability:

3. Decisions:

In this example, the bike shop owner must decide every day at 6PM, on how many bikes to order from the supplier.

Sequence of 24-hr cycle

The sequence of events in a 24-hour cycle for this bike shop is as follows:

  1. At 6 PM: you observe the state S_t:(α,β)
  2. At 6 PM: you order the new bikes
  3. At 6 Am: You receive bikes that you ordered 36hr ago
  4. At 8 AM: You open the store
  5. From 8 Am to 6 PM you experience demand i during the day (modelled using Poisson distribution) more below.
  6. At 6 PM you close the store

The diagram below visualizes this sequence:

In this blog, we are going to summarize everything and connect the Markov Process and Markov Reward Process with the Markov Decision Process to arrive at the optimal “ordering strategy”, all with codes in Python.

Markov Decision Process:

Markov Decision Process (MDP) is a mathematical model whose key components are states, actions, transition probabilities, and rewards.

In the Markov Reward Process, the decisions were fixed, and we were not considering all possible actions for each state. In Markov Decision Process (MDP), we need to build a data structure that for each state, we need to consider all possible actions, and following the $(S_t,A_t)$, we need to know what will be $(S_{t+1}, R_{t+1})$.

Example of Markov Decision Process Data Structure

Here, I am giving you an example of an MDP dictionary, where you can see for each state, all possible actions need to be considered:


MarkovDecProcessDict = {"Current State A":{"Action 1":{("NextS1fromAact1", "Reward1"): "PNextS1fromAact1"
,("NextS2fromAact1", "Reward2"): "PNextS2fromAact1"},
"Action 2":{("NextS1fromAact2", "Reward2"): "PNextS1fromAact2"
,("NextS2fromAact2", "Reward2"): "PNextS2fromAact2"}},

"Current State B":{"Action 1":{("NextS1fromBact1", "Reward1"): "PNextS1fromBact1"
,("NextS2fromBact1", "Reward2"): "PNextS2fromBact1"},
"Action 2":{("NextS1fromBact2", "Reward2"): "PNextS1fromBact2"
,("NextS2fromBact2", "Reward2"): "PNextS2fromBact2"}}
}

for current_state, actions in MarkovDecProcessDict.items():
print(f"Current State: {current_state}")

for action, transitions in actions.items():
print(f" Action: {action}")

for (next_state, reward), probability in transitions.items():
print(f" ({next_state},{reward}): {probability}")
Output of the Python code — source: Author

Markov Decision Process for Inventory Optimization

Here, we are building the dictionary, where for each state, we are considering all possible actions, and for each action, we are considering all possible states and rewards. This dictionary name is **MDP_dict**, and I wrote the code of it below:

from typing import Dict, Tuple

# poisson is used to find pdf of Poisson distribution
from scipy.stats import poisson

MDP_dict: Dict[tuple, Dict[tuple, tuple]] = {}

user_capacity = 2
user_poisson_lambda = 1
holding_cost = 1
missedcostumer_cost = 10



for alpha in range(user_capacity+1):

for beta in range(user_capacity + 1 - alpha):

# This is St, the current state
state = (alpha, beta)

# This is initial inventory, total bike you have at 8AM
init_inv = alpha + beta

base_reward = -alpha* holding_cost

action = {}
# Consider all possible actions
for order in range(user_capacity-init_inv +1):

#action = {}
dict1 = {}
for i in range(init_inv +1):

# if initial demand can meet the deman
if i <= (init_inv-1):

# probality of specifc demand can happen
transition_prob = poisson.pmf(i,user_poisson_lambda)

dict1[((init_inv - i, order), base_reward)] = transition_prob


# if initial demand can not meet the demand
else:

transition_prob = 1- poisson.cdf(init_inv -1, user_poisson_lambda)

# probability of not meeting the demands
transition_prob2 = 1- poisson.cdf(init_inv, user_poisson_lambda)

# total reward

reward = base_reward - missedcostumer_cost*((user_poisson_lambda*transition_prob) - \
init_inv*transition_prob2)

dict1[((init_inv - i, order),reward)] = transition_prob

#if state in MDP_dict:

action[order] = dict1

MDP_dict[state]= action

MDP_dict
# Constants

“MDP_dict” is a Python dictionary where each key represents the “current state,” and the value is “all possible actions at that state,” corresponding to the next state, immediate reward, and probability of the next state. Here is the explanation in the schematic:

Schematic of the MDP_dict — Source: Author

Dynamic Programming

Richard Bellman (1950s) first coined the term called Dynamic Programming (DP). Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems.

DP generally refers to general theories of the Markov Decision Process and algorithms to find an optimal policy in MDP, relying heavily on the Bellman Equation.

In the context of this article, we use the term Dynamic Programming with the goal of finding the optimal policy for the Inventory Optimization problem.

There are generally two import DP algorithms:

Value Function Iteration Algorithm (Bellman 1957)

– Policy Iteration Algorithm (Howard 1960)

In this article, we are going to focus on the Policy Iteration Algorithm, and we are going to implement it in Python.

Policy Iteration Algorithm for Inventory Optimization:

The Policy Iteration Algorithm is a method for finding the optimal policy for a given MDP. The algorithm is based on the following idea:

– 1) Start with an initial policy $π_0$.

– 2) Evaluate the policy $π_0$ by computing the state-value function $V^{π_0}$.

– 3) Improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The algorithm iterates on the above steps until the policy converges (policy stops to change). We are going to go through all three stages in the following sections.

A simple diagram to explain the Policy Iteration Algorithm — Source: Author

1) Start with the Initial Policy

The policy iteration algorithm needs an initial policy (ordering strategy), which can be any arbitrary policy. In this article, we are going to use the policy that we found in the second article, which is the following:

Initial policy:

π_0=C-(α + β)

The initial policy is that at each stage of the inventory, we order the amount of $C-(α + β)$, which is the difference between the capacity of the inventory and the sum of initial items in the inventory ($α$) and the ones will come tomorrow morning ($β$) .

Here is the code for the initial policy:

user_capacity_val = 2

def policy_0_gen(user_capacity: int):

# Generate an initial policy

return {(alpha, beta): user_capacity - (alpha + beta)
for alpha in range(user_capacity + 1)
for beta in range(user_capacity + 1 - alpha)}

policy_0 = policy_0_gen(user_capacity_val)
policy_0

2) Evaluate the policy π_0 by computing the state-value V^{π_0}.

Note that any Markov Decision Process with a fixed policy leads to an **implied** Markov Reward Process. So, like the previous article, if we have the Markov Reward Process, we can find each state's value function.

In other words:

Reference: Author

The below function will get a **fixed policy** (as input), and then it will return an implied Markov Reward Process:

def MRP_using_fixedPolicy(full_MDP, policy):
# Calculate the Markov Reward Process using a fixed policy
MRP_policy = {}
for state in full_MDP.keys():
action = policy[state]
MRP_policy[state] = full_MDP[state][action]
return MRP_policy

As an example, we can give the initial policy to the below function, and it will return the implied Markov Reward Process:

MRP_p0=MRP_using_fixedPolicy(MDP_dict, policy_0)
MRP_p0
Output of the Python code — Source: Author

Having the Markov Reward Process, it is pretty easy to find Immediate Rewards for each state and also the state value function for each state:

def calculate_expected_immediate_rewards(MRP_policy):
# Calculate the expected immediate rewards from the MRP policy
E_immediate_R = {}
for from_state, value in MRP_policy.items():
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward
return E_immediate_R
R_ime_p0 = calculate_expected_immediate_rewards(MRP_p0)
R_ime_p0
import numpy as np
import pandas as pd
def create_transition_probability_matrix(MRP_policy):
# Create the transition probability matrix
states = list(MRP_policy.keys())
num_states = len(states)
trans_prob = np.zeros((num_states, num_states))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in MRP_policy.get(from_state, {}):
if new_state == to_state:
probability = MRP_policy[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability

return df_trans_prob

def calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma=0.9):
# Calculate the state value function
states = list(expected_immediate_rew.keys())
R_exp = np.array(list(expected_immediate_rew.values()))
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec},
index=states)
return MarkRevData

trans_prob_p0 = create_transition_probability_matrix(MRP_p0)

state_val_p0 = calculate_state_value_function(trans_prob_mat=trans_prob_p0,
expected_immediate_rew=R_ime_p0)
state_val_p0
Output of the Python code — Source: Author

3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.

The last part of the Policy Iteration Algorithm is to improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The greedy equation is based on the Bellman Equation, and in fact, it is about finding the action with the highest **State-Value** function for each state.

Bellman Optimality Equation— Source: Author

Here is the code for the greedy operation:

def greedy_operation(MDP_full, state_val_policy, old_policy, gamma=0.9):
# Perform the greedy operation to improve the policy
new_policy = {}
for state in old_policy.keys():
max_q_value, best_action = float('-inf'), None
state_val_dict = state_val_policy.to_dict(orient="index")
for action in MDP_full[state].keys():
q_value = 0
for (next_state, immediate_reward), probability in MDP_full[state][action].items():
q_value = q_value + probability * (immediate_reward + gamma *
(state_val_dict[next_state]["Value Function"]))
if q_value > max_q_value:
max_q_value, best_action = q_value, action
new_policy[state] = best_action
return new_policy

new_policy = greedy_operation(MDP_full=MDP_dict,
state_val_policy=state_val_p0,
old_policy=policy_0)

The new policy after the first iteration of the policy iteration algorithm is the following:

new_policy

In the Policy Iteration algorithm, we keep iteration on the above three steps until the policy converges. In other words, we keep iterating on the above three steps until the policy stops to change. Here is the code for the Policy Iteration Algorithm:

def policy_iteration():
# Perform policy iteration to find the optimal policy
policy = policy_0_gen(user_capacity_val)
while True:
MRP_policy_p0 = MRP_using_fixedPolicy(MDP_dict, policy)
expected_immediate_rew = calculate_expected_immediate_rewards(MRP_policy_p0)
trans_prob_mat_val = create_transition_probability_matrix(MRP_policy_p0)
value_function = calculate_state_value_function(trans_prob_mat=trans_prob_mat_val,
expected_immediate_rew=expected_immediate_rew,
gamma=0.9)
new_policy = greedy_operation(MDP_full=MDP_dict,
state_val_policy=value_function,
old_policy=policy)
if new_policy == policy:
break
policy = new_policy

opt_policy = new_policy
opt_value_func = value_function

return opt_policy, opt_value_func

What is the Optimal Order?

Now, we can look at the outcome of policy iteration and see what the optimal order is for each state. Here is the code for it:

for state, order_quantity in opt_policy.items():
print(f"For state {state}, the optimal order quantity is: {order_quantity}")
The optimal final policy — Source: Author

Putting Everything Together:

I am putting all the codes together in one Python class with its methods.

import numpy as np
from scipy.stats import poisson
import pandas as pd

class MarkovDecisionProcess:
def __init__(self, user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma):
# Initialize the MDP with given parameters
self.user_capacity = user_capacity
self.poisson_lambda = poisson_lambda
self.holding_cost, self.stockout_cost = holding_cost, stockout_cost
self.gamma = gamma
self.full_MDP = self.create_full_MDP() # Create the full MDP

def create_full_MDP(self):
# Create the full MDP dictionary
MDP_dict = {}
for alpha in range(self.user_capacity + 1):
for beta in range(self.user_capacity + 1 - alpha):
state, init_inv = (alpha, beta), alpha + beta
action = {}
for order in range(self.user_capacity - init_inv + 1):
dict1 = {}
for i in range(init_inv + 1):
if i <= (init_inv - 1):
transition_prob = poisson.pmf(i, self.poisson_lambda)
dict1[((init_inv - i, order), -alpha * self.holding_cost)] = transition_prob
else:
transition_prob = 1 - poisson.cdf(init_inv - 1, self.poisson_lambda)
transition_prob2 = 1 - poisson.cdf(init_inv, self.poisson_lambda)
reward = -alpha * self.holding_cost - self.stockout_cost * (
(self.poisson_lambda * transition_prob) - init_inv * transition_prob2)
dict1[((0, order), reward)] = transition_prob
action[order] = dict1
MDP_dict[state] = action
return MDP_dict

def policy_0_gen(self):
# Generate an initial policy
return {(alpha, beta): self.user_capacity - (alpha + beta)
for alpha in range(self.user_capacity + 1)
for beta in range(self.user_capacity + 1 - alpha)}

def MRP_using_fixedPolicy(self, policy):
# Create the MRP using a fixed policy
return {state: self.full_MDP[state][action]
for state, action in policy.items()}

def calculate_state_value_function(self, MRP_policy):
# Calculate the expected immediate rewards from the MRP policy
E_immediate_R = {}
for from_state, value in MRP_policy.items():
expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())
E_immediate_R[from_state] = expected_reward

# Create the transition probability matrix
states = list(MRP_policy.keys())
trans_prob = np.zeros((len(states), len(states)))
df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)
for i, from_state in enumerate(states):
for j, to_state in enumerate(states):
for (new_state, reward) in MRP_policy.get(from_state, {}):
if new_state == to_state:
probability = MRP_policy[from_state].get((new_state, reward), 0.0)
df_trans_prob.iloc[i, j] = probability

# Calculate the state value function
R_exp = np.array(list(E_immediate_R.values()))
val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - self.gamma * df_trans_prob, R_exp)
MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)
return MarkRevData

def greedy_operation(self, MDP_full, state_val_policy, old_policy):
# Perform the greedy operation to improve the policy
new_policy = {}
for state in old_policy.keys():
max_q_value, best_action = float('-inf'), None
state_val_dict = state_val_policy.to_dict(orient="index")
for action in MDP_full[state].keys():
q_value = 0
for (next_state, immediate_reward), probability in MDP_full[state][action].items():
q_value = q_value + probability * (immediate_reward + self.gamma *
(state_val_dict[next_state]["Value Function"]))
if q_value > max_q_value:
max_q_value, best_action = q_value, action
new_policy[state] = best_action
return new_policy

def policy_iteration(self):
# Perform policy iteration to find the optimal policy
policy = self.policy_0_gen()
while True:
MRP_policy_p0 = self.MRP_using_fixedPolicy(policy)
value_function = self.calculate_state_value_function(MRP_policy_p0)
new_policy = self.greedy_operation(self.full_MDP, value_function, policy)
if new_policy == policy:
break
policy = new_policy
opt_policy, opt_value_func = new_policy, value_function
return opt_policy, opt_value_func

Example of using the Markov Decision Process class

# Example usage:
user_capacity = 2
poisson_lambda = 1.0
holding_cost = 1
stockout_cost = 10
gamma = 0.9

MDP_Example = MarkovDecisionProcess(user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma)

opt_policy, opt_val = MDP_Example.policy_iteration()

# Print the optimal policy
print("Optimal Policy:")
for state, order_quantity in opt_policy.items():
print(f"For state {state}, the optimal order quantity is: {order_quantity}")

# Print the optimal value function
print("\nOptimal Value Function:")
print(opt_val)
Output of the Python code — Source: Author

Summary:

  • Inventory optimization is not a static optimization problem; rather, it is a sequential decision-making problem. It means we need an “adaptive” policy where a decision (number of orders) is dependent on the state of the inventory.
  • In this blog, we found the “optimal” ordering policy, based on Dynamic Programming Algorithm.
  • The goal was to build a foundation on how to think about Inventory Optimization problems, how to frame them, and how to solve them using appropriate mathematical modeling.

References:

Thanks for reading so far!

I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.

If you think this article helped you to learn more about inventory optimization and Markov Process, please give it a U+1F44F and follow!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓