# Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Last Updated on December 11, 2023 by Editorial Team

**Author(s): Peyman Kor**

Originally published on Towards AI.

**Inventory Optimization with Dynamic Programming **in Less than 100 Lines of Python Code

**How to master inventory decisions using dynamic programming for maximum profitability**

## Intro

Inventory optimization is a broad problem that arises across many domains. The central question is all about figuring out:

“How many products to order for your store periodically, in order maximize the profitability?”

I think you are a bike shop manager. Every day, you need to contact your supplier to order a number of bikes for your store. If you order too many bikes daily, you will spend** too much money on maintaining and keeping the bikes in the store**. On the other hand, **if you order fewer, you might not have enough bikes to meet customer demands**. You need to have an “ordering strategy” that balances two factors.

We will break down this problem and find the optimal “ordering strategy”. The “optimal ordering strategy” is based on three pillars:

**Markov Process**,**Markov****Rewards Process**- and
**Markov Decision Processes**.

Here and here, I covered how to frame the “inventory optimization” problem, the Markov Process, and the Markov Reward Process.

## Framing the Inventory Optimization Problem

Inventory Optimization is a Sequential Decision-Making Problem. We can frame this problem by defining what the states are, uncertainties, and what the ****recurring decisions**** are.

**State:**

The state (**the internal representation of the process**) can be described by two components:

**α**: is the number of bikes you already have in store**β**: is the number of bikes that you ordered the previous day and will arrive tomorrow morning (these bikes are on a truck)

**2. Uncertainty:**

In this example, uncertainty is random demand demand modeled following Poisson distributions (with Poisson parameter λ∈R>). A demand of i bikes for each i=0,1,2⋯ occurs with probability:

**3. Decisions:**

In this example, the bike shop owner must decide every day at 6PM, on how many bikes to order from the supplier.

**Sequence of 24-hr cycle**

The sequence of events in a 24-hour cycle for this bike shop is as follows:

**At 6 PM**: you observe the state S_t:(α,β)**At 6 PM**: you order the new bikes**At 6 Am**: You receive bikes that you ordered 36hr ago**At 8 AM**: You open the store**From 8 Am to 6 PM**you experience demand i during the day (modelled using Poisson distribution) more below.**At 6 PM**you close the store

The diagram below visualizes this sequence:

In this blog, we are going to summarize everything and connect the Markov Process and Markov Reward Process with the Markov Decision Process to arrive at the optimal “ordering strategy”, all with codes in Python.

**Markov Decision Process:**

Markov Decision Process (MDP) is a mathematical model whose key components are states, actions, transition probabilities, and rewards.

In the Markov Reward Process, *the decisions were fixed*, and we were not considering all possible actions for each state. In Markov Decision Process (MDP), we need to build a data structure that for each state, we need to consider all possible actions, and following the $(S_t,A_t)$, we need to know what will be $(S_{t+1}, R_{t+1})$.

**Example of Markov Decision Process Data Structure**

Here, I am giving you an example of an MDP dictionary, where you can see for each state, all possible actions need to be considered:

MarkovDecProcessDict = {"Current State A":{"Action 1":{("NextS1fromAact1", "Reward1"): "PNextS1fromAact1"

,("NextS2fromAact1", "Reward2"): "PNextS2fromAact1"},

"Action 2":{("NextS1fromAact2", "Reward2"): "PNextS1fromAact2"

,("NextS2fromAact2", "Reward2"): "PNextS2fromAact2"}},

"Current State B":{"Action 1":{("NextS1fromBact1", "Reward1"): "PNextS1fromBact1"

,("NextS2fromBact1", "Reward2"): "PNextS2fromBact1"},

"Action 2":{("NextS1fromBact2", "Reward2"): "PNextS1fromBact2"

,("NextS2fromBact2", "Reward2"): "PNextS2fromBact2"}}

}

for current_state, actions in MarkovDecProcessDict.items():

print(f"Current State: {current_state}")

for action, transitions in actions.items():

print(f" Action: {action}")

for (next_state, reward), probability in transitions.items():

print(f" ({next_state},{reward}): {probability}")

**Markov Decision Process for Inventory Optimization**

Here, we are building the dictionary, where for each state, we are considering all possible actions, and for each action, we are considering all possible states and rewards. This dictionary name is ****MDP_dict**,** and I wrote the code of it below:

`from typing import Dict, Tuple`

# poisson is used to find pdf of Poisson distribution

from scipy.stats import poisson

MDP_dict: Dict[tuple, Dict[tuple, tuple]] = {}

user_capacity = 2

user_poisson_lambda = 1

holding_cost = 1

missedcostumer_cost = 10

for alpha in range(user_capacity+1):

for beta in range(user_capacity + 1 - alpha):

# This is St, the current state

state = (alpha, beta)

# This is initial inventory, total bike you have at 8AM

init_inv = alpha + beta

base_reward = -alpha* holding_cost

action = {}

# Consider all possible actions

for order in range(user_capacity-init_inv +1):

#action = {}

dict1 = {}

for i in range(init_inv +1):

# if initial demand can meet the deman

if i <= (init_inv-1):

# probality of specifc demand can happen

transition_prob = poisson.pmf(i,user_poisson_lambda)

dict1[((init_inv - i, order), base_reward)] = transition_prob

# if initial demand can not meet the demand

else:

transition_prob = 1- poisson.cdf(init_inv -1, user_poisson_lambda)

# probability of not meeting the demands

transition_prob2 = 1- poisson.cdf(init_inv, user_poisson_lambda)

# total reward

reward = base_reward - missedcostumer_cost*((user_poisson_lambda*transition_prob) - \

init_inv*transition_prob2)

dict1[((init_inv - i, order),reward)] = transition_prob

#if state in MDP_dict:

action[order] = dict1

MDP_dict[state]= action

MDP_dict

# Constants

“MDP_dict” is a Python dictionary where each key represents the “current state,” and the value is “all possible actions at that state,” corresponding to the next state, immediate reward, and probability of the next state. Here is the explanation in the schematic:

**Dynamic Programming**

Richard Bellman (1950s) first coined the term called **Dynamic Programming (DP)**. Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems.

DP generally refers to general theories of the Markov Decision Process and **algorithms** to find an optimal policy in MDP, relying heavily on the Bellman Equation.

In the context of this article, we use the term Dynamic Programming with the goal of finding the optimal policy for the Inventory Optimization problem.

There are generally two import DP algorithms:

– **Value Function Iteration Algorithm** (**Bellman 1957**)

**– Policy Iteration Algorithm (Howard 1960)**

In this article, we are going to focus on the Policy Iteration Algorithm, and we are going to implement it in Python.

## Policy Iteration Algorithm for Inventory Optimization:

The **Policy Iteration Algorithm** is a method for finding the optimal policy for a given MDP. The algorithm is based on the following idea:

**– 1)** Start with an initial policy $π_0$.

**– 2)** Evaluate the policy $π_0$ by computing the state-value function $V^{π_0}$.

**– 3) **Improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The algorithm iterates on the above steps until the policy converges (policy stops to change). We are going to go through all three stages in the following sections.

**1) Start with the Initial Policy**

The policy iteration algorithm needs an initial policy (ordering strategy), which can be any arbitrary policy. In this article, we are going to use the policy that we found in the second article, which is the following:

Initial policy:

**π_0=C-(α + β)**

The initial policy is that at each stage of the inventory, we order the amount of $C-(α + β)$, which is the difference between the capacity of the inventory and the sum of initial items in the inventory ($α$) and the ones will come tomorrow morning ($β$) .

Here is the code for the initial policy:

`user_capacity_val = 2`

def policy_0_gen(user_capacity: int):

# Generate an initial policy

return {(alpha, beta): user_capacity - (alpha + beta)

for alpha in range(user_capacity + 1)

for beta in range(user_capacity + 1 - alpha)}

policy_0 = policy_0_gen(user_capacity_val)

policy_0

**2) Evaluate the policy π_0 by computing the state-value V^{π_0}.**

Note that any Markov Decision Process with a fixed policy leads to an ****implied**** Markov Reward Process. So, like the previous article, if we have the Markov Reward Process, we can find each state's value function.

In other words:

The below function will get a ****fixed policy** **(as input), and then it will return an implied Markov Reward Process:

`def MRP_using_fixedPolicy(full_MDP, policy):`

# Calculate the Markov Reward Process using a fixed policy

MRP_policy = {}

for state in full_MDP.keys():

action = policy[state]

MRP_policy[state] = full_MDP[state][action]

return MRP_policy

As an example, we can give the initial policy to the below function, and it will return the implied Markov Reward Process:

`MRP_p0=MRP_using_fixedPolicy(MDP_dict, policy_0)`

MRP_p0

Having the Markov Reward Process, it is *pretty easy to find Immediate Rewards for each state* and also the state value function for each state:

`def calculate_expected_immediate_rewards(MRP_policy):`

# Calculate the expected immediate rewards from the MRP policy

E_immediate_R = {}

for from_state, value in MRP_policy.items():

expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())

E_immediate_R[from_state] = expected_reward

return E_immediate_R

R_ime_p0 = calculate_expected_immediate_rewards(MRP_p0)

R_ime_p0

`import numpy as np`

import pandas as pd

def create_transition_probability_matrix(MRP_policy):

# Create the transition probability matrix

states = list(MRP_policy.keys())

num_states = len(states)

trans_prob = np.zeros((num_states, num_states))

df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)

for i, from_state in enumerate(states):

for j, to_state in enumerate(states):

for (new_state, reward) in MRP_policy.get(from_state, {}):

if new_state == to_state:

probability = MRP_policy[from_state].get((new_state, reward), 0.0)

df_trans_prob.iloc[i, j] = probability

return df_trans_prob

def calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma=0.9):

# Calculate the state value function

states = list(expected_immediate_rew.keys())

R_exp = np.array(list(expected_immediate_rew.values()))

val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp)

MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec},

index=states)

return MarkRevData

trans_prob_p0 = create_transition_probability_matrix(MRP_p0)

state_val_p0 = calculate_state_value_function(trans_prob_mat=trans_prob_p0,

expected_immediate_rew=R_ime_p0)

state_val_p0

**3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.**

The last part of the **Policy Iteration Algorithm** is to improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The greedy equation is based on the Bellman Equation, and in fact, it is about finding the action with the highest ****State-Value**** function for each state.

Here is the code for the greedy operation:

`def greedy_operation(MDP_full, state_val_policy, old_policy, gamma=0.9):`

# Perform the greedy operation to improve the policy

new_policy = {}

for state in old_policy.keys():

max_q_value, best_action = float('-inf'), None

state_val_dict = state_val_policy.to_dict(orient="index")

for action in MDP_full[state].keys():

q_value = 0

for (next_state, immediate_reward), probability in MDP_full[state][action].items():

q_value = q_value + probability * (immediate_reward + gamma *

(state_val_dict[next_state]["Value Function"]))

if q_value > max_q_value:

max_q_value, best_action = q_value, action

new_policy[state] = best_action

return new_policy

new_policy = greedy_operation(MDP_full=MDP_dict,

state_val_policy=state_val_p0,

old_policy=policy_0)

The new policy after the first iteration of the policy iteration algorithm is the following:

`new_policy`

In the Policy Iteration algorithm, we keep iteration on the above three steps until the policy converges. In other words, we keep iterating on the above three steps until the policy stops to change. Here is the code for the Policy Iteration Algorithm:

`def policy_iteration():`

# Perform policy iteration to find the optimal policy

policy = policy_0_gen(user_capacity_val)

while True:

MRP_policy_p0 = MRP_using_fixedPolicy(MDP_dict, policy)

expected_immediate_rew = calculate_expected_immediate_rewards(MRP_policy_p0)

trans_prob_mat_val = create_transition_probability_matrix(MRP_policy_p0)

value_function = calculate_state_value_function(trans_prob_mat=trans_prob_mat_val,

expected_immediate_rew=expected_immediate_rew,

gamma=0.9)

new_policy = greedy_operation(MDP_full=MDP_dict,

state_val_policy=value_function,

old_policy=policy)

if new_policy == policy:

break

policy = new_policy

opt_policy = new_policy

opt_value_func = value_function

return opt_policy, opt_value_func

**What is the Optimal Order?**

Now, we can look at the outcome of policy iteration and see what the optimal order is for each state. Here is the code for it:

`for state, order_quantity in opt_policy.items():`

print(f"For state {state}, the optimal order quantity is: {order_quantity}")

**Putting Everything Together:**

I am putting all the codes together in one Python class with its methods.

`import numpy as np`

from scipy.stats import poisson

import pandas as pd

class MarkovDecisionProcess:

def __init__(self, user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma):

# Initialize the MDP with given parameters

self.user_capacity = user_capacity

self.poisson_lambda = poisson_lambda

self.holding_cost, self.stockout_cost = holding_cost, stockout_cost

self.gamma = gamma

self.full_MDP = self.create_full_MDP() # Create the full MDP

def create_full_MDP(self):

# Create the full MDP dictionary

MDP_dict = {}

for alpha in range(self.user_capacity + 1):

for beta in range(self.user_capacity + 1 - alpha):

state, init_inv = (alpha, beta), alpha + beta

action = {}

for order in range(self.user_capacity - init_inv + 1):

dict1 = {}

for i in range(init_inv + 1):

if i <= (init_inv - 1):

transition_prob = poisson.pmf(i, self.poisson_lambda)

dict1[((init_inv - i, order), -alpha * self.holding_cost)] = transition_prob

else:

transition_prob = 1 - poisson.cdf(init_inv - 1, self.poisson_lambda)

transition_prob2 = 1 - poisson.cdf(init_inv, self.poisson_lambda)

reward = -alpha * self.holding_cost - self.stockout_cost * (

(self.poisson_lambda * transition_prob) - init_inv * transition_prob2)

dict1[((0, order), reward)] = transition_prob

action[order] = dict1

MDP_dict[state] = action

return MDP_dict

def policy_0_gen(self):

# Generate an initial policy

return {(alpha, beta): self.user_capacity - (alpha + beta)

for alpha in range(self.user_capacity + 1)

for beta in range(self.user_capacity + 1 - alpha)}

def MRP_using_fixedPolicy(self, policy):

# Create the MRP using a fixed policy

return {state: self.full_MDP[state][action]

for state, action in policy.items()}

def calculate_state_value_function(self, MRP_policy):

# Calculate the expected immediate rewards from the MRP policy

E_immediate_R = {}

for from_state, value in MRP_policy.items():

expected_reward = sum(reward[1] * prob for (reward, prob) in value.items())

E_immediate_R[from_state] = expected_reward

# Create the transition probability matrix

states = list(MRP_policy.keys())

trans_prob = np.zeros((len(states), len(states)))

df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states)

for i, from_state in enumerate(states):

for j, to_state in enumerate(states):

for (new_state, reward) in MRP_policy.get(from_state, {}):

if new_state == to_state:

probability = MRP_policy[from_state].get((new_state, reward), 0.0)

df_trans_prob.iloc[i, j] = probability

# Calculate the state value function

R_exp = np.array(list(E_immediate_R.values()))

val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - self.gamma * df_trans_prob, R_exp)

MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states)

return MarkRevData

def greedy_operation(self, MDP_full, state_val_policy, old_policy):

# Perform the greedy operation to improve the policy

new_policy = {}

for state in old_policy.keys():

max_q_value, best_action = float('-inf'), None

state_val_dict = state_val_policy.to_dict(orient="index")

for action in MDP_full[state].keys():

q_value = 0

for (next_state, immediate_reward), probability in MDP_full[state][action].items():

q_value = q_value + probability * (immediate_reward + self.gamma *

(state_val_dict[next_state]["Value Function"]))

if q_value > max_q_value:

max_q_value, best_action = q_value, action

new_policy[state] = best_action

return new_policy

def policy_iteration(self):

# Perform policy iteration to find the optimal policy

policy = self.policy_0_gen()

while True:

MRP_policy_p0 = self.MRP_using_fixedPolicy(policy)

value_function = self.calculate_state_value_function(MRP_policy_p0)

new_policy = self.greedy_operation(self.full_MDP, value_function, policy)

if new_policy == policy:

break

policy = new_policy

opt_policy, opt_value_func = new_policy, value_function

return opt_policy, opt_value_func

## Example of using the Markov Decision Process class

`# Example usage:`

user_capacity = 2

poisson_lambda = 1.0

holding_cost = 1

stockout_cost = 10

gamma = 0.9

MDP_Example = MarkovDecisionProcess(user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma)

opt_policy, opt_val = MDP_Example.policy_iteration()

# Print the optimal policy

print("Optimal Policy:")

for state, order_quantity in opt_policy.items():

print(f"For state {state}, the optimal order quantity is: {order_quantity}")

# Print the optimal value function

print("\nOptimal Value Function:")

print(opt_val)

## Summary:

**Inventory optimization**is not a static optimization problem; rather, it is a sequential decision-making problem. It means we need an “adaptive” policy where a decision (number of orders) is dependent on the state of the inventory.- In this blog, we found the “optimal” ordering policy, based on Dynamic Programming Algorithm.
- The goal was to build a foundation on how to think about Inventory Optimization problems, how to frame them, and how to solve them using appropriate mathematical modeling.

## References:

- [1] You can read this example more in-depth in “
**Foundation of Reinforcement Learning with Application in Finance**”. However, I have rewritten Python codes in this blog to make it easier to understand. - Bellman, Richard., A Markovian Decision Process (1957), Journal of Mathematics and Mechanics.
- Howard, R. A. 1960. Dynamic Programming and Markov Processes, (1960.), Cambridge, MA: MIT Press.

## Thanks for reading so far!

I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.

*If you think this article helped you to learn more about inventory optimization and Markov Process, please give it a U+1F44F and follow!*

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI