# Inventory Optimization with Dynamic Programming in Less than 100 Lines of Python Code

Last Updated on December 11, 2023 by Editorial Team

#### Author(s): Peyman Kor

Originally published on Towards AI.

## Intro

Inventory optimization is a broad problem that arises across many domains. The central question is all about figuring out:

“How many products to order for your store periodically, in order maximize the profitability?”

I think you are a bike shop manager. Every day, you need to contact your supplier to order a number of bikes for your store. If you order too many bikes daily, you will spend too much money on maintaining and keeping the bikes in the store. On the other hand, if you order fewer, you might not have enough bikes to meet customer demands. You need to have an “ordering strategy” that balances two factors.

We will break down this problem and find the optimal “ordering strategy”. The “optimal ordering strategy” is based on three pillars:

• Markov Process,
• Markov Rewards Process
• and Markov Decision Processes.

Here and here, I covered how to frame the “inventory optimization” problem, the Markov Process, and the Markov Reward Process.

## Framing the Inventory Optimization Problem

Inventory Optimization is a Sequential Decision-Making Problem. We can frame this problem by defining what the states are, uncertainties, and what the **recurring decisions** are.

1. State:

The state (the internal representation of the process) can be described by two components:

• α : is the number of bikes you already have in store
• β : is the number of bikes that you ordered the previous day and will arrive tomorrow morning (these bikes are on a truck)

2. Uncertainty:

In this example, uncertainty is random demand demand modeled following Poisson distributions (with Poisson parameter λ∈R>). A demand of i bikes for each i=0,1,2⋯ occurs with probability:

3. Decisions:

In this example, the bike shop owner must decide every day at 6PM, on how many bikes to order from the supplier.

## Sequence of 24-hr cycle

The sequence of events in a 24-hour cycle for this bike shop is as follows:

1. At 6 PM: you observe the state S_t:(α,β)
2. At 6 PM: you order the new bikes
3. At 6 Am: You receive bikes that you ordered 36hr ago
4. At 8 AM: You open the store
5. From 8 Am to 6 PM you experience demand i during the day (modelled using Poisson distribution) more below.
6. At 6 PM you close the store

The diagram below visualizes this sequence:

In this blog, we are going to summarize everything and connect the Markov Process and Markov Reward Process with the Markov Decision Process to arrive at the optimal “ordering strategy”, all with codes in Python.

## Markov Decision Process:

Markov Decision Process (MDP) is a mathematical model whose key components are states, actions, transition probabilities, and rewards.

In the Markov Reward Process, the decisions were fixed, and we were not considering all possible actions for each state. In Markov Decision Process (MDP), we need to build a data structure that for each state, we need to consider all possible actions, and following the $(S_t,A_t)$, we need to know what will be $(S_{t+1}, R_{t+1})$.

Example of Markov Decision Process Data Structure

Here, I am giving you an example of an MDP dictionary, where you can see for each state, all possible actions need to be considered:

MarkovDecProcessDict = {"Current State A":{"Action 1":{("NextS1fromAact1", "Reward1"): "PNextS1fromAact1" ,("NextS2fromAact1", "Reward2"): "PNextS2fromAact1"}, "Action 2":{("NextS1fromAact2", "Reward2"): "PNextS1fromAact2" ,("NextS2fromAact2", "Reward2"): "PNextS2fromAact2"}},  "Current State B":{"Action 1":{("NextS1fromBact1", "Reward1"): "PNextS1fromBact1" ,("NextS2fromBact1", "Reward2"): "PNextS2fromBact1"}, "Action 2":{("NextS1fromBact2", "Reward2"): "PNextS1fromBact2" ,("NextS2fromBact2", "Reward2"): "PNextS2fromBact2"}}}for current_state, actions in MarkovDecProcessDict.items(): print(f"Current State: {current_state}")  for action, transitions in actions.items(): print(f" Action: {action}")  for (next_state, reward), probability in transitions.items(): print(f" ({next_state},{reward}): {probability}")

## Markov Decision Process for Inventory Optimization

Here, we are building the dictionary, where for each state, we are considering all possible actions, and for each action, we are considering all possible states and rewards. This dictionary name is **MDP_dict**, and I wrote the code of it below:

from typing import Dict, Tuple# poisson is used to find pdf of Poisson distribution from scipy.stats import poissonMDP_dict: Dict[tuple, Dict[tuple, tuple]] = {}user_capacity = 2user_poisson_lambda = 1holding_cost = 1missedcostumer_cost = 10for alpha in range(user_capacity+1):   for beta in range(user_capacity + 1 - alpha):  # This is St, the current state state = (alpha, beta)  # This is initial inventory, total bike you have at 8AM  init_inv = alpha + beta   base_reward = -alpha* holding_cost  action = {} # Consider all possible actions for order in range(user_capacity-init_inv +1):  #action = {} dict1 = {} for i in range(init_inv +1): # if initial demand can meet the deman if i <= (init_inv-1):  # probality of specifc demand can happen transition_prob = poisson.pmf(i,user_poisson_lambda) dict1[((init_inv - i, order), base_reward)] = transition_prob  # if initial demand can not meet the demand else:  transition_prob = 1- poisson.cdf(init_inv -1, user_poisson_lambda)  # probability of not meeting the demands transition_prob2 = 1- poisson.cdf(init_inv, user_poisson_lambda)  # total reward  reward = base_reward - missedcostumer_cost*((user_poisson_lambda*transition_prob) - \ init_inv*transition_prob2)  dict1[((init_inv - i, order),reward)] = transition_prob #if state in MDP_dict: action[order] = dict1 MDP_dict[state]= actionMDP_dict# Constants

“MDP_dict” is a Python dictionary where each key represents the “current state,” and the value is “all possible actions at that state,” corresponding to the next state, immediate reward, and probability of the next state. Here is the explanation in the schematic:

## Dynamic Programming

Richard Bellman (1950s) first coined the term called Dynamic Programming (DP). Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems.

DP generally refers to general theories of the Markov Decision Process and algorithms to find an optimal policy in MDP, relying heavily on the Bellman Equation.

In the context of this article, we use the term Dynamic Programming with the goal of finding the optimal policy for the Inventory Optimization problem.

There are generally two import DP algorithms:

Value Function Iteration Algorithm (Bellman 1957)

– Policy Iteration Algorithm (Howard 1960)

In this article, we are going to focus on the Policy Iteration Algorithm, and we are going to implement it in Python.

## Policy Iteration Algorithm for Inventory Optimization:

The Policy Iteration Algorithm is a method for finding the optimal policy for a given MDP. The algorithm is based on the following idea:

– 1) Start with an initial policy $π_0$.

– 2) Evaluate the policy $π_0$ by computing the state-value function $V^{π_0}$.

– 3) Improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The algorithm iterates on the above steps until the policy converges (policy stops to change). We are going to go through all three stages in the following sections.

The policy iteration algorithm needs an initial policy (ordering strategy), which can be any arbitrary policy. In this article, we are going to use the policy that we found in the second article, which is the following:

Initial policy:

π_0=C-(α + β)

The initial policy is that at each stage of the inventory, we order the amount of $C-(α + β)$, which is the difference between the capacity of the inventory and the sum of initial items in the inventory ($α$) and the ones will come tomorrow morning ($β$) .

Here is the code for the initial policy:

user_capacity_val = 2def policy_0_gen(user_capacity: int):  # Generate an initial policy return {(alpha, beta): user_capacity - (alpha + beta)  for alpha in range(user_capacity + 1)  for beta in range(user_capacity + 1 - alpha)}policy_0 = policy_0_gen(user_capacity_val)policy_0

## 2) Evaluate the policy π_0 by computing the state-value V^{π_0}.

Note that any Markov Decision Process with a fixed policy leads to an **implied** Markov Reward Process. So, like the previous article, if we have the Markov Reward Process, we can find each state's value function.

In other words:

The below function will get a **fixed policy** (as input), and then it will return an implied Markov Reward Process:

def MRP_using_fixedPolicy(full_MDP, policy): # Calculate the Markov Reward Process using a fixed policy MRP_policy = {} for state in full_MDP.keys(): action = policy[state] MRP_policy[state] = full_MDP[state][action] return MRP_policy

As an example, we can give the initial policy to the below function, and it will return the implied Markov Reward Process:

MRP_p0=MRP_using_fixedPolicy(MDP_dict, policy_0)MRP_p0

Having the Markov Reward Process, it is pretty easy to find Immediate Rewards for each state and also the state value function for each state:

def calculate_expected_immediate_rewards(MRP_policy): # Calculate the expected immediate rewards from the MRP policy E_immediate_R = {} for from_state, value in MRP_policy.items(): expected_reward = sum(reward[1] * prob for (reward, prob) in value.items()) E_immediate_R[from_state] = expected_reward return E_immediate_RR_ime_p0 = calculate_expected_immediate_rewards(MRP_p0)R_ime_p0
import numpy as npimport pandas as pddef create_transition_probability_matrix(MRP_policy): # Create the transition probability matrix states = list(MRP_policy.keys()) num_states = len(states) trans_prob = np.zeros((num_states, num_states)) df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states) for i, from_state in enumerate(states): for j, to_state in enumerate(states): for (new_state, reward) in MRP_policy.get(from_state, {}): if new_state == to_state: probability = MRP_policy[from_state].get((new_state, reward), 0.0) df_trans_prob.iloc[i, j] = probability  return df_trans_probdef calculate_state_value_function(trans_prob_mat, expected_immediate_rew, gamma=0.9): # Calculate the state value function states = list(expected_immediate_rew.keys()) R_exp = np.array(list(expected_immediate_rew.values())) val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - gamma * trans_prob_mat, R_exp) MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec},  index=states) return MarkRevDatatrans_prob_p0 = create_transition_probability_matrix(MRP_p0)state_val_p0 = calculate_state_value_function(trans_prob_mat=trans_prob_p0, expected_immediate_rew=R_ime_p0)state_val_p0

## 3) Improve the policy by acting greedily with respect to $V^{π_0}$ to get a new policy $π_1$.

The last part of the Policy Iteration Algorithm is to improve the policy by acting greedily with respect to $V^{\pi_0}$ to get a new policy $\pi_1$.

The greedy equation is based on the Bellman Equation, and in fact, it is about finding the action with the highest **State-Value** function for each state.

Here is the code for the greedy operation:

def greedy_operation(MDP_full, state_val_policy, old_policy, gamma=0.9): # Perform the greedy operation to improve the policy new_policy = {} for state in old_policy.keys(): max_q_value, best_action = float('-inf'), None state_val_dict = state_val_policy.to_dict(orient="index") for action in MDP_full[state].keys(): q_value = 0 for (next_state, immediate_reward), probability in MDP_full[state][action].items(): q_value = q_value + probability * (immediate_reward + gamma * (state_val_dict[next_state]["Value Function"])) if q_value > max_q_value: max_q_value, best_action = q_value, action new_policy[state] = best_action return new_policynew_policy = greedy_operation(MDP_full=MDP_dict,  state_val_policy=state_val_p0,  old_policy=policy_0)

The new policy after the first iteration of the policy iteration algorithm is the following:

new_policy

In the Policy Iteration algorithm, we keep iteration on the above three steps until the policy converges. In other words, we keep iterating on the above three steps until the policy stops to change. Here is the code for the Policy Iteration Algorithm:

def policy_iteration(): # Perform policy iteration to find the optimal policy policy = policy_0_gen(user_capacity_val) while True: MRP_policy_p0 = MRP_using_fixedPolicy(MDP_dict, policy) expected_immediate_rew = calculate_expected_immediate_rewards(MRP_policy_p0) trans_prob_mat_val = create_transition_probability_matrix(MRP_policy_p0) value_function = calculate_state_value_function(trans_prob_mat=trans_prob_mat_val, expected_immediate_rew=expected_immediate_rew, gamma=0.9) new_policy = greedy_operation(MDP_full=MDP_dict,  state_val_policy=value_function,  old_policy=policy) if new_policy == policy: break policy = new_policy  opt_policy = new_policy opt_value_func = value_function  return opt_policy, opt_value_func

## What is the Optimal Order?

Now, we can look at the outcome of policy iteration and see what the optimal order is for each state. Here is the code for it:

for state, order_quantity in opt_policy.items(): print(f"For state {state}, the optimal order quantity is: {order_quantity}")

## Putting Everything Together:

I am putting all the codes together in one Python class with its methods.

import numpy as npfrom scipy.stats import poissonimport pandas as pdclass MarkovDecisionProcess: def __init__(self, user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma): # Initialize the MDP with given parameters self.user_capacity = user_capacity self.poisson_lambda = poisson_lambda self.holding_cost, self.stockout_cost = holding_cost, stockout_cost self.gamma = gamma self.full_MDP = self.create_full_MDP() # Create the full MDP def create_full_MDP(self): # Create the full MDP dictionary MDP_dict = {} for alpha in range(self.user_capacity + 1): for beta in range(self.user_capacity + 1 - alpha): state, init_inv = (alpha, beta), alpha + beta  action = {} for order in range(self.user_capacity - init_inv + 1): dict1 = {} for i in range(init_inv + 1): if i <= (init_inv - 1): transition_prob = poisson.pmf(i, self.poisson_lambda) dict1[((init_inv - i, order), -alpha * self.holding_cost)] = transition_prob else: transition_prob = 1 - poisson.cdf(init_inv - 1, self.poisson_lambda) transition_prob2 = 1 - poisson.cdf(init_inv, self.poisson_lambda) reward = -alpha * self.holding_cost - self.stockout_cost * ( (self.poisson_lambda * transition_prob) - init_inv * transition_prob2) dict1[((0, order), reward)] = transition_prob action[order] = dict1 MDP_dict[state] = action return MDP_dict def policy_0_gen(self): # Generate an initial policy return {(alpha, beta): self.user_capacity - (alpha + beta)  for alpha in range(self.user_capacity + 1)  for beta in range(self.user_capacity + 1 - alpha)} def MRP_using_fixedPolicy(self, policy): # Create the MRP using a fixed policy return {state: self.full_MDP[state][action]  for state, action in policy.items()}  def calculate_state_value_function(self, MRP_policy): # Calculate the expected immediate rewards from the MRP policy E_immediate_R = {} for from_state, value in MRP_policy.items(): expected_reward = sum(reward[1] * prob for (reward, prob) in value.items()) E_immediate_R[from_state] = expected_reward # Create the transition probability matrix states = list(MRP_policy.keys()) trans_prob = np.zeros((len(states), len(states))) df_trans_prob = pd.DataFrame(trans_prob, columns=states, index=states) for i, from_state in enumerate(states): for j, to_state in enumerate(states): for (new_state, reward) in MRP_policy.get(from_state, {}): if new_state == to_state: probability = MRP_policy[from_state].get((new_state, reward), 0.0) df_trans_prob.iloc[i, j] = probability # Calculate the state value function R_exp = np.array(list(E_immediate_R.values())) val_func_vec = np.linalg.solve(np.eye(len(R_exp)) - self.gamma * df_trans_prob, R_exp) MarkRevData = pd.DataFrame({'Expected Immediate Reward': R_exp, 'Value Function': val_func_vec}, index=states) return MarkRevData def greedy_operation(self, MDP_full, state_val_policy, old_policy): # Perform the greedy operation to improve the policy new_policy = {} for state in old_policy.keys(): max_q_value, best_action = float('-inf'), None state_val_dict = state_val_policy.to_dict(orient="index") for action in MDP_full[state].keys(): q_value = 0 for (next_state, immediate_reward), probability in MDP_full[state][action].items(): q_value = q_value + probability * (immediate_reward + self.gamma * (state_val_dict[next_state]["Value Function"])) if q_value > max_q_value: max_q_value, best_action = q_value, action new_policy[state] = best_action return new_policy def policy_iteration(self): # Perform policy iteration to find the optimal policy policy = self.policy_0_gen() while True: MRP_policy_p0 = self.MRP_using_fixedPolicy(policy) value_function = self.calculate_state_value_function(MRP_policy_p0) new_policy = self.greedy_operation(self.full_MDP, value_function, policy) if new_policy == policy: break policy = new_policy opt_policy, opt_value_func = new_policy, value_function return opt_policy, opt_value_func

## Example of using the Markov Decision Process class

# Example usage:user_capacity = 2poisson_lambda = 1.0holding_cost = 1stockout_cost = 10gamma = 0.9MDP_Example = MarkovDecisionProcess(user_capacity, poisson_lambda, holding_cost, stockout_cost, gamma)opt_policy, opt_val = MDP_Example.policy_iteration()# Print the optimal policyprint("Optimal Policy:")for state, order_quantity in opt_policy.items(): print(f"For state {state}, the optimal order quantity is: {order_quantity}")# Print the optimal value functionprint("\nOptimal Value Function:")print(opt_val)

## Summary:

• Inventory optimization is not a static optimization problem; rather, it is a sequential decision-making problem. It means we need an “adaptive” policy where a decision (number of orders) is dependent on the state of the inventory.
• In this blog, we found the “optimal” ordering policy, based on Dynamic Programming Algorithm.
• The goal was to build a foundation on how to think about Inventory Optimization problems, how to frame them, and how to solve them using appropriate mathematical modeling.

## Thanks for reading so far!

I hope this article has provided an easy-to-understand tutorial on how to do inventory optimization with Python.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI