On-Policy vs. Off-Policy Monte Carlo, With Visualizations
Author(s): James Koh, PhD

Comes with plug-and-play code incorporating importance sampling
Photo by aceofnet on Unsplash

In Reinforcement Learning, we either use Monte Carlo (MC) estimates or Temporal Difference (TD) learning to establish the ‘target’ return from sample episodes. Both approaches allow us to learn from an environment in which transition dynamics are unknown, i.e., p(s',rU+007Cs,a) is unknown.

MC uses the full returns from a state-action pair until the terminal state is reached. It has a high variance but is unbiased when the samples are independent and identically distributed.

I will save comparisons between MC and TD for another day, to be supported by codes. For today, the focus is on MC itself. I… Read the full blog for free on Medium.

