Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Dreamer: A State-of-the-art Model-Based Reinforcement Learning Agent
Machine Learning

Dreamer: A State-of-the-art Model-Based Reinforcement Learning Agent

Last Updated on May 31, 2020 by Editorial Team

Author(s): Sherwin Chen

Reinforcement Learning

A brief walk-through of a state-of-the-art model-based reinforcement learning algorithm

Image by Leandro De Carvalho fromΒ Pixabay

We discuss a model-based reinforcement learning agent called Dreamer, proposed by Hafner et al. at DeepMind that achieves state-of-the-art performance on a variety of image-based control tasks but requires much fewer samples than the contemporary model-free methods.

Overview

Dreamer is composed of threeΒ parts:

  • Dynamics learning: As a model-based RL method, Dreamer learns a dynamics model consisting of four components: a representation model, a transition model, a reconstruction model, and a rewardΒ model.
  • Behavior learning: Based on the dynamics model, Dreamer learns an actor-critic architecture to maximize rewards on imagined trajectories.
  • Environment interaction: Dreamer interacts with the environment using the action model for data collection.

In the rest of the post, we will focus our attention on the first two parts, as the last soon becomes apparent afterΒ that.

Dynamics Learning

Dreamer represents the dynamics as a sequential model with the following components

all of these models are parameterized by (de)convolutional/fully-connected networks with GaussianΒ heads.

In the rest of the post, we refer to the transition model as the prior and the representation model as the posterior since the latter is additionally conditioned on the observation.

Figure 1. The recurrent state-space model(RSSM) in a single time step. h is the deterministic state, s is the prior stochastic state, s’ is the posterior stochastic state.

Like its predecessor, PlaNet, Dreamer adopts the recurrent state-space model(RSSM) as the world model, which splits the state space into stochastic and deterministic components. Figure 1 shows how the RSSM model is unrolled in a single time step. When training the dynamics model, the RNN takes as input the action a_{t-1} and posterior stochastic state s’_{t-1} from the previous time step and outputs a deterministic state h_t. The deterministic state h_t is then a) fed into an MLP with a single hidden layer to compute the prior stochastic state s_t; b) concatenated with the image embedding e_t and fed into another single-layer MLP to compute the posterior stochastic state. After that, we use the concatenation of h_t and s’_t as a start latent variable to reconstruct the image, reward, etc. Mathematically, we can divide the world model into the following parts.

In practice, f is an RNN, and all distributions are Gaussian with parameters produced by the corresponding networks.

One may also regard the RSSM model as a sequential VAE, where the latent variable at time step t is associated with that from the previous time step t-1 through the transition model. This gives us the following variational evidence lower bound(ELBO) objective.

The objective for training the dynamics model, where 𝛽 regularizes the information from o_t toΒ s_t

Maximizing this objective leads to model states that predicts the sequence of observations and rewards while limiting the amount of information extracted at each time step. This encourages the model to reconstruct each image by relying on information extracted at preceding time steps to the extent possible, and only accessing additional information from the current image when necessary. As a result, the information regularizer encourages the model to learn long-term dependencies.

Before we move on, it is worth stressing that the deterministic and stochastic state models serve different purposes: The deterministic part reliably retains information across many time steps, based on which the stochastic part builds up a compact belief state of the environment. The latter is especially important as the environment is generally partially observable to the agentβ€Šβ€”β€ŠFigure 4 shows that, without the stochastic part, the agent fails to learn anything!

From: Dream to Control: Learning Behaviors by Latent Imagination

Behavior Learning

Dreamer trains an actor-critic model on the state space for behavior learning. More specifically, Dreamer first imagines trajectories starting from some true model states s_𝜏 from the agent’s past experience, following the transition model q(s_{t+1}|s_t, a_t), policy q(a_t|s_t) and reward model q(r_t|s_t)β€Šβ€”β€Šsee Figure 1 right for illustration. Notice that, during imagination, the RNN takes as input the prior stochastic state since there is no more observation is given. We then train the actor-critic model by maximizing the expected returns along these trajectories with the following objectives

where πœƒ is the parameters of the dynamics model, and πœ™, πœ“ are the parameters of the action and value models, respectively. V_𝝀(s) is the 𝝀-return in TD(𝝀), where 𝝀 controls the weights of the n-stepΒ target

In fact, the choice of these RL objectives is quite brilliant. I’ve tried to apply some other off-policy methods to the latent space learned by Dreamer, such as SAC with retrace(𝝀), as I thought that function approximation errors introduced by the world model(i.e., the dynamics, reward, and discount models) might cause inaccurate predictions on the imagined trajectories, whereby impairing the performance of the AC model. However, the experimental results suggested an opposite story: learning from imagined trajectories outperforms applying off-policy methods on the latent space in terms of the learning speed and final performance. This is because learning from imagined trajectories provides richer training signals, facilitating the learning process; if we reduce the length of the imagined trajectories H to 1​, Dreamer performs worse than applying SAC to the latentΒ space.

References

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. In ICLRΒ 2020

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy. Deep Variational Information Bottleneck. In ICLRΒ 2017

Acknowledgments

I’d like to especially thank Danijar Hafner for discussions with theΒ code.


Dreamer was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓