Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

How to Achieve Effective Exploration Without the Sacrifice of Exploitation
Machine Learning

How to Achieve Effective Exploration Without the Sacrifice of Exploitation

Last Updated on June 23, 2020 by Editorial Team

Author(s): Sherwin Chen

Machine Learning

A gentle introduction to the Never Give Up(NGU)Β agent

Image by Gerd Altmann fromΒ Pixabay

Modern deep reinforcement learning agents often do well in either exploration or exploitation. Agents like R2D2 and PPO can easily achieve super-human performance on games with dense rewards, but they generally fail miserably when the reward signal becomes sparse. On the other hand, exploration methods such as ICM and RND can effectively learn in sparse reward environments at the cost of performance in dense rewardΒ ones.

We discuss an agent, namely Never Give Up(NGU) proposed by Puigdomènech Badia et al. at DeepMind., that achieves state-of-the-art performance in hard exploration games in Atari without any prior knowledge while maintaining a very high score across the remaining games. In a nutshell, the NGU agent uses 1)a combination of episodic and life-long novelties to encourage the agent to visits rarely visited states, whereby achieving effective exploration; 2)the framework of Universal Value Function Approximator(UVFA) to enable the agent to simultaneously learn different trade-offs between exploration and exploitation.

This post consists of two parts: Firstly, we discuss how the NGU agent computes intrinsic rewards that motivate it to effectively explore the environment. Secondly, we discuss the whole agent, seeing how it utilizes UVFA to learn both exploratory and exploitative policies.

Intrinsic Rewards

Limitations of Previous Intrinsic RewardΒ Methods

It is known that naive exploration methods, such as 𝝐-greedy can easily fail in sparse reward settings, which often require courses of actions that extend far into the future. Many intrinsic reward methods have been proposed lately to drive exploration. One type is based on the idea of state novelty which quantifies how different the current state is from those already visited. The problem is that these exploration bonus fades away when the agent becomes familiar with the environment. This discourages any future visitation to states once their novelty vanished, regardless of the downstream learning opportunities it might allow. The other type of intrinsic reward is built upon the prediction error. However, building a predictive model from observations is often expensive, error-prone, and can be difficult to generalize to arbitrary environments. In the absence of the novelty signal, these algorithms reduce to undirected exploration schemes. Therefore, a careful calibration between the speed of the learning algorithm and that of the vanishing rewards is often required.

The Never-Give-Up Intrinsic Reward

Puigdomènech Badia et al. propose an intrinsic reward method that combines both episodic and life-long novelties as follows

where the life-long novelty 𝛼 could be interpreted as a modulator that controls the scale of the episodic intrinsic reward r^{episodic}, and ​L is a chosen maximum rewardΒ scaling.

In the above equation, the episodic novelty encourages visits to states that distinct from the previous states but rapidly discourage revisiting the same state within the same episode. On the other hand, the life-long novelty regulates the episodic novelty by slowly discouraging visits to states visited many times across episodes. We address how each novelty is computed in the following subsection.

Episodic noveltyΒ module

In order to compute the episodic intrinsic reward, the agent maintains an episodic memory M, which keeps track of the controllable states in an episode. A controllable state is a representation of an observation produced by an embedding function f. It’s controllable as the function is learned to ignore aspects of an observation that are not influenced by an agent’s actions. In practice, NGU trains through an inverse dynamics model, a submodule from ICM that takes in observations x_t and x_{t+1} and predicts action a_t​ (see Figure 15 on the left). The episodic intrinsic reward is then definedΒ as

where n(f(x_t))​ is the counts for visits to the abstract state f(x_t)​. We approximate these counts​ as the sum of the similarities given by a kernel function ​K over the content of the memory M​. In practice, pseudo-counts are computed from the k​-nearest neighbors of state f(x_t)​ in the memory M​ with the following inverseΒ kernel

where 𝝐 guarantees a minimum amount of β€œpseudo-counts”, d is the Euclidean distance and dΒ²_m is the running average of the squared Euclidean distance of the k-nearest neighbors. This running average is used to make the kernel more robust to the tasks being solved, as different tasks may have different typical distances between learned embeddings. Algorithm 1 details theΒ process

Life-long noveltyΒ module

PuigdomΓ¨nech Badia et al. uses the RND to compute the life-long curiosity, which trains a predictor network \hat g​ to approximate a random network g​. The life-long novelty is then computed as a normalized mean squaredΒ error:

where πœ‡ and 𝜎 are running mean and standard deviation forΒ err(x)

The following figure demonstrates the RND architecture used by theΒ agent

The Never-Give-Up Agent

As shown in Figure 17, the NGU agent uses the same architecture as the R2D2 agent, except that it additionally feeds intrinsic reward r^i and their coefficient 𝛽 to the LSTM. This scheme, following Universal Value Function Approximator(UVFA), allows the agent to simultaneously approximate the optimal value function with respect to a family of augmented rewards of theΒ form

where r^e is the extrinsic reward provided by the environment, r^i is the intrinsic reward defined in the previous section. The reward, as well as the policy, is parameterized by a discrete value 𝛽 from the set {𝛽_j}_{j=0}^{N-1}, which controls the strength of r^i. This allows the agent to learn a family of policies that make different trade-offs between exploration and exploitation.

The advantage of learning a larger number of policies comes from the fact that exploitative and exploratory policies could be quite different from a behavior standpoint. Having a larger number of policies that change smoothly allows for more efficient training. One may regard jointly training the exploration policies, i.e. 𝛽>0​, as training auxiliary tasks, which helps train the shared architecture.

In practice, we additionally associate the exploitative policy(with 𝛽=0) with the highest discount factor 𝛾 and the most exploratory policy(with the largest 𝛽) with the smallest discount factor 𝛾. We use smaller discount factors for exploratory policies because the intrinsic reward is dense and the range of values is small, whereas we would like the highest possible discount factor for the exploitative policy in order to be as close as possible to optimizing the undiscounted return. The following equations show the schedule PuigdomΓ¨nech Badia et al. use in their experiments.

where 𝜎 is the sigmoid function and N=32 is the total number of available choices of 𝛽 and 𝛾

Reinforcement learning objective

As the agent is trained with sequential data, PuigdomΓ¨nech Badia et al. use a hybrid of Retrace(𝝀​) and the transformed Bellman operator. This gives us the following transformed RetraceΒ operator

Like DDQN, we compute this operator using the target network and update the online Q network to approach thisΒ value.

References

Badia, AdriΓ  PuigdomΓ¨nech, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, et al. 2020. β€œNever Give Up: Learning Directed Exploration Strategies,” 1–28. http://arxiv.org/abs/2002.06038.


How to Achieve Effective Exploration Without the Sacrifice of Exploitation was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓