Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.

Publication

NEAT with Hindsight Experience Replay
Latest   Machine Learning

NEAT with Hindsight Experience Replay

Last Updated on August 23, 2023 by Editorial Team

Author(s): Kevin Guo

Originally published on Towards AI.

After implementing NEAT in Unity a few weeks ago, I tried to think of ways to improve its performance.

I remember reading about a novel technique used in reinforcement learning algorithms known as Hindsight Experience Replay. With this technique, failed states that an agent reached are tracked and learned from as if they were successes.

At first, I didn’t think much of it. Due to the speciation method NEAT employs, one population is capable of searching a domain in many different directions. This means that experiences collected by one species searching an area of the domain would not be applicable to genomes outside that species. However, that’s when I had a thought: what if each species maintained a separate list of experiences?

I immediately set out to test my ideas.

Testing Environment

An environment is set up like this:

Source: Image by the Author

The agents are spawned on the bottom left and are to reach the star on the top left. The agents can only jump high enough to land on the bottom two platforms.

The agent looks like this.

Source: Image by the Author

For input, the agent receives (normalized):

The difference between the position of the goal and the position of the agent

The position of the agent

The y velocity of the agent

Whether or not the agent is on the ground and can jump

The result of 16 raycasts in a circle around the agent, which return the distance to the nearest platform they hit.

For output, the agent gives:

One output for whether to move left (<0.5) or right (>0.5)

One output for whether or not to jump (>0.5)

Configuration

25% of each generation was created with asexual reproduction. The portion of the generation created with crossover has a 35% chance of mutation. The best two genomes in each species remained unchanged and there was a 5% chance of an interspecies crossover occurring. There were 2000 genomes in the initial generation to ensure the population started off with enough members capable of climbing the first platform, but subsequent generations only had 500 genomes. A population is considered as failed if it does not find one agent that reaches the star within 500 generations.

Hindsight Experience Replay took a bit of tweaking to work with NEAT, but I ultimately settled on a strategy that looked like this:

Sample goals/experiences (the last platform the genome was on, to make sure the goals aren’t floating mid-air) from each genome at the end of a generation that fit the following criteria:

  1. The sample is a few jumps above the starting point, so that no samples are taken at the bottom.
  2. The sample is some distance away from every other goal that has been sampled for the species of the given genome.

The goals are added to the species of each genome so that each species maintains a separate list of goals.

When a new generation is created, assign each species two goals from its goal list, in a way such that the goals with the highest success rate are least likely to be assigned. Note that newly sampled goals are considered as having a success rate of 0%. Each generation is then evaluated three times, once for the real goal and twice for the sampled goals.

The fitness is computed with the following two values:

(starting_distance-closest_distance)/starting_distance

to reward the agent for getting close to the goal

(starting_distance-ending_distance)/ending_distance

to reward the agent for staying close to the goal.

If an agent stays on the bottom for too long, it is removed from the simulation and receives a penalty. This is mainly to speed up evaluation.

For NEAT with HER, the reward is cubed to amplify the differences between good and bad solutions. These values can be negative if the agent moves away from the goal.

Each species is given two starting samples (represented by the blue diamonds) to ensure they could be evaluated three times, but in hindsight this was unnecessary (sorry for the terrible joke). Source: Image by the Author

Results

Standard NEAT with only a distance reward did poorly on this problem, failing six out of ten times. NEAT is extremely reliant on a string of coincidental mutations early on to find the correct path and solve the problem. However, when it fails to do so, it ends up stuck on a local minimum where it climbs the platforms on the left but cannot get further.

Standard NEAT is unable to get any higher

On the other hand, NEAT with HER performs reasonably well. It solves the problem with a median of 149 generations (Stddev: 90) and doesn’t fail once, although it took 415 generations on its worst attempt. Interestingly, NEAT with HER unfolded like a novelty search: as earlier samples became rarer due to the assignment method, agents were forced to explore into new and harder-to-reach areas of the environment. It is important to note that the target goal seemed to have been stumbled upon by chance, although more training might teach the agents how to reach any point in the environment without memorizing a sequence of steps.

Sampled goals promote exploration and decrease the chance of premature convergence.

However, it is important not to overstate this accomplishment. Using a reward function that rewards the agents based on their y value, standard NEAT solves this with a median of 94 generations (Stddev: 36) and less evaluation time but fails once.

Conclusions

NEAT with HER can modestly promote generalization and success on difficult problems without manually messing with a fitness function. However, this comes at a cost of increased evaluation time, and slower convergence when compared to a smartly designed fitness function. Interestingly, it might be possible to train agents this way without an explicit goal at all. Of course, each species still must have at least one sample goal in order to train, but beyond that, the strategy of sampling and assignment promotes exploration of the domain with or without an explicit goal.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓