NEAT with Hindsight Experience Replay
Last Updated on August 23, 2023 by Editorial Team
Author(s): Kevin Guo
Originally published on Towards AI.
After implementing NEAT in Unity a few weeks ago, I tried to think of ways to improve its performance.
I remember reading about a novel technique used in reinforcement learning algorithms known as Hindsight Experience Replay. With this technique, failed states that an agent reached are tracked and learned from as if they were successes.
At first, I didnβt think much of it. Due to the speciation method NEAT employs, one population is capable of searching a domain in many different directions. This means that experiences collected by one species searching an area of the domain would not be applicable to genomes outside that species. However, thatβs when I had a thought: what if each species maintained a separate list of experiences?
I immediately set out to test my ideas.
Testing Environment
An environment is set up like this:
The agents are spawned on the bottom left and are to reach the star on the top left. The agents can only jump high enough to land on the bottom two platforms.
The agent looks like this.
For input, the agent receives (normalized):
The difference between the position of the goal and the position of the agent
The position of the agent
The y velocity of the agent
Whether or not the agent is on the ground and can jump
The result of 16 raycasts in a circle around the agent, which return the distance to the nearest platform they hit.
For output, the agent gives:
One output for whether to move left (<0.5) or right (>0.5)
One output for whether or not to jump (>0.5)
Configuration
25% of each generation was created with asexual reproduction. The portion of the generation created with crossover has a 35% chance of mutation. The best two genomes in each species remained unchanged and there was a 5% chance of an interspecies crossover occurring. There were 2000 genomes in the initial generation to ensure the population started off with enough members capable of climbing the first platform, but subsequent generations only had 500 genomes. A population is considered as failed if it does not find one agent that reaches the star within 500 generations.
Hindsight Experience Replay took a bit of tweaking to work with NEAT, but I ultimately settled on a strategy that looked like this:
Sample goals/experiences (the last platform the genome was on, to make sure the goals arenβt floating mid-air) from each genome at the end of a generation that fit the following criteria:
- The sample is a few jumps above the starting point, so that no samples are taken at the bottom.
- The sample is some distance away from every other goal that has been sampled for the species of the given genome.
The goals are added to the species of each genome so that each species maintains a separate list of goals.
When a new generation is created, assign each species two goals from its goal list, in a way such that the goals with the highest success rate are least likely to be assigned. Note that newly sampled goals are considered as having a success rate of 0%. Each generation is then evaluated three times, once for the real goal and twice for the sampled goals.
The fitness is computed with the following two values:
(starting_distance-closest_distance)/starting_distance
to reward the agent for getting close to the goal
(starting_distance-ending_distance)/ending_distance
to reward the agent for staying close to the goal.
If an agent stays on the bottom for too long, it is removed from the simulation and receives a penalty. This is mainly to speed up evaluation.
For NEAT with HER, the reward is cubed to amplify the differences between good and bad solutions. These values can be negative if the agent moves away from the goal.
Results
Standard NEAT with only a distance reward did poorly on this problem, failing six out of ten times. NEAT is extremely reliant on a string of coincidental mutations early on to find the correct path and solve the problem. However, when it fails to do so, it ends up stuck on a local minimum where it climbs the platforms on the left but cannot get further.
On the other hand, NEAT with HER performs reasonably well. It solves the problem with a median of 149 generations (Stddev: 90) and doesnβt fail once, although it took 415 generations on its worst attempt. Interestingly, NEAT with HER unfolded like a novelty search: as earlier samples became rarer due to the assignment method, agents were forced to explore into new and harder-to-reach areas of the environment. It is important to note that the target goal seemed to have been stumbled upon by chance, although more training might teach the agents how to reach any point in the environment without memorizing a sequence of steps.
However, it is important not to overstate this accomplishment. Using a reward function that rewards the agents based on their y value, standard NEAT solves this with a median of 94 generations (Stddev: 36) and less evaluation time but fails once.
Conclusions
NEAT with HER can modestly promote generalization and success on difficult problems without manually messing with a fitness function. However, this comes at a cost of increased evaluation time, and slower convergence when compared to a smartly designed fitness function. Interestingly, it might be possible to train agents this way without an explicit goal at all. Of course, each species still must have at least one sample goal in order to train, but beyond that, the strategy of sampling and assignment promotes exploration of the domain with or without an explicit goal.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI