Understanding Evolved Policy Gradients

 Learning to hop backwards with EPG.

Learning to hop backwards with EPG.

Here at launchpad.ai and at our fellowship program, we hold biweekly reading groups where we focus on recent research in machine learning.

In April, OpenAI published a blog post about a new deep reinforcement learning (RL) algorithm called Evolved Policy Gradients (EPG). This paper was the subject of our most recent reading group.

OpenAI makes a commendable effort to make their work accessible for the general public; their blog posts provide a “softer” read than the research papers, and they publish code associated with each paper as well (here’s the EPG code repository).

Despite their best efforts to explain the work intuitively, I found their writing on EPG a bit dense. I really wanted to understand it, but I couldn’t find any helpful third-party explanations about it. That’s why I decided to dig in to the paper and write this post: to provide an intuitive explanation of what’s going on with EPG.

Policy Gradients

To understand EPG, one must first understand policy gradients (PG). My first “aha” moment with PG came from reading Andrej Karpathy’s Pong from Pixels post.

If you understand backpropagation as a method for training neural networks in a supervised setting, the basics of PG is a simple extension of the same concept.

In an RL problem, a policy is any function that maps an input state to an output action. The idea with PG is to train a neural network to learn a good policy function using backpropagation. A clever way to do this is to use selected actions as target values in a supervised fashion.

 A diagram taken from Andrej Karpathy’s PG blog post. The green values represent the selected action, and loss is computed between the policy output and these values.

A diagram taken from Andrej Karpathy’s PG blog post. The green values represent the selected action, and loss is computed between the policy output and these values.

As an example, say our network needs to choose between one of three actions; it takes some state input and produces some softmax output like this:

 Some example softmax output from a neural network choosing one of three actions.

Some example softmax output from a neural network choosing one of three actions.

We can choose an action simply by taking the argmax of this output.

 A one-hot encoded representation of the selected action from the above output. (In practice, actions are usually sampled from the distribution provided by the softmax ouput.)

A one-hot encoded representation of the selected action from the above output. (In practice, actions are usually sampled from the distribution provided by the softmax ouput.)

We can then compute the “loss” of our neural network with respect to this selected action vector using a simple loss function, like binary cross-entropy.

 Computing binary cross-entropy loss for this example, taken from  this post  by  Manish Chablani .

Computing binary cross-entropy loss for this example, taken from this post by Manish Chablani.

If we use standard SGD, we can update our neural network to give us an output closer to that selected action given the same input state. This is like saying “behave more like that!” Similarly, we can take a step in the opposite direction (something like “gradient ascent,” just going uphill instead of downhill) which would be like saying “behave less like that!”

That’s the basic idea behind PG, and it’s surprisingly simple. The tough work behind all PG-based RL algorithms is figuring out how to use a PG: when do we want to encourage our policy to repeat an action? When do we want to discourage it? With sparse rewards, this question is nontrivial.

Designing a Loss Function

Other PG methods usually succeed by using an expected reward function and comparing expected rewards to observed rewards, which are “spread” over many time steps using some discount factor. This is essentially a heuristic that deals with the sparse reward problem.

The reasoning here is that the decisions made leading up to some observed reward can be considered “good” and, therefore, encouraged. When observed reward due to actions is better than expected, we should encourage our policy to move in that direction. The discounted reward concept is a way to apply this reasoning to an RL algorithm.

Controlling policy parameter step sizes turns out to be very important. Doing this in a principled way is the novelty behind the Proximal Policy Optimization (PPO) method. Here, a penalty based on KL divergence between policy parameters before and after an update is used to control the step size.

All of this work can be seen as designing a PG loss function that depends on observed rewards— a reasonable requirement for optimizing a policy’s ability to maximize rewards.

Why Not Learn a Loss Function?

RL is hard. PPO works comparatively well on a variety of tasks.

That’s great, but the basic idea with EPG is: let’s use machine learning to learn the loss function, rather than designing one ourselves. Maybe this approach can give us a more general RL algorithm.

Take a step back and think about the whole problem. What information do we need to learn a good policy?

Well, for one thing, the “right choice” given a state depends not only on the current state, but the recent history of states as well. It also depends on our recent history of decisions. Can we design a loss function for PG that takes all of this information into consideration?

With EPG, that’s what they are attempting to do. The resulting loss function might look something like this:

 EPG loss function architecture… quite a lot going on. The grey stuff is the loss function.

EPG loss function architecture… quite a lot going on. The grey stuff is the loss function.

There’s a whole lot going on here, but the main idea is that it is possible to build a neural network that takes all the information you want (including your policy network output and your chosen action) which outputs a single value (the green cube at the top of the image.) We can call that value the loss and train our policy network to minimize it.

This particular architecture uses convolutions that span over time. This time dimension lets us call these layers “temporal convolutions,” but they work just like the convolutional layers we know and love.

The Hard Part

This is a great idea, but there’s one problem: how do we learn good weights for our loss function network?

There’s no direct way to do this. We can’t fudge something like a policy gradient here.

This is where evolution strategies (ES) come in handy.

Evolution Strategies

The “Evolved” part of EPG comes from the use of ES to learn good parameters for the loss function network.

This particular optimization algorithm is quite useful here because it gives us a way to update our weights without any notion of a gradient. The basic idea with ES is this: update the weights by adding random noise to each parameter. If things work better, hold on to those changes. If things get worse, that was a step in the wrong direction (also useful information.)

So we have created a new problem (learning parameters for some crazy complicated but potentially useful loss function) and a potential solution (applying ES). Much of the paper goes in to describing exactly how these guys went about doing it.

Get to Work

ES involve a lot of trial and error. To converge in any reasonable amount of time, we need to parallelize the learning process.

We can do this by running many trials simultaneously. Each of these trials is is executed by a worker (so-called in the paper), or a thread.

A Family of Tasks

 A visualization of one family of tasks. The family is “learn to move the ant to a target location.” A task in this family is “learn to move the ant to location [10, 5]”.

A visualization of one family of tasks. The family is “learn to move the ant to a target location.” A task in this family is “learn to move the ant to location [10, 5]”.

EPG tries to solve a family of tasks. It does so by sampling individual tasks from the family. As pictured above, an example family of tasks is “guide the ant the a specified target location.” A task from this family would be “guide the ant to coordinates [10,5].”

Each worker gets a task sampled from the problem family of tasks. That means in our example that one worker will spend its time trying to learn how to get the ant to target location [10, 5], while another will spend its time trying to get the ant to target location [5, 10]. Each worker will go through many episodes repeating the same task, learning a policy by minimizing the novel loss function.

After so many timesteps, we can take a look at the total reward that each worker has accumulated through this process.

Trying Many Loss Functions

Each worker also gets some variant of the loss function we’re currently working with. By variant I mean, “the latest, greatest version of our learned loss function — plus some random noise.”

So, each worker has a task and a loss function. We can pick the workers with the highest total reward after many timesteps and posit that its variant of the loss function is better than some others, and take a step in the direction of the randomly-generated noise used to create that variant.

One problem with this approach is that one worker may have an easier task than another. For example, one worker’s task may be to guide the ant to coordinates [1,1], while another has to get all the way to [10, 15] (where both start with the ant at [0,0]). The first worker’s sampled task is easier, so we can’t be entirely confident that the policy assigned to it is responsible for the higher rewards that we observe.

To deal with this problem, the authors assign the same variant of the loss function to many workers; each worker gets a different task; they take the average reward over all the workers using the same variant of the loss function.

The result we end up at the end of each trial is a set of candidate loss functions and some idea about their relative performance on the family of tasks. We can use this information to update the loss function parameters.

Where’s the Reward?

An interesting thing about EPG is that, in some experiments, the algorithm is able to learn good policies without directly observing any rewards. All the information about what constitutes “good” behavior is encoded in the loss function, which doesn’t need to (but can) include observed rewards.

Results

 A pretrained EPG does well in “out-of-distribution” tasks at test-time; good test-time validation in RL problems is an important outstanding problem that EPG may solve.

A pretrained EPG does well in “out-of-distribution” tasks at test-time; good test-time validation in RL problems is an important outstanding problem that EPG may solve.

The authors of EPG are optimistic about their results, and the method is certainly intriguing. The work in involves a number of interesting ideas from recent deep RL research.

Perhaps the most important result is the ability for an EPG-trained loss function to perform well on “out-of-distribution” tasks. In the ant example we looked at earlier, an EPG loss function was trained using only goals located to the right of the ant’s initial position. The same loss function was able to train a policy that guides the ant to a goal on the left side of the ant’s initial position at test time.

However, the algorithm faces some practical challenges, mostly related to the need for heavy computational resources (many CPU cores) and relatively poor data efficiency (many, many trials are needed to learn a good loss function.) Loss function updates must be performed sequentially, which puts a limit on the amount of parallel processing that is possible with the method as it stands.

In my humble opinion, this paper is important not because of its direct results, but because it introduces a fundamentally different approach to applying PG methods. This is just an example of how to go about learning a loss function, but there may be entirely different ways to do the same thing. Maybe a different technique for learning a PG loss function is far better than this. But, as far as I’m aware, this is the first attempt to directly learn a PG loss function; side-stepping the loss function design problem may open a promising new path forward for PG methods in general.