Policy Gradient Methods

18 Feb 2025
Reinforcement Learning

Last updated on Feb 18, 2025

Table of Contents

Policy Gradients

When we had full knowledge of the states, we could use Markov Decision Processes to find the optimal policy. When this assumption breaks down, we need to come up with our best approximation. This is not a far stretch from how we might handle new scenarios in our own lives. When we begin a new task, we are certainly not experts. We may learn from a teacher or set off to explore on our own. As we practice and churn out the seemingly endless variations of our endeavour, we begin to develop a sense of what works and what doesn’t. We may not be able to articulate the exact rules that we follow, but we can certainly tell when we are doing well or poorly.

In lieu of a conscious agent with human intelligence, we can approximate the policy using a gradient-based approach. We will use the gradient of the expected reward with respect to the policy parameters to update the policy. Methods that use this approach are called policy gradient methods.

Policy Gradients

Q-Learning is an off-policy learning algorithm that directly estimates optimal action values in order to find optimal policies. That is, it does not update a policy table or function. Instead, it chooses the best action given the computed estimates of the action values. Given a state-action space that is too complex to be efficiently represented as a table, the function must be approximated. This can be done by taking the derivative of the cost fnction with respect to its objective function.

Whereas value function methods like Q-Learning approximate a value function, policy gradient methods approximate the policy function. Given this estimate, the gradient is updated to maximize the expected reward.

\[ \theta_{t+1} = \theta_t + \alpha \nabla J(\theta) \]

Episodic Case

For episodic problems, we can use the average state value

\[ \bar{v}_{\pi} = \sum_{s} d_{\pi}(s) v_{\pi}(s) = \sum_{s} d_{\pi}(s) \sum_{a} \pi_{\theta}(a|s) q_{\pi}(s, a), \]

where \(d_{\pi}(s)\) is the stationary distribution of states under the policy \(\pi\) and \(q_{\pi}(s, a)\) is the action-value function. The fact that \(d_{\pi}(s)\) is stationary is very important when computing the gradients.

What is a Stationary Distribution?

If \(\pi\) is a stationary distribution and \(P\) is the transition matrix over a Markov chain, then \(\pi\) satisfies the equation \(\pi = \pi P\). This means that the distribution of states under the policy \(\pi\) does not change over time.

Significance of the Stationary Distribution

The stationary distribution simplifies the computation of the gradient. It allows us to avoid computing how the state distributions change with respect to the policy parameters. That is because the particular trajectory our agent takes does not affect the probability of being in any given state.

Gradient of the Episodic Case

The gradient of the average state value is

\[ \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi} \left[ \nabla_{\theta} \log \pi_{\theta}(a|s) q_{\pi}(s, a) \right]. \]

An important note here is that this a stochastic gradient. We surely do not have complete knowledge of the environment, so computing the true gradient is out of the question. This also means that our stochastic gradients may exhibity high variance due to the randomness of the environment and policy.

Continuous Case

Using the average value in this case would not directly align with this goal because the value function represents the expected rewards starting from a specific state. In continuing problems, we are interested in global performance across all states under the policy.