策略梯度方法详解

从REINFORCE到基线与方差降低

Posted by 冯宇 on July 5, 2024

引言

在强化学习中,我们希望智能体学会做出最优决策以最大化累积奖励。学习最优策略主要有两大思路:

  1. 基于值函数的方法(如 Q-learning、DQN):先学习状态-动作的价值,再从中推导出策略
  2. 策略梯度方法(Policy Gradient):直接优化策略本身

策略梯度方法的核心优势

  • ✅ 能处理高维或连续动作空间(如机器人控制)
  • ✅ 能学习随机策略(在某些游戏中必需,如石头剪刀布)
  • ✅ 收敛性更好(在理论和实践中)
  • ✅ 可以更自然地融入先验知识

本文将深入讲解策略梯度的数学原理,详细推导 REINFORCE 算法,探讨方差降低技术(baseline、GAE等),并通过 CartPole 和 MuJoCo 环境的完整代码展示实战应用。

1. 策略梯度基础理论

1.1 核心概念回顾

策略(Policy):从状态到动作的映射,记为 $\pi_\theta(a|s)$

  • 确定性策略:$a = \mu_\theta(s)$
  • 随机性策略:$a \sim \pi_\theta(\cdot s)$

目标函数(Objective):期望累积奖励 \(J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right] = E_{\tau \sim \pi_\theta} [R(\tau)]\)

其中轨迹 $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$

策略梯度的目标:找到 $\theta^* = \arg\max_\theta J(\theta)$

1.2 策略梯度定理(Policy Gradient Theorem)

核心问题:如何计算 $\nabla_\theta J(\theta)$?

直观困难

  • 期望包含环境动态 $P(s’ s,a)$,我们通常不知道
  • 轨迹分布 $p_\theta(\tau)$ 依赖于 $\theta$

突破性结果(Sutton et al., 1999):

\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\]

其中 $R_t = \sum_{k=t}^T \gamma^{k-t} r_k$ 是从时刻 $t$ 开始的累积回报。

关键洞察

  1. 梯度不依赖于环境动态 $P(s’ s,a)$(因果性 log derivative trick)
  2. 可以通过采样轨迹来估计梯度
  3. $\log \pi_\theta$ 的梯度可以通过自动微分轻松计算

1.3 策略梯度定理的直观解释

\[\nabla_\theta J(\theta) \propto \sum_{s,a} \rho^\pi(s) \cdot \nabla_\theta \pi_\theta(a|s) \cdot Q^\pi(s,a)\]

解读

  • 在状态 $s$ 下,如果动作 $a$ 有高 Q 值,增加选择 $a$ 的概率
  • 如果动作 $a$ 有低 Q 值(甚至负值),减少选择 $a$ 的概率
  • 访问频率 $\rho^\pi(s)$ 自然加权

类比:类似于监督学习中的交叉熵损失,但标签是由环境反馈的奖励决定的。

1.4 数学推导(可选深入)

符号定义

  • 轨迹 $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$
  • 轨迹概率:$p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t s_t) P(s_{t+1} s_t, a_t)$
  • 轨迹回报:$R(\tau) = \sum_{t=0}^T \gamma^t r_t$

推导步骤

\[\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta E_{\tau \sim p_\theta} [R(\tau)] \\ &= \nabla_\theta \int p_\theta(\tau) R(\tau) d\tau \\ &= \int \nabla_\theta p_\theta(\tau) R(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) R(\tau) d\tau \quad \text{(log derivative trick)} \\ &= E_{\tau \sim p_\theta} [\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)] \end{align}\]

展开 $\log p_\theta(\tau)$

\[\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t)\]

关键步骤:求梯度时,只有 $\pi_\theta$ 依赖于 $\theta$,其他项梯度为 0:

\[\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\]

最终形式

\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau) \right]\]

进一步优化(利用因果性):时刻 $t$ 的动作不影响过去的奖励,因此:

\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\]

其中 $R_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k$。

2. REINFORCE 算法

2.1 算法原理

REINFORCE(Williams, 1992)是最经典的策略梯度算法,直接实现策略梯度定理。

核心思想:Monte Carlo 采样估计梯度。

算法流程

  1. 初始化策略参数 $\theta$
  2. 重复
    • 使用当前策略 $\pi_\theta$ 采样一条完整轨迹 $\tau = (s_0, a_0, r_0, \ldots, s_T)$
    • 计算每个时刻的累积回报 $R_t$
    • 更新策略参数: \(\theta \leftarrow \theta + \alpha \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\)

2.2 伪代码

算法:REINFORCE

输入:可微策略 π_θ
参数:学习率 α, 折扣因子 γ

初始化:策略参数 θ(随机或预设)

for episode = 1, 2, ... do
    生成一条轨迹 τ = {s_0, a_0, r_0, ..., s_T} 使用 π_θ
    
    for t = 0 to T do
        计算回报 R_t = Σ_{k=t}^T γ^{k-t} * r_k
        计算梯度 g_t = ∇_θ log π_θ(a_t | s_t) * R_t
        累积梯度 Δθ += g_t
    end for
    
    更新参数 θ ← θ + α * Δθ
end for

2.3 CartPole 完整实现

import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

# 定义策略网络
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.fc(x)

# REINFORCE 智能体
class REINFORCEAgent:
    def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, hidden_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # 存储轨迹
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state):
        """根据当前策略选择动作"""
        state = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state)
        dist = Categorical(probs)
        action = dist.sample()
        
        # 保存 log π(a|s)
        self.log_probs.append(dist.log_prob(action))
        
        return action.item()
    
    def compute_returns(self):
        """计算每个时刻的累积回报 R_t"""
        returns = []
        R = 0
        for r in reversed(self.rewards):
            R = r + self.gamma * R
            returns.insert(0, R)
        
        # 标准化回报(减少方差)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        return returns
    
    def update(self):
        """更新策略参数"""
        returns = self.compute_returns()
        
        policy_loss = []
        for log_prob, R in zip(self.log_probs, returns):
            # 负号:因为 PyTorch 做梯度下降,我们要梯度上升
            policy_loss.append(-log_prob * R)
        
        # 反向传播
        self.optimizer.zero_grad()
        policy_loss = torch.stack(policy_loss).sum()
        policy_loss.backward()
        self.optimizer.step()
        
        # 清空轨迹缓存
        self.log_probs = []
        self.rewards = []
        
        return policy_loss.item()

# 训练函数
def train_reinforce(env_name='CartPole-v1', n_episodes=1000, max_steps=500):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = REINFORCEAgent(state_dim, action_dim)
    
    episode_rewards = []
    episode_losses = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        
        for step in range(max_steps):
            # 选择动作
            action = agent.select_action(state)
            
            # 执行动作
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            # 存储奖励
            agent.rewards.append(reward)
            episode_reward += reward
            
            state = next_state
            
            if done:
                break
        
        # 更新策略
        loss = agent.update()
        
        episode_rewards.append(episode_reward)
        episode_losses.append(loss)
        
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}, Loss: {loss:.4f}")
    
    env.close()
    
    return episode_rewards, episode_losses, agent

# 训练
rewards, losses, trained_agent = train_reinforce()

# 可视化学习曲线
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(rewards, alpha=0.3, label='Raw')
axes[0].plot(np.convolve(rewards, np.ones(100)/100, mode='valid'), label='Moving Avg (100)')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Total Reward')
axes[0].set_title('REINFORCE Training on CartPole')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(losses, alpha=0.5)
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Policy Loss')
axes[1].set_title('Policy Loss over Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

2.4 测试训练好的策略

def test_policy(agent, env_name='CartPole-v1', n_episodes=10, render=True):
    """测试训练好的策略"""
    env = gym.make(env_name, render_mode='human' if render else None)
    
    test_rewards = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            # 确定性策略:选择概率最高的动作
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            with torch.no_grad():
                probs = agent.policy(state_tensor)
            action = torch.argmax(probs).item()
            
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward
        
        test_rewards.append(episode_reward)
        print(f"Test Episode {episode+1}: Reward = {episode_reward}")
    
    env.close()
    
    print(f"\nAverage Test Reward: {np.mean(test_rewards):.2f} ± {np.std(test_rewards):.2f}")
    
    return test_rewards

# 测试
test_policy(trained_agent, n_episodes=5, render=False)

3. 方差降低技术

3.1 高方差问题

REINFORCE 的主要问题:梯度估计的方差非常高

原因分析

  • Monte Carlo 估计本身方差大
  • 累积回报 $R_t$ 波动大
  • 高方差 → 学习不稳定、需要大量样本

3.2 Baseline 技术

核心思想:从回报中减去一个基准值,减少方差而不引入偏差。

理论保证:对于任意只依赖状态的函数 $b(s)$,有

\[E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = 0\]

修改后的梯度

\[\nabla_\theta J(\theta) = E_{\tau} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R_t - b(s_t)) \right]\]

常见 Baseline 选择

  1. 平均回报:$b = \frac{1}{N} \sum_i R_i$
  2. 状态值函数:$b(s) = V^\pi(s)$(最优)
  3. 移动平均:$b_t = \alpha \cdot b_{t-1} + (1-\alpha) \cdot R_t$

3.3 带 Baseline 的 REINFORCE 实现

class REINFORCEWithBaseline:
    def __init__(self, state_dim, action_dim, hidden_dim=128, lr_policy=1e-3, lr_value=1e-3, gamma=0.99):
        self.gamma = gamma
        
        # 策略网络
        self.policy = PolicyNetwork(state_dim, hidden_dim, action_dim)
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
        
        # 值函数网络(作为 baseline)
        self.value_net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
        
        # 轨迹缓存
        self.log_probs = []
        self.rewards = []
        self.states = []
        self.values = []
    
    def select_action(self, state):
        """选择动作并估计状态值"""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        # 策略网络
        probs = self.policy(state_tensor)
        dist = Categorical(probs)
        action = dist.sample()
        self.log_probs.append(dist.log_prob(action))
        
        # 值函数网络
        value = self.value_net(state_tensor)
        self.values.append(value)
        
        self.states.append(state)
        
        return action.item()
    
    def update(self):
        """更新策略和值函数"""
        # 计算累积回报
        returns = []
        R = 0
        for r in reversed(self.rewards):
            R = r + self.gamma * R
            returns.insert(0, R)
        returns = torch.tensor(returns).unsqueeze(1)
        
        # 转换为张量
        values = torch.cat(self.values)
        log_probs = torch.stack(self.log_probs)
        
        # 计算优势函数 A(s,a) = R - V(s)
        advantages = returns - values.detach()
        
        # 更新策略网络
        policy_loss = -(log_probs * advantages).mean()
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        # 更新值函数网络(均方误差)
        value_loss = nn.MSELoss()(values, returns)
        self.value_optimizer.zero_grad()
        value_loss.backward()
        self.value_optimizer.step()
        
        # 清空缓存
        self.log_probs = []
        self.rewards = []
        self.states = []
        self.values = []
        
        return policy_loss.item(), value_loss.item()

# 训练函数
def train_with_baseline(env_name='CartPole-v1', n_episodes=1000):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = REINFORCEWithBaseline(state_dim, action_dim)
    
    episode_rewards = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        
        for step in range(500):
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.rewards.append(reward)
            episode_reward += reward
            
            state = next_state
            
            if done:
                break
        
        policy_loss, value_loss = agent.update()
        episode_rewards.append(episode_reward)
        
        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(episode_rewards[-100:])
            print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}")
    
    env.close()
    return episode_rewards, agent

# 训练对比
rewards_baseline, agent_baseline = train_with_baseline()

3.4 优势函数(Advantage Function)

定义

\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]

直观理解

  • $A(s,a) > 0$:动作 $a$ 比平均水平好,增加概率
  • $A(s,a) < 0$:动作 $a$ 比平均水平差,减少概率

使用优势函数的梯度

\[\nabla_\theta J(\theta) = E \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t) \right]\]

3.5 广义优势估计(GAE)

问题:如何估计优势函数?

TD 误差

\[\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]

GAE 定义(Schulman et al., 2015):

\[A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}\]

参数

  • $\gamma$:折扣因子
  • $\lambda \in [0, 1]$:偏差-方差权衡
    • $\lambda = 0$:低方差,高偏差(TD)
    • $\lambda = 1$:高方差,无偏差(Monte Carlo)

递归计算

\[A_t = \delta_t + \gamma \lambda A_{t+1}\]

4. 连续动作空间的策略梯度

4.1 高斯策略(Gaussian Policy)

离散动作:使用 Categorical 分布 连续动作:使用高斯分布

\[\pi_\theta(a|s) = \mathcal{N}(a; \mu_\theta(s), \sigma_\theta(s))\]

对数概率

\[\log \pi_\theta(a|s) = -\frac{1}{2} \left( \frac{(a - \mu_\theta(s))^2}{\sigma_\theta(s)^2} + \log(2\pi\sigma_\theta(s)^2) \right)\]

4.2 连续动作空间实现(Pendulum-v1)

from torch.distributions import Normal

class ContinuousPolicyNetwork(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(ContinuousPolicyNetwork, self).__init__()
        
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )
        
        # 均值头
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        
        # 标准差头(使用 log std 确保正数)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        features = self.shared(x)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        std = torch.exp(log_std.clamp(-20, 2))  # 防止数值不稳定
        
        return mean, std

class ContinuousREINFORCE:
    def __init__(self, state_dim, action_dim, hidden_dim=128, lr=3e-4, gamma=0.99):
        self.gamma = gamma
        self.policy = ContinuousPolicyNetwork(state_dim, hidden_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state):
        """从高斯策略采样动作"""
        state = torch.FloatTensor(state).unsqueeze(0)
        mean, std = self.policy(state)
        
        dist = Normal(mean, std)
        action = dist.sample()
        
        self.log_probs.append(dist.log_prob(action).sum(dim=-1))  # 多维动作求和
        
        return action.detach().numpy().flatten()
    
    def update(self):
        """更新策略"""
        returns = []
        R = 0
        for r in reversed(self.rewards):
            R = r + self.gamma * R
            returns.insert(0, R)
        
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        policy_loss = []
        for log_prob, R in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * R)
        
        self.optimizer.zero_grad()
        policy_loss = torch.stack(policy_loss).sum()
        policy_loss.backward()
        self.optimizer.step()
        
        self.log_probs = []
        self.rewards = []
        
        return policy_loss.item()

# 训练 Pendulum
def train_pendulum(n_episodes=500):
    env = gym.make('Pendulum-v1')
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]
    
    agent = ContinuousREINFORCE(state_dim, action_dim)
    
    episode_rewards = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        
        for step in range(200):
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.rewards.append(reward)
            episode_reward += reward
            
            state = next_state
            
            if done:
                break
        
        loss = agent.update()
        episode_rewards.append(episode_reward)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}")
    
    env.close()
    return episode_rewards, agent

rewards_pendulum, agent_pendulum = train_pendulum()

# 可视化
plt.figure(figsize=(10, 5))
plt.plot(rewards_pendulum, alpha=0.3)
plt.plot(np.convolve(rewards_pendulum, np.ones(50)/50, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('REINFORCE on Pendulum-v1 (Continuous Actions)')
plt.grid(True, alpha=0.3)
plt.show()

5. 实战技巧与调参建议

5.1 学习率调整

from torch.optim.lr_scheduler import StepLR

# 使用学习率衰减
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
scheduler = StepLR(optimizer, step_size=200, gamma=0.9)

# 在训练循环中
for episode in range(n_episodes):
    # ... 训练代码 ...
    scheduler.step()

5.2 梯度裁剪

# 防止梯度爆炸
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=0.5)

5.3 熵正则化(Entropy Regularization)

目的:鼓励探索,防止策略过早收敛到次优解。

\[J_{\text{total}}(\theta) = J(\theta) + \beta \cdot H(\pi_\theta)\]
其中 $H(\pi_\theta) = -E_{s,a} [\log \pi_\theta(a s)]$ 是策略熵。

实现

# 在计算损失时添加熵项
entropy = dist.entropy().mean()
policy_loss = -(log_probs * advantages).mean() - 0.01 * entropy  # β = 0.01

5.4 超参数推荐

参数 CartPole Pendulum MuJoCo
Learning Rate 1e-3 3e-4 3e-4
Hidden Dim 128 256 256
Gamma 0.99 0.99 0.99
Entropy Coef 0.0 0.01 0.01
Batch Size 1 episode 1 episode 2048 steps

6. REINFORCE 的优缺点与改进方向

6.1 优点

  • 理论简单:直接实现策略梯度定理
  • 无偏估计:Monte Carlo 回报是无偏的
  • 适用范围广:离散/连续、确定性/随机策略
  • 易于实现:代码简洁,适合入门

6.2 缺点

  • 高方差:需要大量样本才能稳定学习
  • 样本效率低:每条轨迹只用一次(on-policy)
  • 学习速度慢:相比 Actor-Critic 方法
  • 对超参数敏感:学习率、baseline 选择影响大

6.3 改进方向

  1. Actor-Critic 方法
    • 使用值函数作为 baseline 和 critic
    • 代表算法:A2C、A3C、PPO、SAC
  2. 自然策略梯度(NPG)
    • 使用 Fisher 信息矩阵
    • 更稳定的更新方向
  3. Trust Region 方法
    • 限制策略更新幅度
    • 代表算法:TRPO、PPO
  4. Off-Policy 方法
    • 重用历史数据
    • 代表算法:DPG、DDPG、TD3

7. 总结

7.1 核心要点回顾

  1. 策略梯度定理: \(\nabla_\theta J(\theta) = E \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\)

  2. REINFORCE 算法
    • Monte Carlo 采样估计梯度
    • 无偏但高方差
  3. 方差降低
    • Baseline 技术(值函数)
    • 优势函数 $A(s,a) = Q(s,a) - V(s)$
    • GAE(λ 参数权衡偏差-方差)
  4. 连续动作
    • 高斯策略 $\pi_\theta(a s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$
    • 可学习的均值和标准差

7.2 与值函数方法的对比

维度 策略梯度 Q-learning/DQN
优化目标 直接优化策略 优化值函数
动作空间 离散+连续 主要离散
探索策略 随机策略天然探索 需要 ε-greedy
收敛性 理论保证好 可能震荡
样本效率 低(on-policy) 高(off-policy)
实现难度 中等 较简单

7.3 实践建议

  • 🎯 入门项目:从 CartPole 开始
  • 🎯 调试技巧:监控策略熵、梯度范数、回报方差
  • 🎯 超参数:先用推荐值,再逐步调优
  • 🎯 进阶学习:深入学习 A2C/PPO,理解 Actor-Critic 框架

参考资源

  1. 经典论文
    • Williams (1992): Simple Statistical Gradient-Following Algorithms
    • Sutton et al. (1999): Policy Gradient Methods for RL
    • Schulman et al. (2015): High-Dimensional Continuous Control Using GAE
  2. 开源实现
  3. 推荐课程
    • David Silver 强化学习课程 Lecture 7
    • CS 285: Deep RL (UC Berkeley)

策略梯度方法是现代深度强化学习的基石。掌握 REINFORCE 及其变体,不仅能解决实际问题,更能为理解 PPO、SAC 等先进算法打下坚实基础。记住:强化学习是实践的艺术,多动手、多实验,才能真正融会贯通!

  • 强化学习
  • Policy Gradient
  • REINFORCE categories:
  • Reinforcement Learning

本文为占位文章。即将更新:

  • 策略梯度推导与REINFORCE
  • 基线技术与方差降低
  • 连续动作空间(Gaussian Policy)
  • 简单示例与代码框架

💬 交流与讨论

⚠️ 尚未完成 Giscus 配置。请在 _config.yml 中设置 repo_idcategory_id 后重新部署,即可启用升级后的评论系统。

配置完成后,评论区将自动支持 Markdown 代码高亮与 LaTeX 数学公式渲染,访客回复会同步到 GitHub Discussions,并具备通知功能。