引言
在强化学习中,我们希望智能体学会做出最优决策以最大化累积奖励。学习最优策略主要有两大思路:
- 基于值函数的方法(如 Q-learning、DQN):先学习状态-动作的价值,再从中推导出策略
- 策略梯度方法(Policy Gradient):直接优化策略本身
策略梯度方法的核心优势:
- ✅ 能处理高维或连续动作空间(如机器人控制)
- ✅ 能学习随机策略(在某些游戏中必需,如石头剪刀布)
- ✅ 收敛性更好(在理论和实践中)
- ✅ 可以更自然地融入先验知识
本文将深入讲解策略梯度的数学原理,详细推导 REINFORCE 算法,探讨方差降低技术(baseline、GAE等),并通过 CartPole 和 MuJoCo 环境的完整代码展示实战应用。
1. 策略梯度基础理论
1.1 核心概念回顾
策略(Policy):从状态到动作的映射,记为 $\pi_\theta(a|s)$
- 确定性策略:$a = \mu_\theta(s)$
-
随机性策略:$a \sim \pi_\theta(\cdot s)$
目标函数(Objective):期望累积奖励 \(J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right] = E_{\tau \sim \pi_\theta} [R(\tau)]\)
其中轨迹 $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$
策略梯度的目标:找到 $\theta^* = \arg\max_\theta J(\theta)$
1.2 策略梯度定理(Policy Gradient Theorem)
核心问题:如何计算 $\nabla_\theta J(\theta)$?
直观困难:
-
期望包含环境动态 $P(s’ s,a)$,我们通常不知道 - 轨迹分布 $p_\theta(\tau)$ 依赖于 $\theta$
突破性结果(Sutton et al., 1999):
\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\]其中 $R_t = \sum_{k=t}^T \gamma^{k-t} r_k$ 是从时刻 $t$ 开始的累积回报。
关键洞察:
-
梯度不依赖于环境动态 $P(s’ s,a)$(因果性 log derivative trick) - 可以通过采样轨迹来估计梯度
- $\log \pi_\theta$ 的梯度可以通过自动微分轻松计算
1.3 策略梯度定理的直观解释
\[\nabla_\theta J(\theta) \propto \sum_{s,a} \rho^\pi(s) \cdot \nabla_\theta \pi_\theta(a|s) \cdot Q^\pi(s,a)\]解读:
- 在状态 $s$ 下,如果动作 $a$ 有高 Q 值,增加选择 $a$ 的概率
- 如果动作 $a$ 有低 Q 值(甚至负值),减少选择 $a$ 的概率
- 访问频率 $\rho^\pi(s)$ 自然加权
类比:类似于监督学习中的交叉熵损失,但标签是由环境反馈的奖励决定的。
1.4 数学推导(可选深入)
符号定义:
- 轨迹 $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$
-
轨迹概率:$p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t s_t) P(s_{t+1} s_t, a_t)$ - 轨迹回报:$R(\tau) = \sum_{t=0}^T \gamma^t r_t$
推导步骤:
\[\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta E_{\tau \sim p_\theta} [R(\tau)] \\ &= \nabla_\theta \int p_\theta(\tau) R(\tau) d\tau \\ &= \int \nabla_\theta p_\theta(\tau) R(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) R(\tau) d\tau \quad \text{(log derivative trick)} \\ &= E_{\tau \sim p_\theta} [\nabla_\theta \log p_\theta(\tau) \cdot R(\tau)] \end{align}\]展开 $\log p_\theta(\tau)$:
\[\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t)\]关键步骤:求梯度时,只有 $\pi_\theta$ 依赖于 $\theta$,其他项梯度为 0:
\[\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\]最终形式:
\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau) \right]\]进一步优化(利用因果性):时刻 $t$ 的动作不影响过去的奖励,因此:
\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\]其中 $R_t = \sum_{k=t}^{T-1} \gamma^{k-t} r_k$。
2. REINFORCE 算法
2.1 算法原理
REINFORCE(Williams, 1992)是最经典的策略梯度算法,直接实现策略梯度定理。
核心思想:Monte Carlo 采样估计梯度。
算法流程:
- 初始化策略参数 $\theta$
- 重复:
- 使用当前策略 $\pi_\theta$ 采样一条完整轨迹 $\tau = (s_0, a_0, r_0, \ldots, s_T)$
- 计算每个时刻的累积回报 $R_t$
- 更新策略参数: \(\theta \leftarrow \theta + \alpha \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\)
2.2 伪代码
算法:REINFORCE
输入:可微策略 π_θ
参数:学习率 α, 折扣因子 γ
初始化:策略参数 θ(随机或预设)
for episode = 1, 2, ... do
生成一条轨迹 τ = {s_0, a_0, r_0, ..., s_T} 使用 π_θ
for t = 0 to T do
计算回报 R_t = Σ_{k=t}^T γ^{k-t} * r_k
计算梯度 g_t = ∇_θ log π_θ(a_t | s_t) * R_t
累积梯度 Δθ += g_t
end for
更新参数 θ ← θ + α * Δθ
end for
2.3 CartPole 完整实现
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt
# 定义策略网络
class PolicyNetwork(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(PolicyNetwork, self).__init__()
self.fc = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
nn.Softmax(dim=-1)
)
def forward(self, x):
return self.fc(x)
# REINFORCE 智能体
class REINFORCEAgent:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=1e-3, gamma=0.99):
self.gamma = gamma
self.policy = PolicyNetwork(state_dim, hidden_dim, action_dim)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
# 存储轨迹
self.log_probs = []
self.rewards = []
def select_action(self, state):
"""根据当前策略选择动作"""
state = torch.FloatTensor(state).unsqueeze(0)
probs = self.policy(state)
dist = Categorical(probs)
action = dist.sample()
# 保存 log π(a|s)
self.log_probs.append(dist.log_prob(action))
return action.item()
def compute_returns(self):
"""计算每个时刻的累积回报 R_t"""
returns = []
R = 0
for r in reversed(self.rewards):
R = r + self.gamma * R
returns.insert(0, R)
# 标准化回报(减少方差)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def update(self):
"""更新策略参数"""
returns = self.compute_returns()
policy_loss = []
for log_prob, R in zip(self.log_probs, returns):
# 负号:因为 PyTorch 做梯度下降,我们要梯度上升
policy_loss.append(-log_prob * R)
# 反向传播
self.optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.optimizer.step()
# 清空轨迹缓存
self.log_probs = []
self.rewards = []
return policy_loss.item()
# 训练函数
def train_reinforce(env_name='CartPole-v1', n_episodes=1000, max_steps=500):
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = REINFORCEAgent(state_dim, action_dim)
episode_rewards = []
episode_losses = []
for episode in range(n_episodes):
state, _ = env.reset()
episode_reward = 0
for step in range(max_steps):
# 选择动作
action = agent.select_action(state)
# 执行动作
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# 存储奖励
agent.rewards.append(reward)
episode_reward += reward
state = next_state
if done:
break
# 更新策略
loss = agent.update()
episode_rewards.append(episode_reward)
episode_losses.append(loss)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}, Loss: {loss:.4f}")
env.close()
return episode_rewards, episode_losses, agent
# 训练
rewards, losses, trained_agent = train_reinforce()
# 可视化学习曲线
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(rewards, alpha=0.3, label='Raw')
axes[0].plot(np.convolve(rewards, np.ones(100)/100, mode='valid'), label='Moving Avg (100)')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Total Reward')
axes[0].set_title('REINFORCE Training on CartPole')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[1].plot(losses, alpha=0.5)
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Policy Loss')
axes[1].set_title('Policy Loss over Time')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
2.4 测试训练好的策略
def test_policy(agent, env_name='CartPole-v1', n_episodes=10, render=True):
"""测试训练好的策略"""
env = gym.make(env_name, render_mode='human' if render else None)
test_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
episode_reward = 0
done = False
while not done:
# 确定性策略:选择概率最高的动作
state_tensor = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
probs = agent.policy(state_tensor)
action = torch.argmax(probs).item()
state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
episode_reward += reward
test_rewards.append(episode_reward)
print(f"Test Episode {episode+1}: Reward = {episode_reward}")
env.close()
print(f"\nAverage Test Reward: {np.mean(test_rewards):.2f} ± {np.std(test_rewards):.2f}")
return test_rewards
# 测试
test_policy(trained_agent, n_episodes=5, render=False)
3. 方差降低技术
3.1 高方差问题
REINFORCE 的主要问题:梯度估计的方差非常高。
原因分析:
- Monte Carlo 估计本身方差大
- 累积回报 $R_t$ 波动大
- 高方差 → 学习不稳定、需要大量样本
3.2 Baseline 技术
核心思想:从回报中减去一个基准值,减少方差而不引入偏差。
理论保证:对于任意只依赖状态的函数 $b(s)$,有
\[E_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot b(s) \right] = 0\]修改后的梯度:
\[\nabla_\theta J(\theta) = E_{\tau} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R_t - b(s_t)) \right]\]常见 Baseline 选择:
- 平均回报:$b = \frac{1}{N} \sum_i R_i$
- 状态值函数:$b(s) = V^\pi(s)$(最优)
- 移动平均:$b_t = \alpha \cdot b_{t-1} + (1-\alpha) \cdot R_t$
3.3 带 Baseline 的 REINFORCE 实现
class REINFORCEWithBaseline:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr_policy=1e-3, lr_value=1e-3, gamma=0.99):
self.gamma = gamma
# 策略网络
self.policy = PolicyNetwork(state_dim, hidden_dim, action_dim)
self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=lr_policy)
# 值函数网络(作为 baseline)
self.value_net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=lr_value)
# 轨迹缓存
self.log_probs = []
self.rewards = []
self.states = []
self.values = []
def select_action(self, state):
"""选择动作并估计状态值"""
state_tensor = torch.FloatTensor(state).unsqueeze(0)
# 策略网络
probs = self.policy(state_tensor)
dist = Categorical(probs)
action = dist.sample()
self.log_probs.append(dist.log_prob(action))
# 值函数网络
value = self.value_net(state_tensor)
self.values.append(value)
self.states.append(state)
return action.item()
def update(self):
"""更新策略和值函数"""
# 计算累积回报
returns = []
R = 0
for r in reversed(self.rewards):
R = r + self.gamma * R
returns.insert(0, R)
returns = torch.tensor(returns).unsqueeze(1)
# 转换为张量
values = torch.cat(self.values)
log_probs = torch.stack(self.log_probs)
# 计算优势函数 A(s,a) = R - V(s)
advantages = returns - values.detach()
# 更新策略网络
policy_loss = -(log_probs * advantages).mean()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()
# 更新值函数网络(均方误差)
value_loss = nn.MSELoss()(values, returns)
self.value_optimizer.zero_grad()
value_loss.backward()
self.value_optimizer.step()
# 清空缓存
self.log_probs = []
self.rewards = []
self.states = []
self.values = []
return policy_loss.item(), value_loss.item()
# 训练函数
def train_with_baseline(env_name='CartPole-v1', n_episodes=1000):
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
agent = REINFORCEWithBaseline(state_dim, action_dim)
episode_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
episode_reward = 0
for step in range(500):
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.rewards.append(reward)
episode_reward += reward
state = next_state
if done:
break
policy_loss, value_loss = agent.update()
episode_rewards.append(episode_reward)
if (episode + 1) % 100 == 0:
avg_reward = np.mean(episode_rewards[-100:])
print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}")
env.close()
return episode_rewards, agent
# 训练对比
rewards_baseline, agent_baseline = train_with_baseline()
3.4 优势函数(Advantage Function)
定义:
\[A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)\]直观理解:
- $A(s,a) > 0$:动作 $a$ 比平均水平好,增加概率
- $A(s,a) < 0$:动作 $a$ 比平均水平差,减少概率
使用优势函数的梯度:
\[\nabla_\theta J(\theta) = E \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t) \right]\]3.5 广义优势估计(GAE)
问题:如何估计优势函数?
TD 误差:
\[\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\]GAE 定义(Schulman et al., 2015):
\[A_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}\]参数:
- $\gamma$:折扣因子
- $\lambda \in [0, 1]$:偏差-方差权衡
- $\lambda = 0$:低方差,高偏差(TD)
- $\lambda = 1$:高方差,无偏差(Monte Carlo)
递归计算:
\[A_t = \delta_t + \gamma \lambda A_{t+1}\]4. 连续动作空间的策略梯度
4.1 高斯策略(Gaussian Policy)
离散动作:使用 Categorical 分布 连续动作:使用高斯分布
\[\pi_\theta(a|s) = \mathcal{N}(a; \mu_\theta(s), \sigma_\theta(s))\]对数概率:
\[\log \pi_\theta(a|s) = -\frac{1}{2} \left( \frac{(a - \mu_\theta(s))^2}{\sigma_\theta(s)^2} + \log(2\pi\sigma_\theta(s)^2) \right)\]4.2 连续动作空间实现(Pendulum-v1)
from torch.distributions import Normal
class ContinuousPolicyNetwork(nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(ContinuousPolicyNetwork, self).__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh()
)
# 均值头
self.mean_head = nn.Linear(hidden_dim, action_dim)
# 标准差头(使用 log std 确保正数)
self.log_std_head = nn.Linear(hidden_dim, action_dim)
def forward(self, x):
features = self.shared(x)
mean = self.mean_head(features)
log_std = self.log_std_head(features)
std = torch.exp(log_std.clamp(-20, 2)) # 防止数值不稳定
return mean, std
class ContinuousREINFORCE:
def __init__(self, state_dim, action_dim, hidden_dim=128, lr=3e-4, gamma=0.99):
self.gamma = gamma
self.policy = ContinuousPolicyNetwork(state_dim, hidden_dim, action_dim)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
self.log_probs = []
self.rewards = []
def select_action(self, state):
"""从高斯策略采样动作"""
state = torch.FloatTensor(state).unsqueeze(0)
mean, std = self.policy(state)
dist = Normal(mean, std)
action = dist.sample()
self.log_probs.append(dist.log_prob(action).sum(dim=-1)) # 多维动作求和
return action.detach().numpy().flatten()
def update(self):
"""更新策略"""
returns = []
R = 0
for r in reversed(self.rewards):
R = r + self.gamma * R
returns.insert(0, R)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
policy_loss = []
for log_prob, R in zip(self.log_probs, returns):
policy_loss.append(-log_prob * R)
self.optimizer.zero_grad()
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward()
self.optimizer.step()
self.log_probs = []
self.rewards = []
return policy_loss.item()
# 训练 Pendulum
def train_pendulum(n_episodes=500):
env = gym.make('Pendulum-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
agent = ContinuousREINFORCE(state_dim, action_dim)
episode_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
episode_reward = 0
for step in range(200):
action = agent.select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.rewards.append(reward)
episode_reward += reward
state = next_state
if done:
break
loss = agent.update()
episode_rewards.append(episode_reward)
if (episode + 1) % 50 == 0:
avg_reward = np.mean(episode_rewards[-50:])
print(f"Episode {episode+1}, Avg Reward: {avg_reward:.2f}")
env.close()
return episode_rewards, agent
rewards_pendulum, agent_pendulum = train_pendulum()
# 可视化
plt.figure(figsize=(10, 5))
plt.plot(rewards_pendulum, alpha=0.3)
plt.plot(np.convolve(rewards_pendulum, np.ones(50)/50, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('REINFORCE on Pendulum-v1 (Continuous Actions)')
plt.grid(True, alpha=0.3)
plt.show()
5. 实战技巧与调参建议
5.1 学习率调整
from torch.optim.lr_scheduler import StepLR
# 使用学习率衰减
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
scheduler = StepLR(optimizer, step_size=200, gamma=0.9)
# 在训练循环中
for episode in range(n_episodes):
# ... 训练代码 ...
scheduler.step()
5.2 梯度裁剪
# 防止梯度爆炸
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=0.5)
5.3 熵正则化(Entropy Regularization)
目的:鼓励探索,防止策略过早收敛到次优解。
\[J_{\text{total}}(\theta) = J(\theta) + \beta \cdot H(\pi_\theta)\]| 其中 $H(\pi_\theta) = -E_{s,a} [\log \pi_\theta(a | s)]$ 是策略熵。 |
实现:
# 在计算损失时添加熵项
entropy = dist.entropy().mean()
policy_loss = -(log_probs * advantages).mean() - 0.01 * entropy # β = 0.01
5.4 超参数推荐
| 参数 | CartPole | Pendulum | MuJoCo |
|---|---|---|---|
| Learning Rate | 1e-3 | 3e-4 | 3e-4 |
| Hidden Dim | 128 | 256 | 256 |
| Gamma | 0.99 | 0.99 | 0.99 |
| Entropy Coef | 0.0 | 0.01 | 0.01 |
| Batch Size | 1 episode | 1 episode | 2048 steps |
6. REINFORCE 的优缺点与改进方向
6.1 优点
- ✅ 理论简单:直接实现策略梯度定理
- ✅ 无偏估计:Monte Carlo 回报是无偏的
- ✅ 适用范围广:离散/连续、确定性/随机策略
- ✅ 易于实现:代码简洁,适合入门
6.2 缺点
- ❌ 高方差:需要大量样本才能稳定学习
- ❌ 样本效率低:每条轨迹只用一次(on-policy)
- ❌ 学习速度慢:相比 Actor-Critic 方法
- ❌ 对超参数敏感:学习率、baseline 选择影响大
6.3 改进方向
- Actor-Critic 方法:
- 使用值函数作为 baseline 和 critic
- 代表算法:A2C、A3C、PPO、SAC
- 自然策略梯度(NPG):
- 使用 Fisher 信息矩阵
- 更稳定的更新方向
- Trust Region 方法:
- 限制策略更新幅度
- 代表算法:TRPO、PPO
- Off-Policy 方法:
- 重用历史数据
- 代表算法:DPG、DDPG、TD3
7. 总结
7.1 核心要点回顾
-
策略梯度定理: \(\nabla_\theta J(\theta) = E \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t \right]\)
- REINFORCE 算法:
- Monte Carlo 采样估计梯度
- 无偏但高方差
- 方差降低:
- Baseline 技术(值函数)
- 优势函数 $A(s,a) = Q(s,a) - V(s)$
- GAE(λ 参数权衡偏差-方差)
- 连续动作:
-
高斯策略 $\pi_\theta(a s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s))$ - 可学习的均值和标准差
-
7.2 与值函数方法的对比
| 维度 | 策略梯度 | Q-learning/DQN |
|---|---|---|
| 优化目标 | 直接优化策略 | 优化值函数 |
| 动作空间 | 离散+连续 | 主要离散 |
| 探索策略 | 随机策略天然探索 | 需要 ε-greedy |
| 收敛性 | 理论保证好 | 可能震荡 |
| 样本效率 | 低(on-policy) | 高(off-policy) |
| 实现难度 | 中等 | 较简单 |
7.3 实践建议
- 🎯 入门项目:从 CartPole 开始
- 🎯 调试技巧:监控策略熵、梯度范数、回报方差
- 🎯 超参数:先用推荐值,再逐步调优
- 🎯 进阶学习:深入学习 A2C/PPO,理解 Actor-Critic 框架
参考资源
- 经典论文:
- Williams (1992): Simple Statistical Gradient-Following Algorithms
- Sutton et al. (1999): Policy Gradient Methods for RL
- Schulman et al. (2015): High-Dimensional Continuous Control Using GAE
- 开源实现:
- 推荐课程:
- David Silver 强化学习课程 Lecture 7
- CS 285: Deep RL (UC Berkeley)
策略梯度方法是现代深度强化学习的基石。掌握 REINFORCE 及其变体,不仅能解决实际问题,更能为理解 PPO、SAC 等先进算法打下坚实基础。记住:强化学习是实践的艺术,多动手、多实验,才能真正融会贯通!
- 强化学习
- Policy Gradient
- REINFORCE categories:
-
Reinforcement Learning
本文为占位文章。即将更新:
- 策略梯度推导与REINFORCE
- 基线技术与方差降低
- 连续动作空间(Gaussian Policy)
- 简单示例与代码框架
💬 交流与讨论
⚠️ 尚未完成 Giscus 配置。请在
_config.yml中设置repo_id与category_id后重新部署,即可启用升级后的评论系统。配置完成后,评论区将自动支持 Markdown 代码高亮与 LaTeX 数学公式渲染,访客回复会同步到 GitHub Discussions,并具备通知功能。