引言
强化学习(Reinforcement Learning, RL)算法的训练过程常常充满挑战:训练不稳定、收敛缓慢、性能波动大等问题层出不穷。与监督学习不同,RL的超参数调优更加困难,因为:
- 非平稳性:数据分布随策略更新不断变化
- 稀疏奖励:反馈信号延迟且稀少
- 高方差:梯度估计噪声大
- 超参数敏感:微小的参数变化可能导致完全不同的结果
本文将系统总结强化学习调参的实战经验,包括关键超参数的作用机制、调参策略、常见问题的解决方案,以及稳定性和样本效率的提升技巧。
1. 核心超参数详解
1.1 学习率(Learning Rate)
作用:控制参数更新的步长
影响:
- 过大:训练不稳定,性能剧烈波动,甚至发散
- 过小:收敛缓慢,可能陷入局部最优
推荐范围:
| 算法类型 | 推荐学习率 | 典型值 |
|---|---|---|
| DQN系列 | $1 \times 10^{-4}$ ~ $5 \times 10^{-4}$ | $2.5 \times 10^{-4}$ |
| A2C/A3C | $7 \times 10^{-4}$ ~ $1 \times 10^{-3}$ | $7 \times 10^{-4}$ |
| PPO | $1 \times 10^{-4}$ ~ $3 \times 10^{-3}$ | $3 \times 10^{-4}$ |
| SAC | $3 \times 10^{-4}$ ~ $1 \times 10^{-3}$ | $3 \times 10^{-4}$ |
| TD3 | $1 \times 10^{-3}$ ~ $3 \times 10^{-3}$ | $3 \times 10^{-3}$ |
学习率衰减策略:
import torch.optim as optim
# 1. 线性衰减
def linear_schedule(initial_lr, final_lr, max_timesteps):
def lr_schedule(timestep):
progress = timestep / max_timesteps
return initial_lr + (final_lr - initial_lr) * progress
return lr_schedule
# 2. 指数衰减
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.99)
# 3. 余弦退火
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=total_steps, eta_min=1e-6
)
# 4. 自适应学习率(推荐使用Adam)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
调参技巧:
- 从默认值开始:先用典型值测试
- 观察训练曲线:
- 如果损失震荡剧烈 → 降低学习率
- 如果收敛太慢 → 提高学习率
- 网格搜索:在 [1e-5, 1e-2] 范围内对数搜索
- 使用学习率查找器:
def find_optimal_lr(model, env, lr_min=1e-6, lr_max=1e-2, num_steps=100):
"""学习率查找器"""
lrs = np.logspace(np.log10(lr_min), np.log10(lr_max), num_steps)
losses = []
for lr in lrs:
optimizer = optim.Adam(model.parameters(), lr=lr)
# 训练几步并记录损失
loss = train_step(model, optimizer, env)
losses.append(loss)
if loss > 2 * min(losses): # 损失爆炸,提前停止
break
# 绘制学习率-损失曲线
import matplotlib.pyplot as plt
plt.semilogx(lrs[:len(losses)], losses)
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.show()
# 选择梯度最大的点(最陡峭的下降)
optimal_idx = np.argmin(np.gradient(losses))
return lrs[optimal_idx]
1.2 折扣因子(Discount Factor, γ)
作用:控制对未来奖励的重视程度
\[G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]影响:
- 接近1(如0.99):重视长期回报,适合奖励延迟的任务
- 较小(如0.9):重视即时奖励,适合短期决策任务
推荐范围:
| 任务特征 | 推荐γ值 |
|---|---|
| 短期任务(步数<100) | 0.9 ~ 0.95 |
| 中期任务(步数100-1000) | 0.95 ~ 0.99 |
| 长期任务(步数>1000) | 0.99 ~ 0.999 |
| 无限时域任务 | 0.99 |
实战示例:
# Atari游戏(长时域)
gamma = 0.99
# CartPole(短时域)
gamma = 0.95
# MuJoCo连续控制(中长时域)
gamma = 0.99
注意事项:
- γ越大,值函数估计的方差越大
- γ越小,算法越”短视”
- 通常先固定γ=0.99,优先调整其他参数
1.3 批量大小(Batch Size)
作用:每次更新使用的样本数量
影响:
- 大批量:
- ✅ 梯度估计更准确,训练稳定
- ✅ GPU利用率高,计算效率高
- ❌ 样本效率低,需要更多交互
- 小批量:
- ✅ 样本效率高
- ❌ 梯度噪声大,训练不稳定
推荐值:
| 算法 | 批量大小 |
|---|---|
| DQN | 32 ~ 128 |
| PPO | 64 ~ 2048 |
| SAC | 256 ~ 1024 |
| A2C/A3C | 128 ~ 256 |
动态批量大小:
def adaptive_batch_size(timestep, min_batch=32, max_batch=512):
"""根据训练进度调整批量大小"""
# 早期使用小批量探索,后期使用大批量稳定
progress = min(1.0, timestep / 1e6)
batch_size = int(min_batch + (max_batch - min_batch) * progress)
return batch_size
1.4 探索率(Exploration Rate, ε)
作用:平衡探索(Exploration)与利用(Exploitation)
ε-greedy策略:
\[a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s,a) & \text{with probability } 1-\epsilon \end{cases}\]衰减策略:
# 1. 线性衰减(最常用)
def linear_epsilon_decay(timestep, epsilon_start=1.0, epsilon_end=0.01,
decay_steps=1000000):
epsilon = epsilon_start - (epsilon_start - epsilon_end) * min(1.0, timestep / decay_steps)
return epsilon
# 2. 指数衰减
def exponential_epsilon_decay(timestep, epsilon_start=1.0, epsilon_end=0.01,
decay_rate=0.99995):
epsilon = max(epsilon_end, epsilon_start * (decay_rate ** timestep))
return epsilon
# 3. 分段衰减
def piecewise_epsilon(timestep):
if timestep < 500000:
return 1.0
elif timestep < 1000000:
return 0.5
elif timestep < 2000000:
return 0.1
else:
return 0.01
# 使用示例
timestep = 0
for episode in range(num_episodes):
epsilon = linear_epsilon_decay(timestep)
state = env.reset()
done = False
while not done:
if np.random.rand() < epsilon:
action = env.action_space.sample() # 探索
else:
action = select_action(state) # 利用
state, reward, done, _ = env.step(action)
timestep += 1
推荐配置:
| 参数 | DQN | PPO |
|---|---|---|
| 初始ε | 1.0 | N/A(使用熵正则化) |
| 最终ε | 0.01 ~ 0.05 | N/A |
| 衰减步数 | 1M ~ 10M | N/A |
替代方案:熵正则化(用于策略梯度算法)
# PPO中的熵奖励
entropy_coef = 0.01
loss = policy_loss + value_loss - entropy_coef * entropy
1.5 目标网络更新频率(Target Network Update)
作用:稳定Q值估计,减少移动目标问题
DQN的目标网络:
class DQN:
def __init__(self):
self.q_network = QNetwork()
self.target_network = QNetwork()
self.target_network.load_state_dict(self.q_network.state_dict())
self.target_update_freq = 10000 # 每10000步更新一次
def update_target_network(self, timestep):
if timestep % self.target_update_freq == 0:
self.target_network.load_state_dict(self.q_network.state_dict())
软更新(Soft Update):
\[\theta_{\text{target}} \leftarrow \tau \theta + (1-\tau) \theta_{\text{target}}\]def soft_update(target_model, source_model, tau=0.005):
"""软更新目标网络"""
for target_param, param in zip(target_model.parameters(),
source_model.parameters()):
target_param.data.copy_(
tau * param.data + (1.0 - tau) * target_param.data
)
# 每次更新后调用
soft_update(target_network, q_network, tau=0.005)
推荐配置:
| 更新方式 | DQN | DDPG/TD3/SAC |
|---|---|---|
| 硬更新频率 | 10000 ~ 50000 步 | 不推荐 |
| 软更新τ | N/A | 0.001 ~ 0.01 |
1.6 经验回放缓冲区大小(Replay Buffer Size)
作用:存储历史经验,打破数据相关性
影响:
- 大缓冲区(1M+):
- ✅ 数据多样性高,打破相关性
- ❌ 内存消耗大,可能包含过时策略的数据
- 小缓冲区(10K-100K):
- ✅ 数据新鲜度高
- ❌ 可能过拟合最近的经验
推荐大小:
| 算法 | 缓冲区大小 |
|---|---|
| DQN | 1M |
| DDPG | 1M |
| SAC | 1M |
| TD3 | 1M |
| PPO | 不使用回放缓冲区 |
优先级经验回放(PER):
class PrioritizedReplayBuffer:
def __init__(self, capacity, alpha=0.6, beta=0.4):
self.capacity = capacity
self.alpha = alpha # 优先级指数
self.beta = beta # 重要性采样指数
self.buffer = []
self.priorities = np.zeros(capacity, dtype=np.float32)
self.position = 0
def add(self, state, action, reward, next_state, done):
max_priority = self.priorities.max() if self.buffer else 1.0
if len(self.buffer) < self.capacity:
self.buffer.append((state, action, reward, next_state, done))
else:
self.buffer[self.position] = (state, action, reward, next_state, done)
self.priorities[self.position] = max_priority
self.position = (self.position + 1) % self.capacity
def sample(self, batch_size):
if len(self.buffer) == self.capacity:
priorities = self.priorities
else:
priorities = self.priorities[:len(self.buffer)]
# 计算采样概率
probabilities = priorities ** self.alpha
probabilities /= probabilities.sum()
# 采样
indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
samples = [self.buffer[idx] for idx in indices]
# 重要性采样权重
total = len(self.buffer)
weights = (total * probabilities[indices]) ** (-self.beta)
weights /= weights.max()
return samples, indices, weights
def update_priorities(self, indices, priorities):
for idx, priority in zip(indices, priorities):
self.priorities[idx] = priority
2. 算法特定的超参数
2.1 PPO(Proximal Policy Optimization)
关键超参数:
| 参数 | 含义 | 典型值 | 作用 |
|---|---|---|---|
clip_range |
策略裁剪范围 | 0.1 ~ 0.3 | 限制策略更新幅度 |
n_epochs |
每批数据的训练轮数 | 3 ~ 10 | 提高样本效率 |
gae_lambda |
GAE参数λ | 0.95 ~ 0.99 | 平衡偏差和方差 |
value_coef |
值函数损失系数 | 0.5 ~ 1.0 | 值函数学习权重 |
entropy_coef |
熵正则化系数 | 0.01 ~ 0.1 | 鼓励探索 |
推荐配置(Stable Baselines3默认值):
from stable_baselines3 import PPO
model = PPO(
"MlpPolicy",
env,
learning_rate=3e-4,
n_steps=2048, # 每次更新采集的步数
batch_size=64, # 小批量大小
n_epochs=10, # 训练轮数
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2, # 裁剪范围
clip_range_vf=None, # 值函数裁剪(可选)
ent_coef=0.0, # 熵系数
vf_coef=0.5, # 值函数系数
max_grad_norm=0.5, # 梯度裁剪
verbose=1
)
调参建议:
- clip_range:
- 如果训练不稳定 → 减小到0.1
- 如果收敛太慢 → 增大到0.3
- n_epochs:
- 简单任务:3-5轮
- 复杂任务:10-15轮
- 注意过拟合:观察训练集和验证集表现
- entropy_coef:
- 开始时较大(0.01-0.1)鼓励探索
- 后期衰减到0,利用已学到的策略
def adaptive_entropy_coef(timestep, initial=0.1, final=0.001, decay_steps=1e6):
progress = min(1.0, timestep / decay_steps)
return initial + (final - initial) * progress
2.2 SAC(Soft Actor-Critic)
关键超参数:
| 参数 | 含义 | 典型值 |
|---|---|---|
temperature (α) |
熵温度系数 | 0.2(自动调整) |
tau |
软更新系数 | 0.005 |
target_entropy |
目标熵 | $-\dim(\mathcal{A})$ |
learning_starts |
开始学习的步数 | 10000 |
自动温度调整:
class SAC:
def __init__(self, action_dim):
# 自动调整熵温度
self.target_entropy = -action_dim # 启发式目标熵
self.log_alpha = torch.zeros(1, requires_grad=True)
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=3e-4)
def update_alpha(self, entropy):
"""更新温度参数"""
alpha_loss = -(self.log_alpha * (entropy + self.target_entropy).detach()).mean()
self.alpha_optimizer.zero_grad()
alpha_loss.backward()
self.alpha_optimizer.step()
alpha = self.log_alpha.exp()
return alpha
2.3 DQN及其变体
Rainbow DQN组件:
class RainbowDQN:
def __init__(self):
# 1. Double DQN:减少过估计
self.use_double_dqn = True
# 2. Dueling DQN:分离值函数和优势函数
self.use_dueling = True
# 3. Prioritized Experience Replay
self.use_per = True
self.per_alpha = 0.6
self.per_beta = 0.4
# 4. Multi-step Learning
self.n_step = 3
# 5. Distributional RL (C51)
self.use_distributional = True
self.v_min = -10
self.v_max = 10
self.n_atoms = 51
# 6. Noisy Networks:替代ε-greedy
self.use_noisy_net = True
推荐组合:
- 简单任务:Double DQN + Dueling
- 中等任务:+ PER
- 复杂任务:Rainbow(全部组件)
3. 提升训练稳定性的技巧
3.1 梯度裁剪(Gradient Clipping)
问题:强化学习中梯度经常爆炸或消失
解决方案:
import torch.nn.utils as nn_utils
# 方法1:按范数裁剪(最常用)
nn_utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
# 方法2:按值裁剪
nn_utils.clip_grad_value_(model.parameters(), clip_value=1.0)
# 完整训练循环
for epoch in range(num_epochs):
loss = compute_loss()
optimizer.zero_grad()
loss.backward()
# 梯度裁剪
nn_utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()
推荐值:
- PPO: max_norm=0.5
- DQN: max_norm=10
- SAC: max_norm=1.0
3.2 奖励归一化与裁剪
问题:奖励尺度差异大导致训练不稳定
解决方案:
class RewardNormalizer:
def __init__(self, gamma=0.99, epsilon=1e-8):
self.gamma = gamma
self.epsilon = epsilon
self.returns = []
self.mean = 0
self.var = 1
self.count = 0
def update(self, reward):
"""增量更新统计信息"""
self.returns.append(reward)
self.count += 1
if len(self.returns) > 1000:
self.returns.pop(0)
# 计算折扣回报的均值和方差
discounted_returns = []
R = 0
for r in reversed(self.returns):
R = r + self.gamma * R
discounted_returns.insert(0, R)
self.mean = np.mean(discounted_returns)
self.var = np.var(discounted_returns) + self.epsilon
def normalize(self, reward):
"""归一化奖励"""
return reward / np.sqrt(self.var)
# 奖励裁剪
def clip_reward(reward, min_value=-10, max_value=10):
return np.clip(reward, min_value, max_value)
# Atari游戏常用:符号裁剪
def sign_clip_reward(reward):
return np.sign(reward)
3.3 观测归一化
VecNormalize包装器(Stable Baselines3):
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
# 创建向量化环境
env = DummyVecEnv([lambda: gym.make("LunarLander-v2")])
# 归一化观测和奖励
env = VecNormalize(
env,
norm_obs=True, # 归一化观测
norm_reward=True, # 归一化奖励
clip_obs=10.0, # 裁剪观测
clip_reward=10.0, # 裁剪奖励
gamma=0.99
)
# 训练
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=100000)
# 保存归一化统计信息
env.save("vec_normalize.pkl")
# 测试时加载
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False # 不更新统计信息
env.norm_reward = False # 测试时不归一化奖励
3.4 网络初始化
正交初始化(Orthogonal Initialization):
import torch.nn as nn
def orthogonal_init(module, gain=1.0):
"""正交初始化"""
if isinstance(module, (nn.Linear, nn.Conv2d)):
nn.init.orthogonal_(module.weight, gain=gain)
if module.bias is not None:
module.bias.data.fill_(0.0)
# 应用到网络
class ActorCritic(nn.Module):
def __init__(self):
super().__init__()
self.actor = nn.Linear(state_dim, action_dim)
self.critic = nn.Linear(state_dim, 1)
# 使用正交初始化
self.apply(lambda m: orthogonal_init(m, gain=np.sqrt(2)))
# 最后一层使用小增益
orthogonal_init(self.actor, gain=0.01)
推荐配置:
- 隐藏层:gain=$\sqrt{2}$(对应ReLU激活)
- 策略输出层:gain=0.01(小初始值,避免策略更新过大)
- 值函数输出层:gain=1.0
3.5 学习率预热(Learning Rate Warmup)
class WarmupScheduler:
def __init__(self, optimizer, warmup_steps, initial_lr, target_lr):
self.optimizer = optimizer
self.warmup_steps = warmup_steps
self.initial_lr = initial_lr
self.target_lr = target_lr
self.current_step = 0
def step(self):
self.current_step += 1
if self.current_step <= self.warmup_steps:
lr = self.initial_lr + (self.target_lr - self.initial_lr) * \
(self.current_step / self.warmup_steps)
for param_group in self.optimizer.param_groups:
param_group['lr'] = lr
# 使用示例
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6)
scheduler = WarmupScheduler(optimizer, warmup_steps=10000,
initial_lr=1e-6, target_lr=3e-4)
for step in range(total_steps):
# 训练步骤
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 更新学习率
scheduler.step()
4. 提升样本效率的技巧
4.1 广义优势估计(GAE)
问题:TD误差方差大,蒙特卡洛偏差大
解决方案:GAE结合两者优点
\[\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\]其中TD误差:$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
"""
计算广义优势估计
参数:
rewards: 奖励序列 [T]
values: 值函数估计 [T+1]
dones: 终止标志 [T]
gamma: 折扣因子
gae_lambda: GAE参数
返回:
advantages: 优势估计 [T]
returns: 回报估计 [T]
"""
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = values[-1]
else:
next_value = values[t + 1]
# TD误差
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
# GAE
gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
advantages.insert(0, gae)
advantages = np.array(advantages)
returns = advantages + values[:-1]
return advantages, returns
# 归一化优势(提高稳定性)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
λ的选择:
- λ=0:纯TD(低方差,高偏差)
- λ=1:纯蒙特卡洛(高方差,无偏差)
- λ=0.95:推荐折中值
4.2 N-step Returns
思想:使用多步回报而非单步TD
\[G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\]class NStepReplayBuffer:
def __init__(self, capacity, n_step=3, gamma=0.99):
self.capacity = capacity
self.n_step = n_step
self.gamma = gamma
self.buffer = deque(maxlen=capacity)
self.n_step_buffer = deque(maxlen=n_step)
def add(self, state, action, reward, next_state, done):
# 添加到n步缓冲区
self.n_step_buffer.append((state, action, reward, next_state, done))
if len(self.n_step_buffer) < self.n_step:
return
# 计算n步回报
n_step_reward = 0
for i, (_, _, r, _, _) in enumerate(self.n_step_buffer):
n_step_reward += (self.gamma ** i) * r
# 获取n步后的状态
state_0 = self.n_step_buffer[0][0]
action_0 = self.n_step_buffer[0][1]
next_state_n = self.n_step_buffer[-1][3]
done_n = self.n_step_buffer[-1][4]
# 存储n步转移
self.buffer.append((state_0, action_0, n_step_reward,
next_state_n, done_n))
def sample(self, batch_size):
indices = np.random.choice(len(self.buffer), batch_size)
return [self.buffer[i] for i in indices]
推荐值:
- 简单任务:n=1(单步)
- 中等任务:n=3-5
- 复杂任务:n=5-10(但注意方差增加)
4.3 并行环境(Vectorized Environments)
加速采样:
from stable_baselines3.common.vec_env import SubprocVecEnv, DummyVecEnv
def make_env(env_id, rank, seed=0):
def _init():
env = gym.make(env_id)
env.reset(seed=seed + rank)
return env
return _init
if __name__ == '__main__':
num_envs = 8 # 并行环境数量
env_id = "CartPole-v1"
# 使用多进程(推荐)
env = SubprocVecEnv([make_env(env_id, i) for i in range(num_envs)])
# 或使用单进程(调试时)
# env = DummyVecEnv([make_env(env_id, i) for i in range(num_envs)])
# 训练
model = PPO("MlpPolicy", env, n_steps=128, batch_size=256)
model.learn(total_timesteps=100000)
优势:
- 加速数据采集(线性加速)
- 增加数据多样性
- 更好地利用多核CPU
4.4 课程学习(Curriculum Learning)
思想:从简单任务逐渐过渡到复杂任务
class CurriculumEnv:
def __init__(self, base_env, difficulty_schedule):
self.env = base_env
self.difficulty_schedule = difficulty_schedule
self.timestep = 0
def reset(self):
self.timestep += 1
# 根据进度调整难度
difficulty = self.difficulty_schedule(self.timestep)
self.env.set_difficulty(difficulty)
return self.env.reset()
def step(self, action):
return self.env.step(action)
# 难度调度函数
def linear_difficulty_schedule(timestep, max_timestep=1e6):
"""线性增加难度"""
return min(1.0, timestep / max_timestep)
def threshold_difficulty_schedule(timestep, thresholds):
"""基于成功率的阶梯式难度"""
for threshold, difficulty in thresholds:
if success_rate > threshold:
return difficulty
return thresholds[0][1]
5. 调参流程与最佳实践
5.1 系统化调参流程
Step 1: 建立基线
# 1. 使用默认超参数运行
from stable_baselines3 import PPO
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
# 2. 记录性能指标
baseline_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"基线平均奖励: {baseline_reward}")
Step 2: 超参数重要性排序
优先级(从高到低):
- 学习率:影响最大,首先调整
- 网络架构:隐藏层大小、层数
- 批量大小:影响稳定性和效率
- 折扣因子γ:取决于任务特性
- 探索策略:ε或熵系数
- 目标网络更新频率
- 其他算法特定参数
Step 3: 网格搜索或随机搜索
from stable_baselines3.common.evaluation import evaluate_policy
import optuna
def objective(trial):
"""Optuna优化目标函数"""
# 定义超参数搜索空间
learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-3)
gamma = trial.suggest_categorical('gamma', [0.9, 0.95, 0.99, 0.995])
gae_lambda = trial.suggest_categorical('gae_lambda', [0.8, 0.9, 0.95, 0.99])
clip_range = trial.suggest_uniform('clip_range', 0.1, 0.4)
ent_coef = trial.suggest_loguniform('ent_coef', 1e-4, 0.1)
# 创建模型
model = PPO(
"MlpPolicy",
env,
learning_rate=learning_rate,
gamma=gamma,
gae_lambda=gae_lambda,
clip_range=clip_range,
ent_coef=ent_coef,
verbose=0
)
# 训练
model.learn(total_timesteps=50000)
# 评估
mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=20)
return mean_reward
# 运行优化
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)
print("最佳超参数:")
print(study.best_params)
Step 4: 精细调整
# 在最佳超参数附近进行精细搜索
best_lr = study.best_params['learning_rate']
fine_tune_lrs = [
best_lr * 0.5,
best_lr * 0.75,
best_lr,
best_lr * 1.25,
best_lr * 1.5
]
results = []
for lr in fine_tune_lrs:
model = PPO("MlpPolicy", env, learning_rate=lr, **other_best_params)
model.learn(total_timesteps=100000)
reward = evaluate_policy(model, env)
results.append((lr, reward))
best_lr_fine = max(results, key=lambda x: x[1])[0]
5.2 监控与可视化
TensorBoard集成:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir='runs/experiment1')
for episode in range(num_episodes):
# 训练过程
episode_reward = 0
state = env.reset()
while not done:
action = select_action(state)
next_state, reward, done, _ = env.step(action)
episode_reward += reward
# 记录损失
loss = update_model()
writer.add_scalar('Loss/policy_loss', loss, global_step)
state = next_state
global_step += 1
# 记录每回合指标
writer.add_scalar('Reward/episode', episode_reward, episode)
writer.add_scalar('Length/episode', episode_length, episode)
# 记录超参数
writer.add_scalar('Hyperparameters/learning_rate', current_lr, episode)
writer.add_scalar('Hyperparameters/epsilon', current_epsilon, episode)
writer.close()
Weights & Biases集成:
import wandb
# 初始化
wandb.init(
project="rl-tuning",
config={
"learning_rate": 3e-4,
"gamma": 0.99,
"architecture": "MLP",
"environment": "CartPole-v1"
}
)
# 训练过程中记录
for step in range(total_steps):
# 训练
loss = train_step()
reward = evaluate()
# 记录指标
wandb.log({
"loss": loss,
"reward": reward,
"epsilon": epsilon,
"learning_rate": current_lr
}, step=step)
# 保存最佳模型
wandb.save('best_model.pth')
5.3 常见问题诊断
| 症状 | 可能原因 | 解决方案 |
|---|---|---|
| 训练不收敛 | 学习率过大 | 降低学习率 |
| 收敛很慢 | 学习率过小 | 提高学习率 |
| 性能剧烈波动 | 批量大小太小 | 增大批量 |
| 奖励曲线震荡 | 探索率过高 | 加快ε衰减 |
| 值函数发散 | 目标网络更新太频繁 | 降低更新频率 |
| 过拟合最近经验 | 回放缓冲区太小 | 增大缓冲区 |
| 陷入局部最优 | 探索不足 | 增加熵正则化 |
| 梯度爆炸 | 网络初始化不当 | 使用正交初始化 |
6. 实战案例
6.1 CartPole调参示例
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
# 创建环境
env = gym.make("CartPole-v1")
# 基线配置(默认参数)
baseline_model = PPO("MlpPolicy", env, verbose=0)
baseline_model.learn(total_timesteps=50000)
baseline_reward, _ = evaluate_policy(baseline_model, env, n_eval_episodes=100)
print(f"基线奖励: {baseline_reward:.2f}")
# 调优配置
tuned_model = PPO(
"MlpPolicy",
env,
learning_rate=5e-4, # 提高学习率加速收敛
n_steps=1024, # 增加采样步数
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.98,
clip_range=0.2,
ent_coef=0.01, # 增加探索
vf_coef=0.5,
max_grad_norm=0.5,
verbose=0
)
tuned_model.learn(total_timesteps=50000)
tuned_reward, _ = evaluate_policy(tuned_model, env, n_eval_episodes=100)
print(f"调优后奖励: {tuned_reward:.2f}")
print(f"提升: {(tuned_reward - baseline_reward) / baseline_reward * 100:.1f}%")
6.2 MuJoCo调参示例
import gym
from stable_baselines3 import SAC
from stable_baselines3.common.noise import NormalActionNoise
# 创建连续控制环境
env = gym.make("HalfCheetah-v3")
# 添加动作噪声
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(
mean=np.zeros(n_actions),
sigma=0.1 * np.ones(n_actions)
)
# SAC配置
model = SAC(
"MlpPolicy",
env,
learning_rate=3e-4,
buffer_size=1000000,
learning_starts=10000, # 预填充回放缓冲区
batch_size=256,
tau=0.005, # 软更新系数
gamma=0.99,
train_freq=1,
gradient_steps=1,
action_noise=action_noise,
ent_coef='auto', # 自动调整熵系数
target_entropy='auto',
verbose=1
)
# 训练
model.learn(total_timesteps=1000000)
# 评估
mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=50)
print(f"平均奖励: {mean_reward:.2f}")
7. 总结与建议
7.1 核心要点
- 从默认值开始:使用经过验证的默认超参数
- 优先调整关键参数:学习率 > 网络架构 > 批量大小
- 系统化搜索:使用自动化工具(Optuna, Ray Tune)
- 监控训练过程:使用TensorBoard或W&B
- 稳定性优先:先让训练稳定,再追求性能
7.2 黄金法则
- ✅ 多次运行取平均:RL结果随机性大,至少3次
- ✅ 保存最佳模型:定期评估并保存检查点
- ✅ 记录所有实验:超参数、代码版本、结果
- ✅ 渐进式调参:每次只改变一个参数
- ✅ 复现性:固定随机种子,记录环境版本
7.3 常用默认配置速查表
| 算法 | 学习率 | 批量大小 | γ | 其他关键参数 |
|---|---|---|---|---|
| DQN | 1e-4 | 32 | 0.99 | target_update=10000, buffer=1M |
| PPO | 3e-4 | 64 | 0.99 | clip=0.2, epochs=10, GAE_λ=0.95 |
| SAC | 3e-4 | 256 | 0.99 | τ=0.005, auto_alpha=True |
| TD3 | 3e-4 | 100 | 0.99 | τ=0.005, policy_delay=2 |
| A2C | 7e-4 | 128 | 0.99 | ent_coef=0.01, vf_coef=0.5 |
参考资源
- 实用工具:
- RL Baselines3 Zoo: https://github.com/DLR-RM/rl-baselines3-zoo
- Optuna: https://optuna.org/
- 论文:
- Henderson et al. (2018) “Deep Reinforcement Learning that Matters”
- Engstrom et al. (2020) “Implementation Matters in Deep RL”
- 博客:
- OpenAI Spinning Up: https://spinningup.openai.com/
- Stable Baselines3 文档: https://stable-baselines3.readthedocs.io/
通过系统化的调参方法和工程技巧,你可以显著提升强化学习算法的性能和稳定性。记住,调参是一个迭代过程,需要耐心和经验积累。希望本文能帮助你在RL项目中少走弯路!
💬 交流与讨论
⚠️ 尚未完成 Giscus 配置。请在
_config.yml中设置repo_id与category_id后重新部署,即可启用升级后的评论系统。配置完成后,评论区将自动支持 Markdown 代码高亮与 LaTeX 数学公式渲染,访客回复会同步到 GitHub Discussions,并具备通知功能。