强化学习调参技巧与实战经验

超参数优化、稳定性提升与样本效率改进

Posted by 冯宇 on August 5, 2024

引言

强化学习(Reinforcement Learning, RL)算法的训练过程常常充满挑战:训练不稳定、收敛缓慢、性能波动大等问题层出不穷。与监督学习不同,RL的超参数调优更加困难,因为:

  1. 非平稳性:数据分布随策略更新不断变化
  2. 稀疏奖励:反馈信号延迟且稀少
  3. 高方差:梯度估计噪声大
  4. 超参数敏感:微小的参数变化可能导致完全不同的结果

本文将系统总结强化学习调参的实战经验,包括关键超参数的作用机制、调参策略、常见问题的解决方案,以及稳定性和样本效率的提升技巧。

1. 核心超参数详解

1.1 学习率(Learning Rate)

作用:控制参数更新的步长

影响

  • 过大:训练不稳定,性能剧烈波动,甚至发散
  • 过小:收敛缓慢,可能陷入局部最优

推荐范围

算法类型 推荐学习率 典型值
DQN系列 $1 \times 10^{-4}$ ~ $5 \times 10^{-4}$ $2.5 \times 10^{-4}$
A2C/A3C $7 \times 10^{-4}$ ~ $1 \times 10^{-3}$ $7 \times 10^{-4}$
PPO $1 \times 10^{-4}$ ~ $3 \times 10^{-3}$ $3 \times 10^{-4}$
SAC $3 \times 10^{-4}$ ~ $1 \times 10^{-3}$ $3 \times 10^{-4}$
TD3 $1 \times 10^{-3}$ ~ $3 \times 10^{-3}$ $3 \times 10^{-3}$

学习率衰减策略

import torch.optim as optim

# 1. 线性衰减
def linear_schedule(initial_lr, final_lr, max_timesteps):
    def lr_schedule(timestep):
        progress = timestep / max_timesteps
        return initial_lr + (final_lr - initial_lr) * progress
    return lr_schedule

# 2. 指数衰减
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.99)

# 3. 余弦退火
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=total_steps, eta_min=1e-6
)

# 4. 自适应学习率(推荐使用Adam)
optimizer = optim.Adam(model.parameters(), lr=3e-4)

调参技巧

  1. 从默认值开始:先用典型值测试
  2. 观察训练曲线
    • 如果损失震荡剧烈 → 降低学习率
    • 如果收敛太慢 → 提高学习率
  3. 网格搜索:在 [1e-5, 1e-2] 范围内对数搜索
  4. 使用学习率查找器
def find_optimal_lr(model, env, lr_min=1e-6, lr_max=1e-2, num_steps=100):
    """学习率查找器"""
    lrs = np.logspace(np.log10(lr_min), np.log10(lr_max), num_steps)
    losses = []
    
    for lr in lrs:
        optimizer = optim.Adam(model.parameters(), lr=lr)
        # 训练几步并记录损失
        loss = train_step(model, optimizer, env)
        losses.append(loss)
        
        if loss > 2 * min(losses):  # 损失爆炸,提前停止
            break
    
    # 绘制学习率-损失曲线
    import matplotlib.pyplot as plt
    plt.semilogx(lrs[:len(losses)], losses)
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Learning Rate Finder')
    plt.show()
    
    # 选择梯度最大的点(最陡峭的下降)
    optimal_idx = np.argmin(np.gradient(losses))
    return lrs[optimal_idx]

1.2 折扣因子(Discount Factor, γ)

作用:控制对未来奖励的重视程度

\[G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]

影响

  • 接近1(如0.99):重视长期回报,适合奖励延迟的任务
  • 较小(如0.9):重视即时奖励,适合短期决策任务

推荐范围

任务特征 推荐γ值
短期任务(步数<100) 0.9 ~ 0.95
中期任务(步数100-1000) 0.95 ~ 0.99
长期任务(步数>1000) 0.99 ~ 0.999
无限时域任务 0.99

实战示例

# Atari游戏(长时域)
gamma = 0.99

# CartPole(短时域)
gamma = 0.95

# MuJoCo连续控制(中长时域)
gamma = 0.99

注意事项

  • γ越大,值函数估计的方差越大
  • γ越小,算法越”短视”
  • 通常先固定γ=0.99,优先调整其他参数

1.3 批量大小(Batch Size)

作用:每次更新使用的样本数量

影响

  • 大批量
    • ✅ 梯度估计更准确,训练稳定
    • ✅ GPU利用率高,计算效率高
    • ❌ 样本效率低,需要更多交互
  • 小批量
    • ✅ 样本效率高
    • ❌ 梯度噪声大,训练不稳定

推荐值

算法 批量大小
DQN 32 ~ 128
PPO 64 ~ 2048
SAC 256 ~ 1024
A2C/A3C 128 ~ 256

动态批量大小

def adaptive_batch_size(timestep, min_batch=32, max_batch=512):
    """根据训练进度调整批量大小"""
    # 早期使用小批量探索,后期使用大批量稳定
    progress = min(1.0, timestep / 1e6)
    batch_size = int(min_batch + (max_batch - min_batch) * progress)
    return batch_size

1.4 探索率(Exploration Rate, ε)

作用:平衡探索(Exploration)与利用(Exploitation)

ε-greedy策略

\[a = \begin{cases} \text{random action} & \text{with probability } \epsilon \\ \arg\max_a Q(s,a) & \text{with probability } 1-\epsilon \end{cases}\]

衰减策略

# 1. 线性衰减(最常用)
def linear_epsilon_decay(timestep, epsilon_start=1.0, epsilon_end=0.01, 
                         decay_steps=1000000):
    epsilon = epsilon_start - (epsilon_start - epsilon_end) * min(1.0, timestep / decay_steps)
    return epsilon

# 2. 指数衰减
def exponential_epsilon_decay(timestep, epsilon_start=1.0, epsilon_end=0.01, 
                              decay_rate=0.99995):
    epsilon = max(epsilon_end, epsilon_start * (decay_rate ** timestep))
    return epsilon

# 3. 分段衰减
def piecewise_epsilon(timestep):
    if timestep < 500000:
        return 1.0
    elif timestep < 1000000:
        return 0.5
    elif timestep < 2000000:
        return 0.1
    else:
        return 0.01

# 使用示例
timestep = 0
for episode in range(num_episodes):
    epsilon = linear_epsilon_decay(timestep)
    
    state = env.reset()
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # 探索
        else:
            action = select_action(state)  # 利用
        
        state, reward, done, _ = env.step(action)
        timestep += 1

推荐配置

参数 DQN PPO
初始ε 1.0 N/A(使用熵正则化)
最终ε 0.01 ~ 0.05 N/A
衰减步数 1M ~ 10M N/A

替代方案:熵正则化(用于策略梯度算法)

# PPO中的熵奖励
entropy_coef = 0.01
loss = policy_loss + value_loss - entropy_coef * entropy

1.5 目标网络更新频率(Target Network Update)

作用:稳定Q值估计,减少移动目标问题

DQN的目标网络

class DQN:
    def __init__(self):
        self.q_network = QNetwork()
        self.target_network = QNetwork()
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        self.target_update_freq = 10000  # 每10000步更新一次
        
    def update_target_network(self, timestep):
        if timestep % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())

软更新(Soft Update)

\[\theta_{\text{target}} \leftarrow \tau \theta + (1-\tau) \theta_{\text{target}}\]
def soft_update(target_model, source_model, tau=0.005):
    """软更新目标网络"""
    for target_param, param in zip(target_model.parameters(), 
                                   source_model.parameters()):
        target_param.data.copy_(
            tau * param.data + (1.0 - tau) * target_param.data
        )

# 每次更新后调用
soft_update(target_network, q_network, tau=0.005)

推荐配置

更新方式 DQN DDPG/TD3/SAC
硬更新频率 10000 ~ 50000 步 不推荐
软更新τ N/A 0.001 ~ 0.01

1.6 经验回放缓冲区大小(Replay Buffer Size)

作用:存储历史经验,打破数据相关性

影响

  • 大缓冲区(1M+):
    • ✅ 数据多样性高,打破相关性
    • ❌ 内存消耗大,可能包含过时策略的数据
  • 小缓冲区(10K-100K):
    • ✅ 数据新鲜度高
    • ❌ 可能过拟合最近的经验

推荐大小

算法 缓冲区大小
DQN 1M
DDPG 1M
SAC 1M
TD3 1M
PPO 不使用回放缓冲区

优先级经验回放(PER)

class PrioritizedReplayBuffer:
    def __init__(self, capacity, alpha=0.6, beta=0.4):
        self.capacity = capacity
        self.alpha = alpha  # 优先级指数
        self.beta = beta    # 重要性采样指数
        self.buffer = []
        self.priorities = np.zeros(capacity, dtype=np.float32)
        self.position = 0
        
    def add(self, state, action, reward, next_state, done):
        max_priority = self.priorities.max() if self.buffer else 1.0
        
        if len(self.buffer) < self.capacity:
            self.buffer.append((state, action, reward, next_state, done))
        else:
            self.buffer[self.position] = (state, action, reward, next_state, done)
        
        self.priorities[self.position] = max_priority
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        if len(self.buffer) == self.capacity:
            priorities = self.priorities
        else:
            priorities = self.priorities[:len(self.buffer)]
        
        # 计算采样概率
        probabilities = priorities ** self.alpha
        probabilities /= probabilities.sum()
        
        # 采样
        indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
        samples = [self.buffer[idx] for idx in indices]
        
        # 重要性采样权重
        total = len(self.buffer)
        weights = (total * probabilities[indices]) ** (-self.beta)
        weights /= weights.max()
        
        return samples, indices, weights
    
    def update_priorities(self, indices, priorities):
        for idx, priority in zip(indices, priorities):
            self.priorities[idx] = priority

2. 算法特定的超参数

2.1 PPO(Proximal Policy Optimization)

关键超参数

参数 含义 典型值 作用
clip_range 策略裁剪范围 0.1 ~ 0.3 限制策略更新幅度
n_epochs 每批数据的训练轮数 3 ~ 10 提高样本效率
gae_lambda GAE参数λ 0.95 ~ 0.99 平衡偏差和方差
value_coef 值函数损失系数 0.5 ~ 1.0 值函数学习权重
entropy_coef 熵正则化系数 0.01 ~ 0.1 鼓励探索

推荐配置(Stable Baselines3默认值):

from stable_baselines3 import PPO

model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,        # 每次更新采集的步数
    batch_size=64,       # 小批量大小
    n_epochs=10,         # 训练轮数
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,      # 裁剪范围
    clip_range_vf=None,  # 值函数裁剪(可选)
    ent_coef=0.0,        # 熵系数
    vf_coef=0.5,         # 值函数系数
    max_grad_norm=0.5,   # 梯度裁剪
    verbose=1
)

调参建议

  1. clip_range
    • 如果训练不稳定 → 减小到0.1
    • 如果收敛太慢 → 增大到0.3
  2. n_epochs
    • 简单任务:3-5轮
    • 复杂任务:10-15轮
    • 注意过拟合:观察训练集和验证集表现
  3. entropy_coef
    • 开始时较大(0.01-0.1)鼓励探索
    • 后期衰减到0,利用已学到的策略
def adaptive_entropy_coef(timestep, initial=0.1, final=0.001, decay_steps=1e6):
    progress = min(1.0, timestep / decay_steps)
    return initial + (final - initial) * progress

2.2 SAC(Soft Actor-Critic)

关键超参数

参数 含义 典型值
temperature (α) 熵温度系数 0.2(自动调整)
tau 软更新系数 0.005
target_entropy 目标熵 $-\dim(\mathcal{A})$
learning_starts 开始学习的步数 10000

自动温度调整

class SAC:
    def __init__(self, action_dim):
        # 自动调整熵温度
        self.target_entropy = -action_dim  # 启发式目标熵
        self.log_alpha = torch.zeros(1, requires_grad=True)
        self.alpha_optimizer = optim.Adam([self.log_alpha], lr=3e-4)
    
    def update_alpha(self, entropy):
        """更新温度参数"""
        alpha_loss = -(self.log_alpha * (entropy + self.target_entropy).detach()).mean()
        
        self.alpha_optimizer.zero_grad()
        alpha_loss.backward()
        self.alpha_optimizer.step()
        
        alpha = self.log_alpha.exp()
        return alpha

2.3 DQN及其变体

Rainbow DQN组件

class RainbowDQN:
    def __init__(self):
        # 1. Double DQN:减少过估计
        self.use_double_dqn = True
        
        # 2. Dueling DQN:分离值函数和优势函数
        self.use_dueling = True
        
        # 3. Prioritized Experience Replay
        self.use_per = True
        self.per_alpha = 0.6
        self.per_beta = 0.4
        
        # 4. Multi-step Learning
        self.n_step = 3
        
        # 5. Distributional RL (C51)
        self.use_distributional = True
        self.v_min = -10
        self.v_max = 10
        self.n_atoms = 51
        
        # 6. Noisy Networks:替代ε-greedy
        self.use_noisy_net = True

推荐组合

  • 简单任务:Double DQN + Dueling
  • 中等任务:+ PER
  • 复杂任务:Rainbow(全部组件)

3. 提升训练稳定性的技巧

3.1 梯度裁剪(Gradient Clipping)

问题:强化学习中梯度经常爆炸或消失

解决方案

import torch.nn.utils as nn_utils

# 方法1:按范数裁剪(最常用)
nn_utils.clip_grad_norm_(model.parameters(), max_norm=0.5)

# 方法2:按值裁剪
nn_utils.clip_grad_value_(model.parameters(), clip_value=1.0)

# 完整训练循环
for epoch in range(num_epochs):
    loss = compute_loss()
    
    optimizer.zero_grad()
    loss.backward()
    
    # 梯度裁剪
    nn_utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
    
    optimizer.step()

推荐值

  • PPO: max_norm=0.5
  • DQN: max_norm=10
  • SAC: max_norm=1.0

3.2 奖励归一化与裁剪

问题:奖励尺度差异大导致训练不稳定

解决方案

class RewardNormalizer:
    def __init__(self, gamma=0.99, epsilon=1e-8):
        self.gamma = gamma
        self.epsilon = epsilon
        self.returns = []
        self.mean = 0
        self.var = 1
        self.count = 0
    
    def update(self, reward):
        """增量更新统计信息"""
        self.returns.append(reward)
        self.count += 1
        
        if len(self.returns) > 1000:
            self.returns.pop(0)
        
        # 计算折扣回报的均值和方差
        discounted_returns = []
        R = 0
        for r in reversed(self.returns):
            R = r + self.gamma * R
            discounted_returns.insert(0, R)
        
        self.mean = np.mean(discounted_returns)
        self.var = np.var(discounted_returns) + self.epsilon
    
    def normalize(self, reward):
        """归一化奖励"""
        return reward / np.sqrt(self.var)

# 奖励裁剪
def clip_reward(reward, min_value=-10, max_value=10):
    return np.clip(reward, min_value, max_value)

# Atari游戏常用:符号裁剪
def sign_clip_reward(reward):
    return np.sign(reward)

3.3 观测归一化

VecNormalize包装器(Stable Baselines3):

from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# 创建向量化环境
env = DummyVecEnv([lambda: gym.make("LunarLander-v2")])

# 归一化观测和奖励
env = VecNormalize(
    env,
    norm_obs=True,      # 归一化观测
    norm_reward=True,   # 归一化奖励
    clip_obs=10.0,      # 裁剪观测
    clip_reward=10.0,   # 裁剪奖励
    gamma=0.99
)

# 训练
model = PPO("MlpPolicy", env)
model.learn(total_timesteps=100000)

# 保存归一化统计信息
env.save("vec_normalize.pkl")

# 测试时加载
env = VecNormalize.load("vec_normalize.pkl", env)
env.training = False  # 不更新统计信息
env.norm_reward = False  # 测试时不归一化奖励

3.4 网络初始化

正交初始化(Orthogonal Initialization):

import torch.nn as nn

def orthogonal_init(module, gain=1.0):
    """正交初始化"""
    if isinstance(module, (nn.Linear, nn.Conv2d)):
        nn.init.orthogonal_(module.weight, gain=gain)
        if module.bias is not None:
            module.bias.data.fill_(0.0)

# 应用到网络
class ActorCritic(nn.Module):
    def __init__(self):
        super().__init__()
        self.actor = nn.Linear(state_dim, action_dim)
        self.critic = nn.Linear(state_dim, 1)
        
        # 使用正交初始化
        self.apply(lambda m: orthogonal_init(m, gain=np.sqrt(2)))
        
        # 最后一层使用小增益
        orthogonal_init(self.actor, gain=0.01)

推荐配置

  • 隐藏层:gain=$\sqrt{2}$(对应ReLU激活)
  • 策略输出层:gain=0.01(小初始值,避免策略更新过大)
  • 值函数输出层:gain=1.0

3.5 学习率预热(Learning Rate Warmup)

class WarmupScheduler:
    def __init__(self, optimizer, warmup_steps, initial_lr, target_lr):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.initial_lr = initial_lr
        self.target_lr = target_lr
        self.current_step = 0
    
    def step(self):
        self.current_step += 1
        if self.current_step <= self.warmup_steps:
            lr = self.initial_lr + (self.target_lr - self.initial_lr) * \
                 (self.current_step / self.warmup_steps)
            for param_group in self.optimizer.param_groups:
                param_group['lr'] = lr

# 使用示例
optimizer = torch.optim.Adam(model.parameters(), lr=1e-6)
scheduler = WarmupScheduler(optimizer, warmup_steps=10000, 
                           initial_lr=1e-6, target_lr=3e-4)

for step in range(total_steps):
    # 训练步骤
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # 更新学习率
    scheduler.step()

4. 提升样本效率的技巧

4.1 广义优势估计(GAE)

问题:TD误差方差大,蒙特卡洛偏差大

解决方案:GAE结合两者优点

\[\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}\]

其中TD误差:$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
    """
    计算广义优势估计
    
    参数:
        rewards: 奖励序列 [T]
        values: 值函数估计 [T+1]
        dones: 终止标志 [T]
        gamma: 折扣因子
        gae_lambda: GAE参数
    
    返回:
        advantages: 优势估计 [T]
        returns: 回报估计 [T]
    """
    advantages = []
    gae = 0
    
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = values[-1]
        else:
            next_value = values[t + 1]
        
        # TD误差
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        
        # GAE
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    
    advantages = np.array(advantages)
    returns = advantages + values[:-1]
    
    return advantages, returns

# 归一化优势(提高稳定性)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

λ的选择

  • λ=0:纯TD(低方差,高偏差)
  • λ=1:纯蒙特卡洛(高方差,无偏差)
  • λ=0.95:推荐折中值

4.2 N-step Returns

思想:使用多步回报而非单步TD

\[G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\]
class NStepReplayBuffer:
    def __init__(self, capacity, n_step=3, gamma=0.99):
        self.capacity = capacity
        self.n_step = n_step
        self.gamma = gamma
        self.buffer = deque(maxlen=capacity)
        self.n_step_buffer = deque(maxlen=n_step)
    
    def add(self, state, action, reward, next_state, done):
        # 添加到n步缓冲区
        self.n_step_buffer.append((state, action, reward, next_state, done))
        
        if len(self.n_step_buffer) < self.n_step:
            return
        
        # 计算n步回报
        n_step_reward = 0
        for i, (_, _, r, _, _) in enumerate(self.n_step_buffer):
            n_step_reward += (self.gamma ** i) * r
        
        # 获取n步后的状态
        state_0 = self.n_step_buffer[0][0]
        action_0 = self.n_step_buffer[0][1]
        next_state_n = self.n_step_buffer[-1][3]
        done_n = self.n_step_buffer[-1][4]
        
        # 存储n步转移
        self.buffer.append((state_0, action_0, n_step_reward, 
                          next_state_n, done_n))
    
    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size)
        return [self.buffer[i] for i in indices]

推荐值

  • 简单任务:n=1(单步)
  • 中等任务:n=3-5
  • 复杂任务:n=5-10(但注意方差增加)

4.3 并行环境(Vectorized Environments)

加速采样

from stable_baselines3.common.vec_env import SubprocVecEnv, DummyVecEnv

def make_env(env_id, rank, seed=0):
    def _init():
        env = gym.make(env_id)
        env.reset(seed=seed + rank)
        return env
    return _init

if __name__ == '__main__':
    num_envs = 8  # 并行环境数量
    env_id = "CartPole-v1"
    
    # 使用多进程(推荐)
    env = SubprocVecEnv([make_env(env_id, i) for i in range(num_envs)])
    
    # 或使用单进程(调试时)
    # env = DummyVecEnv([make_env(env_id, i) for i in range(num_envs)])
    
    # 训练
    model = PPO("MlpPolicy", env, n_steps=128, batch_size=256)
    model.learn(total_timesteps=100000)

优势

  • 加速数据采集(线性加速)
  • 增加数据多样性
  • 更好地利用多核CPU

4.4 课程学习(Curriculum Learning)

思想:从简单任务逐渐过渡到复杂任务

class CurriculumEnv:
    def __init__(self, base_env, difficulty_schedule):
        self.env = base_env
        self.difficulty_schedule = difficulty_schedule
        self.timestep = 0
    
    def reset(self):
        self.timestep += 1
        # 根据进度调整难度
        difficulty = self.difficulty_schedule(self.timestep)
        self.env.set_difficulty(difficulty)
        return self.env.reset()
    
    def step(self, action):
        return self.env.step(action)

# 难度调度函数
def linear_difficulty_schedule(timestep, max_timestep=1e6):
    """线性增加难度"""
    return min(1.0, timestep / max_timestep)

def threshold_difficulty_schedule(timestep, thresholds):
    """基于成功率的阶梯式难度"""
    for threshold, difficulty in thresholds:
        if success_rate > threshold:
            return difficulty
    return thresholds[0][1]

5. 调参流程与最佳实践

5.1 系统化调参流程

Step 1: 建立基线

# 1. 使用默认超参数运行
from stable_baselines3 import PPO

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# 2. 记录性能指标
baseline_reward = evaluate_policy(model, env, n_eval_episodes=100)
print(f"基线平均奖励: {baseline_reward}")

Step 2: 超参数重要性排序

优先级(从高到低):

  1. 学习率:影响最大,首先调整
  2. 网络架构:隐藏层大小、层数
  3. 批量大小:影响稳定性和效率
  4. 折扣因子γ:取决于任务特性
  5. 探索策略:ε或熵系数
  6. 目标网络更新频率
  7. 其他算法特定参数

Step 3: 网格搜索或随机搜索

from stable_baselines3.common.evaluation import evaluate_policy
import optuna

def objective(trial):
    """Optuna优化目标函数"""
    # 定义超参数搜索空间
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-3)
    gamma = trial.suggest_categorical('gamma', [0.9, 0.95, 0.99, 0.995])
    gae_lambda = trial.suggest_categorical('gae_lambda', [0.8, 0.9, 0.95, 0.99])
    clip_range = trial.suggest_uniform('clip_range', 0.1, 0.4)
    ent_coef = trial.suggest_loguniform('ent_coef', 1e-4, 0.1)
    
    # 创建模型
    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=learning_rate,
        gamma=gamma,
        gae_lambda=gae_lambda,
        clip_range=clip_range,
        ent_coef=ent_coef,
        verbose=0
    )
    
    # 训练
    model.learn(total_timesteps=50000)
    
    # 评估
    mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=20)
    
    return mean_reward

# 运行优化
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)

print("最佳超参数:")
print(study.best_params)

Step 4: 精细调整

# 在最佳超参数附近进行精细搜索
best_lr = study.best_params['learning_rate']

fine_tune_lrs = [
    best_lr * 0.5,
    best_lr * 0.75,
    best_lr,
    best_lr * 1.25,
    best_lr * 1.5
]

results = []
for lr in fine_tune_lrs:
    model = PPO("MlpPolicy", env, learning_rate=lr, **other_best_params)
    model.learn(total_timesteps=100000)
    reward = evaluate_policy(model, env)
    results.append((lr, reward))

best_lr_fine = max(results, key=lambda x: x[1])[0]

5.2 监控与可视化

TensorBoard集成

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir='runs/experiment1')

for episode in range(num_episodes):
    # 训练过程
    episode_reward = 0
    state = env.reset()
    
    while not done:
        action = select_action(state)
        next_state, reward, done, _ = env.step(action)
        episode_reward += reward
        
        # 记录损失
        loss = update_model()
        writer.add_scalar('Loss/policy_loss', loss, global_step)
        
        state = next_state
        global_step += 1
    
    # 记录每回合指标
    writer.add_scalar('Reward/episode', episode_reward, episode)
    writer.add_scalar('Length/episode', episode_length, episode)
    
    # 记录超参数
    writer.add_scalar('Hyperparameters/learning_rate', current_lr, episode)
    writer.add_scalar('Hyperparameters/epsilon', current_epsilon, episode)

writer.close()

Weights & Biases集成

import wandb

# 初始化
wandb.init(
    project="rl-tuning",
    config={
        "learning_rate": 3e-4,
        "gamma": 0.99,
        "architecture": "MLP",
        "environment": "CartPole-v1"
    }
)

# 训练过程中记录
for step in range(total_steps):
    # 训练
    loss = train_step()
    reward = evaluate()
    
    # 记录指标
    wandb.log({
        "loss": loss,
        "reward": reward,
        "epsilon": epsilon,
        "learning_rate": current_lr
    }, step=step)

# 保存最佳模型
wandb.save('best_model.pth')

5.3 常见问题诊断

症状 可能原因 解决方案
训练不收敛 学习率过大 降低学习率
收敛很慢 学习率过小 提高学习率
性能剧烈波动 批量大小太小 增大批量
奖励曲线震荡 探索率过高 加快ε衰减
值函数发散 目标网络更新太频繁 降低更新频率
过拟合最近经验 回放缓冲区太小 增大缓冲区
陷入局部最优 探索不足 增加熵正则化
梯度爆炸 网络初始化不当 使用正交初始化

6. 实战案例

6.1 CartPole调参示例

import gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

# 创建环境
env = gym.make("CartPole-v1")

# 基线配置(默认参数)
baseline_model = PPO("MlpPolicy", env, verbose=0)
baseline_model.learn(total_timesteps=50000)
baseline_reward, _ = evaluate_policy(baseline_model, env, n_eval_episodes=100)
print(f"基线奖励: {baseline_reward:.2f}")

# 调优配置
tuned_model = PPO(
    "MlpPolicy",
    env,
    learning_rate=5e-4,       # 提高学习率加速收敛
    n_steps=1024,             # 增加采样步数
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.98,
    clip_range=0.2,
    ent_coef=0.01,            # 增加探索
    vf_coef=0.5,
    max_grad_norm=0.5,
    verbose=0
)

tuned_model.learn(total_timesteps=50000)
tuned_reward, _ = evaluate_policy(tuned_model, env, n_eval_episodes=100)
print(f"调优后奖励: {tuned_reward:.2f}")
print(f"提升: {(tuned_reward - baseline_reward) / baseline_reward * 100:.1f}%")

6.2 MuJoCo调参示例

import gym
from stable_baselines3 import SAC
from stable_baselines3.common.noise import NormalActionNoise

# 创建连续控制环境
env = gym.make("HalfCheetah-v3")

# 添加动作噪声
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(
    mean=np.zeros(n_actions),
    sigma=0.1 * np.ones(n_actions)
)

# SAC配置
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    buffer_size=1000000,
    learning_starts=10000,      # 预填充回放缓冲区
    batch_size=256,
    tau=0.005,                  # 软更新系数
    gamma=0.99,
    train_freq=1,
    gradient_steps=1,
    action_noise=action_noise,
    ent_coef='auto',            # 自动调整熵系数
    target_entropy='auto',
    verbose=1
)

# 训练
model.learn(total_timesteps=1000000)

# 评估
mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=50)
print(f"平均奖励: {mean_reward:.2f}")

7. 总结与建议

7.1 核心要点

  1. 从默认值开始:使用经过验证的默认超参数
  2. 优先调整关键参数:学习率 > 网络架构 > 批量大小
  3. 系统化搜索:使用自动化工具(Optuna, Ray Tune)
  4. 监控训练过程:使用TensorBoard或W&B
  5. 稳定性优先:先让训练稳定,再追求性能

7.2 黄金法则

  • 多次运行取平均:RL结果随机性大,至少3次
  • 保存最佳模型:定期评估并保存检查点
  • 记录所有实验:超参数、代码版本、结果
  • 渐进式调参:每次只改变一个参数
  • 复现性:固定随机种子,记录环境版本

7.3 常用默认配置速查表

算法 学习率 批量大小 γ 其他关键参数
DQN 1e-4 32 0.99 target_update=10000, buffer=1M
PPO 3e-4 64 0.99 clip=0.2, epochs=10, GAE_λ=0.95
SAC 3e-4 256 0.99 τ=0.005, auto_alpha=True
TD3 3e-4 100 0.99 τ=0.005, policy_delay=2
A2C 7e-4 128 0.99 ent_coef=0.01, vf_coef=0.5

参考资源

  1. 实用工具
  2. 论文
    • Henderson et al. (2018) “Deep Reinforcement Learning that Matters”
    • Engstrom et al. (2020) “Implementation Matters in Deep RL”
  3. 博客

通过系统化的调参方法和工程技巧,你可以显著提升强化学习算法的性能和稳定性。记住,调参是一个迭代过程,需要耐心和经验积累。希望本文能帮助你在RL项目中少走弯路!


💬 交流与讨论

⚠️ 尚未完成 Giscus 配置。请在 _config.yml 中设置 repo_idcategory_id 后重新部署,即可启用升级后的评论系统。

配置完成后,评论区将自动支持 Markdown 代码高亮与 LaTeX 数学公式渲染,访客回复会同步到 GitHub Discussions,并具备通知功能。