【CS 285 DRL Homework 2】Policy Gradients 策略优化

Exp 1：原始策略优化（Vanilla Policy Gradient）

Vanilla adj. 普通的，没有新意的；香草的

训练算法总体思路

主要的训练算法集中在 RL_Trainer.run_training_loop 中。通过观察一个循环（iteration）的调用过程，可总结为：

收集多个路线,，获得 train_batch_size 个时刻的数据存在 replay buffer 中
上一步完全结束之后，从 buffer 中采样最近 batch_size 个时刻数据
用数据训练模型
1. 先更新策略
2. 再更新 baseline （第一节还不需要）

第一、二步是完全串行的。且通过观察代码实现可知，虽然有 buffer 作数据中转，但 train_batch_size 与 batch_size 定义为相等，每次训练的数据都是上次模型更新之后的。因此依然是 On-Policy RL。

第一部分主要需要补全 PGAgent.train，这一函数又牵扯到计算总收益 Q，以及 policy update

两种 $ \hat{Q} $ 的计算：Reward-to-go or Not

注意这里是 $ \hat{Q} $ （Q-hat），是单个蒙特卡洛采样路径的收益值，而不是 Q-learning 中神经网络给出的状态-动作对的预测价值。

Reward-to-go 就是考虑因果的 Q 值，t 时间点的的 Q 不考虑 t 之前时间点的收益。

不考虑因果的：求和即可，每个位置的值都和时刻 t 无关，都是一样的。

    #####################################################
    ################## HELPER FUNCTIONS #################
    #####################################################
    def _discounted_return(self, rewards):
        """
            Helper function
            Input: list of rewards {r_0, r_1, ..., r_t', ... r_T} from a single rollout of length T
            Output: list where each index t contains sum_{t'=0}^T gamma^t' r_{t'}
        """
        discounted_sum, discount = 0, 1
        for rr in rewards:
            discounted_sum += discount * rr
            discount *= self.gamma

        return [discounted_sum for i in range(len(rewards))]

考虑 Reward-to-go 的：使用迭代的办法

$$ \begin{align} \hat{Q}_{t}&=\sum_{t'=t}^{T} \gamma^{t'-t} * r_{t'}\\ &=\sum_{t'=t+1}^{T} \gamma^{t'-t} * r_{t'}+r_{t}\\ &=\gamma\sum_{t'=t+1}^{T} \gamma^{t'-t-1} * r_{t'}+r_{t}\\ &=\hat{Q}_{t+1}+r_{t} \end{align} $$

而已知：

$$\hat{Q}_{T}=\sum_{t'=T}^{T} \gamma^{t'-T} * r_{t'}=r_{T}$$

    def _discounted_cumsum(self, rewards):
        """
            Helper function which
            -takes a list of rewards {r_0, r_1, ..., r_t', ... r_T},
            -and returns a list where the entry in each index t is sum_{t'=t}^T gamma^(t'-t) * r_{t'}
            (For Reward-to-go)
        """
        rtg_discounted_q = rewards.copy()
        for i in range(len(rtg_discounted_q)-2, -1, -1):
            rtg_discounted_q[i] = self.gamma * (rtg_discounted_q[i+1]) + rewards[i]

        return rtg_discounted_q

策略更新（Policy Updating）

策略优化的数学本质是：通过调整策略概率模型的分布，最大化收益的期望值 。

但通过使用对数求导的数学技巧，策略优化目标函数从结果上讲，可以认为是 策略对数概率的加权平均 ，而权重是收益值之和。因此收益越高的决策权重越大。（当然这是感性认识，而不是数学本质）

$$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{i, t} \mid \mathbf{s}_{i, t}\right) \hat{Q}_{i, t}^{\pi}$$

class MLPPolicyPG(MLPPolicy):
    def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):

        super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)
        self.baseline_loss = nn.MSELoss()

    def update(self, observations, actions, advantages, q_values=None):
        observations = ptu.from_numpy(observations)
        actions = ptu.from_numpy(actions)
        advantages = ptu.from_numpy(advantages)

        self.optimizer.zero_grad()
        observations, actions, advantages = ptu.from_numpy(observations), ptu.from_numpy(actions), ptu.from_numpy(advantages)
        action_distribution = self.forward(observations)
        log_probs = action_distribution.log_prob(actions)
        loss = -torch.mul(log_probs, advantages).mean()
        loss.backward()
        self.optimizer.step()

        if self.nn_baseline:
            pass # omitted
        train_log = {
            'Training Loss': ptu.to_numpy(loss),
        }
        return train_log

Exp 2：Neural Network Baselines

Critic 模型用作 Baseline 或者 Critic

这届内容实际是在 Actor-Critic 章节的课程才讲的（）

Critic 模型指的是一个学习器（比如神经网络），输入是状态（或再加上动作），输出是这个状态（或状态-动作对）的价值。

这一节中虽然存在 Critic 神经网络模型，但它是作为 Baseline 使用，所以依然是 Policy Gradient，而不是 Actor-Critic。分辨的原则是，PG 在策略更新中 Advantage 值的被减数依然是蒙特卡洛采样 $ \hat{Q} $

Baseline 模型训练

如上文思路，虽然有 buffer 作数据中转，但 train_batch_size 与 batch_size 定义为相等，每次训练的数据都是上次更新之后的。

因此，buffer 只是表象，本质上还是 On-policy RL。On-Policy 假设成立。

Baseline 模型的训练思路是函数估计：神经网络的特征抽取和拟合能力，使得它能够识别出不同但相似的状态，从而采取相似的决策。

这种可以认为是蒙特卡洛法的延申解释。传统蒙特卡洛法需要在 完全相同 的输入上多次采样（ $ V^{\pi}\left(\mathbf{s}_{t}\right) \approx \frac{1}{N}\sum_{i=1}^{N}\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right) $ ），这在大部分的强化学习环境都是不可能的。

当然这里还额外考虑了收益的时间递减。

class MLPPolicyPG(MLPPolicy):
    def __init__(self, ac_dim, ob_dim, n_layers, size, **kwargs):

        super().__init__(ac_dim, ob_dim, n_layers, size, **kwargs)
        self.baseline_loss = nn.MSELoss()

    def update(self, observations, actions, advantages, q_values=None):
        # Updating Policy (omitted)

        if self.nn_baseline:
            ## TODO: update the neural network baseline using the q_values as
            ## targets. The q_values should first be normalized to have a mean
            ## of zero and a standard deviation of one.

            ## Note: You will need to convert the targets into a tensor using
                ## ptu.from_numpy before using it in the loss
            assert q_values is not None
            self.baseline_optimizer.zero_grad()
            q_values = ptu.from_numpy(q_values) if isinstance(q_values, np.ndarray) else q_values
            q_mean, q_std = torch.mean(q_values), torch.std(q_values)
            q_values = (q_values - q_mean).divide(q_std)
            values = self.baseline(observations)
            print(values.shape, q_values.shape)
            b_loss = self.baseline_loss(values, q_values)
            b_loss.backward()
            self.baseline_optimizer.step()

        train_log = {
            'Training Loss': ptu.to_numpy(loss),
        }
        return train_log

引入 Baselines

很简单：

def estimate_advantage(self, obs: np.ndarray, rews_list: np.ndarray, q_values: np.ndarray, terminals: np.ndarray):

    """
        Computes advantages by (possibly) using GAE, or subtracting a baseline from the estimated Q values
    """

    # Estimate the advantage when nn_baseline is True,
    # by querying the neural network that you're using to learn the value function
    if self.nn_baseline:
        values_unnormalized = self.actor.run_baseline_prediction(obs)
        ## ensure that the value predictions and q_values have the same dimensionality
        ## to prevent silent broadcasting errors
        assert values_unnormalized.ndim == q_values.ndim
        ## TODO: values were trained with standardized q_values, so ensure
            ## that the predictions have the same mean and standard deviation as
            ## the current batch of q_values
        values = values_unnormalized * q_values.std() + q_values.mean()

        batch_size = obs.shape[0]
        if self.gae_lambda is not None:
            pass # TODO
        else:
            ## TODO: compute advantage estimates using q_values, and values as baselines
            advantages = np.zeros(batch_size)
            for i in range(batch_size):
                advantages[i] = q_values[i] - values[i]

    # Else, just set the advantage to [Q]
    else:
        advantages = q_values.copy()

    # Normalize the resulting advantages to have a mean of zero
    # and a standard deviation of one
    if self.standardize_advantages:
        ad_mean, ad_std = np.average(advantages), np.std(advantages)
        advantages = (advantages - ad_mean) / ad_std
    return advantages

Exp 3：GAE

def estimate_advantage(self, obs: np.ndarray, rews_list: np.ndarray, q_values: np.ndarray, terminals: np.ndarray):

    """
        Computes advantages by (possibly) using GAE, or subtracting a baseline from the estimated Q values
    """

    # Estimate the advantage when nn_baseline is True,
    # by querying the neural network that you're using to learn the value function
    if self.nn_baseline:
        values_unnormalized = self.actor.run_baseline_prediction(obs)
        assert values_unnormalized.ndim == q_values.ndim
        values = values_unnormalized * q_values.std() + q_values.mean()
        batch_size = obs.shape[0]
        if self.gae_lambda is not None:
            ## append a dummy T+1 value for simpler recursive calculation
            values = np.append(values, [0])

            ## combine rews_list into a single array
            rews = np.concatenate(rews_list)

            ## create empty numpy array to populate with GAE advantage
            ## estimates, with dummy T+1 value for simpler recursive calculation
            advantages = np.zeros(batch_size + 1)
            flatten_rews = np.concatenate(rews_list)

            for i in reversed(range(batch_size)):
                ## TODO: recursively compute advantage estimates starting from
                    ## timestep T.
                ## HINT: use terminals to handle edge cases. terminals[i]
                    ## is 1 if the state is the last in its trajectory, and
                    ## 0 otherwise.
                if terminals[i]:
                    advantages[i] = flatten_rews[i] - values[i]
                else:
                    delta = flatten_rews[i] + self.gamma * values[i+1] - values[i]
                    advantages[i] = delta + self.gamma * advantages[i+1]
        else:
            advantages = np.zeros(batch_size)
            for i in range(batch_size):
                advantages[i] = q_values[i] - values[i]
    else:
        advantages = q_values.copy()

    # Normalize the resulting advantages to have a mean of zero
    # and a standard deviation of one
    if self.standardize_advantages:
        ad_mean, ad_std = np.average(advantages), np.std(advantages)
        advantages = (advantages - ad_mean) / ad_std
    return advantages

【CS 285 DRL Homework 2】Policy Gradients 策略优化

Exp 1：原始策略优化（Vanilla Policy Gradient）

训练算法总体思路

两种 \( \hat{Q} \) 的计算：Reward-to-go or Not

策略更新（Policy Updating）

Exp 2：Neural Network Baselines

Critic 模型用作 Baseline 或者 Critic

Baseline 模型训练

引入 Baselines

Exp 3：GAE

Petrickstar

引用和评论

用PyTorch从零构建 DeepSeek R1：模型架构和分步训练详解

Manus邀请码如何获取，申请理由如何写及申请技巧

基于yolov5实现的AI智能盒子框架

18个常用的强化学习算法整理：从基础方法到高级模型的理论技术与代码实现

【TVM教程】为 ARM CPU 自动调度神经网络

2024-2025年汽车市场分析与预测报告300+份集萃解读|附PDF下载

AIBrix 深度解读：字节跳动大模型推理的云原生实践