Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

摘要

强化学习或最优控制的框架提供了强大且广泛适用的智能决策的数学形式。尽管强化学习问题的一般形式可以对不确定性进行有效的推理，但是强化学习与概率模型中的推理之间的联系并不是很明显。但是，这种联系在算法设计方面具有相当大的价值：将问题形式化为概率推理，从原理上使我们能够使用各种各样的近似推理工具，以灵活强大的方式扩展模型，并了解组成性和部分可观察性的原因。在本文中，我们将讨论强化学习或最优控制问题的泛化（有时称为最大熵强化学习）如何等同于确定性动态情况下的精确概率推理和随机动态情况下的变分推理。我们将介绍此框架的详细信息，概述以此为基础的先前工作及相关思想，以提出新的强化学习和控制算法，并描述对未来研究的看法。

论文信息

作者：Sergey Levine
出处：arXiv
机构：UC Berkeley
关键词：
论文链接
其他资料：
- 知乎专栏解读
开源代码：

A Graphical Model for Control as Inference

在本节中，我们将介绍允许我们将控制嵌入到PGM框架中的基本图模型，并讨论如何使用该框架来派生几种标准强化学习和动态规划方法的变体。本节中介绍的PGM对应于标准强化学习问题的泛化，其中RL目标通过熵项进行了增强。奖励函数的大小在奖励最大化和熵最大化之间进行权衡，从而可以在无限大的奖励限制内恢复原始的RL问题。我们将首先定义符号，然后定义图模型，然后介绍几种推理方法，并描述它们与强化学习和动态规划中的标准算法之间的关系。最后，我们将讨论此方法的一些局限性并由此驱动第3节中的变分方法。

The Decision Making Problem and Terminology

首先，我们将介绍用于标准最优控制或强化学习公式的符号表示。我们将使用$\mathbf{s}\in \mathcal{S}$来表示状态，并使用$\mathbf{a}\in \mathcal{A}$来表示动作，均可以是离散的也可以是连续的。状态根据随机动态$p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$演化，这通常是未知的。我们将遵循有限离散时间推导，时长为$T$，并暂时不考虑折扣因子。只需通过修改转移动态就可以很容易地将折扣因子$\lambda$纳入此框架中，这样任何动作都会以$1-\lambda$的概率产生向吸收状态的转移，并且所有标准转移概率都将乘以$\lambda$。

该框架中的任务可以由奖励函数$r(\mathbf{s}_t,\mathbf{a}_t)$定义。解决任务通常涉及恢复一个策略$p(\mathbf{a}_t|\mathbf{s}_t,\theta)$，该策略指定了在以某些参数向量$\theta$所参数化的状态为条件的动作上的分布。然后，通过以下最大化给出标准强化学习策略搜索问题：

$\theta^{\star}=\arg \max _{\theta} \sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim p\left(\mathbf{s}_{t}, \mathbf{a}_{t} | \theta\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \tag{1}$

该优化问题旨在找到一个策略参数向量$\theta$，以使策略的总期望奖励$\sum_t r(\mathbf{s}_t,\mathbf{a}_t)$最大化。期望值是根据策略的轨迹分布$p(\tau)$得出的，由以下公式得出：

$p(\tau)=p\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T} | \theta\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{a}_{t} | \mathbf{s}_{t}, \theta\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \tag{2}$

为简洁起见，通常将动作条件$p(\mathbf{a}_t|\mathbf{s}_t,\theta)$表示为$\pi_\theta(\mathbf{a}_t|\mathbf{s}_t)$，以强调它是由带参数$\theta$的参数化策略给出的。这些参数可能对应于例如神经网络中的权重。但是，我们也可以通过让$\theta$表示开环规划中的一系列动作来将标准规划问题嵌入该表述中。

以这种方式形式化了决策问题后，我们要问的将控制推导为推理框架的下一个问题是：我们如何构造概率图模型，以使最可能的轨迹与最优策略的轨迹相对应？或者，等效地，我们如何公式化概率图形模型，以便推断后验动作条件$p(\mathbf{a}_t|\mathbf{s}_t,\theta)$为我们提供最优策略？

The Graphical Model

要将控制问题嵌入到图模型中，我们可以简单地通过对状态，动作和下一个状态之间的关系进行建模来开始。这种关系很简单，并且对应于具有公式$p(\mathbf{s}_t+1|\mathbf{s}_t,\mathbf{a}_t)$因子的图模型，如图1(a)所示。但是，这种图模型不足以解决控制问题，因为它没有奖励或成本的概念。因此，我们必须在此模型中引入一个附加变量，我们将其表示为$\mathcal{O}_t$。该附加变量是二进制随机变量，其中$\mathcal{O}_t=1$表示时间步t是最优的，而$\mathcal{O}_t=0$表示它不是最优的。我们将选择由以下方程式给出的该变量的分布：

$p\left(\mathcal{O}_{t}=1 | \mathbf{s}_{t}, \mathbf{a}_{t}\right)=\exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \tag{3}$

图1(b)总结了带有这些附加变量的图模型。虽然这乍看起来似乎是一个奇特而随意的选择，但是当我们对所有$t\in\{1,\dots,T\}$都有$\mathcal{O}_t=1$时，它会导致动作的一个非常自然的后验分布：

$\begin{aligned} p\left(\tau | \mathbf{o}_{1: T}\right) \propto p\left(\tau, \mathbf{o}_{1: T}\right) &=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathcal{O}_{t}=1 | \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ &=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ &=\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \end{aligned} \tag{4}$

也就是说，观察到给定轨迹的概率由其根据动态发生的概率（最后一行方括号中的项）与沿着该轨迹的总奖励的指数之间的乘积给出。在具有确定性动态的系统中，最容易理解该方程式，其中第一项对于动态可行的所有轨迹都是常数。在这种情况下，具有最高奖励的轨迹具有最高的概率，而具有较低奖励的轨迹则具有指数级降低的概率。如果我们要规划从某个初始状态$\mathbf{s}_t$开始的最优动作序列，则可以以$\mathbf{o}_{1:T}$为条件，并选择$p(\mathbf{s}_1)=\delta(\mathbf{s}_1)$，在这种情况下，最大后验推断对应于一种规划问题。很容易看出，在动态是确定性的情况下，这恰好与标准规划或轨迹优化相对应，在这种情况下，公式(4)简化为

$p\left(\tau | \mathbf{o}_{1: T}\right) \propto \mathbb{1}[p(\tau) \neq 0] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \tag{5}$

在这里，指示函数仅指示轨迹$\tau$是动态一致的（意味着$p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)\neq0$）并且初始状态正确。随机动态的情况带来了一些挑战，将在第3节中进行详细讨论。但是，即使在确定性动态下，我们通常也有兴趣恢复策略而不是规划。在此PGM中，最优策略可以写为$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{t:T}=1)$（为简洁起见，在其余推导中，我们将省略$= 1$）。这种分布与上一节中的$p(\mathbf{a}_t|\mathbf{s}_t,\theta^*)$有点相似，但有两个主要区别：首先，它与参数$\theta$无关，其次，我们将在稍后看到它优化了一个与公式（1）中的标准强化学习目标略微有些不同的目标。

Policy Search as Probabilistic Inference

我们可以使用标准sum-product推断算法来恢复最优策略$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{t:T})$，类似于HMM-style动态贝叶斯网络中的推断。正如我们将在本节中看到的，计算以下形式的反馈（backward）消息就足够了

$\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$

这些消息具有自然的解释：它们表示从状态$\mathbf{s}_t$和动作$\mathbf{a}_t$开始的$t$到$T$时间步的轨迹是最优的概率（请注意，$\beta_t(\mathbf{s}_t,\mathbf{a}_t)$不是在$\mathbf{s}_t$，$\mathbf{a}_t$的概率密度，而是$\mathcal{O}_{t:T}=1$的概率）。稍微重载符号，我们还将介绍消息

$\beta_{t}\left(\mathbf{s}_{t}\right)=p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}\right)$

这些消息表示从状态$\mathbf{s}_t$开始的$t$到$T$的轨迹是最优的概率。我们可以通过对动作积分来从state-action消息中恢复state-only消息：

$\beta_{t}\left(\mathbf{s}_{t}\right)=p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}\right)=\int_{\mathcal{A}} p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) d \mathbf{a}_{t}=\int_{\mathcal{A}} \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) d \mathbf{a}_{t}$

因子$p(\mathbf{a}_t|\mathbf{s}_t)$是动作先验。请注意，它绝不是以$\mathcal{O}_{1:T}$为条件：它不表示最优动作的概率，而仅表示动作的先验概率。图1中的PGM实际上不包含该因子，为简单起见我们可以假设$p\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\frac{1}{|\mathcal{A}|}$，也就是说，它是一个常数，对应于一组动作的均匀分布。稍后我们将看到，这种假设实际上并不会带来任何泛化性的损失，因为可以通过奖励函数将任何不均匀的$p(\mathbf{a}_t|\mathbf{s}_t)$代入$p(\mathcal{O}_t|\mathbf{s}_t, \mathbf{a}_t)$。

用于计算$\beta_t(\mathbf{s}_t,\mathbf{a}_t)$的递归消息传递算法从最后一个时间步$t = T$沿时间回传至$t = 1$。（The recursive message passing algorithm for computing $\beta_t(\mathbf{s}_t,\mathbf{a}_t)$ proceeds from the last time step $t = T$ backward through time to $t = 1$）。在基本情况下，我们注意到$p(\mathcal{O}_T|\mathbf{s}_T, \mathbf{a}_T)$与 $\exp(r(\mathbf{s}_T，\mathbf{a}_T))$成比例，因为只有一个要考虑的因子。递归的情况如下：

$\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)=\int_{\mathcal{S}} \beta_{t+1}\left(\mathbf{s}_{t+1}\right) p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) d \mathbf{s}_{t+1} \tag{6}$

从这些反馈消息中，我们可以得出最优策略$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{t:T})$。首先，请注意$\mathcal{O}_{1:(t-1)}$在给定$\mathbf{s}_t$时条件独立于$\mathbf{a}_t$，这意味着$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{1:T})=p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{t:T})$，在当前的动作分布下我们可以忽略过去。这很直观：在马尔可夫系统中，最优动作不依赖于过去。由此，我们可以使用两个反馈消息轻松地恢复最优动作分布：

$p\left(\mathbf{a}_{t} | \mathbf{s}_{t}, \mathcal{O}_{t: T}\right)=\frac{p\left(\mathbf{s}_{t}, \mathbf{a}_{t} | \mathcal{O}_{t: T}\right)}{p\left(\mathbf{s}_{t} | \mathcal{O}_{t: T}\right)}=\frac{p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) p\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t}\right)}{p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t}\right)} \propto \frac{p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}{p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t}\right)}=\frac{\beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}{\beta_{t}\left(\mathbf{s}_{t}\right)}$

第三步中的条件顺序通过使用贝叶斯公式翻转，并约去分子和分母中的$p(\mathcal{O}_{t:T})$因子。 $p(\mathbf{a}_t|\mathbf{s}_t)$项消失了，因为我们先前假定它是均匀分布。

这种推导为我们提供了解决方案，但也许不是那么直观。通过考虑这些方程在对数空间中的作用，可以提供一些直观的角度。为此，我们将对数空间消息引入为

$\begin{aligned} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) &=\log \beta_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \\ V\left(\mathbf{s}_{t}\right) &=\log \beta_{t}\left(\mathbf{s}_{t}\right) \end{aligned}$

在这里使用$Q$和$V$并非偶然：对数空间消息对应于状态value functions和状态动作value functions的“软”变体。首先，考虑对数空间中的动作边际：

$V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t}$

当$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$的值较大时，上式表示对于$\mathbf{a}_t$的hard maximum。也就是说，对于较大的$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$，

$V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} \approx \max _{\mathbf{a}_{t}} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$

对于较小的$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$值，maximum是软的。因此，我们可以将$V$和$Q$分别称为soft value functions和Q-functions。我们还可以考虑对数空间中公式(6)中的backup。对于确定性动态，此backup由下式给出：

$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+V\left(\mathbf{s}_{t+1}\right)$

完全对应于Bellman backup。但是，当动态随机时，backup由下式给出：

$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\exp \left(V\left(\mathbf{s}_{t+1}\right)\right)\right] \tag{7}$

该backup是特殊的，因为它不考虑下一个状态的期望value，而是考虑下一个期望value的“soft max”。直观的，这会产生乐观的Q函数：如果在下一个状态的可能结果中，有一个结果具有很高的value，它仍将主导backup，即使在存在其他可能的状态并且具有非常低的value时。这会产生寻求风险的行为：如果智能体根据此Q函数进行行动，则它所采取的操作可能具有极高的风险，只要它们具有一定的非零概率可以获得较高的回报。显然，这种行为在许多情况下不是我们所希望的，并且本节中描述的标准PGM通常不太适合于随机动态。在第3节中，我们将描述一个简单的修改，它通过使用变分推理的框架，使backup与随机动态情况下的soft Bellman backup相对应。

Which Objective does This Inference Procedure Optimize?

在上一节中，我们导出了一个推理范式，该范式可用于获取以所有最优变量$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{1:T})$为条件的动作的分布。但是，该策略实际上优化了哪个目标？回想一下，总体分布由下式给出：

$p(\tau)=\left[p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \exp \left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \tag{8}$

在确定性动态的情况下，我们可以将其简化为公式(5)。在这种情况下，条件分布$p(\mathbf{a}_t|\mathbf{s}_t,\mathcal{O}_{1:T})$可以通过边际化整个轨迹分布并在$\mathbf{s}_t$的每个时间步调整策略来简单地获得。我们可以针对此问题采用基于优化的近似推断方法，在这种情况下，目标是拟合一个近似值$\pi(\mathbf{a}_t|\mathbf{s}_t)$，以使轨迹分布

$\hat{p}(\tau) \propto \mathbb{1}[p(\tau) \neq 0] \prod_{t=1}^{T} \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)$

匹配公式(5)中的分布。如上一节所述，在精确推理的情况下，匹配是精确的，这意味着$D_{KL}(\hat{p}(\tau)||p(\tau))=0$，其中$D_{KL}$是KL散度。因此，我们可以将推理过程视为最小化$D_{KL}(\hat{p}(\tau)||p(\tau))$，这由下式给出：

$D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=-E_{\tau \sim \hat{p}(\tau)}[\log p(\tau)-\log \hat{p}(\tau)]$

将等式两边取反，并代入$p(\tau)$和$\hat{p}(\tau)$，我们得到

$\begin{aligned} -D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))&= E_{\tau \sim \hat{p}(\tau)}[\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T}\left(\log p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)+r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) \\ &\qquad\qquad-\log p\left(\mathbf{s}_{1}\right)-\sum_{t=1}^{T}\left(\log p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)]\\ &=E_{\tau \sim \hat{p}(\tau)}\left[\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]+E_{\mathbf{s}_{t} \sim \hat{p}\left(\mathbf{s}_{t}\right)}\left[\mathcal{H}\left(\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)\right] \end{aligned}$

因此，与等式(1)中的标准控制目标（仅使奖励最大化）相反，使KL散度最小对应于使预期奖励和预期条件熵最大化。因此，这种控制目标有时被称为最大熵强化学习或最大熵控制。

但是，在随机动态的情况下，解决方案并不是那么简单。在随机动态下，最优分布由下式给出：

$\hat{p}(\tau)=p\left(\mathbf{s}_{1} | \mathcal{O}_{1: T}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}, \mathcal{O}_{1: T}\right) p\left(\mathbf{a}_{t} | \mathbf{s}_{t}, \mathcal{O}_{1: T}\right) \tag{9}$

其中初始状态分布和动态也以最优为条件。因此，KL散度中的动态和初始状态项不会抵消，并且目标也没有上面导出的简单熵最大化形式。（在确定性情况下，我们知道$p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t,\mathcal{O}_{1:T})=p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$，因为只有一种转移是可能的。）我们仍然可以在轨迹层面依靠原始的KL散度最小化，并将目标写为

$-D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=E_{\tau \sim \hat{p}(\tau)}\left[\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]+\mathcal{H}(\hat{p}(\tau)) \tag{10}$

但是，由于$\log p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$项，在无模型设置中很难优化该目标。如前一部分所述，它还导致了一种乐观的策略，其假定对动态的某种程度的控制在大多数控制问题中都是不现实的。在第3节中，我们将推导出一个变分推理范式，即使在随机动态的情况下，该范式也可简化为方程式(9)中的便捷目标，并且在此过程中，还将处理第2.3节中讨论的风险偏好行为。

Alternative Model Formulations

值得指出的是，等式(3)中$p\left(\mathcal{O}_{t}=1 | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$的定义需要一个额外的假设，即奖励$r(\mathbf{s}_{t}, \mathbf{a}_{t})$始终为负。（这个假设实际上并不是很强：如果我们假设奖励有上界，那么我们总是可以简单地通过减去最大奖励来构建完全等效的奖励。）否则，我们最终得到$p\left(\mathcal{O}_{t}=0 | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$的负概率。但是，实际上并不需要此假设：完全有可能用$(\mathbf{s}_{t}, \mathbf{a}_{t},\mathcal{O}_{t})$上的无向因子以及通过$\Phi_{t}\left(\mathbf{s}_{t}, \mathbf{a}_{t}, \mathcal{O}_{t}\right)=\mathbf{1}_{\mathcal{O}_{t}=1} \exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)$给出未归一化的可能性定义图模型。$\mathcal{O}_{t}=0$的可能性无关紧要，因为我们始终以$\mathcal{O}_{t}=1$为条件。这导致了与上述相同的精确推理范式，但没有负奖励假设。一旦我们满足于使用无向图模型，我们甚至可以完全删除变量$\mathcal{O}_{t}$，只需在$(\mathbf{s}_{t}, \mathbf{a}_{t})$上简单添加可能性为$\Phi_{t}(\mathbf{s}_{t}, \mathbf{a}_{t})=\exp \left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)$的无向因子，这在数学上是等效的。这是Ziebart（Ziebart，2010）描述的条件随机场公式。该模型中的分析和推理方法与针对具有显式最优变量$\mathcal{O}_{t}$的有向模型的分析和推理方法相同，并且模型的特定选择只是一种符号上的便利。我们将在本文中使用变量$\mathcal{O}_{t}$来简化推导过程，并将其保留在有向图模型框架内，但是所有推导都可以在条件随机场公式中直接再现。

此框架的另一个常见修改是将显式温度$\alpha$纳入$\mathcal{O}_{t}$的CPD中，以使$p\left(\mathcal{O}_{t} | \mathbf{s}_{t}, b\mathbf{a}_{t}\right)= \exp(\frac 1\alpha (\mathbf{s}_{t}, \mathbf{a}_{t}))$。然后，可以将相应的最大熵目标等效地写为（原始）奖励的期望，并在熵项上附加乘数$\alpha$。这提供了在熵最大化和标准最优控制或RL之间进行插值的自然机制：当$\alpha\to0$时，最优解接近标准最优控制解。注意，这实际上并没有增加方法的泛化性，因为常数$\frac 1\alpha$总是可以乘以奖励，但是明确指定此温度常数可以帮助阐明标准与熵最大化最优控制之间的联系。

最后，值得再次提及折扣因子的作用：在强化学习中，使用如下形式的Bellman backup非常普遍

$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \leftarrow r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\gamma E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]$

其中$\gamma \in(0,1]$是折扣因子。这允许在infinite-horizon的设置中学习value functions（否则对于$\gamma= 1$，backup将是不收敛的），并减小了策略梯度算法中Monte Carlo advantage estimators的方差（ Schulman et al。，2016）。折扣因子可以看作是系统动态的简单重新定义。如果初始动态由$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$给出，则增加折扣因子等于未折扣的value拟合修改的动态$\bar{p}\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)=\gamma p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$，这时无论动作如何，都存在概率为$1-\gamma$的额外转移到奖励为零的吸收状态。我们将在本文的推导中忽略$\gamma$，但只要在出现$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$期望值的任何地方通过修改(soft) Bellman backup在任何情况下简单地插入它，例如公式(7)或下一节中的公式(15)。

Variational Inference and Stochastic Dynamics

在第2.3节和第2.4节中讨论的在随机动态情况下最大熵框架的问题性质实质上等于一个假设，即允许智能体控制其行为和系统动态以产生最优轨迹，但是其对动态的权限会根据与真实动态的偏差而受到惩罚。因此，可以从等式中分解出等式(10)中的$\log p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$项，从而产生additive terms对应于后验动态$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t},\mathcal{O}_{1:T}\right)$与真实动态$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$之间的交叉熵。这解释了第2.3节中讨论的方法的风险偏好性：如果允许该智能体影响其动态，甚至仅是一点，它都将合理地选择消除风险动作的不太可能但极其糟糕的结果。

当然，在实际的强化学习和控制问题中，系统动态的这种操纵是不可能的，并且由此产生的策略可能导致灾难性的不良后果。我们可以通过修改推理过程来矫正此问题。在本节中，我们将通过固定系统动态，写下相应的最大熵目标并推导用于对其进行优化的动态规划过程来得出此矫正。然后，我们将证明该过程相当于结构化变分推理的直接应用。

Maximum Entropy Reinforcement Learning with Fixed Dynamics

在2.4节中讨论的随机动态问题可以简单地概括如下：由于后验动态分布$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t},\mathcal{O}_{1:T}\right)$不一定与真实动态$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$匹配，智能体假定它可以在一定程度上影响动态。解决此问题的一个简单方法是通过强制后验动态和初始状态分布分别匹配$p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)$和$p\left(\mathbf{s}_{t}\right)$来明确禁止此控制。然后，简单地给出优化后的轨迹分布：

$\hat{p}(\tau)=p\left(\mathrm{s}_{1}\right) \prod_{t=1}^{T} p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)$

与第2.4节中介绍的确定性情况的推导相同，得出以下目标：

$-D_{\mathrm{KL}}(\hat{p}(\tau) \| p(\tau))=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\mathcal{H}\left(\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)\right] \tag{11}$

也就是说，目标仍然是使奖励和熵最大化，但是现在处于随机转移动态之下。为了优化此目标，我们可以像在2.3节中一样计算反馈消息。但是，由于我们现在是从最大化等式（11）中的目标开始的，因此必须作为动态规划算法从优化的角度导出这些反馈消息。和以前一样，我们将从优化$\pi(\mathbf{a}_{t}| \mathbf{s}_{t})$这样的基本情况开始，其最大化

$\begin{aligned} &E_{\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right) \sim \hat{p}\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)}\left[r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-\log \pi\left(\mathbf{a}_{T} | \mathbf{s}_{T}\right)\right]=\\ &E_{\mathrm{s}_{T} \sim \hat{p}\left(\mathrm{s}_{T}\right)}\left[-D_{\mathrm{KL}}\left(\pi\left(\mathrm{a}_{T} | \mathrm{s}_{T}\right) \| \frac{1}{\exp \left(V\left(\mathrm{s}_{T}\right)\right)} \exp \left(r\left(\mathrm{s}_{T}, \mathrm{a}_{T}\right)\right)\right)+V\left(\mathrm{s}_{T}\right)\right] \end{aligned} \tag{12}$

从KL散度的定义来看，等式成立，而$\exp(V(\mathbf{s}_T))$是关于$\mathbf{a}_T$的$\exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right)$的归一化常数，其中$V\left(\mathbf{s}_{T}\right)=\log \int_{\mathcal{A}} \exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)\right) d \mathbf{a}_{T}$，与2.3节中的软最大化相同。由于我们知道当两个参数表示相同的分布时，KL散度被最小化，因此最优策略为

$\pi\left(\mathbf{a}_{T} | \mathbf{s}_{T}\right)=\exp \left(r\left(\mathbf{s}_{T}, \mathbf{a}_{T}\right)-V\left(\mathbf{s}_{T}\right)\right) \tag{13}$

然后可以按以下方式计算递归情况：对于给定的时间步$t$，$\pi(\mathbf{a}_t|\mathbf{s}_t)$必须最大化两个项：

$E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right]+E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right] \tag{14}$

第一项直接从等式(11)中的目标得出，而第二项表示$\pi(\mathbf{a}_t|\mathbf{s}_t)$对所有后续时间步的期望的贡献。第二项值得更深入的推导。首先，考虑一个基本情况：给定方程式(13)中的$\pi(\mathbf{a}_T|\mathbf{s}_T)$方程，我们可以通过将方程式直接代入方程式(12)来评估策略目标。由于KL散度计算为零，因此只剩下$V(\mathbf{s}_T)$项。在递归情况下，我们注意到我们可以将等式(14)中的目标重写为

$\begin{aligned} &E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log \pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right]+E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim \hat{p}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right]=\\ &E_{\mathbf{s}_{t} \sim \hat{p}\left(\mathbf{s}_{t}\right)}\left[-D_{\mathbf{K L}}\left(\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) \| \frac{1}{\exp \left(V\left(\mathbf{s}_{t}\right)\right)} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right)+V\left(\mathbf{s}_{t}\right)\right] \end{aligned}$

现在我们定义

$\begin{aligned} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) &=r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{\mathbf{s}_{t+1} \sim p\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right] \\ V\left(\mathbf{s}_{t}\right) &=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} \end{aligned} \tag{15}$

它对应于标准的具有针对value function软最大化的Bellman backup。选择

$\pi\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-V\left(\mathbf{s}_{t}\right)\right) \tag{16}$

我们再次看到KL散度计算为零，就像在$t=T$的基本情况下一样，$E_{\mathbf{s}_{t} \sim \hat{p}}\left(\mathbf{s}_{t}\right)\left[V\left(\mathbf{s}_{t}\right)\right]$作为时间步$t$的目标中唯一剩下的项。这意味着，如果我们固定动态和初始状态分布，并且只允许更改策略，我们将恢复Bellman backup运算，其使用下一个状态的期望值而不是在第2.3节中看到的乐观估计（比较式(15)与式(7)）。尽管这为寻求风险的策略的实际问题提供了解决方案，但它与概率图模型的便利框架之间的差异也许有点令人不满意。在下一部分中，我们将讨论此过程如何构成结构化变分推理的直接应用。

Connection to Structured Variational Inference

第3.1节中解释优化过程的一种方法是将其作为结构化变分推断的一种特殊类型。在结构化变分推理中，我们的目标是用另一个可能更简单的分布$q(y)$近似某个分布$p(y)$。通常，$q(y)$被视为某种易于处理的因式分布，例如链或树中连接的条件分布的乘积，这有助于进行易处理的精确推理。在我们的例子中，我们的目标是逼近$p(\tau)$，由下式给出

通过下面的分布来逼近

$q(\tau)=q\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right) q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) \tag{18}$

如果我们固定$q(\mathbf{s}_1)=p(\mathbf{s}_1)$且$q(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)=p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$，则$q(\tau)$恰好是第3.1节中的分布$\hat p(\tau)$，在这里我们将其重命名为$q(\tau)$，以强调与结构化变分推理的联系。请注意，出于相同的原因，我们也将$\pi(\mathbf{a}_t|\mathbf{s}_t)$重命名为$q(\mathbf{a}_t|\mathbf{s}_t)$。在结构化变分推理中，近似推理是通过优化变分下界（也称为evidence lower bound）来执行的。回想一下，这里的evidence是，对于所有$t\in{1,\dots,T}$，都有$\mathcal{O}_t=1$，以及后验是以初始状态$\mathbf{s}_1$为条件。变分下界由下式给出

$\begin{aligned} \log p\left(\mathcal{O}_{1: T}\right) &=\log \iint p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) d \mathbf{s}_{1: T} d \mathbf{a}_{1: T} \\ &=\log \iint p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \frac{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)} d \mathbf{s}_{1: T} d \mathbf{a}_{1: T} \\ &=\log E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\frac{p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}{q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\right] \\ & \geq E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\log p\left(\mathcal{O}_{1: T}, \mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)-\log q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)\right] \end{aligned}$

最后一行的不等式是通过詹森不等式获得的。通过等式(17)和(18)中的定义替换$p(\tau)$和$q(\tau)$，并注意由于$q(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)=p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$而引起的抵消，下界减小到

$\log p\left(\mathcal{O}_{1: T}\right) \geq E_{\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right) \sim q\left(\mathbf{s}_{1: T}, \mathbf{a}_{1: T}\right)}\left[\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right] \tag{19}$

取决于一个加性常数。针对策略$q(\mathbf{a}_t|\mathbf{s}_t)$优化此目标正好对应于公式(11)中的目标。直观来讲，这意味着该目标试图找到与最大熵轨迹分布最接近的匹配，但要遵循这样的约束，即仅允许智能体修改策略，而不允许修改动态。请注意，此框架还可以轻松适应策略的任何其他结构性约束，包括对特定分布类别的约束（例如，条件高斯或由神经网络参数化的分类分布），或对部分可观察性的约束，其中整个状态$\mathbf{s}_t$不能用作输入，而策略只能访问该状态的某些不可逆函数。

Approximate Inference with Function Approximation

我们在上面的讨论中看到，具有类似于Bellman backup的更新的动态规划反馈算法可以在最大熵强化学习框架中恢复value function和Q函数的“软”类似物，并且可以从value function和Q函数中恢复出随机最优策略。在本节中，我们将讨论如何使用函数逼近从该理论框架中得出针对高维或连续强化学习问题的实用算法。这将产生一些反映标准强化学习中相应技术的原型方法：policy gradients，actor-critic算法和Q-learning。

Maximum Entropy Policy Gradients

一种执行结构化变分推理的方法是直接优化关于变分分布的evidence lower bound（Koller和Friedman，2009）。这种方法可以直接应用于最大熵强化学习。注意，变异分布由三个项组成：$q(\mathbf{s}_1)$，$q(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$和$q(\mathbf{a}_t|\mathbf{s}_t)$。前两个项分别固定为$p(\mathbf{s}_1)$和$p(\mathbf{s}_{t+1}|\mathbf{s}_t,\mathbf{a}_t)$，仅剩下$q(\mathbf{a}_t|\mathbf{s}_t)$可以变化。我们可以使用任何具有参数$\theta$的表达条件来对该分布进行参数化，因此将其表示为$q_\theta(\mathbf{a}_t|\mathbf{s}_t)$。这些参数可以对应于例如深度神经网络中的权重，其将$\mathbf{s}_t$作为输入并输出某些分布类别的参数。在离散动作的情况下，网络可以直接输出分类分布的参数（例如，通过soft max运算）。在连续动作的情况下，网络可以输出指数族分布的参数，例如高斯分布。在所有情况下，我们都可以通过使用样本估算目标的梯度来直接优化方程(11)中的目标。这种梯度的形式几乎与标准策略梯度（Williams，1992）相同，为完整起见，我们在此对其进行总结。首先，让我们将目标重述如下：

$J(\theta)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\mathcal{H}\left(q_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)\right]$

梯度由下式给出

$\begin{aligned} \nabla_{\theta} J(\theta) &=\sum_{t=1}^{T} \nabla_{\theta} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\mathcal{H}\left(q_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q_{\theta}\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)-1\right)\right] \\ &=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\left(\sum_{t^{\prime}=t}^{T} r\left(\mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q_{\theta}\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)-b\left(\mathbf{s}_{t^{\prime}}\right)\right)\right] \end{aligned}$

其中第二行来自似然比技巧（Williams，1992）和熵的定义，以获得对数$\log q_\theta(\mathbf{a}_{t’}|\mathbf{s}_{t’})$项。-1来自熵项的导数。最后一行指出梯度估计器对于加法状态相关常数不变，并用状态相关基线b（st’）代替-1。所得策略梯度估计量与标准策略梯度估计量完全匹配，唯一的修改是在每个时间步长t’处将-logqθ（at’| st’）项添加到奖励中。直观地，通过减去当前策略下该操作的对数概率来修改每个操作的报酬，这会使该策略最大化熵。该梯度估计量可以紧凑地写成

$\nabla_{\theta} J(\theta)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\theta} \log q_{\theta}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right) \hat{A}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$

其中Aˆ（st，at）是一个优势估算器。可以使用任何标准优势估算器，例如GAE估算器（Schulman等人，2016）来代替上述标准基线蒙特卡罗收益率。同样，唯一必要的修改是在每个时间步长t’处将-logqθ（at’| st’）添加到奖励中。与标准策略梯度一样，此方法的实际实现通过从当前策略中采样轨迹来估计期望值，并且可以通过遵循自然梯度方向进行改进。

Maximum Entropy Actor-Critic Algorithms

我们可以采用一种消息传递方法，而不是直接区分变化的下限，这将在后面看到，它可以产生较低方差的梯度估计。首先，请注意，对于q（at | st）的最优目标分布，我们可以写下以下等式：

$q^{\star}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\frac{1}{Z} \exp \left(E_{q\left(\mathbf{s}_{(t+1): r}, \mathbf{a}_{(t+1) ; T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\sum_{t^{\prime}=t}^{T} \log p\left(\mathcal{O}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\sum_{t^{\prime}=t+1}^{T} \log q\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)\right]\right)$

这是因为基于st的条件使动作处于完全独立于所有过去状态的状态，但是动作仍然取决于所有将来的状态和动作。请注意，动力学项p（st + 1 | st，at）和q（st + 1 | st，at）不会出现在上述方程式中，因为它们会完美抵消。我们可以简化以下期望：

$\begin{aligned} &E_{q\left(\mathbf{s}_{(t+1): T}, \mathbf{a}_{(t+1): T} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)\left[\log p\left(\mathcal{O}_{t: T} | \mathbf{s}_{t: T}, \mathbf{a}_{t: T}\right)\right]}=\\ &\log p\left(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[E\left[\sum_{t^{\prime}=t+1}^{T} \log p\left(\mathcal{O}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)\right]\right] \end{aligned}$

在这种情况下，请注意内部期望不包含st或at，因此自然地表示了可以从将来状态发送的消息。我们将表示此消息V（st + 1），因为它将与软值函数相对应：

$\begin{aligned} V\left(\mathbf{s}_{t}\right) &=E\left[\sum_{t^{\prime}=t+1}^{T} \log p\left(\mathcal{O}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}, \mathbf{a}_{t^{\prime}}\right)-\log q\left(\mathbf{a}_{t^{\prime}} | \mathbf{s}_{t^{\prime}}\right)\right] \\ &=E_{q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[\log p\left(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)+E_{q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right.}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right] \end{aligned}$

为了方便起见，我们还可以将Q函数定义为

$Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)=\log p\left(\mathcal{O}_{t} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]$

使得V（st）= Eq（at | st）[Q（st，at）-log q（at | st）]，最优策略为

$q^{\star}\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\frac{\exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)}{\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t}}$

注意，在这种情况下，与动态编程的情况相比，值函数和Q函数对应于当前策略q（at | st）的值，而不是最优值函数和Q函数。然而，在收敛时，当每个t的q（at | st）=q⋆（at | st）时，我们有

$\begin{aligned} V\left(\mathbf{s}_{t}\right) &=E_{q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right] \\ &=E_{q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t}\right] \\ &=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t} \end{aligned}$

这是2.3节中常见的最大柔度。现在我们看到，可以通过将消息向后传递时间来计算q（at | st）的最优变化分布，并且消息由V（st）和Q（st，at）给出。

到目前为止，此推导假定可以准确表示策略和消息。我们可以像上一节中一样放松第一个假设。我们首先写下单个因子q（at | st）的变化下界，如下所示：

$\max _{q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)} E_{\mathbf{s}_{t} \sim q\left(\mathbf{s}_{t}\right)}\left[E_{\mathbf{a}_{t} \sim q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right]\right]$

显而易见，这个目标只是完全的变分下界，它由Eq（τ）[log p（）-log q（）]给出，只限于包含q（at | st）的项。如果我们限制策略的类别q（at | st）以使其不能精确地表示q⋆（at | st），我们仍然可以通过计算公式（22）的梯度来优化公式（22）中的目标

$E_{\mathbf{s}_{t} \sim q\left(\mathbf{s}_{t}\right)}\left[E_{\mathbf{a}_{t} \sim q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[\nabla \log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)-b\left(\mathbf{s}_{t}\right)\right)\right]\right]$

其中b（st）是任何与状态有关的基线。可以使用来自q（）的样本来计算此梯度，并且像上一节中的策略梯度一样，它直接类似于经典似然比策略梯度。修改在于使用反向消息Q（st，at）代替蒙特卡洛优势估计。因此，该算法对应于参与者评论算法，该算法通常提供较低的方差梯度估计。

为了将其转化为实用的算法，我们还必须能够近似估算后向消息Q（st，at）和V（st）。一种简单明了的方法是用参数化函数Qφ（st，at）和Vψ（st）以及参数和来表示它们，并优化参数以最小化平方误差目标

$\mathcal{E}(\phi)=E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V_{\psi}\left(\mathbf{s}_{t+1}\right)\right]-Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$ $\mathcal{E}(\psi)=E_{\mathbf{s}_{t} \sim q\left(\mathbf{s}_{t}\right)}\left[\left(E_{\mathbf{a}_{t} \sim q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\log q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)\right]-V_{\psi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)^{2}\right]$

这种解释为最大熵actor-critic和策略迭代算法带来了一些有趣的可能性。首先，它表明跟踪V（st）和Q（st，at）网络都是有益的。这在消息传递框架中是完全合理的，并且在实践中可能具有与使用目标网络相同的许多好处，在目标网络中，Q和V的更新可能会交错或受阻以保持稳定性。其次，这表明策略迭代或参与者批评方法可能是首选方法（例如，优于直接Q学习），因为它们显式地处理结构化变分近似中的近似消息和近似因子。这正是软角色批评算法所采用的方案（Haarnoja et al。，2018b）。

Soft Q-Learning

我们可以导出强化学习算法的另一种形式，而无需使用显式的策略参数化，仅拟合消息Qφ（st，at）。在这种情况下，我们假设值函数V（st）和策略q（at | st）都隐式参数化，其中

$V\left(\mathbf{s}_{t}\right)=\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right) d \mathbf{a}_{t}$

如公式（21）所示，以及

$q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)=\exp \left(Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-V\left(\mathbf{s}_{t}\right)\right)$

它直接对应于等式（20）。在这种情况下，不需要除Qφ（st，at）以外的其他参数，这可以通过最小化公式（23）中的误差，用隐式公式替换V（st）来代替Vψ（st）来学习。我们可以将结果梯度更新写为

$\phi \leftarrow \phi-\alpha E\left[\frac{d Q_{\phi}}{d \phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\left(Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\log \int_{\mathcal{A}} \exp \left(Q\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)\right) d \mathbf{a}_{t+1}\right)\right)\right]$

值得指出的是与标准Q学习更新的相似之处：

$\left.\phi \leftarrow \phi-\alpha E\left[\frac{d Q_{\phi}}{d \phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\left(Q_{\phi}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\left(r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+\max _{\mathbf{a}_{t+1}} Q_{\phi}\left(\mathbf{s}_{t+1}, \mathbf{a}_{t+1}\right)\right)\right)\right)\right]$

如果标准Q学习更新的最大值大于1，则软Q学习更新的最大值为“ soft”。随着奖励幅度的增加，软更新类似于硬更新。在离散操作的情况下，此更新易于实现，因为积分被总和取代，并且可以通过标准化Q函数简单地提取策略。
在连续动作的情况下，需要进一步的近似值来使用样本评估积分。正如Haarnoja等人所讨论的那样，从隐式策略中进行采样也是很重要的，并且需要一个近似的推理过程。（Haarnoja et al。，2017）。

我们可以进一步使用该框架来说明软Q学习与策略梯度之间的有趣联系。根据完全由Qφ（st，at）定义的等式（20）中策略的定义，我们可以从策略梯度开始得出替代梯度。此推导表示策略梯度和Q学习之间的联系，这种联系在标准框架中并不明显，但在最大熵框架中却显而易见。全部推导由Haarnoja等人提供。（Haarnoja et al。，2017）（附录B）。
最终梯度对应于

$\nabla_{\phi} J(\phi)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\left(\nabla_{\phi} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)-\nabla_{\phi} V\left(\mathbf{s}_{t}\right)\right) \hat{A}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$

软Q学习梯度可以等效地写为

$\nabla_{\phi} J(\phi)=\sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[\nabla_{\phi} Q\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \hat{A}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$

利用我们可以使用任何依赖于状态的基线这一事实，我们用目标值r（st，at）+ V（st + 1）代替Aˆ（st，at）。尽管这些梯度并不完全相等，但附加项-∇φV（st）仅说明以下事实：仅策略梯度不足以解决Q（st，at）的一个额外自由度：与动作无关的常量。如果我们将政策梯度与V（st）的Bellman误差最小化相加，则可以消除该项

$\nabla_{\phi} V\left(\mathbf{s}_{t}\right) E_{\mathbf{a}_{t} \sim q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)+E_{\mathbf{s}_{t+1} \sim q\left(\mathbf{s}_{t+1} | \mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[V\left(\mathbf{s}_{t+1}\right)\right]\right]=\nabla_{\phi} V\left(\mathbf{s}_{t}\right) E_{\mathbf{a}_{t} \sim q\left(\mathbf{a}_{t} | \mathbf{s}_{t}\right)}\left[\hat{Q}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]$

注意到ˆQ（st，at）只是一个（非基线）回报估计，我们可以证明，对于特定于状态相关基线的选择，策略梯度和值梯度的总和与方程式（24）完全匹配。项∇φV（st）Eat〜q（at | st）[ˆQ（st，at）]抵消了项∇φV（st）ˆ A（st，at）在when A（st，at）时的期望）= ˆQ（st，at）（也就是说，当我们使用零基准时）。这就完成了软Q学习与策略梯度之间一般等效的证明。