RL【3】：Bellman Optimality Equation

系列文章目錄

文章目錄

系列文章目錄
前言
Definition of optimal policy
Bellman optimality equation
- Introduction
- Maximization on the right-hand side
- Contraction mapping theorem
- Solution
- Optimality
Analyzing optimal policies
總結

前言

本系列文章主要用于記錄 B站趙世鈺老師的【強化學習的數學原理】的學習筆記，關于趙老師課程的具體內容，可以移步：
B站視頻：【【強化學習的數學原理】課程：從零開始到透徹理解（完結）】
GitHub 課程資料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Definition of optimal policy

The state value could be used to evaluate if a policy is good or not: if

$vπ1(s)≥vπ2(s),?s∈Sv_{\pi_1}(s) \geq v_{\pi_2}(s), \quad \forall s \in \mathcal{S}$

then $π1\pi_1$ is “better” than $π2\pi_2$ .

A policy $π?\pi^*$ is optimal if

$vπ?(s)≥vπ(s),?s∈S,?π.v_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \; \forall \pi.$

Bellman optimality equation

Introduction

Bellman optimality equation (elementwise form):

$\max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}$

$\max_{\pi} \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}$

Remarks:

$\mid s,a), p(s' \mid s,a)$ are known.
$v (s), v (s^{'})$ are unknown and to be calculated.
$π(s)\pi(s)$ is known!

結合 Bellman 方程依賴于已知策略，解釋為什么在 Bellman 最優性方程 里要取 $max?π\max_\pi$ ，以及它和最優策略 $π?\pi^*$ 的關系

回顧：Bellman 方程（依賴已知策略）

對于一個固定的策略 $π\pi$ ，狀態價值函數定義為：

$vπ(s)=∑aπ(a∣s)[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vπ(s’)]v_\pi(s) = \sum_a \pi(a \mid s) \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v_\pi(s’) \Bigg]$

這里 $π(a∣s)\pi(a|s)$ 是 已知的動作分布，所以 Bellman 方程在這種情況下是 策略評估 (policy evaluation) 工具。
策略 = 對每個狀態 $s$ ，給所有可能動作 $a$ 分配概率，即一個策略對應了一組“所有狀態的動作概率分布”。
不同策略對應的就是 不同的動作概率分布集合。

策略 $π1\pi_1$ ：

在狀態 $s$ ，可能給動作 $a_1$ 的概率高一些，給動作 $a_2$ 的概率低一些。

策略 $π2\pi_2$ ：

在相同狀態 $s$ ，可能恰好相反，給 $a_1$ 的概率低，給 $a_2$ 的概率高。

從策略依賴到最優性

如果我們不想只評估一個具體策略，而是想找到 最優策略，那就需要考慮：對于每個狀態 $s$ ，什么樣的動作選擇（或策略）能最大化長期回報？

于是，狀態價值的定義變成：

$v?(s)=max?πvπ(s),?s∈Sv^*(s) = \max_\pi v_\pi(s), \quad \forall s \in \mathcal{S}$

這里的 $max?π\max_\pi$ 表示：在所有可能的策略中，找到一個能使價值函數最大的策略。

Bellman 最優性方程

把 $max?π\max_\pi$ 引入 Bellman 方程，得到：

$v(s)=max?a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]v^(s) = \max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]$

關鍵點：

在普通 Bellman 方程里： $π(a∣s)\pi(a|s)$ 是已知的分布。
在 Bellman 最優性方程里：我們不固定策略，而是 直接在動作層面取最大化，等價于“選擇最優動作”。
因此，最優價值函數 $v^*(s)$ 不再依賴于某個具體的 $π\pi$ ，而是內含了 策略優化的過程。

和最優策略 $π?\pi^*$ 的關系

定義：

$π(s)=arg?max?a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]\pi^(s) = \arg\max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]$

即最優策略就是在每個狀態下選擇能使未來回報最大的動作。

換句話說：

普通 Bellman 方程 = 已知策略的價值評估。
Bellman 最優性方程 = 在所有策略中選擇最優的價值函數，從而定義了最優策略。

Bellman optimality equation (matrix-vector form):

$\max_\pi (r_\pi + \gamma P_\pi v)$

where the elements corresponding to $s$ or $s^{'}$ are

$[rπ]s?∑aπ(a∣s)∑rp(r∣s,a)r,[r_\pi]_s \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s,a) r,$

$[Pπ]s,s′=p(s′∣s)?∑aπ(a∣s)∑s′p(s′∣s,a)[P_\pi]{s,s'} = p(s' \mid s) \triangleq \sum_a \pi(a \mid s) \sum{s'} p(s' \mid s,a)$

Here $max?π\max_\pi$ is performed elementwise.

Maximization on the right-hand side

Solve the Bellman optimality equation

必須先考慮右邊的式子，即先有某個最優策略 $π\pi$ ，然后才有最優的狀態價值 $v (s)$

elementwise form

$\max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}$
- Fix $v^{'} (s)$ first and solve $π\pi$ :
  
  $\max_\pi \sum_a \pi(a \mid s) q(s,a)$
- Inspired by the above example, considering that $∑aπ(a∣s)=1\sum_a \pi(a \mid s) = 1$ , we have
  
  $max?π∑aπ(a∣s)q(s,a)=max?a∈A(s)q(s,a),\max_\pi \sum_a \pi(a \mid s) q(s,a) = \max_{a \in \mathcal{A}(s)} q(s,a),$
- where the optimality is achieved when
  
  $π(a∣s)={1,a=a?0,a≠a?\pi(a \mid s) = \begin{cases} 1, & a = a^* \\ 0, & a \neq a^* \end{cases}$
- where $a^* = \arg\max_a q(s,a)$ .
matrix-vector form

$\max_\pi (r_\pi + \gamma P_\pi v)$
- Let
  
  $\max_\pi (r_\pi + \gamma P_\pi v).$
- Then, the Bellman optimality equation becomes
  
  $v = f (v)$
- where
  
  $\max\pi \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}.$

Contraction mapping theorem

Fixed point:

$\in X$ is a fixed point of $\to X$ if $f (x) = x$

Contraction mapping (or contractive function):

$f$ is a contraction mapping if $∥f(x1)?f(x2)∥≤γ∥x1?x2∥\| f(x_1) - f(x_2) \| \leq \gamma \| x_1 - x_2 \|$ , where $γ∈(0,1)\gamma \in (0,1)$ .

$γ\gamma$ must be strictly less than $1$ so that many limits such as $γk→0\gamma^k \to 0$ as $\to \infty$ hold.
Here $∥?∥\| \cdot \|$ can be any vector norm.

Contraction mapping theorem

For any equation that has the form of $x = f (x)$ , if $f$ is a contraction mapping, then

Existence: There exists a fixed point $x^*$ satisfying $f(x^*) = x^*$ .
Uniqueness: The fixed point $x^*$ is unique.
Algorithm: Consider a sequence ${x_k\}$ where $x_{k+1} = f(x_k)$ , then $xk→x?x_k \to x^*$ as $\to \infty$ . Moreover, the convergence rate is exponentially fast.

Solution

Let’s come back to the Bellman optimality equation:

$\max_\pi (r_\pi + \gamma P_\pi v)$

Contraction Property:

$f (v)$ is a contraction mapping satisfying

$∥f(v1)?f(v2)∥≤γ∥v1?v2∥\| f(v_1) - f(v_2) \| \leq \gamma \| v_1 - v_2 \|$
where $γ\gamma$ is the discount rate!

Existence, Uniqueness, and Algorithm:

For the BOE $\max_\pi (r_\pi + \gamma P_\pi v)$ , there always exists a solution $v^*$ and the solution is unique. The solution could be solved iteratively by

$vk+1=f(vk)=max?π(rπ+γPπvk)v_{k+1} = f(v_k) = \max_\pi (r_\pi + \gamma P_\pi v_k)$
This sequence ${v_k\}$ converges to $v^*$ exponentially fast given any initial guess $v_0$ . The convergence rate is determined by $γ\gamma$ .

Optimality

Suppose $v^*$ is the solution to the Bellman optimality equation. It satisfies

$v?=max?π(rπ+γPπv?)v^* = \max_\pi (r_\pi + \gamma P_\pi v^*)$

Suppose

$π?=arg?max?π(rπ+γPπv?)\pi^* = \arg\max_\pi (r_\pi + \gamma P_\pi v^*)$

Then

$v?=rπ?+γPπ?v?v^* = r_{\pi^*} + \gamma P_{\pi^*} v^*$

Therefore, $π?\pi^*$ is a policy and $v?=vπ?v^* = v_{\pi^*}$ is the corresponding state value.

Policy Optimality

Suppose that $v^*$ is the unique solution to $\max_\pi (r_\pi + \gamma P_\pi v),$
and $vπv_\pi$ is the state value function satisfying $vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi$ for any given policy $π\pi$ , then

$v≥vπ,?πv^ \geq v_\pi, \quad \forall \pi$

Greedy Optimal Policy

For any $\in \mathcal{S}$ , the deterministic greedy policy

$π(a∣s)={1,a=a(s)0,a≠a(s)\pi^(a \mid s) = \begin{cases} 1, & a = a^(s) \\ 0, & a \neq a^(s) \end{cases}$
is an optimal policy solving the BOE. Here,

$a^(s) = \arg\max_a q^(s,a),*$
where $q(s,a):=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v?(s′)q^(s,a) := \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^*(s')$
Proof

$π(s)=arg?max?π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′))\pi^(s) = \arg\max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^(s') \right)$

對公式的概括

$v^*$ 的意義

$v^*$ 是一個“理想值表”，它告訴你：從每個狀態出發，如果以后一直做出最優選擇，能拿到的總回報是多少。

它滿足一個自洽的關系式（Bellman 最優性方程）：

$v?(s)=max?a[即時獎勵+γ×未來價值]v^*(s) = \max_a \Big[\text{即時獎勵} + \gamma \times \text{未來價值}\Big]$

$π?\pi^*$ 的意義

$π?\pi^*$ 是“最優策略”，它規定了：在每個狀態下應該采取哪個動作，才能保證總回報不比任何其他策略差。
從公式上看， $π?\pi^*$ 就是選擇能讓 $v^*$ 達到最大的那個動作（也就是“貪心選擇”）。

Policy Optimality 定理

對于任意其他策略 $π\pi$ ，它的價值函數 $vπv_\pi$ 都不會超過 $v^*$ 。
$v^*$ 是所有策略里能實現的最高水平，它一定支配所有其他策略的價值表。

Greedy Optimal Policy 定理

只要你已經有了 $v^*$ ，那么直接在每個狀態里“選那個讓回報最大化的動作”就能得到 $π?\pi^*$ 。
最優策略其實就是“貪心地”選動作，但前提是這個貪心是基于正確的 $v^*$ 。

更進一步的解釋

為什么要先有 $v^*$ 才能得到 $π?\pi^*$ ？
因為 $π?\pi^*$ 的定義依賴于“未來回報”，而未來回報就是由 $v^*$ 描述的。
一旦知道了 $v^*$ ，最優策略就能“順理成章”地通過貪心法則推出來。

為什么 $v^*$ 比任何 $vπv_\pi$ 都大？
$vπv_\pi$ 是“固定策略下”的表現。
$v^*$ 是在每一步都挑選最優動作的表現。
顯然，如果你隨時都能選最好的動作，你的表現不可能比其他任何固定策略差。

為什么貪心策略一定最優？
因為 Bellman 方程已經保證了：在 $v^*$ 下，每個狀態的最優價值都等于“選擇最優動作”得到的回報。
所以只要你在每個狀態都執行這個“最優動作”，整個過程的價值函數自然等于 $v^*$ 。
也就是說：貪心 + 正確的價值表 = 全局最優策略。

Analyzing optimal policies

What factors determine the optimal policy?

It can be clearly seen from the BOE

$\max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r \;+\; \gamma \sum_{s'} p(s' \mid s,a) v(s') \right)$
that there are three factors:
- Reward design: $r$
- System model: $\mid s,a), \; p(r \mid s,a)$
- Discount rate: $γ\gamma$
In this equation, $\pi(a \mid s)$ are unknowns to be calculated.

Optimal Policy Invariance

Consider a Markov decision process with $v?∈R∣S∣v^* \in \mathbb{R}^{|\mathcal{S}|}$ as the optimal state value satisfying

$v?=max?π(rπ+γPπv?).v^* = \max_\pi (r_\pi + \gamma P_\pi v^*).$
If every reward $r$ is changed by an affine transformation to $a r + b$ , where $\in \mathbb{R}$ and $\neq 0$ , then the corresponding optimal state value $v^{'}$ is also an affine transformation of $v^*$ :

$v^* + \frac{b}{1-\gamma} \mathbf{1},$
- where $γ∈(0,1)\gamma \in (0,1)$ is the discount rate and $1=[1,…,1]T\mathbf{1} = [1, \ldots, 1]^T$ .
Consequently, the optimal policies are invariant to the affine transformation of the reward signals.