系列文章目錄
文章目錄
- 系列文章目錄
- 前言
- Definition of optimal policy
- Bellman optimality equation
- Introduction
- Maximization on the right-hand side
- Contraction mapping theorem
- Solution
- Optimality
- Analyzing optimal policies
- 總結
前言
本系列文章主要用于記錄 B站 趙世鈺老師的【強化學習的數學原理】的學習筆記,關于趙老師課程的具體內容,可以移步:
B站視頻:【【強化學習的數學原理】課程:從零開始到透徹理解(完結)】
GitHub 課程資料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Definition of optimal policy
The state value could be used to evaluate if a policy is good or not: if
vπ1(s)≥vπ2(s),?s∈Sv_{\pi_1}(s) \geq v_{\pi_2}(s), \quad \forall s \in \mathcal{S}vπ1??(s)≥vπ2??(s),?s∈S
then π1\pi_1π1? is “better” than π2\pi_2π2?.
A policy π?\pi^*π? is optimal if
vπ?(s)≥vπ(s),?s∈S,?π.v_{\pi^*}(s) \geq v_\pi(s), \quad \forall s \in \mathcal{S}, \; \forall \pi.vπ??(s)≥vπ?(s),?s∈S,?π.
Bellman optimality equation
Introduction
Bellman optimality equation (elementwise form):
v(s)=max?π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′)),?s∈Sv(s) = \max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}v(s)=maxπ?∑a?π(a∣s)(∑r?p(r∣s,a)r+γ∑s′?p(s′∣s,a)v(s′)),?s∈S
=max?π∑aπ(a∣s)q(s,a),s∈S= \max_{\pi} \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}=maxπ?∑a?π(a∣s)q(s,a),s∈S
Remarks:
- p(r∣s,a),p(s′∣s,a)p(r \mid s,a), p(s' \mid s,a)p(r∣s,a),p(s′∣s,a) are known.
- v(s),v(s′)v(s), v(s')v(s),v(s′) are unknown and to be calculated.
- π(s)\pi(s)π(s) is known!
結合 Bellman 方程依賴于已知策略,解釋為什么在 Bellman 最優性方程 里要取 max?π\max_\pimaxπ?,以及它和最優策略 π?\pi^*π? 的關系
回顧:Bellman 方程(依賴已知策略)
對于一個固定的策略 π\piπ,狀態價值函數定義為:
vπ(s)=∑aπ(a∣s)[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)vπ(s’)]v_\pi(s) = \sum_a \pi(a \mid s) \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v_\pi(s’) \Bigg]vπ?(s)=∑a?π(a∣s)[∑r?p(r∣s,a)r+γ∑s’?p(s’∣s,a)vπ?(s’)]
- 這里 π(a∣s)\pi(a|s)π(a∣s) 是 已知的動作分布,所以 Bellman 方程在這種情況下是 策略評估 (policy evaluation) 工具。
- 策略 = 對每個狀態 sss,給所有可能動作 aaa 分配概率,即一個策略對應了一組“所有狀態的動作概率分布”。
- 不同策略對應的就是 不同的動作概率分布集合。
策略 π1\pi_1π1?:
在狀態 sss,可能給動作 a1a_1a1? 的概率高一些,給動作 a2a_2a2? 的概率低一些。
策略 π2\pi_2π2?:
在相同狀態 sss,可能恰好相反,給 a1a_1a1? 的概率低,給 a2a_2a2? 的概率高。
從策略依賴到最優性
如果我們不想只評估一個具體策略,而是想找到 最優策略,那就需要考慮:對于每個狀態 sss,什么樣的動作選擇(或策略)能最大化長期回報?
于是,狀態價值的定義變成:
v?(s)=max?πvπ(s),?s∈Sv^*(s) = \max_\pi v_\pi(s), \quad \forall s \in \mathcal{S}v?(s)=maxπ?vπ?(s),?s∈S
這里的 max?π\max_\pimaxπ? 表示:在所有可能的策略中,找到一個能使價值函數最大的策略。
Bellman 最優性方程
把 max?π\max_\pimaxπ? 引入 Bellman 方程,得到:
v(s)=max?a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]v^(s) = \max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]v(s)=maxa?[∑r?p(r∣s,a)r+γ∑s’?p(s’∣s,a)v(s’)]
關鍵點:
- 在普通 Bellman 方程里:π(a∣s)\pi(a|s)π(a∣s) 是已知的分布。
- 在 Bellman 最優性方程里:我們不固定策略,而是 直接在動作層面取最大化,等價于“選擇最優動作”。
- 因此,最優價值函數 v?(s)v^*(s)v?(s) 不再依賴于某個具體的 π\piπ,而是內含了 策略優化的過程。
和最優策略 π?\pi^*π? 的關系
定義:
π(s)=arg?max?a[∑rp(r∣s,a)r+γ∑s’p(s’∣s,a)v(s’)]\pi^(s) = \arg\max_a \Bigg[ \sum_r p(r \mid s,a) r + \gamma \sum_{s’} p(s’ \mid s,a) v^(s’) \Bigg]π(s)=argmaxa?[∑r?p(r∣s,a)r+γ∑s’?p(s’∣s,a)v(s’)]
即最優策略就是在每個狀態下選擇能使未來回報最大的動作。
換句話說:
- 普通 Bellman 方程 = 已知策略的價值評估。
- Bellman 最優性方程 = 在所有策略中選擇最優的價值函數,從而定義了最優策略。
Bellman optimality equation (matrix-vector form):
v=max?π(rπ+γPπv)v = \max_\pi (r_\pi + \gamma P_\pi v)v=maxπ?(rπ?+γPπ?v)
where the elements corresponding to sss or s′s's′ are
[rπ]s?∑aπ(a∣s)∑rp(r∣s,a)r,[r_\pi]_s \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s,a) r,[rπ?]s??∑a?π(a∣s)∑r?p(r∣s,a)r,
[Pπ]s,s′=p(s′∣s)?∑aπ(a∣s)∑s′p(s′∣s,a)[P_\pi]{s,s'} = p(s' \mid s) \triangleq \sum_a \pi(a \mid s) \sum{s'} p(s' \mid s,a)[Pπ?]s,s′=p(s′∣s)?∑a?π(a∣s)∑s′p(s′∣s,a)
Here max?π\max_\pimaxπ? is performed elementwise.
Maximization on the right-hand side
Solve the Bellman optimality equation
必須先考慮右邊的式子,即先有某個最優策略 π\piπ,然后才有最優的狀態價值 v(s)v(s)v(s)
-
elementwise form
v(s)=max?π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′)),?s∈Sv(s) = \max_{\pi} \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v(s') \right), \quad \forall s \in \mathcal{S}v(s)=maxπ?∑a?π(a∣s)(∑r?p(r∣s,a)r+γ∑s′?p(s′∣s,a)v(s′)),?s∈S
-
Fix v′(s)v'(s)v′(s) first and solve π\piπ:
v(s)=max?π∑aπ(a∣s)q(s,a)v(s) = \max_\pi \sum_a \pi(a \mid s) q(s,a)v(s)=maxπ?∑a?π(a∣s)q(s,a)
-
Inspired by the above example, considering that ∑aπ(a∣s)=1\sum_a \pi(a \mid s) = 1∑a?π(a∣s)=1, we have
max?π∑aπ(a∣s)q(s,a)=max?a∈A(s)q(s,a),\max_\pi \sum_a \pi(a \mid s) q(s,a) = \max_{a \in \mathcal{A}(s)} q(s,a),maxπ?∑a?π(a∣s)q(s,a)=maxa∈A(s)?q(s,a),
-
where the optimality is achieved when
π(a∣s)={1,a=a?0,a≠a?\pi(a \mid s) = \begin{cases} 1, & a = a^* \\ 0, & a \neq a^* \end{cases}π(a∣s)={1,0,?a=a?a=a??
-
where a?=arg?max?aq(s,a)a^* = \arg\max_a q(s,a)a?=argmaxa?q(s,a).
-
-
matrix-vector form
v=max?π(rπ+γPπv)v = \max_\pi (r_\pi + \gamma P_\pi v)v=maxπ?(rπ?+γPπ?v)
-
Let
f(v):=max?π(rπ+γPπv).f(v) := \max_\pi (r_\pi + \gamma P_\pi v).f(v):=maxπ?(rπ?+γPπ?v).
-
Then, the Bellman optimality equation becomes
v=f(v)v = f(v)v=f(v)
-
where
[f(v)]s=max?π∑aπ(a∣s)q(s,a),s∈S.[f(v)]s = \max\pi \sum_a \pi(a \mid s) q(s,a), \quad s \in \mathcal{S}.[f(v)]s=maxπ∑a?π(a∣s)q(s,a),s∈S.
-
Contraction mapping theorem
Fixed point:
x∈Xx \in Xx∈X is a fixed point of f:X→Xf : X \to Xf:X→X if f(x)=xf(x) = xf(x)=x
Contraction mapping (or contractive function):
fff is a contraction mapping if ∥f(x1)?f(x2)∥≤γ∥x1?x2∥\| f(x_1) - f(x_2) \| \leq \gamma \| x_1 - x_2 \|∥f(x1?)?f(x2?)∥≤γ∥x1??x2?∥, where γ∈(0,1)\gamma \in (0,1)γ∈(0,1).
- γ\gammaγ must be strictly less than 111 so that many limits such as γk→0\gamma^k \to 0γk→0 as k→∞k \to \inftyk→∞ hold.
- Here ∥?∥\| \cdot \|∥?∥ can be any vector norm.
Contraction mapping theorem
For any equation that has the form of x=f(x)x = f(x)x=f(x), if fff is a contraction mapping, then
- Existence: There exists a fixed point x?x^*x? satisfying f(x?)=x?f(x^*) = x^*f(x?)=x?.
- Uniqueness: The fixed point x?x^*x? is unique.
- Algorithm: Consider a sequence {xk}\{x_k\}{xk?} where xk+1=f(xk)x_{k+1} = f(x_k)xk+1?=f(xk?), then xk→x?x_k \to x^*xk?→x? as k→∞k \to \inftyk→∞. Moreover, the convergence rate is exponentially fast.
Solution
Let’s come back to the Bellman optimality equation:
v=f(v)=max?π(rπ+γPπv)v = f(v) = \max_\pi (r_\pi + \gamma P_\pi v)v=f(v)=maxπ?(rπ?+γPπ?v)
Contraction Property:
-
f(v)f(v)f(v) is a contraction mapping satisfying
∥f(v1)?f(v2)∥≤γ∥v1?v2∥\| f(v_1) - f(v_2) \| \leq \gamma \| v_1 - v_2 \|∥f(v1?)?f(v2?)∥≤γ∥v1??v2?∥
-
where γ\gammaγ is the discount rate!
Existence, Uniqueness, and Algorithm:
-
For the BOE v=f(v)=max?π(rπ+γPπv)v = f(v) = \max_\pi (r_\pi + \gamma P_\pi v)v=f(v)=maxπ?(rπ?+γPπ?v), there always exists a solution v?v^*v? and the solution is unique. The solution could be solved iteratively by
vk+1=f(vk)=max?π(rπ+γPπvk)v_{k+1} = f(v_k) = \max_\pi (r_\pi + \gamma P_\pi v_k)vk+1?=f(vk?)=maxπ?(rπ?+γPπ?vk?)
-
This sequence {vk}\{v_k\}{vk?} converges to v?v^*v? exponentially fast given any initial guess v0v_0v0?. The convergence rate is determined by γ\gammaγ.
Optimality
Suppose v?v^*v? is the solution to the Bellman optimality equation. It satisfies
v?=max?π(rπ+γPπv?)v^* = \max_\pi (r_\pi + \gamma P_\pi v^*)v?=maxπ?(rπ?+γPπ?v?)
Suppose
π?=arg?max?π(rπ+γPπv?)\pi^* = \arg\max_\pi (r_\pi + \gamma P_\pi v^*)π?=argmaxπ?(rπ?+γPπ?v?)
Then
v?=rπ?+γPπ?v?v^* = r_{\pi^*} + \gamma P_{\pi^*} v^*v?=rπ??+γPπ??v?
Therefore, π?\pi^*π? is a policy and v?=vπ?v^* = v_{\pi^*}v?=vπ?? is the corresponding state value.
Policy Optimality
-
Suppose that v?v^*v? is the unique solution to v=max?π(rπ+γPπv),v = \max_\pi (r_\pi + \gamma P_\pi v),v=maxπ?(rπ?+γPπ?v),
and vπv_\pivπ? is the state value function satisfying vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pivπ?=rπ?+γPπ?vπ? for any given policy π\piπ, thenv≥vπ,?πv^ \geq v_\pi, \quad \forall \piv≥vπ?,?π
Greedy Optimal Policy
-
For any s∈Ss \in \mathcal{S}s∈S, the deterministic greedy policy
π(a∣s)={1,a=a(s)0,a≠a(s)\pi^(a \mid s) = \begin{cases} 1, & a = a^(s) \\ 0, & a \neq a^(s) \end{cases}π(a∣s)={1,0,?a=a(s)a=a(s)?
-
is an optimal policy solving the BOE. Here,
?a(s)=arg?max?aq(s,a),?*a^(s) = \arg\max_a q^(s,a),*?a(s)=argmaxa?q(s,a),?
-
where q(s,a):=∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v?(s′)q^(s,a) := \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^*(s')q(s,a):=∑r?p(r∣s,a)r+γ∑s′?p(s′∣s,a)v?(s′)
-
Proof
π(s)=arg?max?π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′))\pi^(s) = \arg\max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r + \gamma \sum_{s'} p(s' \mid s,a) v^(s') \right)π(s)=argmaxπ?∑a?π(a∣s)(∑r?p(r∣s,a)r+γ∑s′?p(s′∣s,a)v(s′))
對公式的概括
v?v^*v? 的意義
v?v^*v? 是一個“理想值表”,它告訴你:從每個狀態出發,如果以后一直做出最優選擇,能拿到的總回報是多少。
它滿足一個自洽的關系式(Bellman 最優性方程):
v?(s)=max?a[即時獎勵+γ×未來價值]v^*(s) = \max_a \Big[\text{即時獎勵} + \gamma \times \text{未來價值}\Big]v?(s)=maxa?[即時獎勵+γ×未來價值]
π?\pi^*π? 的意義
- π?\pi^*π? 是“最優策略”,它規定了:在每個狀態下應該采取哪個動作,才能保證總回報不比任何其他策略差。
- 從公式上看,π?\pi^*π? 就是選擇能讓 v?v^*v? 達到最大的那個動作(也就是“貪心選擇”)。
Policy Optimality 定理
- 對于任意其他策略 π\piπ,它的價值函數 vπv_\pivπ? 都不會超過 v?v^*v?。
- v?v^*v? 是所有策略里能實現的最高水平,它一定支配所有其他策略的價值表。
Greedy Optimal Policy 定理
- 只要你已經有了 v?v^*v?,那么直接在每個狀態里“選那個讓回報最大化的動作”就能得到 π?\pi^*π?。
- 最優策略其實就是“貪心地”選動作,但前提是這個貪心是基于正確的 v?v^*v?。
更進一步的解釋
- 為什么要先有 v?v^*v? 才能得到 π?\pi^*π??
- 因為 π?\pi^*π? 的定義依賴于“未來回報”,而未來回報就是由 v?v^*v? 描述的。
- 一旦知道了 v?v^*v?,最優策略就能“順理成章”地通過貪心法則推出來。
- 為什么 v?v^*v? 比任何 vπv_\pivπ? 都大?
- vπv_\pivπ? 是“固定策略下”的表現。
- v?v^*v? 是在每一步都挑選最優動作的表現。
- 顯然,如果你隨時都能選最好的動作,你的表現不可能比其他任何固定策略差。
- 為什么貪心策略一定最優?
- 因為 Bellman 方程已經保證了:在 v?v^*v? 下,每個狀態的最優價值都等于“選擇最優動作”得到的回報。
- 所以只要你在每個狀態都執行這個“最優動作”,整個過程的價值函數自然等于 v?v^*v?。
- 也就是說:貪心 + 正確的價值表 = 全局最優策略。
Analyzing optimal policies
What factors determine the optimal policy?
-
It can be clearly seen from the BOE
v(s)=max?π∑aπ(a∣s)(∑rp(r∣s,a)r+γ∑s′p(s′∣s,a)v(s′))v(s) = \max_\pi \sum_a \pi(a \mid s) \left( \sum_r p(r \mid s,a) r \;+\; \gamma \sum_{s'} p(s' \mid s,a) v(s') \right)v(s)=maxπ?∑a?π(a∣s)(∑r?p(r∣s,a)r+γ∑s′?p(s′∣s,a)v(s′))
-
that there are three factors:
- Reward design: rrr
- System model: p(s′∣s,a),p(r∣s,a)p(s' \mid s,a), \; p(r \mid s,a)p(s′∣s,a),p(r∣s,a)
- Discount rate: γ\gammaγ
-
In this equation, v(s),v(s′),π(a∣s)v(s), v(s'), \pi(a \mid s)v(s),v(s′),π(a∣s) are unknowns to be calculated.
Optimal Policy Invariance
-
Consider a Markov decision process with v?∈R∣S∣v^* \in \mathbb{R}^{|\mathcal{S}|}v?∈R∣S∣ as the optimal state value satisfying
v?=max?π(rπ+γPπv?).v^* = \max_\pi (r_\pi + \gamma P_\pi v^*).v?=maxπ?(rπ?+γPπ?v?).
-
If every reward rrr is changed by an affine transformation to ar+bar + bar+b, where a,b∈Ra, b \in \mathbb{R}a,b∈R and a≠0a \neq 0a=0, then the corresponding optimal state value v′v'v′ is also an affine transformation of v?v^*v?:
v′=av?+b1?γ1,v' = a v^* + \frac{b}{1-\gamma} \mathbf{1},v′=av?+1?γb?1,
- where γ∈(0,1)\gamma \in (0,1)γ∈(0,1) is the discount rate and 1=[1,…,1]T\mathbf{1} = [1, \ldots, 1]^T1=[1,…,1]T.
-
Consequently, the optimal policies are invariant to the affine transformation of the reward signals.
總結
Bellman 最優性方程刻畫了在所有策略中選擇最優策略的價值函數,它保證存在唯一的最優狀態價值 v?v^*v?,并且通過對每個狀態下的動作取最大化(貪心原則)即可導出最優策略 π?\pi^*π?,同時最優策略的性質只依賴于獎勵設計、環境轉移模型和折扣因子,而對獎勵的仿射變換保持不變。