多智能体强化学习—QPLEX
QPLEX的主要亮点**:分别对联合Q值 $Q_{tot}$ 和各个agent的Q值 $Q_{i}$ 使用Dueling structure: $Q=V+A$ 进行分解,将IGM一致性转化为易于实现的优势函数取值范围约束,从而方便了具有线性分解结构的值函数的学习,这种分解让Q值的获得更为具体,***Q值=当前状态的价值V+采取动作的价值A***,这样可以进一步判断Q值的获得是由于状态还是由于采取的
多智能体强化学习—QPLEX
论文地址:QPLEX: Duplex Dueling Multi-Agent Q-Learning
视频效果:Experiments on StarCraft II
建议了解一下QMIX:多智能体强化学习—QMIX
1 介绍
IGM(Individual-Global-Max):
argmax u Q t o t ( τ , u ) = ( argmax u 1 Q 1 ( τ 1 , u 1 ) ⋮ argmax u n Q n ( τ n , u n ) ) \underset{\mathbf{u}}{\operatorname{argmax}} Q_{t o t}(\tau, \mathbf{u})=\left(\begin{array}{cc} \operatorname{argmax}_{u^{1}} & Q_{1}\left(\tau^{1}, u^{1}\right) \\ \vdots \\ {\operatorname{argmax}}_{u^{n}} & Q_{n}\left(\tau^{n}, u^{n}\right) \end{array}\right) uargmaxQtot(τ,u)=⎝⎜⎛argmaxu1⋮argmaxunQ1(τ1,u1)Qn(τn,un)⎠⎟⎞
其中, Q t o t Q_{tot} Qtot表示联合Q函数; Q i Q_i Qi表示智能体 i的动作值函数。
IGM表示 a r g m a x ( Q t o t ) argmax (Q_{tot}) argmax(Qtot) 与 a r g m a x ( Q i ) argmax (Q_i) argmax(Qi)得到相同结果,这表示在无约束条件的情况下,个体最优就代表整体最优。
QPLEX:QMIX和VDN提出了IGM的两个充分条件(下式)来因式分解联合动作值函数。这两种分解方法都受到结构约束,并限制了它们可以表示的联合动作值函数类。
Q t o t V D N ( τ , a ) = ∑ i = 1 n Q i ( τ i , a i ) and ∀ i ∈ N , ∂ Q t o t Q M I X ( τ , a ) ∂ Q i ( τ i , a i ) > 0 Q_{t o t}^{\mathrm{VDN}}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{i=1}^{n} Q_{i}\left(\tau_{i}, a_{i}\right) \quad \text { and } \quad \forall i \in \mathcal{N}, \frac{\partial Q_{t o t}^{\mathrm{QMIX}}(\boldsymbol{\tau}, \boldsymbol{a})}{\partial Q_{i}\left(\tau_{i}, a_{i}\right)}>0 QtotVDN(τ,a)=i=1∑nQi(τi,ai) and ∀i∈N,∂Qi(τi,ai)∂QtotQMIX(τ,a)>0
为了解决这一挑战,提出了QPLEX,称为duPLEX dueling multi-agent Q-learning-(QPLEX),该方法采用 duplex dueling network 结构将联合动作值函数分解为个体的动作值函数。QPLEX引入了dueling structure:Q=V+A,用于表示联合和个体动作值函数,然后将IGM原则重新形式化为基于优势的IGM。这种重新表述将IGM的一致性转换为对优势函数值范围的约束,从而促进了具有线性分解结构的动作-值函数的学习。QPLEX利用 advantage of a duplex dueling 架构将IGM约束条件编码到神经网络结构中,保证IGM的一致性。
QPLEX的主要亮点:分别对联合Q值 Q t o t Q_{tot} Qtot 和各个agent的Q值 Q i Q_{i} Qi 使用Dueling structure: Q = V + A Q=V+A Q=V+A 进行分解,将IGM一致性转化为易于实现的优势函数取值范围约束,从而方便了具有线性分解结构的值函数的学习,这种分解让Q值的获得更为具体,Q值=当前状态的价值V+采取动作的价值A,这样可以进一步判断Q值的获得是由于状态还是由于采取的动作的优势。
下面进行具体分解:
- 联合动作价值函数的dueling分解:
(Joint Dueling) Q t o t ( τ , a ) = V tot ( τ ) + A tot ( τ , a ) and V tot ( τ ) = max a ′ Q tot ( τ , a ′ ) \text { (Joint Dueling) } \quad Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})+A_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a}) \text { and } V_{\text {tot }}(\boldsymbol{\tau})=\max _{a^{\prime}} Q_{\text {tot }}\left(\boldsymbol{\tau}, \boldsymbol{a}^{\prime}\right) (Joint Dueling) Qtot(τ,a)=Vtot (τ)+Atot (τ,a) and Vtot (τ)=a′maxQtot (τ,a′)
其中: Q tot : T × A ↦ R Q_{\text {tot }}: \mathcal{T} \times \mathcal{A} \mapsto \mathbb{R} Qtot :T×A↦R
- 个体动作价值函数的dueling分解:
(Individual Dueling) Q i ( τ i , a i ) = V i ( τ i ) + A i ( τ i , a i ) and V i ( τ i ) = max a i ′ Q i ( τ i , a i ′ ) \text { (Individual Dueling) } \quad Q_{i}\left(\tau_{i}, a_{i}\right)=V_{i}\left(\tau_{i}\right)+A_{i}\left(\tau_{i}, a_{i}\right) \text { and } V_{i}\left(\tau_{i}\right)=\max _{a_{i}^{\prime}} Q_{i}\left(\tau_{i}, a_{i}^{\prime}\right) (Individual Dueling) Qi(τi,ai)=Vi(τi)+Ai(τi,ai) and Vi(τi)=ai′maxQi(τi,ai′)
其中: [ Q i : T × A ↦ R ] i = 1 n \left[Q_{i}: \mathcal{T} \times \mathcal{A} \mapsto \mathbb{R}\right]_{i=1}^{n} [Qi:T×A↦R]i=1n, where ∀ τ ∈ T , ∀ a ∈ A , ∀ i ∈ N \forall \boldsymbol{\tau} \in \mathcal{T}, \forall \boldsymbol{a} \in \mathcal{A}, \forall i \in \mathcal{N} ∀τ∈T,∀a∈A,∀i∈N
- 约束条件:
arg max a ∈ A A tot ( τ , a ) = ( arg max a 1 ∈ A A 1 ( τ 1 , a 1 ) , … , arg max a n ∈ A A n ( τ n , a n ) ) \underset{\boldsymbol{a} \in \mathcal{A}}{\arg \max } A_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a})=\left(\underset{a_{1} \in \mathcal{A}}{\arg \max } A_{1}\left(\tau_{1}, a_{1}\right), \ldots, \underset{a_{n} \in \mathcal{A}}{\arg \max } A_{n}\left(\tau_{n}, a_{n}\right)\right) a∈AargmaxAtot (τ,a)=(a1∈AargmaxA1(τ1,a1),…,an∈AargmaxAn(τn,an))
对于优势函数有:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s, a)=Q^{\pi}(s, a)-V^{\pi}(s) Aπ(s,a)=Qπ(s,a)−Vπ(s)
当执行最佳动作时: Q π ( s , a ) = V π ( s ) Q^{\pi}(s, a)=V^{\pi}(s) Qπ(s,a)=Vπ(s) ,则对于联合动作优势函数有:
A t o t ( τ , a ∗ ) = A i ( τ i , a i ∗ ) = 0 a n d A t o t ( τ , a ) < 0 , A i ( τ i , a i ) ≤ 0 A_{t o t}\left(\boldsymbol{\tau}, \boldsymbol{a}^{*}\right)=A_{i}\left(\tau_{i}, a_{i}^{*}\right)=0 \quad and \quad A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})<0, A_{i}\left(\tau_{i}, a_{i}\right) \leq 0 Atot(τ,a∗)=Ai(τi,ai∗)=0andAtot(τ,a)<0,Ai(τi,ai)≤0
where A ∗ ( τ ) = { a ∣ a ∈ A , Q tot ( τ , a ) = V tot ( τ ) } \mathcal{A}^{*}(\boldsymbol{\tau})=\left\{\boldsymbol{a} \mid \boldsymbol{a} \in \mathcal{A}, Q_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})\right\} A∗(τ)={a∣a∈A,Qtot (τ,a)=Vtot (τ)} ∀ τ ∈ T , ∀ a ∗ ∈ A ∗ ( τ ) , ∀ a ∈ A \ A ∗ ( τ ) , ∀ i ∈ N \forall \boldsymbol{\tau} \in \mathcal{T}, \forall a^{*} \in \mathcal{A}^{*}(\tau), \forall a \in \mathcal{A} \backslash \mathcal{A}^{*}(\tau), \forall i \in \mathcal{N} ∀τ∈T,∀a∗∈A∗(τ),∀a∈A\A∗(τ),∀i∈N
因为V只和状态有关,与动作无关,所以影响Q的主要是A,那么就把上式的约束转成了IGM约束条件
2 QMIX 算法框架
框架主要分三两部分:
- (a)Dueling Mixing网络
- (b)整体的Duplex Dueling架构
- (c)智能体网络结构,Transformation网络
下面进行具体分析:
2.1 Agent network(和QMIX网络的Agent network相同)
输入: t t t时刻智能体 a a a的观测值 o i t o_i^t oit、 t − 1 t-1 t−1时刻智能体 a a a的动作 a i t − 1 a_i^{t-1} ait−1
输出: t t t时刻智能体 a a a的值函数 Q i ( τ i , a i t ) Q_{i}\left(\tau_{i}, a_i^{t}\right) Qi(τi,ait)
Agent network由DRQN网络实现,根据不同的任务需求,不同智能体的网络可以进行单独训练,也可进行参数共享,DRQN是将DQN中的全连接层替换为LSTM网络,循环网络在观测质量变化的情况下,具有更强的适应性。如图所示,其网络一共包含 3 层,输入层(MLP多层神经网络)→ 中间层(GRU门控循环神经网络)→ 输出层(MLP多层神经网络)
实现代码如下:
智能体网络参数配置:
# --- Agent parameters ---
agent: "rnn" # Default rnn agent
rnn_hidden_dim: 64 # Size of hidden state for default rnn agent
obs_agent_id: True # Include the agent's one_hot id in the observation
obs_last_action: True # Include the agent's last action (one_hot) in the observation
RNN网络:
class RNNAgent(nn.Module):
def __init__(self, input_shape, args):
super(RNNAgent, self).__init__()
self.args = args
#根据参数配置,智能体网络的输入
#input_shape = obs_shape + n_actions + one_hot_code(one_hot_code_o+one_hot_code_u)
self.fc1 = nn.Linear(input_shape, args.rnn_hidden_dim) # 线性层
self.rnn = nn.GRUCell(args.rnn_hidden_dim, args.rnn_hidden_dim) # GRU层,需要输入隐藏状态
self.fc2 = nn.Linear(args.rnn_hidden_dim, args.n_actions) # 线性层
def init_hidden(self):
# make hidden states on same device as model
return self.fc1.weight.new(1, self.args.rnn_hidden_dim).zero_()
def forward(self, inputs, hidden_state):
x = F.relu(self.fc1(inputs)) # 输入经过线性层后relu激活,输出x
h_in = hidden_state.reshape(-1, self.args.rnn_hidden_dim) # 对隐藏状态进行变形,列数为隐藏层维度大小
h = self.rnn(x, h_in) # 循环神经网络,输入x,与隐藏状态(上一时刻信息)
q = self.fc2(h) # 输出Q值
return q, h
Transformation network
输入:智能体 i i i的状态价值函数 V i ( τ i ) V_i(\tau_i) Vi(τi)、优势函数 A i ( τ i , a i ) A_i(\tau_i,a_i) Ai(τi,ai)、全局状态 s t s_t st
输出:基于全局信息 s s s 智能体 i i i的状态价值函数 V i ( τ ) V_i(\tau) Vi(τ)、优势函数 A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)
[ V i ( τ i ) , A i ( τ i , a i ) ] i = 1 n to [ V i ( τ ) , A i ( τ , a i ) ] i = 1 n \left[V_{i}\left(\tau_{i}\right), A_{i}\left(\tau_{i}, a_{i}\right)\right]_{i=1}^{n} \text { to }\left[V_{i}(\boldsymbol{\tau}), A_{i}\left(\boldsymbol{\tau}, a_{i}\right)\right]_{i=1}^{n} [Vi(τi),Ai(τi,ai)]i=1n to [Vi(τ),Ai(τ,ai)]i=1n
将局部的值函数与优势函数 V i ( τ i ) V_i(\tau_i) Vi(τi) , A i ( τ i , a i ) A_i(\tau_i,a_i) Ai(τi,ai) 与全局信息 s t s_t st(或联合观测历史 )结合,获得基于全局观测信息的局部值函数 V i ( τ ) V_i(\tau) Vi(τ), A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)
具体实现方式是:
V i ( τ ) = w i ( τ ) V i ( τ i ) + b i ( τ ) and A i ( τ , a i ) = w i ( τ ) A i ( τ i , a i ) + b i ( τ ) V_{i}(\boldsymbol{\tau})=w_{i}(\boldsymbol{\tau}) V_{i}\left(\tau_{i}\right)+b_{i}(\boldsymbol{\tau}) \quad \text { and } \quad A_{i}\left(\boldsymbol{\tau}, a_{i}\right)=w_{i}(\boldsymbol{\tau}) A_{i}\left(\tau_{i}, a_{i}\right)+b_{i}(\boldsymbol{\tau}) Vi(τ)=wi(τ)Vi(τi)+bi(τ) and Ai(τ,ai)=wi(τ)Ai(τi,ai)+bi(τ)
其中 w i w_i wi是正权值,保证了局部函数和全局函数之间的单调性。
实现代码如下:
# 超网络输出权重 W
self.hyper_w_final = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
nn.ReLU(),
nn.Linear(hypernet_embed, self.n_agents))
# 超网络输出偏差 bias
self.V = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
nn.ReLU(),
nn.Linear(hypernet_embed, self.n_agents))
# 根据全局状态s获得transformation网络的参数
w_final = self.hyper_w_final(states) # 获得W权重
w_final = th.abs(w_final) # 求绝对值保证单调
w_final = w_final.view(-1, self.n_agents) + 1e-10
v = self.V(states) # 获得b偏差
v = v.view(-1, self.n_agents)
if self.args.weighted_head: # 是否使用加权头
agent_qs = w_final * agent_qs + v # 计算智能体动作价值函数
if not is_v:
max_q_i = max_q_i.view(-1, self.n_agents)
if self.args.weighted_head:
max_q_i = w_final * max_q_i + v # 根据状态值函数计算
2.2 Dueling Mixing network
输入:智能体 i i i的状态价值函数 V i ( τ ) V_i(\tau) Vi(τ)、优势函数 A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)、全局状态 s t s_t st
输出:联合动作价值函数 Q t o t ( τ , a ) Q_{tot}\left(\tau, a\right) Qtot(τ,a)

Dueling Mixing network 主要由两部分组成:
Q t o t ( τ , a ) = V t o t ( τ ) + A t o t ( τ , a ) Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{t o t}(\boldsymbol{\tau})+A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a}) Qtot(τ,a)=Vtot(τ)+Atot(τ,a)
- 计算 V t o t ( τ ) V_{tot}(\tau) Vtot(τ):
因为 V V V仅与状态 s s s(或联合观测历史 τ \tau τ)有关, V t o t ( τ ) V_{tot}(\tau) Vtot(τ) 表示为:
V t o t ( τ ) = ∑ i = 1 n V i ( τ ) V_{t o t}(\boldsymbol{\tau})=\sum_{i=1}^{n} V_{i}(\boldsymbol{\tau}) Vtot(τ)=i=1∑nVi(τ)
- 计算 A t o t ( τ , a ) A_{tot}(\tau,a) Atot(τ,a):
A t o t ( τ , a ) = ∑ i = 1 n λ i ( τ , a ) A i ( τ , a i ) A_{t o t}(\tau, a)=\sum_{i=1}^{n} \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a}) A_{i}\left(\boldsymbol{\tau}, a_{i}\right) Atot(τ,a)=i=1∑nλi(τ,a)Ai(τ,ai)
其中 , λ i ( τ , a ) > 0 ,\lambda_{i}(\tau, \boldsymbol{a})>0 ,λi(τ,a)>0,保证贪婪动作选择与策略一致
λ i ( τ , a ) = ∑ k = 1 K λ i , k ( τ , a ) ϕ i , k ( τ ) v k ( τ ) \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{k=1}^{K} \lambda_{i, k}(\boldsymbol{\tau}, \boldsymbol{a}) \phi_{i, k}(\boldsymbol{\tau}) v_{k}(\boldsymbol{\tau}) λi(τ,a)=k=1∑Kλi,k(τ,a)ϕi,k(τ)vk(τ)
λ i ( τ , a ) \lambda_{i}(\tau, \boldsymbol{a}) λi(τ,a)采用multi-head attention机制,式中, K K K 是头数, λ i , k ( τ , a ) , ϕ i , k ( τ ) \lambda_{i, k}(\tau, a), \phi_{i, k}(\tau) λi,k(τ,a),ϕi,k(τ) 为被sigmoid激活的注意力权重, v k ( τ ) > 0 v_{k}(\tau)>0 vk(τ)>0 为每个head的key
- 有了 V t o t ( τ ) V_{tot}(\tau) Vtot(τ), A t o t ( τ , a ) A_{tot}(\tau,a) Atot(τ,a),可求 Q t o t Q_{tot} Qtot:
Q t o t ( τ , a ) = V tot ( τ ) + A t o t ( τ , a ) = ∑ i = 1 n Q i ( τ , a i ) + ∑ i = 1 n ( λ i ( τ , a ) − 1 ) A i ( τ , a i ) Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})+A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{i=1}^{n} Q_{i}\left(\boldsymbol{\tau}, a_{i}\right)+\sum_{i=1}^{n}\left(\lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})-1\right) A_{i}\left(\boldsymbol{\tau}, a_{i}\right) Qtot(τ,a)=Vtot (τ)+Atot(τ,a)=i=1∑nQi(τ,ai)+i=1∑n(λi(τ,a)−1)Ai(τ,ai)
其中, 前半部分 ∑ i = 1 n Q i ( τ , a i ) \sum_{i=1}^{n} Q_{i}\left(\tau, a_{i}\right) ∑i=1nQi(τ,ai) 与VDN的 Q t o t V D N Q_{t o t}^{V D N} QtotVDN 相同, 而后半部分修正了 Q tot V D N Q_{\text {tot }}^{V D N} Qtot VDN 与真实的联合动作价值函数 Q t o t Q_{tot} Qtot之间的误差。
实现代码如下:
# use the Q_Learner to train
agent_output_type: "q"
learner: "q_learner"
double_q: True
mixer: "qmix"
mixing_embed_dim: 32
hypernet_layers: 2
hypernet_embed: 64
class DMAQer(nn.Module):
def __init__(self, args):
super(DMAQer, self).__init__()
# 智能体、环境参数读取
self.args = args
self.n_agents = args.n_agents
self.n_actions = args.n_actions
self.state_dim = int(np.prod(args.state_shape))
self.action_dim = args.n_agents * self.n_actions
self.state_action_dim = self.state_dim + self.action_dim + 1
self.embed_dim = args.mixing_embed_dim # 隐层维度
hypernet_embed = self.args.hypernet_embed # 超网络的隐层维度
# 超网络输出权重 W
self.hyper_w_final = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
nn.ReLU(),
nn.Linear(hypernet_embed, self.n_agents))
# 超网络输出偏差 bias
self.V = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
nn.ReLU(),
nn.Linear(hypernet_embed, self.n_agents))
# 计算lambda
self.si_weight = DMAQ_SI_Weight(args)
# Q_{tot}=V_{tot}+A_{tot}=\sum Q_{i}+ \sum{(\lambda-1)*A_{i}}
def calc_v(self, agent_qs): # Dueling Mixing网络计算V_tot V_{tot}=\sum V_{i}
agent_qs = agent_qs.view(-1, self.n_agents)
v_tot = th.sum(agent_qs, dim=-1) # 求和
return v_tot
def calc_adv(self, agent_qs, states, actions, max_q_i): # Dueling Mixing网络计算A_tot \sum{(\lambda-1)*A_{i}}
states = states.reshape(-1, self.state_dim)
actions = actions.reshape(-1, self.action_dim)
agent_qs = agent_qs.view(-1, self.n_agents)
max_q_i = max_q_i.view(-1, self.n_agents)
adv_q = (agent_qs - max_q_i).view(-1, self.n_agents).detach() # 计算优势函数,并去掉梯度
adv_w_final = self.si_weight(states, actions) # 获得权重
adv_w_final = adv_w_final.view(-1, self.n_agents)
# 计算A_tot
if self.args.is_minus_one: # 是不是相减的形式
adv_tot = th.sum(adv_q * (adv_w_final - 1.), dim=1) # \sum{(\lambda-1)*A_{i}}
else:
adv_tot = th.sum(adv_q * adv_w_final, dim=1)
return adv_tot
def calc(self, agent_qs, states, actions=None, max_q_i=None, is_v=False): # 计算total价值函数
if is_v:
v_tot = self.calc_v(agent_qs)
return v_tot
else:
adv_tot = self.calc_adv(agent_qs, states, actions, max_q_i)
return adv_tot
def forward(self, agent_qs, states, actions=None, max_q_i=None, is_v=False):
bs = agent_qs.size(0) # 样本数量数
states = states.reshape(-1, self.state_dim)
agent_qs = agent_qs.view(-1, self.n_agents)
# 根据全局状态s获得transformation网络的参数
w_final = self.hyper_w_final(states) # 获得W权重
w_final = th.abs(w_final) # 求绝对值保证单调
w_final = w_final.view(-1, self.n_agents) + 1e-10
v = self.V(states) # 获得b偏差
v = v.view(-1, self.n_agents)
if self.args.weighted_head: # 是否使用加权头
agent_qs = w_final * agent_qs + v # 计算智能体动作价值函数
if not is_v:
max_q_i = max_q_i.view(-1, self.n_agents)
if self.args.weighted_head:
max_q_i = w_final * max_q_i + v # 根据状态值函数计算
y = self.calc(agent_qs, states, actions=actions, max_q_i=max_q_i, is_v=is_v) # 进入Dueling Mixing网络,计算total
v_tot = y.view(bs, -1, 1)
return v_tot
多头注意力计算部分:
λ i ( τ , a ) = ∑ k = 1 K λ i , k ( τ , a ) ϕ i , k ( τ ) v k ( τ ) \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{k=1}^{K} \lambda_{i, k}(\boldsymbol{\tau}, \boldsymbol{a}) \phi_{i, k}(\boldsymbol{\tau}) v_{k}(\boldsymbol{\tau}) λi(τ,a)=k=1∑Kλi,k(τ,a)ϕi,k(τ)vk(τ)
class DMAQ_SI_Weight(nn.Module):
def __init__(self, args):
super(DMAQ_SI_Weight, self).__init__()
self.args = args
self.n_agents = args.n_agents
self.n_actions = args.n_actions
self.state_dim = int(np.prod(args.state_shape))
self.action_dim = args.n_agents * self.n_actions
self.state_action_dim = self.state_dim + self.action_dim
self.num_kernel = args.num_kernel
self.key_extractors = nn.ModuleList()
self.agents_extractors = nn.ModuleList()
self.action_extractors = nn.ModuleList()
adv_hypernet_embed = self.args.adv_hypernet_embed
for i in range(self.num_kernel): # multi-head attention
if getattr(args, "adv_hypernet_layers", 1) == 1:
self.key_extractors.append(nn.Linear(self.state_dim, 1)) # key
self.agents_extractors.append(nn.Linear(self.state_dim, self.n_agents)) # agent
self.action_extractors.append(nn.Linear(self.state_action_dim, self.n_agents)) # action
elif getattr(args, "adv_hypernet_layers", 1) == 2:
self.key_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, 1))) # key
self.agents_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, self.n_agents))) # agent
self.action_extractors.append(nn.Sequential(nn.Linear(self.state_action_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, self.n_agents))) # action
elif getattr(args, "adv_hypernet_layers", 1) == 3:
self.key_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, 1))) # key
self.agents_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, self.n_agents))) # agent
self.action_extractors.append(nn.Sequential(nn.Linear(self.state_action_dim, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
nn.ReLU(),
nn.Linear(adv_hypernet_embed, self.n_agents))) # action
else:
raise Exception("Error setting number of adv hypernet layers.")
def forward(self, states, actions):
states = states.reshape(-1, self.state_dim)
actions = actions.reshape(-1, self.action_dim)
data = th.cat([states, actions], dim=1)
all_head_key = [k_ext(states) for k_ext in self.key_extractors]
all_head_agents = [k_ext(states) for k_ext in self.agents_extractors]
all_head_action = [sel_ext(data) for sel_ext in self.action_extractors]
head_attend_weights = []
for curr_head_key, curr_head_agents, curr_head_action in zip(all_head_key, all_head_agents, all_head_action):
x_key = th.abs(curr_head_key).repeat(1, self.n_agents) + 1e-10 #v_{k}(\tau)
x_agents = F.sigmoid(curr_head_agents) #\phi_{i, k}(\tau)
x_action = F.sigmoid(curr_head_action) #\lambda_{i, k}(\tau, a)
weights = x_key * x_agents * x_action #权重
head_attend_weights.append(weights)
head_attend = th.stack(head_attend_weights, dim=1)
head_attend = head_attend.view(-1, self.num_kernel, self.n_agents)
head_attend = th.sum(head_attend, dim=1) #求和
return head_attend
2.3 算法更新流程
损失函数: L ( θ ) = ∑ i = 1 b [ ( y i t o t − Q t o t ( τ , u , s ; θ ) ) 2 ] \mathcal{L}(\theta)=\sum_{i=1}^{b}\left[\left(y_{i}^{t o t}-Q_{t o t}(\tau, \mathbf{u}, s ; \theta)\right)^{2}\right] L(θ)=∑i=1b[(yitot−Qtot(τ,u,s;θ))2]
其中 b b b表示从经验池中采样的样本数量, y t o t = r + γ max u ′ Q t o t ( τ ′ , u ′ , s ′ ; θ − ) y^{t o t}=r+\gamma \max _{\mathbf{u}^{\prime}} Q_{t o t}\left(\tau^{\prime}, \mathbf{u}^{\prime}, s^{\prime} ; \theta^{-}\right) ytot=r+γmaxu′Qtot(τ′,u′,s′;θ−), θ − \theta^{-} θ−是目标网络的参数,
所以时序差分的误差可表示为:
T D e r r o r = ( r + γ Q t o t ( target ) ) − Q t o t ( evalutate ) \begin{aligned} {TDerror}=(r+\gamma Q _{ tot }(\text { target })) -Q _{ tot }(\text { evalutate }) \end{aligned} TDerror=(r+γQtot( target ))−Qtot( evalutate )
Q t o t ( target ) Q _{ tot }(\text { target }) Qtot( target ):状态 s ′ s^{'} s′的情况下,所有行为中,获取的最大价值 Q t o t Q_{tot} Qtot。根据IGM条件,输入为此状态下每个智能体的最大动作价值。
Q t o t ( evalutate ) Q _{ tot }(\text { evalutate }) Qtot( evalutate ): 状态 s s s的情况下,根据当前网络策略所能获得 Q t o t Q_{tot} Qtot。
实现代码如下:
参数配置:
# use epsilon greedy action selector
# --- QMIX specific parameters ---
# use epsilon greedy action selector
action_selector: "epsilon_greedy"
epsilon_start: 1.0
epsilon_finish: 0.05
epsilon_anneal_time: 50000
runner: "episode"
buffer_size: 5000
# update the target network every {} episodes
target_update_interval: 200
# use the Q_Learner to train
agent_output_type: "q"
learner: "q_learner"
double_q: True
mixer: "qmix"
mixing_embed_dim: 32
hypernet_layers: 2
hypernet_embed: 64
name: "qmix"
动作选择:(ε-greedy)
class EpsilonGreedyActionSelector():
def __init__(self, args):
self.args = args
self.schedule = DecayThenFlatSchedule(args.epsilon_start, args.epsilon_finish, args.epsilon_anneal_time,
decay="linear")
self.epsilon = self.schedule.eval(0)
def select_action(self, agent_inputs, avail_actions, t_env, test_mode=False):
# Assuming agent_inputs is a batch of Q-Values for each agent bav
self.epsilon = self.schedule.eval(t_env) # 获取epsilon
if test_mode:
# Greedy action selection only
self.epsilon = 0.0
# mask actions that are excluded from selection
masked_q_values = agent_inputs.clone() # q值 q_value
masked_q_values[avail_actions == 0.0] = -float("inf") # should never be selected! 不能选择的动作赋值为 负无穷
random_numbers = th.rand_like(agent_inputs[:, :, 0]) # 生成相同维度的随机矩阵
pick_random = (random_numbers < self.epsilon).long() # 如果小于epsilon
random_actions = Categorical(avail_actions.float()).sample().long() # 把可选的动作进行类别分布
# pick_random==1 说明 random_numbers < self.epsilon 进行随机探索
# pick_random==0 说明 random_numbers > self.epsilon 选择动作价值最大的函数
picked_actions = pick_random * random_actions + (1 - pick_random) * masked_q_values.max(dim=2)[1] # 进行动作选择
return picked_actions # 选择的动作
计算单个智能体估计的Q值
# Calculate estimated Q-Values 估计每个agent对应的Q值
mac_out = []
mac.init_hidden(batch.batch_size)
for t in range(batch.max_seq_length):
agent_outs = mac.forward(batch, t=t) # 计算智能体的Q值
mac_out.append(agent_outs)
mac_out = th.stack(mac_out, dim=1) # Concat over time
# Pick the Q-Values for the actions taken by each agent
# 取每个agent动作对应的Q值,并且把最后不需要的一维去掉,因为最后一维只有一个值了
chosen_action_qvals = th.gather(mac_out[:, :-1], dim=3, index=actions).squeeze(3) # Remove the last dim
x_mac_out = mac_out.clone().detach() # 提取数据不带梯度
x_mac_out[avail_actions == 0] = -9999999 # 不能执行的动作赋值为负无穷
max_action_qvals, max_action_index = x_mac_out[:, :-1].max(dim=3) # 最大的动作值及其索引
max_action_index = max_action_index.detach().unsqueeze(3) # 去掉梯度
is_max_action = (max_action_index == actions).int().float() # 是最大动作
计算单个智能体目标Q值
# Calculate the Q-Values necessary for the target 计算目标Q值
target_mac_out = []
self.target_mac.init_hidden(batch.batch_size)
for t in range(batch.max_seq_length):
target_agent_outs = self.target_mac.forward(batch, t=t)
target_mac_out.append(target_agent_outs)
# We don't need the first timesteps Q-Value estimate for calculating targets
target_mac_out = th.stack(target_mac_out[1:], dim=1) # Concat across time
# Mask out unavailable actions
target_mac_out[avail_actions[:, 1:] == 0] = -9999999
# Max over target Q-Values 找到最大的动作价值
if self.args.double_q: # 使用double结构,找到最大价值动作,再进行计算价值
# Get actions that maximise live Q (for double q-learning)
mac_out_detach = mac_out.clone().detach()
mac_out_detach[avail_actions == 0] = -9999999
cur_max_actions = mac_out_detach[:, 1:].max(dim=3, keepdim=True)[1] # 找到最大价值的动作
# 利用最优动作求取最大动作价值,并且把最后不需要的一维去掉
target_chosen_qvals = th.gather(target_mac_out, 3, cur_max_actions).squeeze(3)
target_max_qvals = target_mac_out.max(dim=3)[0]
target_next_actions = cur_max_actions.detach()
cur_max_actions_onehot = th.zeros(cur_max_actions.squeeze(3).shape + (self.n_actions,)).cuda() # 最大价值动作的独热码
cur_max_actions_onehot = cur_max_actions_onehot.scatter_(3, cur_max_actions, 1)
else:
# Calculate the Q-Values necessary for the target
# 上面都写了,不知道为啥又写一遍
target_mac_out = []
self.target_mac.init_hidden(batch.batch_size)
for t in range(batch.max_seq_length):
target_agent_outs = self.target_mac.forward(batch, t=t)
target_mac_out.append(target_agent_outs)
# We don't need the first timesteps Q-Value estimate for calculating targets
target_mac_out = th.stack(target_mac_out[1:], dim=1) # Concat across time
target_max_qvals = target_mac_out.max(dim=3)[0] # 找到最大价值函数
根据损失函数,进行反向传播
# Mix 混合网络,求total值
# QPLEX更新过程,evaluate网络输入的是每个agent选出来的行为的q值,target网络输入的是每个agent最大的q值,和DQN更新方式一样
if mixer is not None:
# 计算Q _{ tot }(evalutate)
if self.args.mixer == "dmaq_qatten":
ans_chosen, q_attend_regs, head_entropies = \
mixer(chosen_action_qvals, batch["state"][:, :-1], is_v=True) # 计算状态价值V
ans_adv, _, _ = mixer(chosen_action_qvals, batch["state"][:, :-1], actions=actions_onehot,
max_q_i=max_action_qvals, is_v=False) # 计算优势值A
chosen_action_qvals = ans_chosen + ans_adv # 动作价值Q
else:
ans_chosen = mixer(chosen_action_qvals, batch["state"][:, :-1], is_v=True) # 计算状态价值V
ans_adv = mixer(chosen_action_qvals, batch["state"][:, :-1], actions=actions_onehot,
max_q_i=max_action_qvals, is_v=False) # 计算优势值A
chosen_action_qvals = ans_chosen + ans_adv # 动作价值Q
# 计算Q _{ tot }(target )
if self.args.double_q:
if self.args.mixer == "dmaq_qatten":
target_chosen, _, _ = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
is_v=True) # 计算状态价值V
target_adv, _, _ = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
actions=cur_max_actions_onehot,
max_q_i=target_max_qvals, is_v=False) # 计算优势值A
target_max_qvals = target_chosen + target_adv # 动作价值Q
else:
target_chosen = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:], is_v=True) # 计算状态价值V
target_adv = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
actions=cur_max_actions_onehot,
max_q_i=target_max_qvals, is_v=False) # 计算优势值A
target_max_qvals = target_chosen + target_adv # 动作价值Q
else:
target_max_qvals = self.target_mixer(target_max_qvals, batch["state"][:, 1:], is_v=True) # 动作价值Q
# Calculate 1-step Q-Learning targets 以Q-Learning的方法计算目标值r+gamma*Q _{ tot }({ target }
targets = rewards + self.args.gamma * (1 - terminated) * target_max_qvals
if show_demo:
tot_q_data = chosen_action_qvals.detach().cpu().numpy()
tot_target = targets.detach().cpu().numpy()
print('action_pair_%d_%d' % (save_data[0], save_data[1]), np.squeeze(q_data[:, 0]),
np.squeeze(q_i_data[:, 0]), np.squeeze(tot_q_data[:, 0]), np.squeeze(tot_target[:, 0]))
self.logger.log_stat('action_pair_%d_%d' % (save_data[0], save_data[1]),
np.squeeze(tot_q_data[:, 0]), t_env)
return
# Td-error
td_error = (chosen_action_qvals - targets.detach())
mask = mask.expand_as(td_error) # 将mask扩展为td_error相同的size
# 0-out the targets that came from padded data
masked_td_error = td_error * mask # 抹掉填充的经验的td_error
# Normal L2 loss, take mean over actual data
if self.args.mixer == "dmaq_qatten":
loss = (masked_td_error ** 2).sum() / mask.sum() + q_attend_regs
else:
# L2的损失函数,不能直接用mean,因为还有许多经验是没用的,所以要求和再比真实的经验数,才是真正的均值
loss = (masked_td_error ** 2).sum() / mask.sum()
# Optimise RMSprop
# 优化
optimiser.zero_grad()
loss.backward()
grad_norm = th.nn.utils.clip_grad_norm_(params, self.args.grad_norm_clip)
optimiser.step()
3 实验效果:
下图为在星际争霸2六种不同的在线数据收集地图的学习曲线。可以看到QPLEX显著优于其他算法
参考:
博客:QPLEX: Duplex Dueling Multi-agent Q-learning
代码:https://github.com/oxwhirl/pymarl
更多推荐
所有评论(0)