LLMLearnLLMLearn
首页
Transformer架构
首页
Transformer架构
  • Transformer架构

    • Attention is All You Need

Refer to caption

Attention⁢(Q,K,V)=softmax⁢(Q⁢KTdk)⁢V\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})Vroman_Attention ( italic_Q , italic_K , italic_V ) = roman_softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V(1)

Refer to caption

Refer to caption

MultiHead⁢(Q,K,V)\displaystyle\mathrm{MultiHead}(Q,K,V)roman_MultiHead ( italic_Q , italic_K , italic_V )=Concat⁢(head1,…,headh)⁢WO\displaystyle=\mathrm{Concat}(\mathrm{head_{1}},...,\mathrm{head_{h}})W^{O}= roman_Concat ( roman_head start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_head start_POSTSUBSCRIPT roman_h end_POSTSUBSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT
where⁢headi\displaystyle\text{where}~{}\mathrm{head_{i}}where roman_head start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT=Attention⁢(Q⁢WiQ,K⁢WiK,V⁢WiV)\displaystyle=\mathrm{Attention}(QW^{Q}_{i},KW^{K}_{i},VW^{V}_{i})= roman_Attention ( italic_Q italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

FFN⁢(x)=max⁡(0,x⁢W1+b1)⁢W2+b2\mathrm{FFN}(x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2}roman_FFN ( italic_x ) = roman_max ( 0 , italic_x italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(2)

P⁢E(p⁢o⁢s,2⁢i)=s⁢i⁢n⁢(p⁢o⁢s/100002⁢i/dmodel)\displaystyle PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\text{model}}})italic_P italic_E start_POSTSUBSCRIPT ( italic_p italic_o italic_s , 2 italic_i ) end_POSTSUBSCRIPT = italic_s italic_i italic_n ( italic_p italic_o italic_s / 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
P⁢E(p⁢o⁢s,2⁢i+1)=c⁢o⁢s⁢(p⁢o⁢s/100002⁢i/dmodel)\displaystyle PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\text{model}}})italic_P italic_E start_POSTSUBSCRIPT ( italic_p italic_o italic_s , 2 italic_i + 1 ) end_POSTSUBSCRIPT = italic_c italic_o italic_s ( italic_p italic_o italic_s / 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )

O⁢(n2⋅d)O(n^{2}\cdot d)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_d )O⁢(1)O(1)italic_O ( 1 )O⁢(1)O(1)italic_O ( 1 )
O⁢(n⋅d2)O(n\cdot d^{2})italic_O ( italic_n ⋅ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )O⁢(n)O(n)italic_O ( italic_n )O⁢(n)O(n)italic_O ( italic_n )
O⁢(k⋅n⋅d2)O(k\cdot n\cdot d^{2})italic_O ( italic_k ⋅ italic_n ⋅ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )O⁢(1)O(1)italic_O ( 1 )O⁢(l⁢o⁢gk⁢(n))O(log_{k}(n))italic_O ( italic_l italic_o italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_n ) )
O⁢(r⋅n⋅d)O(r\cdot n\cdot d)italic_O ( italic_r ⋅ italic_n ⋅ italic_d )O⁢(1)O(1)italic_O ( 1 )O⁢(n/r)O(n/r)italic_O ( italic_n / italic_r )

l⁢r⁢a⁢t⁢e=dmodel−0.5⋅min⁡(s⁢t⁢e⁢p⁢_⁢n⁢u⁢m−0.5,s⁢t⁢e⁢p⁢_⁢n⁢u⁢m⋅w⁢a⁢r⁢m⁢u⁢p⁢_⁢s⁢t⁢e⁢p⁢s−1.5)lrate=d_{\text{model}}^{-0.5}\cdot\min({step\_num}^{-0.5},{step\_num}\cdot{% warmup\_steps}^{-1.5})italic_l italic_r italic_a italic_t italic_e = italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 0.5 end_POSTSUPERSCRIPT ⋅ roman_min ( italic_s italic_t italic_e italic_p _ italic_n italic_u italic_m start_POSTSUPERSCRIPT - 0.5 end_POSTSUPERSCRIPT , italic_s italic_t italic_e italic_p _ italic_n italic_u italic_m ⋅ italic_w italic_a italic_r italic_m italic_u italic_p _ italic_s italic_t italic_e italic_p italic_s start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT )(3)

ModelBLEU
EN-DEEN-FREN-DEEN-FR
ByteNet [18]23.75
39.21.0⋅10201.0\cdot 10^{20}1.0 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
24.639.922.3⋅10192.3\cdot 10^{19}2.3 ⋅ 10 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT1.4⋅10201.4\cdot 10^{20}1.4 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
ConvS2S [9]25.1640.469.6⋅10189.6\cdot 10^{18}9.6 ⋅ 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT1.5⋅10201.5\cdot 10^{20}1.5 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
MoE [32]26.0340.562.0⋅10192.0\cdot 10^{19}2.0 ⋅ 10 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT1.2⋅10201.2\cdot 10^{20}1.2 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
40.48.0⋅10208.0\cdot 10^{20}8.0 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
26.3041.161.8⋅10201.8\cdot 10^{20}1.8 ⋅ 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT1.1⋅10211.1\cdot 10^{21}1.1 ⋅ 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT
26.3641.297.7⋅10197.7\cdot 10^{19}7.7 ⋅ 10 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT1.2⋅10211.2\cdot 10^{21}1.2 ⋅ 10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT
27.338.13.3⋅𝟏𝟎𝟏𝟖3.3\cdot 10^{18}bold_3.3 bold_⋅ bold_10 start_POSTSUPERSCRIPT bold_18 end_POSTSUPERSCRIPT
28.441.82.3⋅10192.3\cdot 10^{19}2.3 ⋅ 10 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT

.

NNitalic_Ndmodeld_{\text{model}}italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPTdffd_{\text{ff}}italic_d start_POSTSUBSCRIPT ff end_POSTSUBSCRIPThhitalic_hdkd_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTdvd_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPTPd⁢r⁢o⁢pP_{drop}italic_P start_POSTSUBSCRIPT italic_d italic_r italic_o italic_p end_POSTSUBSCRIPTϵl⁢s\epsilon_{ls}italic_ϵ start_POSTSUBSCRIPT italic_l italic_s end_POSTSUBSCRIPTPPLBLEU
×106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
65122048864640.10.1100K4.9225.865
(A)15125125.2924.9
41281285.0025.5
1632324.9125.8
3216165.0125.4
(B)165.1625.158
325.0125.460
(C)26.1123.736
45.1925.350
84.8825.580
25632325.7524.528
10241281284.6626.0168
10245.1225.453
40964.7526.290
(D)0.05.7724.6
0.24.9525.5
0.04.6725.3
0.25.4725.7
(E)4.9225.7
610244096160.3300K4.3326.4213

88.3
90.4
90.4
91.7
91.3
91.3
91.3
92.1
92.1
92.7
93.0
93.3

  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  • [38]
  • [39]
  • [40]

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
最近更新:: 2025/5/20 09:07
Contributors: e0164034