I realized that the factor "d^(1/2)" doesn't have anything to do with final weights, since it's constant ( or say fair to any dot product) , that means it may change the distribution, but without order
for 13:20: we could consider two Vector with D dimension : 1. When doing the inner product of the vector pairs , we can think : ___ the same vector with different dimension shall meet equal relationship as we expected:___ such as A(1,1,1), with B(1,1,1) D=3 a(1,1,1,1) with b (1,1,1,1) , D = 4 This two pair of vector (A,B ) , (a,b) shall has the same attention value: So take the inner product for both of two paris: A * B = 3 a * b = 4 and we found 3 != 4, so we divide by the square of the dimension: A*B / sqrt(3) = 3 / 1.7320 ~ 1.7320 a* b / sqrt(4) = 4 / 2 ~ 2 by this approximate : indeed, A*B is closer to a*b (even it's not exactly equal) , but it definitely is a better solution instead of taking the inner product without divide anything.
有一个问题:positional encoding 应该是直接加上去而不是concatenate吧。原文是:The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed