I realized that the factor "d^(1/2)" doesn't have anything to do with final weights, since it's constant ( or say fair to any dot product) , that means it may change the distribution, but without order
@胡浩-h8r5 жыл бұрын
感谢李老师的付出,可以看出做这些PPT需要许多的精力!这比看原论文舒服多了
@taiwanest5 жыл бұрын
如此清晰透側的教學,令人驚嘆!
@alexanderyau63474 жыл бұрын
太棒了,讲的非常好!非常清晰透彻!感谢李教授!
@wolfmib5 жыл бұрын
for 13:20: we could consider two Vector with D dimension : 1. When doing the inner product of the vector pairs , we can think : ___ the same vector with different dimension shall meet equal relationship as we expected:___ such as A(1,1,1), with B(1,1,1) D=3 a(1,1,1,1) with b (1,1,1,1) , D = 4 This two pair of vector (A,B ) , (a,b) shall has the same attention value: So take the inner product for both of two paris: A * B = 3 a * b = 4 and we found 3 != 4, so we divide by the square of the dimension: A*B / sqrt(3) = 3 / 1.7320 ~ 1.7320 a* b / sqrt(4) = 4 / 2 ~ 2 by this approximate : indeed, A*B is closer to a*b (even it's not exactly equal) , but it definitely is a better solution instead of taking the inner product without divide anything.
@zechenliu5760 Жыл бұрын
按这种解释的话,直接除以D不就好了?为什么要除以根号D?
@muhammadsaadmansoor77773 жыл бұрын
I not only understood transformers but I also learned Chinese from this video
有一个问题:positional encoding 应该是直接加上去而不是concatenate吧。原文是:The positional encodings have the same dimension d_model as the embeddings, so that the two can be summed