I honestly do not understand how you get through papers so quickly, I would love for you to make a video on how you read papers!
@tomwright99044 жыл бұрын
Having read a bunch of other papers might help :)
@florianhonicke54484 жыл бұрын
The cool thing is that one can see that you have fun explaining everything
@tqri97954 жыл бұрын
I think one of the reasons that attention or self-attention work so surprisingly well is they reduce the difficulty of optimization. It's like just connecting everything, so that training signals and gradients can find a way back to the leaves in the graph. This is an approach to dealing with vanishing gradients in comlex networks.
@zeouhu73452 жыл бұрын
Maybe I misunderstand your comment, but feedforward NN also connects everything
@kentsui5674 жыл бұрын
Thanks Yannic for the your great analysis. It looks like the conclusion of the study is that dot-product attention is required, and adding a bias will help (and of course because of more parameters). I guess it's very task specific, if I am doing semantic role labeling where interaction of tokens are required, definitely losing the dependence of inputs will hurt a lot. And philosophically, having all the pair-wise attention is a "brute force" that we consider relationship of every pair of tokens - in terms of completeness of information it is difficult to be surpassed, but of course there shall be better way of routing information that is yet to be discovered.
@arkasaha44124 жыл бұрын
Please do a video on "How Does Batch Normalization Help Optimization?", it's quite fundamental as BN is everywhere nowadays.
@herp_derpingson4 жыл бұрын
DNN in disguise! Honestly, I didnt see that coming. I guess the reviewers missed it too. Great video. Keep it coming. You dont seem to take breaks on weekends. Not that there is much to do during quarantine. Maybe play the game called, "while True: learn()". It is a machine learning themed puzzle game.
@RohitKumarSingh254 жыл бұрын
This was a really good analysis. Very helpful 😁
@kicckicc4 жыл бұрын
With multi heads and hidden embedding sharing the weight, random synthersizer is much weaker than mlp. That may make it work better in nlp tasks. Resnet and CNN are special cases of mlp, but they are stronger in their use cases.
@ЗакировМарат-в5щ4 жыл бұрын
10:00 already a weakness working just with positions probably not a good idea
@seankernitsman60554 жыл бұрын
I agree that the paper's proposed model is doesn't have the clearest methodology. Possible correction at 12:40: Shouldn't the matrix be of shape l x d?
@YannicKilcher4 жыл бұрын
correct
@usmanfarooq21854 жыл бұрын
Hi, great work there sir, you really motivate us, I just ping to get a suggestion. Currently, I am doing a master's in Data Science and near future willing to apply for Ph.D. will enroll my thesis next semester so could you please suggest major topic fields with might have a high impact on acceptance for Ph.D. Much appreciated. Thanks.
@YannicKilcher4 жыл бұрын
Really depends on the advisor you're targeting. Make sure you pick things in their field. If you don't know yet, just pick whatever you like best.
@freemind.d27144 жыл бұрын
This paper should called Self-Attention without Self-Attention..... The reason Self-Attention been designed is that it can dynamic adjust its attention(i.e weights) by context, don't know what's the point of this paper
@cameron48144 жыл бұрын
i love what you're doing.
@rin-or3no4 жыл бұрын
Hi Yannic, if you find this paper interesting can you please read it for us? (Certified Adversarial Robustness via Randomized Smoothing). Or if you have read it before please share the link.
@xzheng50814 жыл бұрын
the comparison of parameter size seems not true in Table 1... for the proposed synthesizer, the parameter size should multiply by the head num of original multi-head attention
@mattiascross14172 жыл бұрын
So if it just uses a random weights matrix followed by a trained FFN, is it essentially a feed-forward echo state network?
@anantakusumap24594 жыл бұрын
Thank you yannick
@vexilligerave93564 жыл бұрын
it sometimes works a bit. so true
@bluel1ng4 жыл бұрын
If the softmax(R) result of the random synthesizer is constant isn't it just another linear projection of the values which itself normally are linear projected, so it could be multiplied into one value projection weight matrix - at least after training? I wonder how exactly the setup was for vanilla + random. Right now I do not see where the improvement comes from, just adding matrix multiplications with weight matrices without non-linearities in between should not be beneficial. Maybe I am misunderstaning here something...? Maybe somebody can clarify what is going on in the random synthesizer case.
@bluel1ng4 жыл бұрын
ok, I think a full linear layer would have a (d_model*l x d_model*l) weight matrix while in the random synthesizer case a (d_model x d_model) multiplication is done (with W_value) and afterwards a multiplication with the 'fake attention self-scores' which is a l x l ... so it is some kind of factorized linear layer?
@YannicKilcher4 жыл бұрын
Yes what you're saying makes sense. Maybe it's sort-of like a skip-connection in residual networks. That helps as well without introducing more weights. Who knows...
@bluel1ng4 жыл бұрын
@@YannicKilcher Probably in the single-head version it is: SynRnd(X) = softmax(R) X W_v X: (l x d) # input W_v: (d x d) # value-projection weights R: (l x l) # global fixed attention Every input embedding is replaced by a new random linear combinations of projected input embeddings. It reminds me of some form of hashing. When applied with a lot of heads this will probably 'select' / 'focus' on different symbol-groups/sets (simply by position) of the input. Interesting that this global (non input dependent) mish-mash helps.
@YannicKilcher4 жыл бұрын
@@bluel1ng True, that makes sense.
@yorailevi6747 Жыл бұрын
@@bluel1ng Has there been any development on this kind of mishmash? If we were to "mirror" the input a few times and randomly softmaxing every few forward layers we could do the same thing as "transformers" without calling them transformers
@aBigBadWolf4 жыл бұрын
Why did you pick this paper? Because it is google? The paper is not good imo. It is also not peer reviewed. By making a video about it you further give to the 1% of papers that get "free attention" due to the institution or supervisor.
@bluel1ng4 жыл бұрын
Good point about quantity vs. quality! I personally enjoy "talk-throughs" of random ML papers - even if they are not the ultimate gold-nugget. But even for myself I see a) that during busy periods I will not have enough time to view all the stuff Yannic releases and b) somehow I worry a bit that the current *extreme* output-rate is hardly sustainable over a longer period of time. :-)
@YannicKilcher4 жыл бұрын
1. Peer review is a joke 2. I pick papers because they interest me, nothing else