Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Рет қаралды 16,787

Күн бұрын

Пікірлер: 33

@pablovela2053 4 жыл бұрын

I honestly do not understand how you get through papers so quickly, I would love for you to make a video on how you read papers!

@tomwright9904 4 жыл бұрын

Having read a bunch of other papers might help :)

@florianhonicke5448 4 жыл бұрын

The cool thing is that one can see that you have fun explaining everything

@tqri9795 4 жыл бұрын

I think one of the reasons that attention or self-attention work so surprisingly well is they reduce the difficulty of optimization. It's like just connecting everything, so that training signals and gradients can find a way back to the leaves in the graph. This is an approach to dealing with vanishing gradients in comlex networks.

@zeouhu7345 2 жыл бұрын

Maybe I misunderstand your comment, but feedforward NN also connects everything

@kentsui567 4 жыл бұрын

Thanks Yannic for the your great analysis. It looks like the conclusion of the study is that dot-product attention is required, and adding a bias will help (and of course because of more parameters). I guess it's very task specific, if I am doing semantic role labeling where interaction of tokens are required, definitely losing the dependence of inputs will hurt a lot. And philosophically, having all the pair-wise attention is a "brute force" that we consider relationship of every pair of tokens - in terms of completeness of information it is difficult to be surpassed, but of course there shall be better way of routing information that is yet to be discovered.

@arkasaha4412 4 жыл бұрын

Please do a video on "How Does Batch Normalization Help Optimization?", it's quite fundamental as BN is everywhere nowadays.

@herp_derpingson 4 жыл бұрын

DNN in disguise! Honestly, I didnt see that coming. I guess the reviewers missed it too. Great video. Keep it coming. You dont seem to take breaks on weekends. Not that there is much to do during quarantine. Maybe play the game called, "while True: learn()". It is a machine learning themed puzzle game.

@RohitKumarSingh25 4 жыл бұрын

This was a really good analysis. Very helpful 😁

@kicckicc 4 жыл бұрын

With multi heads and hidden embedding sharing the weight, random synthersizer is much weaker than mlp. That may make it work better in nlp tasks. Resnet and CNN are special cases of mlp, but they are stronger in their use cases.

@ЗакировМарат-в5щ 4 жыл бұрын

10:00 already a weakness working just with positions probably not a good idea

@seankernitsman6055 4 жыл бұрын

I agree that the paper's proposed model is doesn't have the clearest methodology. Possible correction at 12:40: Shouldn't the matrix be of shape l x d?

@YannicKilcher 4 жыл бұрын

correct

@usmanfarooq2185 4 жыл бұрын

Hi, great work there sir, you really motivate us, I just ping to get a suggestion. Currently, I am doing a master's in Data Science and near future willing to apply for Ph.D. will enroll my thesis next semester so could you please suggest major topic fields with might have a high impact on acceptance for Ph.D. Much appreciated. Thanks.

@YannicKilcher 4 жыл бұрын

Really depends on the advisor you're targeting. Make sure you pick things in their field. If you don't know yet, just pick whatever you like best.

@freemind.d2714 4 жыл бұрын

This paper should called Self-Attention without Self-Attention..... The reason Self-Attention been designed is that it can dynamic adjust its attention(i.e weights) by context, don't know what's the point of this paper

@cameron4814 4 жыл бұрын

i love what you're doing.

@rin-or3no 4 жыл бұрын

Hi Yannic, if you find this paper interesting can you please read it for us? (Certified Adversarial Robustness via Randomized Smoothing). Or if you have read it before please share the link.

@xzheng5081 4 жыл бұрын

the comparison of parameter size seems not true in Table 1... for the proposed synthesizer, the parameter size should multiply by the head num of original multi-head attention

@mattiascross1417 2 жыл бұрын

So if it just uses a random weights matrix followed by a trained FFN, is it essentially a feed-forward echo state network?

@anantakusumap2459 4 жыл бұрын

Thank you yannick

@vexilligerave9356 4 жыл бұрын

it sometimes works a bit. so true

@bluel1ng 4 жыл бұрын

If the softmax(R) result of the random synthesizer is constant isn't it just another linear projection of the values which itself normally are linear projected, so it could be multiplied into one value projection weight matrix - at least after training? I wonder how exactly the setup was for vanilla + random. Right now I do not see where the improvement comes from, just adding matrix multiplications with weight matrices without non-linearities in between should not be beneficial. Maybe I am misunderstaning here something...? Maybe somebody can clarify what is going on in the random synthesizer case.

@bluel1ng 4 жыл бұрын

ok, I think a full linear layer would have a (d_model*l x d_model*l) weight matrix while in the random synthesizer case a (d_model x d_model) multiplication is done (with W_value) and afterwards a multiplication with the 'fake attention self-scores' which is a l x l ... so it is some kind of factorized linear layer?

@YannicKilcher 4 жыл бұрын

Yes what you're saying makes sense. Maybe it's sort-of like a skip-connection in residual networks. That helps as well without introducing more weights. Who knows...

@bluel1ng 4 жыл бұрын

@@YannicKilcher Probably in the single-head version it is: SynRnd(X) = softmax(R) X W_v X: (l x d) # input W_v: (d x d) # value-projection weights R: (l x l) # global fixed attention Every input embedding is replaced by a new random linear combinations of projected input embeddings. It reminds me of some form of hashing. When applied with a lot of heads this will probably 'select' / 'focus' on different symbol-groups/sets (simply by position) of the input. Interesting that this global (non input dependent) mish-mash helps.

@YannicKilcher 4 жыл бұрын

@@bluel1ng True, that makes sense.

@yorailevi6747 Жыл бұрын

@@bluel1ng Has there been any development on this kind of mishmash? If we were to "mirror" the input a few times and randomly softmaxing every few forward layers we could do the same thing as "transformers" without calling them transformers

@aBigBadWolf 4 жыл бұрын

Why did you pick this paper? Because it is google? The paper is not good imo. It is also not peer reviewed. By making a video about it you further give to the 1% of papers that get "free attention" due to the institution or supervisor.

@bluel1ng 4 жыл бұрын

Good point about quantity vs. quality! I personally enjoy "talk-throughs" of random ML papers - even if they are not the ultimate gold-nugget. But even for myself I see a) that during busy periods I will not have enough time to view all the stuff Yannic releases and b) somehow I worry a bit that the current *extreme* output-rate is hardly sustainable over a longer period of time. :-)

@YannicKilcher 4 жыл бұрын

1. Peer review is a joke 2. I pick papers because they interest me, nothing else