Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Рет қаралды 15,733

Yannic Kilcher

Күн бұрын

Пікірлер: 55

@YannicKilcher 3 жыл бұрын

ERRATA: Sometimes I say "Switch Transformer" instead of "Feedback Transformer". Forgive me :)

@hoaxuan7074 3 жыл бұрын

You have switching on the brain😆 And rightly so. With ReLU nets the switching happens at zero and a gradual change in input never causes a discontinuous change in output. It seems nets where switching causes discontinuity can still be trained, but I can't study everything.

@muhammadsaadmansoor7777 3 жыл бұрын

Anyone who watches these videos, i think can understand at least this much.

@sanjaykrish8719 3 жыл бұрын

Such an easy explanation.. your just awesome

@Ronschk 3 жыл бұрын

How long do you practice pronouncing the names before you record the video? ^^

@whiletruekill 3 жыл бұрын

Swiss are often fluent in French

@YannicKilcher 3 жыл бұрын

I take it as a compliment ;)

@swayson5208 3 жыл бұрын

@@YannicKilcher curious to hear you pronounce African names :D

@hoaxuan7074 3 жыл бұрын

Off topic: That batch training even works suggests that training algorithms only ever search the space of statistical solutions. Where no one neuron can be exceptional. Then neurons must work in diffuse statistical groups that are more resistant to damage when moving from batch to batch. (B) Dot products are statistical summary measures and filters. A net would have to be seriously sparsified for that not to apply. (C) The is a type of net based on neurons that have forced +c or -c weighted connections to every neuron in the the prior layer. Only statistical solutions exist in that case yet the nets work well. (They use parametric activation functions as the adjustable components.)

@herp_derpingson 3 жыл бұрын

1:44 Bold of you to assume that I have friends. * cries in a corner * . 23:26 Kinda reminds me of the scratch pad idea which some neural turing machine papers used to have. Any node can write to its scratch pad and any node can read any scratchpad. . 29:40 The whole setup reminds me of Full Adder from my Electronics 101 course. If I remember it correctly there was also a parallel adder which allowed for some parallelism before ultimately synchronizing at the end. I wonder if we can try something similar to get some speedup. . 32:00 Yeah this very similar to an attention RNN. We are going full circle. :P . This looks quite useful for use cases where we are a bit more data constrained.

@maltejensen7392 3 жыл бұрын

This is so extremely well explained, I'm so grateful for this content! Do you have a Discord or something else where one can interact with you about deep learning?

@YannicKilcher 3 жыл бұрын

Sure, there's a discord link in the description of the video

@florianhoenicke4858 3 жыл бұрын

Great video - thanks!

@muhammadsaadmansoor7777 3 жыл бұрын

How do they fair against performers. Speed and accuracy wise

@YannicKilcher 3 жыл бұрын

I have no idea, but probably performers are faster and this is more accurate

@shengyaozhuang3748 3 жыл бұрын

Transformer: let's get rid of recurrent connections so we can train very fast! Feedback Transformer: why not we add recurrent connections to transformers so we can do better reasoning!

@tezlashock 3 жыл бұрын

22:50 why not use the stored memory data only when a prediction is uncertain? that way it can train in parallel but also check against itself whenever it doesn't know for sure.

@YannicKilcher 3 жыл бұрын

Good suggestion!

@elinaoikonomou381 3 жыл бұрын

Thank you, Yannic for the nice explanation! In a case of a language translation task with an encoder-decoder transformer, these memory units are added only in the decoder, or the encoder as well? In the case we add memory in both, the keys and values we feed in the second multi-head attention layer of the decoder should be the same ones as in the encoder, since the author proposes to compute them only once. Is this correct?

@YannicKilcher 3 жыл бұрын

In general, the decoder has full access to the encoder states, and no causal masking is applied in the encoder, so I'd say this is only a problem in the decoder.

@terryr9052 3 жыл бұрын

I am hoping this can help improve long term patterns in music generation such as from jukebox

@NeoKailthas 3 жыл бұрын

Very useful explanation. Thank you

@paulkirkland9860 3 жыл бұрын

Transformers are just becoming more like the idea of neural population coding or reservoir computing. The next step is going to be randomly allowing the connections, rather than the dense connection to all. If you agree - let's collab, if you disagree - let's discuss.

@SijanC147 3 жыл бұрын

Love your work, I cannot express in words how informative every video has been. Can you (or anyone) point me to a resource that elaborates further on your statement @11:25 "a single neural network layer can only do Linear Operations" ? Thanks and keep up the great work!

@siyn007 3 жыл бұрын

a single layer has the formula z = Wx, where z is the hidden layer and x is the input layer. If you don't have other layers or not naming z = Nonlinearity(Wx), z will be a linearly dependent on x, which is the most you can get from one layer. If you have two layers, z = W_1 x, v = Nonlinearity(W_2 z), you can perform non linear operations such as Xor.

@YannicKilcher 3 жыл бұрын

What siyn007 says :)

@bluel1ng 3 жыл бұрын

One of the main advantages of transformers over RNNs with attention was the possibility of parallel and effective training. Feedback transformers are a nice model and they are more powerful but I will have to read the paper to learn how effective/practical they are from a training perspective. BTW I think “unbounded reasoning depth” is a bit too bold. Also even vanilla transformers have per token the feed forward “MLP burgers” (e.g. projecting into higher 2x or 4x d_model an then again reducing to d_model) layers with multiple non-linearities... just to say that some form of general (in the sense of not linear separable) computation already can be performed even by a single layer.

@AntonisNikitakis 3 жыл бұрын

"this is an RNN with an attention mechanism.." don't say that to Schmidhuber :P

@DamianReloaded 3 жыл бұрын

34:00 I think this being a transformer has attention at every transforming layer as well along with the attention to the memory. It would be more like an RNN with attention at each layer and memory. Right? Oh you just said that XD

@dimitriognibene8945 Жыл бұрын

What is the representation power of a transformer?

@antoninhejny8156 3 жыл бұрын

Why does facebook invest money into ML tasks such as visual recognition? Or do they just assume that all task are so interconnected in principle so that by solving one problem they are getting closer to solve the others, which are directly useful, eg. recommender system or text analysis.

@snippletrap 3 жыл бұрын

Visual recognition is directly useful and Facebook uses it for its main platform as well as Portal and probably the upcoming smart glasses.

@YannicKilcher 3 жыл бұрын

They do that, too

@MrKurumuz 3 жыл бұрын

facebook owns oculus

@trieuhobbies8649 3 жыл бұрын

31:38 you said "Switch Transformer" :)

@spicychiley 3 жыл бұрын

also at 28:15

@YannicKilcher 3 жыл бұрын

Crap 😁 thanks

@CppExpedition Жыл бұрын

The problem of tracking code relationships between layers was well defined, however the solution they propose its not satisfying.

@tamask001 3 жыл бұрын

I'm wondering what the definition of a "Transformer" really is. My impression was that you need both parallel processing with a position input as well as an attention mechanism to qualify as a transformer. But this paper really tries to extend the definition to anything with attention mechanism-ish something. It sounds to me like they just try to up-market some obsolete research by using trendy words.

@GeekProdigyGuy 3 жыл бұрын

There is no definition. These aren't hundred year old terms, they're newly coined jargon with not even a decade of use. You may as well complain about people naming their models things like BERT and RoBERTa and YOLO9000...

@jonatan01i 3 жыл бұрын

I removed my like so that I can like it again.

@hoaxuan7074 3 жыл бұрын

However a large single layer net can be used as an external memory bank for a main NN. However you have to set things up to meet the mathematical requirements of the dot product. The variance equation for linear combinations of random variables tells you the dot product rather prefers non-sparse inputs when used as a memory. And used under capacity can even provide error correction (variance reduction.) Then a suitable scheme is vector to vector random projection, bipolar (+1, -1) binarization and then the dot product with a weight vector. To train, get the recall error and divide by the number of dimensions. Then add or subtract that as indicated by the +1,-1 binarization to each weight term to make the error zero. By the CLT that has added a little Gaussian noise to all the prior stored memories. However by retraining you can squeeze the noise out completely within capacity. And that is the same as a matrix inverse solution but online. Hey, you need quite a good understanding of the dot product to see how that works, which I'm not sure so many NN researchers actually have.😳 You can think about the case where you replace the locality sensitive hash (RP+binarization) with a full hash in combination with the CLT.

@zandrrlife 3 жыл бұрын

😱👍🏾

@scottmiller2591 3 жыл бұрын

Soon: RNN is all you need

@patf9770 3 жыл бұрын

Imagine using this with a BERT-like model. Why not just replace the CLS token with a memory vector?

@patf9770 3 жыл бұрын

I think DALL-E used multiple modalities tokenized as input for 1 transformer. Why not have a masked vector in the input that acts as memory?

@YannicKilcher 3 жыл бұрын

good iea

@patf9770 3 жыл бұрын

@@YannicKilcher I'm an new researcher focusing on this area. Here's to hoping I can publish something like this in 2021.

@mgostIH 3 жыл бұрын

Finally, we'll be able to make a realistic GPT-Girlfriend, it will remember about that one time years ago when we made a compliment to some other girl and use that against us passive aggressively 😎