ERRATA: Sometimes I say "Switch Transformer" instead of "Feedback Transformer". Forgive me :)
@hoaxuan70743 жыл бұрын
You have switching on the brain😆 And rightly so. With ReLU nets the switching happens at zero and a gradual change in input never causes a discontinuous change in output. It seems nets where switching causes discontinuity can still be trained, but I can't study everything.
@muhammadsaadmansoor77773 жыл бұрын
Anyone who watches these videos, i think can understand at least this much.
@sanjaykrish87193 жыл бұрын
Such an easy explanation.. your just awesome
@Ronschk3 жыл бұрын
How long do you practice pronouncing the names before you record the video? ^^
@whiletruekill3 жыл бұрын
Swiss are often fluent in French
@YannicKilcher3 жыл бұрын
I take it as a compliment ;)
@swayson52083 жыл бұрын
@@YannicKilcher curious to hear you pronounce African names :D
@hoaxuan70743 жыл бұрын
Off topic: That batch training even works suggests that training algorithms only ever search the space of statistical solutions. Where no one neuron can be exceptional. Then neurons must work in diffuse statistical groups that are more resistant to damage when moving from batch to batch. (B) Dot products are statistical summary measures and filters. A net would have to be seriously sparsified for that not to apply. (C) The is a type of net based on neurons that have forced +c or -c weighted connections to every neuron in the the prior layer. Only statistical solutions exist in that case yet the nets work well. (They use parametric activation functions as the adjustable components.)
@herp_derpingson3 жыл бұрын
1:44 Bold of you to assume that I have friends. * cries in a corner * . 23:26 Kinda reminds me of the scratch pad idea which some neural turing machine papers used to have. Any node can write to its scratch pad and any node can read any scratchpad. . 29:40 The whole setup reminds me of Full Adder from my Electronics 101 course. If I remember it correctly there was also a parallel adder which allowed for some parallelism before ultimately synchronizing at the end. I wonder if we can try something similar to get some speedup. . 32:00 Yeah this very similar to an attention RNN. We are going full circle. :P . This looks quite useful for use cases where we are a bit more data constrained.
@maltejensen73923 жыл бұрын
This is so extremely well explained, I'm so grateful for this content! Do you have a Discord or something else where one can interact with you about deep learning?
@YannicKilcher3 жыл бұрын
Sure, there's a discord link in the description of the video
@florianhoenicke48583 жыл бұрын
Great video - thanks!
@muhammadsaadmansoor77773 жыл бұрын
How do they fair against performers. Speed and accuracy wise
@YannicKilcher3 жыл бұрын
I have no idea, but probably performers are faster and this is more accurate
@shengyaozhuang37483 жыл бұрын
Transformer: let's get rid of recurrent connections so we can train very fast! Feedback Transformer: why not we add recurrent connections to transformers so we can do better reasoning!
@tezlashock3 жыл бұрын
22:50 why not use the stored memory data only when a prediction is uncertain? that way it can train in parallel but also check against itself whenever it doesn't know for sure.
@YannicKilcher3 жыл бұрын
Good suggestion!
@elinaoikonomou3813 жыл бұрын
Thank you, Yannic for the nice explanation! In a case of a language translation task with an encoder-decoder transformer, these memory units are added only in the decoder, or the encoder as well? In the case we add memory in both, the keys and values we feed in the second multi-head attention layer of the decoder should be the same ones as in the encoder, since the author proposes to compute them only once. Is this correct?
@YannicKilcher3 жыл бұрын
In general, the decoder has full access to the encoder states, and no causal masking is applied in the encoder, so I'd say this is only a problem in the decoder.
@terryr90523 жыл бұрын
I am hoping this can help improve long term patterns in music generation such as from jukebox
@NeoKailthas3 жыл бұрын
Very useful explanation. Thank you
@paulkirkland98603 жыл бұрын
Transformers are just becoming more like the idea of neural population coding or reservoir computing. The next step is going to be randomly allowing the connections, rather than the dense connection to all. If you agree - let's collab, if you disagree - let's discuss.
@SijanC1473 жыл бұрын
Love your work, I cannot express in words how informative every video has been. Can you (or anyone) point me to a resource that elaborates further on your statement @11:25 "a single neural network layer can only do Linear Operations" ? Thanks and keep up the great work!
@siyn0073 жыл бұрын
a single layer has the formula z = Wx, where z is the hidden layer and x is the input layer. If you don't have other layers or not naming z = Nonlinearity(Wx), z will be a linearly dependent on x, which is the most you can get from one layer. If you have two layers, z = W_1 x, v = Nonlinearity(W_2 z), you can perform non linear operations such as Xor.
@YannicKilcher3 жыл бұрын
What siyn007 says :)
@bluel1ng3 жыл бұрын
One of the main advantages of transformers over RNNs with attention was the possibility of parallel and effective training. Feedback transformers are a nice model and they are more powerful but I will have to read the paper to learn how effective/practical they are from a training perspective. BTW I think “unbounded reasoning depth” is a bit too bold. Also even vanilla transformers have per token the feed forward “MLP burgers” (e.g. projecting into higher 2x or 4x d_model an then again reducing to d_model) layers with multiple non-linearities... just to say that some form of general (in the sense of not linear separable) computation already can be performed even by a single layer.
@AntonisNikitakis3 жыл бұрын
"this is an RNN with an attention mechanism.." don't say that to Schmidhuber :P
@DamianReloaded3 жыл бұрын
34:00 I think this being a transformer has attention at every transforming layer as well along with the attention to the memory. It would be more like an RNN with attention at each layer and memory. Right? Oh you just said that XD
@dimitriognibene8945 Жыл бұрын
What is the representation power of a transformer?
@antoninhejny81563 жыл бұрын
Why does facebook invest money into ML tasks such as visual recognition? Or do they just assume that all task are so interconnected in principle so that by solving one problem they are getting closer to solve the others, which are directly useful, eg. recommender system or text analysis.
@snippletrap3 жыл бұрын
Visual recognition is directly useful and Facebook uses it for its main platform as well as Portal and probably the upcoming smart glasses.
@YannicKilcher3 жыл бұрын
They do that, too
@MrKurumuz3 жыл бұрын
facebook owns oculus
@trieuhobbies86493 жыл бұрын
31:38 you said "Switch Transformer" :)
@spicychiley3 жыл бұрын
also at 28:15
@YannicKilcher3 жыл бұрын
Crap 😁 thanks
@CppExpedition Жыл бұрын
The problem of tracking code relationships between layers was well defined, however the solution they propose its not satisfying.
@tamask0013 жыл бұрын
I'm wondering what the definition of a "Transformer" really is. My impression was that you need both parallel processing with a position input as well as an attention mechanism to qualify as a transformer. But this paper really tries to extend the definition to anything with attention mechanism-ish something. It sounds to me like they just try to up-market some obsolete research by using trendy words.
@GeekProdigyGuy3 жыл бұрын
There is no definition. These aren't hundred year old terms, they're newly coined jargon with not even a decade of use. You may as well complain about people naming their models things like BERT and RoBERTa and YOLO9000...
@jonatan01i3 жыл бұрын
I removed my like so that I can like it again.
@hoaxuan70743 жыл бұрын
However a large single layer net can be used as an external memory bank for a main NN. However you have to set things up to meet the mathematical requirements of the dot product. The variance equation for linear combinations of random variables tells you the dot product rather prefers non-sparse inputs when used as a memory. And used under capacity can even provide error correction (variance reduction.) Then a suitable scheme is vector to vector random projection, bipolar (+1, -1) binarization and then the dot product with a weight vector. To train, get the recall error and divide by the number of dimensions. Then add or subtract that as indicated by the +1,-1 binarization to each weight term to make the error zero. By the CLT that has added a little Gaussian noise to all the prior stored memories. However by retraining you can squeeze the noise out completely within capacity. And that is the same as a matrix inverse solution but online. Hey, you need quite a good understanding of the dot product to see how that works, which I'm not sure so many NN researchers actually have.😳 You can think about the case where you replace the locality sensitive hash (RP+binarization) with a full hash in combination with the CLT.
@zandrrlife3 жыл бұрын
😱👍🏾
@scottmiller25913 жыл бұрын
Soon: RNN is all you need
@patf97703 жыл бұрын
Imagine using this with a BERT-like model. Why not just replace the CLS token with a memory vector?
@patf97703 жыл бұрын
I think DALL-E used multiple modalities tokenized as input for 1 transformer. Why not have a masked vector in the input that acts as memory?
@YannicKilcher3 жыл бұрын
good iea
@patf97703 жыл бұрын
@@YannicKilcher I'm an new researcher focusing on this area. Here's to hoping I can publish something like this in 2021.
@mgostIH3 жыл бұрын
Finally, we'll be able to make a realistic GPT-Girlfriend, it will remember about that one time years ago when we made a compliment to some other girl and use that against us passive aggressively 😎