LSTM is dead. Long Live Transformers!

  Рет қаралды 525,022

Seattle Applied Deep Learning

Seattle Applied Deep Learning

Күн бұрын

Leo Dirac (@leopd) talks about how LSTM models for Natural Language Processing (NLP) have been practically replaced by transformer-based models. Basic background on NLP, and a brief history of supervised learning techniques on documents, from bag of words, through vanilla RNNs and LSTM. Then there's a technical deep dive into how Transformers work with multi-headed self-attention, and positional encoding. Includes sample code for applying these ideas to real-world projects.

Пікірлер: 297
@FernandoWittmann
@FernandoWittmann 4 жыл бұрын
That's one of the best deep learning related presentations I've seen in a while! Not only introduced transformers but also gave an overview of other NLP strategies, activation functions and also best practices when using optimizers. Thank you!!
@ahmadmoussa3771
@ahmadmoussa3771 3 жыл бұрын
I second this! The talk was such a joy to listen to
@aashnavaid6918
@aashnavaid6918 3 жыл бұрын
in about 30 minutes!!!!
@jackholloway7516
@jackholloway7516 Жыл бұрын
:¥£€€’
@jbnunn
@jbnunn Жыл бұрын
Agree -- I've watched half a dozen videos on transformers in the past 2 days, I wish I'd started with Leo's.
@richardosuala9739
@richardosuala9739 3 жыл бұрын
Thank you for this concise and well-rounded talk! The pseudocode example was awesome!
@vamseesriharsha2312
@vamseesriharsha2312 3 жыл бұрын
Good to see Adam Driver working on transformers 😁
@_RMSG_
@_RMSG_ Жыл бұрын
I love this presentation Doesn't assume that the audience knows far more than is necessary, goes through explanations of relevant parts of Transformers, notes shortcomings, etc; Best slideshow I've seen this year, and it's from over 3 years ago
@8chronos
@8chronos 3 жыл бұрын
The best presentation/explanation to the topic I have seen so far. Thanks a lot :)
@briancase6180
@briancase6180 2 жыл бұрын
Thanks for this! It gets to the heart of the matter quickly and in an easy to grasp way. Excellent.
@BartoszBielecki
@BartoszBielecki Жыл бұрын
World deserve more lectures like this one. I don't need examples on how to tune U-net, but the overview of this huge research space and ideas underneath each group.
@lmao4982
@lmao4982 Жыл бұрын
This is like 90% of what I remember from my NLP course with all the uncertainty cleared up, thanks!
@monikathornton8790
@monikathornton8790 3 жыл бұрын
Great talk. It's always thrilling to see someone who actually knows what they're supposedly presenting.
@evennot
@evennot 4 жыл бұрын
I was trying to use similar super-low frequency sine trick for audio sample classification (to give network more clues about attack/sustain/release positioning). Never did I know, that one can use several of those in different phases. Such a simple and beautiful trick The presentation is awesome
@JagdeepSandhuSJC
@JagdeepSandhuSJC 2 жыл бұрын
Leo is an excellent professor. He explains difficult concepts in an easy-to-understand way.
@ismaila3347
@ismaila3347 4 жыл бұрын
This finally made it clear for me why RNNs have been introduced! thanks for sharing
@Johnathanaa7
@Johnathanaa7 4 жыл бұрын
Best transformer presentation I’ve seen hands down. Nice job!
@cliffrosen5180
@cliffrosen5180 Жыл бұрын
Wonderfully clear and precise presentation. One thing that tripped me up, though, is this formula at 4 minutes in: Hi+1 = A(Hi, xi) Seems this should rather be: Hi+1 = A(Hi,xi+1) which might be more intuitively written as: Hi = A(Hi-1,xi)
@ajitkirpekar4251
@ajitkirpekar4251 3 жыл бұрын
Its hard to overstate just how much this topic has(is) transformed the industry. As others have said, understanding it is not easy because there are a bunch of components that don't seem to align with one another and overall the architecture is such a departure from the most traditional things you are taught. I myself have wrangled with it for a while and its still difficult to fully grasp. Like any hard problem, you have to bang your head against it for a while before it clicks.
@JorgetePanete
@JorgetePanete 2 жыл бұрын
"has(is)"??
@sanjivgautam9063
@sanjivgautam9063 3 жыл бұрын
For anyone feeling overwhelmed, it is completely reasonable, as this video is just a 28 minute recap for experienced machine learning practitioners, and lot of them are just spamming the top comments with "This is by far the best video", "Everything is clear with this single video" and all.
@adamgm84
@adamgm84 3 жыл бұрын
Sounds like it is my lucky day then, for me to jump from noob to semi-non-noob by gathering thinking patterns from more-advanced individuals. I will fill in the swiss cheese holes of crystallized intelligence later by extrapolating out from my current fluid intelligence level... or something like that. Sorry I'll see myself out.
@svily0
@svily0 3 жыл бұрын
I was about to make a remark about the presenter speaking like a machine gun at the start. I can't even follow such a pace even in my native language, on a lazy Sunday afternoon with a drink in my hand. Who cares what you say if no one manages to understand it??? Easy, easy boy... slow down, no one cares how fast you can speak, what matters is what you are able to explain. (so the others understand it).
@user-zw5rp7xx4q
@user-zw5rp7xx4q 3 жыл бұрын
@@svily0 >I can't even follow such a pace even in my native language maybe that's the issue?
@svily0
@svily0 3 жыл бұрын
@@user-zw5rp7xx4q Well, could as well be, but on the fringe side I have a masters degree. Could not be just that. ;)
@Nathan0A
@Nathan0A 3 жыл бұрын
This is by far the best comment, Everything is clear after reading this single comment! Thank you all
@Scranny
@Scranny 3 жыл бұрын
12:56 the review of the pseudocode of the attention mechanism was what finally helped me understand it (specifically the meaning of the Q,K,V vectors), what other videos were lacking. In the second outer for loop, I still don't fully understand why it loops over the length of the input sequence. The output can be of different length, no? Maybe this is an error. Also, I think he didn't mention the masking of the remaining output at each step so the model doesn't "cheat".
@Splish_Splash
@Splish_Splash Жыл бұрын
for every word we compute its query, key and value vectors, so we need to loop through our sequence
@sarab9644
@sarab9644 3 жыл бұрын
Excellent presentation! Perfect!
@ehza
@ehza 2 жыл бұрын
This is beautiful. Clear and concise!
@ramibishara5887
@ramibishara5887 3 жыл бұрын
where can I find the presentation doc of this talk amigos? thanks
@DavidWhite679
@DavidWhite679 4 жыл бұрын
This helped me a ton to understand the basics. Thanks!
@asnaeb2
@asnaeb2 3 жыл бұрын
More vids please this was really informative on what actual SOTA is
@thusi87
@thusi87 3 жыл бұрын
Great summary! Wonder if you have a collection of talks you give on similar topics ?
@Lumcoin
@Lumcoin 3 жыл бұрын
-sorry for the lack of technical terms- I did not completely get it how transformers work regarding to positional information: Isn't X_in the information of the previous hidden layer? That is not enough for the network, because the input embeddings lack any temporal/positional information, right? But why not just add one new linear temporal value to the embeddings instead of many sinewaves at different scale?
@amortalbeing
@amortalbeing 3 жыл бұрын
This was fantastic. really well presented.
@carolynconway3893
@carolynconway3893 4 жыл бұрын
Such a useful talk! TYSM 🤗
@dhariri
@dhariri 3 жыл бұрын
Excellent talk. Thank you @leopd !
@ooio78
@ooio78 3 жыл бұрын
Wonderful and educational, value to those who need it!
@jaypark7417
@jaypark7417 3 жыл бұрын
Thank you for sharing it. Really helpful!!
@MrDudugeda2
@MrDudugeda2 3 жыл бұрын
this is easily the best NLP talk ive heard this year
@a_sun5941
@a_sun5941 2 жыл бұрын
Great Presentation!
@maciej2320
@maciej2320 4 ай бұрын
Four years ago! Shocking.
@driziiD
@driziiD Жыл бұрын
very impressive presentation. thank you.
@FernandoWittmann
@FernandoWittmann 4 жыл бұрын
I have a question: is it possible to use those SoA models shared in the very end od the presentation as document embeddings? Something analog to doc2vec. My intent is to transform documents into vectors that well represents them and would allows me to compare the similarity of different documents
@LeoDirac
@LeoDirac 3 жыл бұрын
Absolutely yes.
@dgabri3le
@dgabri3le 3 жыл бұрын
Thanks! Really good compare/contrasting.
@shivapriyakatta4885
@shivapriyakatta4885 3 жыл бұрын
One of the best talks on Deep Learning!...thank you
@pushpakgupta22
@pushpakgupta22 4 жыл бұрын
All i want is his level of humbleness and knowledge
@pazmiki77
@pazmiki77 3 жыл бұрын
Don't just want, make it happen than. You could literally do this
@pi5549
@pi5549 3 жыл бұрын
Find the humility to get your head down and acquire the knowledge. Let the universe do the rest.
@bgundogdu16
@bgundogdu16 4 жыл бұрын
Great presentation!
@mikiallen7733
@mikiallen7733 4 жыл бұрын
Does the multi-headed attention + position encoding work equally well and better than plain vanilla LSTM but on numeric input ( float or integers ) vectors / tensors ? Your input is highly appreciated
@anoop5611
@anoop5611 3 жыл бұрын
Not an expert here, but the way attention works is closely tied to the way nearby words are relevant to each other: for example, a pronoun and it's relevant noun. Multi-headed attention would identify more such abstract relationships between words in a window. So if the numeric input seq has a set of consistent relationships among all its members, then attention would help embed more relational info on the input data so that processing it becomes easier when honouring this relational info.
@gauravkantrod1205
@gauravkantrod1205 3 жыл бұрын
Amazing talk. It would be of great help if you can post link to the documents.
@BoersenCrashKurs
@BoersenCrashKurs 2 жыл бұрын
When I want to use transformers for time series analysis while the dataset includes individual specific effects. What do I do? In this case the only possibility would be to match the batch size with the length of the individual data length? Right?
@LeoDirac
@LeoDirac 2 жыл бұрын
No, batch and time will be different tensor dimensions. If your dataset has 17 features, and the length is 100 time steps, then your input tensor might be 32x100x17 with a batch size of 32.
@SanataniAryavrat
@SanataniAryavrat 3 жыл бұрын
Wow... that was a quick summarization of all the NN research things in past many decades...
@Achrononmaster
@Achrononmaster 3 жыл бұрын
You folks need to look into asymptotics and Padé approximant methods, or for functions of many variables as ANN's are you'd use the generalize Canterbury Approximants. The is not yet a rigorous development in information theoretic terms, but Padé summations (essentially repeated fraction representations) are known to yield rapid convergence to correct limits for divergent Taylor series in non-converging regions of the complex plane. What this boils down to is that you only need a fairly small number of iterations to get very accurate results if you only require approximations. To my knowledge this sort of method is not being used in deep learning, but has been used by physicists in perturbation theory. I think you will find it extremely powerful in deep learning. Padé (or Canterbury) summation methods when generalized are a way of extracting information from incomplete data. So if you use a neural net to get a few first approximants, and assume they are modelling an analytically continued function, then you have a series (the node activation summation) you can Padé sum and extract more information than you'd be able to otherwise.
@matthewarnold3922
@matthewarnold3922 4 жыл бұрын
Excellent talk. Kudos!
@georgejo7905
@georgejo7905 4 жыл бұрын
interesting looks a lot like my signal class. how to implement various filters on a dsp.
@ProfessionalTycoons
@ProfessionalTycoons 4 жыл бұрын
RIP LSTM 2019, she/he/it/they would be remembered by....
@mohammaduzair608
@mohammaduzair608 4 жыл бұрын
Not everyone will get this
@dineshnagumothu5792
@dineshnagumothu5792 4 жыл бұрын
Still, LSTM works better with long texts. It has its own use cases.
@mateuszanuszewski69
@mateuszanuszewski69 3 жыл бұрын
@@dineshnagumothu5792 you obviously didn't get it. it is "DEAD", lol. RIP LSTM.
@maloukemallouke9735
@maloukemallouke9735 2 жыл бұрын
Thank's so much for video . can'i ask some one if he know where i can find a pre-trainded modele to identfiy number in Image that are from 0 to 100. No writied by hand specialy and can be any where position in image ? Thank's for adavance.
@zeeshanashraf4502
@zeeshanashraf4502 Жыл бұрын
Great presentation.
@SuilujChannel
@SuilujChannel 4 жыл бұрын
question regarding 26:27 so if i plan on analysing time series sensor data should i stick to LSTM or is the transformers model a good choice for time series data?
@isaacgroen3692
@isaacgroen3692 4 жыл бұрын
I could use an answer to this question as well
@akhileshrai4176
@akhileshrai4176 4 жыл бұрын
@@isaacgroen3692 Damn I have the same question
@abdulazeez7971
@abdulazeez7971 4 жыл бұрын
U need to use LSTM for time series. Bcos in transformers, it's all about attention or positional intelligence which has to be learnt. Whereas in time series, it's all about the trend and patterns which requires the model to remember a complete sequence of data points.
@SuilujChannel
@SuilujChannel 4 жыл бұрын
@@abdulazeez7971 thanks for the info :)
@Jason-jk1zo
@Jason-jk1zo 4 жыл бұрын
The primary advantages and benefits from the transformer are the attention and positional encoding, which are quite useful for translation because the grammar differences in different languages may cause the disorder of the input and output words. But for time series sensor data, they are not disordered (comparing output with input)! RNN, such as LSTM is a suitable choice to perform analysis for such data.
@dalissonfigueiredo
@dalissonfigueiredo 3 жыл бұрын
What a great explanation, thank you.
@yangl1849
@yangl1849 2 жыл бұрын
hkj678aTY656S\]paxz dAESAZ RS
@lukebitton3694
@lukebitton3694 3 жыл бұрын
I've always wondered how standard Relu's can provide non-trivial learning if they are essentially linear for positive values? I know with standard linear activation functions any deep network can be reduced to a since layer transformation. Is it the discontinuity at zero that stops this being the case for Relu?
@lucast2212
@lucast2212 3 жыл бұрын
Exactly. Think of it like this. A matrix-vector multiplication is a linear transformation. That means it rotates and shifts its input vector. That is why you can write two of these operations as a single one (A_matrix * B_matrix * C_vec = D_matrix * C_vec) and also why you can add scalar multiplications in between (which is what linear activation would do, and is just a scaling operation on the vector). But if you only scale some of the entries of the vector (ReLu) that does not work anymore. If you take a pen, rotating and scaling it preservers your pen, but if you want to only scale parts of it, you have to break it.
@lukebitton3694
@lukebitton3694 3 жыл бұрын
@@lucast2212 Cheers! good explanation, thanks.
@sainissunil
@sainissunil Жыл бұрын
This talk is awesome!
3 жыл бұрын
Amazing presentation
@ax5344
@ax5344 3 жыл бұрын
1. @10:17, the speaker says all we need is the encoder part for classification problem, is this True? How about BERT, when we use BERT encoding for classification, say sentiment analysis, all that has worked was the encoder part? 2. @ 12:25, the slide is really clear in explaining relevance[i,j], but the example is translation, so clearly it is not on the "encoder part". In the encoder part, how is relevance[i,j] computed? what is the difference between key and value? It seems they are all values of the input vector? Aren't they the same in the encoder part? Thank you!
@trevorclark2186
@trevorclark2186 2 жыл бұрын
Good question...Key and Value seems symmetric. I was expecting symmetry in a self-attention model, but I can't quite understand how this works with the key/value analogy.
@xruan6582
@xruan6582 3 жыл бұрын
20:00 If I multiply a small scaling factor λ₁ (e.g. 0.01) to the output before feeding to activation function, sigmoid will be sensitive to difference between, say, 5 and 50. Similarly, if I multiply another scaling factor λ₂ (e.g. 100) to the sigmoid output, I can get activated output ranging between 0 and 100. Is that a better solution than Relu, which has no cap at all?
@LeoDirac
@LeoDirac 3 жыл бұрын
The problem with that approach is that in the very middle of the range the sigmoid is almost entirely linear - for input near zero, the output is 0.5 + x/4. And neural networks need nonlinearity in the activation to achieve their expressiveness. Linear algebra tells us that if you have a series of linear layers they can always and exactly be compressed down to a single linear layer, which we know isn't a very powerful neural net.
@xruan6582
@xruan6582 3 жыл бұрын
@@LeoDirac Relu is linear from 0 to ∞
@LeoDirac
@LeoDirac 3 жыл бұрын
@@xruan6582 Right! That's the funny thing about ReLU - it either "does nothing" (leaves the input the same) or it "outputs nothing" (zero). But by sometimes doing one and sometimes doing the other, it is effectively making a logic decision for every neuron based on the input value, and that's enough computational power to build arbitrarily complex functions. If you want to follow the biological analogy, you can fairly accurately say that each neuron in a ReLU net is firing or not, depending on whether the weighted sum of its inputs exceeds some threshold (either zero, or the bias if your layer has bias). And then a cool thing about ReLU is that they can fire weakly or strongly.
@leromerom
@leromerom 3 жыл бұрын
Clear, precise, fluid thank you!
@ChrisHalden007
@ChrisHalden007 Жыл бұрын
Great video. Thanks
@jung-akim9157
@jung-akim9157 3 жыл бұрын
This is one of the clearest and most informative presentation about nlp models and their comparison. Thank you so much.
@tastyw0rm
@tastyw0rm Жыл бұрын
This was more than meets the eye
@thebanjak2433
@thebanjak2433 4 жыл бұрын
Well done and thank you
@etiennetiennetienne
@etiennetiennetienne 4 жыл бұрын
how do transformers deal with too large inputs? for example if you want to process an entire book? is there still some memory or autoregressive setting with transformers?
@LeoDirac
@LeoDirac 3 жыл бұрын
Short answer is you can't. Long answer is people are working on it - don't have any research papers handy, and I don't think there's consensus. But fundamentally this remains one of the key limitations of transformers - they don't work on very large documents. In practice, the size of documents they do work on is big enough for most any problem. I mean if you really want to (a.k.a. have a big team of smart engineers), nothing's stopping you from building a giant cluster implementation where each node is handling its own set of tokens, but dear god it would be slow. Bandwidth inside a GPU is ~terabyte / second, while most datacenter networks can't do more than a couple gigabytes/second.
@etiennetiennetienne
@etiennetiennetienne 3 жыл бұрын
@@LeoDirac thanks for the (short and long) answers. I have seen recently publications aiming at reducing the quadratic complexity like Transformers are RNNs (arxiv.org/abs/2006.16236). I guess transformer really solve how to route information accurately, but not how to compress memory, nor how to use past tokens out of reach for gradient-based optimization (perhaps this is badly formulated).
@nonamenoname2618
@nonamenoname2618 3 жыл бұрын
Do these transformers also work well for timer series predictions? I am working on air pollution predictions and would like to try out these transformers in some keras architecture for that application if the architectures are available somewhere. tnx
@LeoDirac
@LeoDirac 3 жыл бұрын
I haven't seen this tried in the literature. But transformers sure should work for time-series analysis - no reason they shouldn't to my eye. You might have to do the positional encoding yourself though - not sure if the Keras blocks do that for you.
@nonamenoname2618
@nonamenoname2618 3 жыл бұрын
@@LeoDirac Thanks for the hint! In case you have used yourself some architecture for time series forecasting that is available on github or somewhere else, I would appreciate knowing about it!
@anewmanvs
@anewmanvs 3 жыл бұрын
Very good presentation
@timharris72
@timharris72 4 жыл бұрын
This is hands down the best presentation on LSTMs and Transformers I have ever seen. The speaker is really good. He knows his stuff.
@FrancescoCapuano-ll1md
@FrancescoCapuano-ll1md Жыл бұрын
This is outstanding!
@aj-kl7de
@aj-kl7de 4 жыл бұрын
Great stuff.
@rohitdhankar360
@rohitdhankar360 8 ай бұрын
@10:30 - Attention is all you need -- Multi Head Attention Mechanism --
@thomaskwok8389
@thomaskwok8389 4 жыл бұрын
Clear and concise👍
@riesler3041
@riesler3041 3 жыл бұрын
Presentation: perfect Explanation: perfect me (every 10 mins): " but that belt tho... ehh PERFECT!"
@ziruiliu3998
@ziruiliu3998 Жыл бұрын
supposing i am using a net to approximate a real world physis ODE equation with time series data, in this case the Transformer is still the best choice?
@seattleapplieddeeplearning
@seattleapplieddeeplearning Жыл бұрын
I'm not sure. I have barely read any papers on this kind of modeling. I will say that a wonderful property of transformers is that they can learn to analyze arbitrary dimensional inputs - it's easy to create positional encodings for 1D inputs (sequence), or 2D (image), or 3D, 4D, 5D, etc. Some physics modeling scenarios will want this kind of input. If your inputs are purely 1D, you could use older NN architectures, but in 2023 there are very few situations where I'd choose an LSTM over a transformer. (e.g. if you need an extremely long time horizon.) -Leo
@ziruiliu3998
@ziruiliu3998 Жыл бұрын
@@seattleapplieddeeplearning Thanks for your reply, this realy helps me.
@oleksiinag3150
@oleksiinag3150 3 жыл бұрын
He is incredible One of the best presenters
@snehotoshbanerjee1938
@snehotoshbanerjee1938 3 жыл бұрын
Simply Wow!
@MoltarTheGreat
@MoltarTheGreat 3 жыл бұрын
Amazing video, I feel like I actually have a more concrete grasp on how transformers work now. The only thing I didn't understand was the Positional Encoding but that's because I'm unfamiliar with signal processing.
@bruce-livealifewewillremem2663
@bruce-livealifewewillremem2663 3 жыл бұрын
Dude, can you share your PPT or PDF. Thanks in advance!
@Davourflave
@Davourflave 3 жыл бұрын
Very nice recap of Transformers and what sets them apart from RNNs! Just one little remark, you are not doing things in N^2 for the transformer since you fixed your N to be at maximum some sequence length. You can now set this N to be a much bigger number as GPUs have been highly optimized to do the according multiplications. However, for long sequence lengths, the quadratic nature of an all-to-all comparison is going to be an issue nonetheless.
@cafeinomano_
@cafeinomano_ Жыл бұрын
Best Transformer explanation ever.
@rp88imxoimxo27
@rp88imxoimxo27 3 жыл бұрын
Nice video but forced to watch on 2x speed trying not to fall asleep
@ThingEngineer
@ThingEngineer 3 жыл бұрын
This is by far the best video. Ever.
@chrisfiegel9455
@chrisfiegel9455 10 ай бұрын
This was amazing.
@DrummerBoyGames
@DrummerBoyGames 3 жыл бұрын
Excellent vid, am wondering about a point made around 22:00 about SGD being "slow but gives great results." I was under the impression that SGD was generally considered pretty OK w/r/t speed, especially compared to full gradient descent? Maybe it's slow compared to Adam I guess, or in this specific use-case it's slow? Perhaps I'm wrong. Anyways, thanks for the vid!
@LeoDirac
@LeoDirac 3 жыл бұрын
I was really just comparing SGD vs Adam there. Adam is usually much faster than SGD to converge. SGD is the standard and so a lot of optimization research has tried to produce a faster optimizer. Full batch gradient descent is almost never practical in deep learning settings. That would require a "minibatch size" equal to your entire dataset, which would require vast amounts of GPU RAM unless your dataset is tiny. FWIW, full batch techniques can actually converge fairly quickly, but it's mostly studied for convex optimization problems, which neural networks are not. The "noise" introduced by the random samples in SGD is thought to be very important to help deal with the non-convexity of the NN loss surface.
@mahomagi6543
@mahomagi6543 3 жыл бұрын
can we also translate source code to natural language?
@DeltonMyalil
@DeltonMyalil 4 ай бұрын
This aged like fine wine.
@BlockDesignz
@BlockDesignz 4 жыл бұрын
This is brilliant.
@randomcandy1000
@randomcandy1000 Жыл бұрын
this is awesome!!! thank you
@jeffg4686
@jeffg4686 3 ай бұрын
Relevance is just how often a word appears in the input? NM on this. I looked it up. The answer is similarity of tokens in the embedding - ones with higher similarity gets more relevance.
@g3kc
@g3kc 2 жыл бұрын
great talk!
@BiranchiNarayanNayak
@BiranchiNarayanNayak 4 жыл бұрын
Very well explained... Love it.
@hailking5588
@hailking5588 2 жыл бұрын
Anybody knows why transfer learning never really worked with LSTM. Any links or papers on that?
@LeoDirac
@LeoDirac Жыл бұрын
I've never read any papers about this - just my personal experience and talking to colleagues. If I had to guess, I'd say it's related to the fact that LSTM's are really tough to train. Which is not surprising if you think about them as incredibly deep networks (depth = sequence length) but the weights are re-used at every layer. Those few parameters get re-used for a lot of things. Transfer learning necessarily means being able to _quickly_ retrain a network on a new task. But training is never fast with an LSTM. That's just my speculation though.
@srikantachaitanya6561
@srikantachaitanya6561 3 жыл бұрын
Thank you...
@davr9724
@davr9724 3 жыл бұрын
Amazing!!!
@kjpcs123
@kjpcs123 3 жыл бұрын
A great introduction to transformers.
@user-iw7ku6ml7j
@user-iw7ku6ml7j Жыл бұрын
Awesome!
@TJVideoChannelUTube
@TJVideoChannelUTube Жыл бұрын
At 18:25, Leo Dirac mentioned Transformer Model doesn't need activation functions. Is this correct?
@seattleapplieddeeplearning
@seattleapplieddeeplearning Жыл бұрын
No that's not correct. I understand the confusion, though. The advantage is that Transformers don't require those specific "complex" activations of sigmoid or tanh which LSTM relies on. Transformers can use ReLU activations which are computationally much simpler. With GPUs the actual amount of computation isn't really the issue, but rather the precision at which they need to run. LSTMs typically need to run in full 32-bit precision, whereas modern datacenter GPUs like A100, H100 or TPUs are way faster at 16-bit computation. That's because tanh and sigmoid squash inputs down to numbers that are very close to 1, 0 or -1, and so small differences, say between 0.991 and 0.992, become very meaningful, requiring lots of digits of precision, and thus lots of silicon to keep track of them. But simpler activations like ReLU tend to work much better on 16-bit silicon. One more clarification: Transformers typically require softmax computations for the attention mechanism, which are technically very similar to sigmoid. Softmax still squash inputs close to 0 and 1. But for reasons ... the small differences in softmax activations don't matter much.
@seattleapplieddeeplearning
@seattleapplieddeeplearning Жыл бұрын
Okay, reasons. The difference is subtle but essentially it's because a 0.999 coming out of softmax means "pay full attention here" and the other 0.001 doesn't matter for anything. But in an LSTM a 0.999 coming out of a sigmoid is effectively saying "0.001 of the backdrop signal goes back in time to the previous token". And then you might need another 0.001 to back in time to the token before that. So that's why the higher precision is critical for LSTM. Vanishing gradients.
@TJVideoChannelUTube
@TJVideoChannelUTube Жыл бұрын
@@seattleapplieddeeplearning Here are comments by ChatGPT: Activation functions are not essential for the Transformer model as they are not used within the model. The self-attention mechanism within the Transformer is able to introduce non-linearity into the model, which is why the Transformer does not require activation functions like other neural networks. Instead, the self-attention mechanism in the Transformer uses matrix multiplication and softmax functions to compute the attention scores, and these scores are used to weight the input vectors. The use of the softmax function in the attention mechanism can be considered a form of activation function, but it is not the same as the commonly used activation functions in other neural network models. However, activation functions can be used in other parts of the Transformer model, such as in the feed-forward neural networks in the encoder and decoder layers. In transformer models, the self-attention mechanism replaces the need for activation functions in the traditional sense, as it allows for the model to selectively weight the input features without explicitly passing them through an activation function.
@seattleapplieddeeplearning
@seattleapplieddeeplearning Жыл бұрын
ChatGPT's answer is a bit misleading (especially the first sentence), but not necessarily wrong. Any NN is effectively useless without some kind of nonlinearity, and "activation" is pretty synonymous with "nonlinearity" but a bit more vague. It's true that because there's nonlinearity in attention mechanism that you don't need it in the other areas, which are typically MLP's so I'll call them MLP's. But I'm not aware of any real transformers in use which skip the nonlinearity in the "MLP" parts of the model. But the point I was making on that slide is why Transformers are better than LSTM and similar old-school NN's. And that's because LSTM intrinsically _needs_ tanh and sigmoid to operate, and these have a bunch of problems. Transformers can use any activation function. (Or arguably "none" but I wouldn't say that because I think it's misleading, and probably would hurt quality a lot too.)
@TJVideoChannelUTube
@TJVideoChannelUTube Жыл бұрын
@@seattleapplieddeeplearning I think ChatGPT implies that self-attention mechanism, in the Transformer model, does not use activation functions. Is this statement correct: The self-attention mechanism will not involve in deep learning processes, because no activation functions needed in this layer.
@rubenstefanus2402
@rubenstefanus2402 4 жыл бұрын
Great ! Thanks
@juliawang3131
@juliawang3131 Жыл бұрын
impressive!
@joneskiller8
@joneskiller8 5 ай бұрын
I need that belt.
@brandomiranda6703
@brandomiranda6703 4 жыл бұрын
If tanh/sigmoid aren’t that useful, then what is the ReLU equivalent for RNNs? Is there a modern/improved version?
@kvazau8444
@kvazau8444 4 жыл бұрын
We still use ReLU. I've used leaky relu in the past to solve zeroing, though.
@0MVR_0
@0MVR_0 4 жыл бұрын
Traditionally, recurrents have used sigmoid and tanh to discern relevancy to predictive capability. They are useful, yet have idiosyncratic necessities to be productive. This is a nightmare for anyone trying to answer why their non-refundable black box is only sometimes better than the weather man.
@LeoDirac
@LeoDirac 3 жыл бұрын
This is a great question. I wish I had a good specific answer for you, because I'm pretty sure there are some things will be reliably somewhat better. But generally, lots of people have tried hard to find a better kind of recurrent net than LSTM -- including exhaustive and evolutionary architecture searches -- and while they do find improvements, apparently nothing has been discovered which is _better enough_ to justify lots of people switching. And honestly there's *a lot* to be said for building upon well-understood primitives. You really don't want to go searching through several novel hyperparameters for only a modest benefit.
@kvazau8444
@kvazau8444 3 жыл бұрын
@@LeoDirac This is not quite accurate. All the recent huge-parameter-count language models, for instance, use transformers instead of LSTMs
@user-or7ji5hv8y
@user-or7ji5hv8y 4 жыл бұрын
Can we have access to the slides?
@favlcas
@favlcas 4 жыл бұрын
Great presentation. Thank you!
@bryancc2012
@bryancc2012 4 жыл бұрын
good video!
Transfer learning and Transformer models (ML Tech Talks)
44:59
TensorFlow
Рет қаралды 113 М.
Bro be careful where you drop the ball  #learnfromkhaby  #comedy
00:19
Khaby. Lame
Рет қаралды 40 МЛН
AI Language Models & Transformers - Computerphile
20:39
Computerphile
Рет қаралды 325 М.
The complete guide to Transformer neural Networks!
27:53
CodeEmporium
Рет қаралды 30 М.
Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!
36:15
StatQuest with Josh Starmer
Рет қаралды 571 М.
Vision Transformer Basics
30:49
Samuel Albanie
Рет қаралды 17 М.
What are Transformer Models and how do they work?
44:26
Serrano.Academy
Рет қаралды 96 М.
What are Transformer Neural Networks?
16:44
Ari Seff
Рет қаралды 159 М.
Pytorch Transformers from Scratch (Attention is all you need)
57:10
Aladdin Persson
Рет қаралды 290 М.
#miniphone
0:18
Miniphone
Рет қаралды 11 МЛН
iPhone 12 socket cleaning #fixit
0:30
Tamar DB (mt)
Рет қаралды 1,8 МЛН
Fiber kablo
0:15
Elektrik-Elektronik
Рет қаралды 8 МЛН
What’s your charging level??
0:14
Татьяна Дука
Рет қаралды 7 МЛН
Which Phone Unlock Code Will You Choose? 🤔️
0:14
Game9bit
Рет қаралды 12 МЛН
Carregando telefone com carregador cortado
1:01
Andcarli
Рет қаралды 1,9 МЛН