Vanishing & Exploding Gradient explained | A problem resulting from backpropagation

Рет қаралды 125,941

Күн бұрын

Пікірлер: 100

@deeplizard 6 жыл бұрын

Machine Learning / Deep Learning Fundamentals playlist: kzbin.info/aero/PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU Keras Machine Learning / Deep Learning Tutorial playlist: kzbin.info/aero/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL BACKPROP VIDEOS: Backpropagation explained | Part 1 - The intuition kzbin.info/www/bejne/jnaWnKWcaKiEotU Backpropagation explained | Part 2 - The mathematical notation kzbin.info/www/bejne/aJ62qqaIrZJkmZI Backpropagation explained | Part 3 - Mathematical observations kzbin.info/www/bejne/fWbFZZ2Id7CBrtk Backpropagation explained | Part 4 - Calculating the gradient kzbin.info/www/bejne/kKOYp5x3j6yhmqc Backpropagation explained | Part 5 - What puts the “back” in backprop? kzbin.info/www/bejne/rnTPfJKVeNaNpLM

@yuxiaofei3442 6 жыл бұрын

The voice is so nice and confident.

@deeplizard 6 жыл бұрын

Thanks, yu!

@himanshutanwani5118 5 жыл бұрын

@@deeplizard lol, was that intentional? xD

@lonewolf2547 5 жыл бұрын

I landed here after checking andrews vidoes about this(which was confusing), but this video explained it very clearly and simple

@prateekkumar151 5 жыл бұрын

Same here, Didn't like his explanation. This was very clear.Thanks

@milindbebarta2226 Жыл бұрын

Yep, his explanations aren't clear sometimes. It's frustating.

@AfnanKhan-ni6zc 2 ай бұрын

Same, now I watch other videos first then move to his lectures 😂

@jackripper6066 3 жыл бұрын

I was stuck on this concept for hours and didn't click on this video because of the views but I was wrong this is the clearest and simplest explaination I've found thanks a lot!

@MarkW_ 4 жыл бұрын

Perhaps a small addition to the explanation for vanishing gradients in this video, from a computer architecture point of view. When a network is trained on an actual computer system the variable types (e.g. floats) have a limited 'resolution' that they can express due to their numerical representation. This means that adding or subtracting a very small value from a value 'x' could actually result in 'x' (unchanged), meaning that the network stopped training that particular weight. For example: 0.29 - 0.000000001 could become 0.29. With neural networks moving towards smaller variable types (e.g. 16 bit floats instead of 32) this problem is becoming more pronounced. For a similar reason, floating point representations usually do not become zero, they just approach zero.

@RabbitLLLord 5 ай бұрын

Dude, this is super insightful. Thanks!

@cmram23 3 жыл бұрын

The best and the simplest explanation of Vanishing Gradient I have found so far.

@HassanKhan-kq4lj 4 ай бұрын

I am having my deep learning exam tomorrow. Started studying just one day before the exam. Couldn't understand anything. Then found your video. Now I understood this concept. Thanks a lot 😭😭😭

@karimafifi5501 4 жыл бұрын

The intro is so relaxing. It is like you are in another dimension.

@tymothylim6550 3 жыл бұрын

Thank you very much for this video! I learnt about these similar problems of vanishing and exploding gradients and how it affects the convergence of weight values to their optimal values!

@farshad-hasanpour 3 жыл бұрын

This channel never let us down. great work!

@sciences_rainbow6292 3 жыл бұрын

Your videos are just perfect ! the voice, the explanation, the animation! Genius of pedagogy :)

@prithviprakash1110 3 жыл бұрын

Great job explaining this, understood something I was unsure about in a very decisive and clear way. Thanks!

@deeps-n5y 5 жыл бұрын

Underrated channel ! Thanks for posting these videos :)

@carinacruz3945 4 жыл бұрын

The best source to understand machine learning concepts in an easy way. =)

@entertainment8067 2 жыл бұрын

Thanks for an amazing tutorial, love from Afghanistan

@NikkieBiteMe 5 жыл бұрын

I fiiiinally understand how the vanishing gradient problem occurs

@mita1498 5 жыл бұрын

Very good channel indeed !!!

@absolute___zero 4 жыл бұрын

there is a bigger problem with gradient descent and it is not vanishing or exploding thing. It is that it gets stuck in local minima, and that hasn't been solved yet. Only partial solutions like simulated annealing or GAs. UKF , montecarlo and stuff like that, which involves randomness. The only way to find better minimum is to introduce randomness.

@craigboger 4 жыл бұрын

Thank you so much for your series and explanations!

@absolute___zero 4 жыл бұрын

Vanishing gradient is not a problem. It is a feature of stacking something upon something that depends on something else and so on. It is like falling dominoes but with increasing piece on each step. Because chaining function after function after function after function after function, where all the functions are summing, and at the end doing a ReLU, you are going to get your output values blowing up! Vanishing gradients is not a problem, it is how it is supposed to work. The math is right, at the lower layers you can't use big gradients because they are going to affect the output layer exponentialy. And also, cut the first minute and a half of the video because it is just loss of time.

@jsaezmarti 3 жыл бұрын

Thanks for the explaining the concept so clearly :D

@albertoramos9586 2 жыл бұрын

Thank you so much!!!

@MrSoumyabrata 3 жыл бұрын

Thank you for such a nice video. Understood the concept.

@abdullahalmazed5387 8 ай бұрын

awesome explanation

@aidynabirov7728 2 жыл бұрын

Awesome video!

@billycheung7095 4 жыл бұрын

Well explained. Thanks for your works.

@_seeker423 4 жыл бұрын

Yet to see a video that explained this so clearly. One question. Does 1. both vanishing and exploding gradients lead to underfitting or 2. vanishing leads to underfitting and exploding lead to overfitting?

@fosheimdet 2 жыл бұрын

Why is this an issue? If the partial derivative of the loss w.r.t. a weight is small, its change should also be small so that we step in the direction of steepest descent of the loss function. Is the problem of vanishing gradients that we effectively lose the ability to train certain weights of our network, reducing dimensionality of our model?

@alphatrader5450 5 жыл бұрын

Great explanation! Background gives me motion sickness though.

@deeplizard 5 жыл бұрын

Thanks for the feedback! I'll keep that in mind.

@ritukamnnit 3 жыл бұрын

good explanation. keep up the great work :)

@yongwoo1020 6 жыл бұрын

I would have labeled your layers or edges "a", "b", "c", etc when you were discussing the cumulative effect on gradients that are earlier in the network (gradient = a * b * c * d, etc.). It can be a bit confusing since convention has us thinking one way and notation is reinforcing that while the conversation is about backprop which runs counter to that convention. The groundwork is laid for a very basic misunderstanding that could be cured with simple labels. Great video, btw.

@deeplizard 6 жыл бұрын

Appreciate your feedback, Samsung Blues.

@bashirghariba879 4 жыл бұрын

Good description

@loneWOLF-fq7nz 5 жыл бұрын

best explanation !!! Good Work

@hamidraza1584 3 жыл бұрын

Is this problem occurs in simple neuruel or rnn lstm networks??

@abihebbar 4 жыл бұрын

In the case of exploding gradient, when it gets multiplied with the Learning Rate (between 0.0001 & 0.01), the result will be much less (usually less than 1). When this is further subtracted with the existing weight, wouldn't the updated weight still be less than 1? In that case, how is it different from vanishing gradient?

@hossainahamed8789 4 жыл бұрын

loved it

@gideonfaive6886 4 жыл бұрын

{ "question": "Vanishing Gradient is mostly related to ………… and is usually caused by having too many values …………… in calculating the …………", "choices": [ "earlier weights, less than one, gradient", "earlier weights, less than one, loss", "back propagation, less than one, gradient", "back propagation, less than one, loss" ], "answer": "earlier weights, less than one, gradient", "creator": "Hivemind", "creationDate": "2020-04-21T21:46:35.500Z" }

@deeplizard 4 жыл бұрын

Thanks, Gideon! Just added your question to deeplizard.com/learn/learn/video/qO_NLVjD6zE :)

@garrett6064 3 жыл бұрын

Hmm... possible solutions, my guess is (A) too many layers in the net, or (B) a separate Learning Rate for each layer used in conjunction with a function to normalize the Learning Rate and thus the gradient. But this is just my guess.

@deeplizard 3 жыл бұрын

Nice! Not sure what type of impact these potential solutions may have. A known solution is revealed in the next episode ✨

@garrett6064 3 жыл бұрын

@@deeplizard after spending a further ten seconds thinking about this I decided I didn't like my "solutions". But I did think that "Every first solution should be Use Less Layers." Was not a bad philosophy. 😆

@dennismuller371 5 жыл бұрын

A gradient gets substracted from the weights to update them. This gradient can be really small and hence has no impact. Is also can become really large. How comes, since the gradient gets substracted, exploding gradient creates values that are larger than their former values? Should it not be sth. like a negative weight then? Nice videos btw. :)

@deeplizard 5 жыл бұрын

Hey Dennis - Good question and observation. Let me see if I can help clarify. Essentially, vanishing gradient = small gradient update to the weight. Exploding gradient = large gradient update to the weight. With exploding gradient, the large gradient causes a relatively large weight update, which possibly makes the weight completely "jump" over its optimal value. This update could indeed result in a negative weight, and that's fine. The "exploding" is just in terms of how large the gradient is, not how large the weight becomes. In the video, I did illustrate the exploding gradient update with a larger positive number, but it probably would have been more intuitive to show the example with a larger negative number. Does this help clarify?

@ahmedelhamy1845 3 жыл бұрын

@@deeplizard I think that gradients can't be greater than 0.25 when using sigmoid as activation function as its derivative range from 0 to 0.25 thus it will never exceed 1 by any means. I think gradient explode is coming due to weight initialization problem as weights are initialized with large values. Correct or clarify me please?

@sidbhatia4230 5 жыл бұрын

Vanishing gradient is dependent on the learning rate of the model, right?

@mikashaw7926 3 жыл бұрын

OMG I UNDERSTAND NOW

@driesdesmet1069 3 жыл бұрын

Nothing about the sigmoid function? I thought this was also one of the causes of a vanishing/exploding gradient?

@petraschubert8220 5 жыл бұрын

Great viedeo thanks for that! But I have one long time question about the backpropagation. I can adjust which layer's weight I hit bit summing up the components. But which weight will actually be updated then? will the layers inbetween the components of my chainrule update aswell? Would be very greatful for an answer, thanks!

@askinc102 6 жыл бұрын

If the gradient (of loss) is small, doesn't it imply that a very small update is required?

@deeplizard 6 жыл бұрын

Hey sandesh - Good question. Michael Nielsen addresses this question in chapter 5 of his book, and I think it's a nice explanation. I'll link to the full chapter, but I've included the relevant excerpt below. Let me know if this helps clarify. neuralnetworksanddeeplearning.com/chap5.html "One response to vanishing (or unstable) gradients is to wonder if they're really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function f(x) of a single variable. Wouldn't it be good news if the derivative f′(x) was small? Wouldn't that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don't need to do much adjustment of the weights and biases? Of course, this isn't the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a [784,30,30,30,10] network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don't have enough information. And so it can't possibly be the case that not much learning needs to be done in the first layer. If we're going to train deep networks, we need to figure out how to address the vanishing gradient problem."

@VinayKumar-hy6ee 6 жыл бұрын

As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function deeplizard

@abdulbakey8305 5 жыл бұрын

@@deeplizard so what is the reason for the derivative to come out small if it is not for the function to be at optimum w.r.t that particular weight

@gideonfaive6886 4 жыл бұрын

{ "question": "Vanishing gradient impair our training by making our weight being updated and their values getting further away from the optimal weights value", "choices": [ "False", "True", "BLANK_SPACE", "BLANK_SPACE" ], "answer": "False", "creator": "Hivemind", "creationDate": "2020-04-21T21:53:20.326Z" }

@deeplizard 4 жыл бұрын

Thanks, Gideon! Just added your question to deeplizard.com/learn/video/qO_NLVjD6zE :)

@justchill99902 5 жыл бұрын

Hello! Question - I might sound silly here but do we ever have a weight update in the positive direction? I mean the weight was let's say 0.3 and then after update, it turned to 0.4? as while updating we always "subtract" the gradient * very low learning rate, this product that we subtract from the actual weight will always be very small.. so unless this product is negative (which only happens when the gradient is negative) , we will never add some value to the current weight but always reduce it right? So to make some sense out of it, when do we get negative gradients ? do we generally have this happening?

@ink-redible 5 жыл бұрын

Yes, we do get negative gradients, if looked at the formula for back prop; you will easily find when the gradient will turn out to be negative

@stormwaker 5 жыл бұрын

Love the series, but please refrain from zoom-in zoom-out background animation - it makes me distracted, nauseous even.

@deeplizard 5 жыл бұрын

Thank you for the feedback!

@nomadian1258 Ай бұрын

100th comment🎉

@rishabbanerjee5152 5 жыл бұрын

math-less machine learning is so good :D

@VinayKumar-hy6ee 6 жыл бұрын

As gradient of sigmoid or tan function are very low wouldn't it lead to vanishing gradient effect and why these these are chosen as activation function

@deeplizard 6 жыл бұрын

Hey Vinay - Check out the video on bias below to see how the result from a given activation function may actually be "shifted" to a greater number, which in turn might help to reduce vanishing gradient. kzbin.info/www/bejne/fpbXd5yeqL2Gr9U Let me know if this helps.

@GirlKnowsTech 3 жыл бұрын

00:30 What is the gradient 01:18 Introduction 01:45 What is the vanishing gradient problem? 03:28 How does the vanishing gradient problem occurs? 05:31 What about exploding gradient?

@deeplizard 3 жыл бұрын

Added to the description. Thanks so much!

@midopurple3665 Жыл бұрын

You are great, I feel like wanting to marry you for your intelligence

@albertodomino9420 3 жыл бұрын

Please talk slowly.

@lovelessOrphenKoRn 5 жыл бұрын

I wish you wouldnt talk so fast. Cant keep up. Or subtitles wouldnt bei nice

@deeplizard 5 жыл бұрын

There are English subtitles automatically created by KZbin that you can turn on for this video. Also, you can slow down the speed on the video settings to 75% or 50% of the real-time speed to see if that helps.

@mouhamadibrahim3574 3 жыл бұрын

I want to marry you

@deepaksingh9318 6 жыл бұрын

What an easy explanation.. 👍 I am jst loving this Playlist and dont want it to end ever 😁

@deeplizard 6 жыл бұрын

Thanks, deepak! Be sure to check out the Keras series as well 😎 kzbin.info/aero/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL

@absolute___zero 4 жыл бұрын

4:22 this is all wrong. updates can be positive or negative. this depends on the direction of the slope, if the slope is positive, the update is going to be substracted, if the slope is negative the weight updated is going to be added. So, weights don't just get smaller and smaller, they are updated in small quantities that's it, if you wait long enough (like weeks or months) you are going to get your ANN training completed just fine. You should learn about derivatives of composite functions and you will understand then that vanishing gradient is not a problem. It can be a problem though if you use float32 data type (single precision), because it has considerable error when using long chain of calculations. Switching to double will help with vanishing gradient "problem" (in quotes)

@panwong9624 6 жыл бұрын

cannot wait to watch the next video that addresses the vanishing and exploding gradient problem. :)

@deeplizard 6 жыл бұрын

Have you got around to it yet? This is the one - Weight initialization: kzbin.info/www/bejne/bpzVlWingLuqY7M

@RandomGuy-hi2jm 5 жыл бұрын

what can we do to prevent it???? i think we should use relu activation function

@deeplizard 5 жыл бұрын

The answer is revealed in the following video. deeplizard.com/learn/video/8krd5qKVw-Q

@Sikuq 4 жыл бұрын

Excellent #28 follow up to your playlist ## 23-27. Thanks.

@neoblackcyptron 5 жыл бұрын

Wow this explanation was really clear and to the point, subbed immediately, going to check out all the videos over time.

@laurafigueroa2852 3 жыл бұрын

Thank you!

@harshitraj8409 2 жыл бұрын

Crystal Clear Explanation.

@anirudhgangadhar6158 Жыл бұрын

The best explanation of exploding and vanishing gradients I have come across so far. Great job!

@fritz-c 4 жыл бұрын

I spotted a couple slight typos in the article for this video. we don't perse, ↓ we don't per se, further away from it’s optimal value ↓ further away from its optimal value

@deeplizard 4 жыл бұрын

Fixed, thanks Chris! :D

@진정필-k9p 3 жыл бұрын

I finally understand about gradient vanishing and exploding from your video!! Thanks : )

@zongyigong6658 4 жыл бұрын

What if some intermediate gradients are large (> 1) and some are small (< 1), they could balance out to give early gradients still normal sizes. The described problem seems not to stem from the defect of the theory but rather from practical numerical implementation perspectives. The heuristic explanation is a bit handwaving. When should we expect to have a vanishing problem and when should we expect to have an exploding problem, is one vs the other purely random. If they are not random but depend on the nature of data or the NN architecture, what are they and why?

@abdulmukit4420 4 жыл бұрын

The background image is really unnecessary and is rather disturbing

@kareemjeiroudi1964 5 жыл бұрын

May I know who edits your videos? Because whoever does, he/ she has humor 😃

@deeplizard 5 жыл бұрын

Haha thanks, kareem! There are two of us behind deeplizard. Myself, which you hear in this video, and my partner, who you'll hear in other videos like our PyTorch series. Together, we run the entire assembly line from topic creation and recording to production and editing, etc. :)