Weight Initialization Techniques | What not to do?

Weight Initialization Techniques | What not to do? | Deep Learning

Рет қаралды 40,849

CampusX

Күн бұрын

Пікірлер: 80

@yashodhanmandke3843 2 жыл бұрын

I think this is the best way to teach.which discuses where things. Fail.kudos to you

@saketpathak5361 2 жыл бұрын

Morning sir.....I have recently started watching your videos and the concepts are getting more than clear to me after watching your videos.....Although I have one request that can you please complete the clustering techniques from ML playlist....like DBSCAN and recommendation systems types of problems are not yet covered.....It will be very very helpful sir if you make those videos

@ShubhamSharma-gs9pt 2 жыл бұрын

thanks sir!! waiting for more and more uploads in this playlist!😊😊😊😊😊

@apudas7 3 ай бұрын

Great sir. the way you teach is really helpful sir.

@elonmusk4267 4 ай бұрын

the legendary playlist

@pranavreddy9218 5 ай бұрын

b11 (bias), a11 and z11 make it clear which one is which.. that is the only confusion, remaining all awesome.. great series

@apudas7 3 ай бұрын

Thank you sir. you're the reason why I can continue learning DL.

@SARATCHANDRASRINADHU 7 күн бұрын

superb explaination sir

@DipansuJoshi 6 ай бұрын

Thanks, sir! 💌💌Loves to spend time on your KZbin Channel.

@ali75988 Жыл бұрын

17:03 Chain rule: y is an activation function, lets say y = a(u) at final node. chain rule, doesn't take into account u (the weighted sum variable) but takes a11 and z11 (weighted sum variable at node 11) both into account. if anyone can explain why we skipped weighted sum at output but took at hidden layer nodes, would be thankful. Regards,

@sofiashahin4603 Жыл бұрын

So for last node the activation function gives the output y_pred , nd we take that into account also as in the chain goes from L>y_pred>z_final>o11>z_previous>w¹11(complete chain) Coz if we won't then it wdnt be differentiable! U can differentiate a fxn wrt another only if it is a fxn of it. [Also In case of simple regression the activation terms r omitted coz O wd always be w1x1+w2x2+b ] I hope my point was clear

@NiranjanV823 2 ай бұрын

47:40 How will gradient for ReLU be large if x is large, gradient for ReLU is constant = 1 for x>0

@shubharya3418 Жыл бұрын

Sir, in this video at 40:00, you said that NN with tanh is affected more for vanishing gradient descent problem compared to the NN with sigmoid for the inputs closer to zero (very small inputs). But, The sigmoid activation function maps its inputs to values between 0 and 1. When the inputs are close to zero, the output of the sigmoid function is close to 0.5, and its gradient is close to zero. On the other hand, the tanh activation function maps its inputs to values between -1 and 1. When the inputs are close to zero, the output of the tanh function is close to zero, and its gradient is close to 1. Therefore, a neural network with a tanh activation function is less likely to face the vanishing gradient problem compared to a neural network with a sigmoid activation function, especially for inputs close to zero. could you please look into this?? thanks.🙏

@thelife5628 Жыл бұрын

I am also having same doubt bro.... did you get any solution for it??

@thelife5628 Жыл бұрын

and for sigmoid one, at 0 sigmoid has max derivative value ~0.25

@thelife5628 Жыл бұрын

Sir we request you to pls clarify this...

@xenon1787 4 ай бұрын

I think he is wrong is saying that tanh will give VDP early but ofc it'll give at deeper levels as for e.g (0.9)^n and n is large then gradient will vanish

@rakshitbazaz6960 15 күн бұрын

Look at the curve of both, sigmoid will have larger values near zero than tanh, so it will have less effect.

@shantanuyadav732 3 ай бұрын

Great video sir, amazing explaination.

@pavantripathi1890 7 ай бұрын

Thanks for the wonderful explanation!

@motivation_with_harsh Жыл бұрын

Best teacher you are sir ❤❤❤❤❤❤❤❤❤❤❤

@rafibasha4145 2 жыл бұрын

@17:15,bro how y^hat associated directly with z11 there should be cumulative output z21 right

@apratimmehta1828 Жыл бұрын

Y hat is z21.because last node has linear activation. Hense a21 is same as z21 and hence y hat.

@thethreemusketeers4500 2 жыл бұрын

Thanks sir Deep learning playlist complete kr do its request sir

@rockykumarverma980 27 күн бұрын

Thank you so much sir 🙏🙏🙏

@krishcp7718 Жыл бұрын

Hi Nitish, Nicely presented video. At timestamp 16:42 in the derivative of loss function with chain rule, I think the middle term should be the partial derivative w.r.t a21 and then a21 should be w.r.t w2 11 and w2 21 and not as given. Because yhat does not change directly because of a11 , rather it changes because of a21. Actually there needs to be a value a21 along with bias b21 likewise to value of neurons a11 and a12 in the precious layer. This is because in backprop the value of any neuron or weights of the connects is based on the directly connected neuron's weight and bias. Thanks a lot. Krish

@bibekrauth3408 Жыл бұрын

Bro Y_hat = a21, so it doesn't matter. Equations are correct only

@sujithsaikalakonda4863 Жыл бұрын

@@bibekrauth3408 you are correct, but a_21 is changing with the change in weights (w2_11 and w2_21), and then this weighs dependent on a_11 and a_21 respectively. If anything wrong please correct me.

@ickywitchy4667 10 ай бұрын

@@sujithsaikalakonda4863 last layer has no activation function..you are totally wrong

@ShubhamSingh-iq5kj 2 жыл бұрын

Thank you for amazing ml playlist just about to complete playlist .😇😇😇

@kindaeasy9797 6 ай бұрын

9:30 , partial derivative equal to 0 how?

@thelife5628 Жыл бұрын

14:24 Sir you have used linear activation function in your output layer but during practical at 23:55 you are using sigmoid in output layer. I have tried using linear in outer layer and I am getting final weights still 0, But when I used sigmoid in outer layer I got non-zero constant weights.

@apratimmehta1828 Жыл бұрын

Sir used linear activation in the last layer because he took a regression example in theory but he implemented a classification problem in practical hence he used sigmoid there.

@researcher7410 11 ай бұрын

Sir please make video on deep learning by using Pytorch...

@narendraparmar1631 10 ай бұрын

Really helpful Thanks

@BhaskarMishra-e6q Ай бұрын

Sir I think you forgot 3rd attachment Please share case 3 Code notebook in Description.

@ankanmazumdar5000 2 жыл бұрын

Sir , just one suggestion for these kind of long videos, kindly add chapters to the video by splitting video's timestamps, would be really helpful to understand , where are we?

@tehreemqasim2204 8 ай бұрын

excellent video

@suryakantacharya7933 5 ай бұрын

Sir jab ham large random number le rahe hai toh RELU mai agar gradients bade aa rahe toh ham learning rate ko kam kar ke range mai nahi la sakte?

@rakeshkumarrout2629 2 жыл бұрын

Sir discord ka link kaha milega..

@namanmodi7536 2 жыл бұрын

sir aap ne bola tha every 3 days me ak video upload hoga par aap to aab every 3 week me ak video upload kar rahe ho

@debjitsarkar2651 2 жыл бұрын

SIR HOW TO JOIN YOUR ONLINE 6 MONTH AI CLASS?? PLEASE REPLY

@avishinde2929 2 жыл бұрын

good morning sir ji thank you so much sir 😊😊

@popularclips6681 3 ай бұрын

Amazing

@sakshitattri4882 2 ай бұрын

Derivative of tanh will be 1 not 0…. The formula of derivative is 1-tanh^2(x) Right??

@dragnar4743 Жыл бұрын

near the end of the video, when we took ReLU with large weight, that was exploding gradient right??

@bhushanbowlekar4539 Жыл бұрын

at timestamp 38.05, it should be derivative of tanh(0) is close to 1 similarly for sigmoid ,derivative of sigmoid(0)= 0.25 approx but not 0 ,and hence at timestamp 40.05 sigmoid will reach to VGP faster than tanh becase this 0.25 is less as comapred to derivative(tanh)=1 and at timestamp 46.30 it should be EGP am I right?

@anjalimishra2884 Жыл бұрын

The given link in description for the code was not running...it shows errors

@harsh.gupta2021 9 ай бұрын

Sir can u please provide us with the soft copy of your notes used in video

@rafibasha4145 2 жыл бұрын

Nitish bro,upload 2 videos per week if possible

@alroygama6166 2 жыл бұрын

Derivative of tanh(0) is not 0 but 1 sir. Please check

@AhmedAli-uj2js Жыл бұрын

yes, you are right

@sandipansarkar9211 2 жыл бұрын

where is the dataset link?

@pavankumarbannuru6145 2 жыл бұрын

sir your giving diff x values multiplied by weights, though weights are same but x is diff for each right, then how come it ill be same, pls reply

@Naman_Bansal102 7 ай бұрын

BEST VIDEO

@shaz-z506 2 жыл бұрын

47.57 is it an exploding gradient problem or a vanishing gradient?

@tanmaychakraborty7818 2 жыл бұрын

Exploding Gradient

@flakky626 Жыл бұрын

@@tanmaychakraborty7818 merko bhi same doubt aaya tha glad koi aur bhi notice kiya I was confused there

@tanmaychakraborty7818 Жыл бұрын

@@flakky626 no worries hope it clarified

@waheedweins 5 ай бұрын

i am feeling trouble in finding data_set .. can anyone help?

@AbdulRahman-zp5bp 2 жыл бұрын

if we initialize the weights with big random values, will Exploding Gradient problem occur?

@hammadkhalid7201 Жыл бұрын

sir, I initialize small positive weights and biases or big negative weights and biases how will that lead to a vanishing gradient problem for Sigmoid and Tanh when their gradients behave in the best way if small inputs are provided. The smaller the inputs for calculating the gradients the better the weights updation process.

@MohadisTanoli 9 ай бұрын

Smaller gradient will lead to almost no updation of weights and its a vanishing gradient prob

@teenagepanda8972 2 жыл бұрын

Thank you sir

@siddharthvij9087 Жыл бұрын

Excellent video

@umangdobariya5680 2 жыл бұрын

very in-depth video

@farhadkhan3893 2 жыл бұрын

thank you for your struggle

@lingasodanapalli615 10 ай бұрын

Why does derivitive of tanh(X) becomes zero if Activation becomes zero? No it is not. Because tanh(X) derivitive is sech^2(X). So derivitive of tanh(X) at X=0 is 1 nmot zero. In ReLU case it is zero because for ReLU F(X)= (Max(0, X). Hence derivitive of )n is zero.

@braineaterzombie3981 6 ай бұрын

I may be wrong.. But i think you completely misunderstood tanh and confused it with tanh used in trigonometry . This is NOT normal tan function. It just looks like tan(x) horizontally. So its derivative is not sec2x

@RamanSharma-wl4ld 3 ай бұрын

@@braineaterzombie3981 bro tanh(x) has derivative 1 at 0

@mr.deep. 2 жыл бұрын

Thanks

@tanmaychakraborty7818 2 жыл бұрын

You're really legend

@sandipansarkar9211 2 жыл бұрын

finished watching

@aniketsharma1943 2 жыл бұрын

A doubt when we are taking derivatives of sigmoid for zero (weights/biases). Earlier we considered one neuron as an output and took derivative wrt weights, in this video we are taking derivatives of activation function also wrt z , does this mean same ?

@tanmaychakraborty7818 2 жыл бұрын

Yes it means the same you can try to derivative it manually for example Sigmoid we will have 1/(1+e^z) Now z=w11*X1 +... So on.. then y_hat=sigmoid (value)...now to derivative this you need the chain rule Conclusion: Your assumption is correct Hope it helps