This is simply amazing man. It should be included as part of the official guide.
@pradeepadmingradalpharecord4 жыл бұрын
Hey I made an issue on github here (github.com/pytorch/tutorials/issues/1024#issue-637571028), regarding this, I am having some issues merging, so if someone could help, would be great.
@ankurgupta28064 жыл бұрын
Absolutely
@elliotwaite4 жыл бұрын
@@pradeepadmingradalpharecord This looks like a great idea. I'll check it out on GitHub.
@israrkhan-bj3zh3 жыл бұрын
You said this is your first video on pytorch but its amazing. Looking forward to have more videos.
@collinmccarthy22 күн бұрын
Amazing, thank you! After going through Andrej Karpathy's micrograd this actually makes (a little) sense, so if anyone is confused I recommend checking that out and coming back to this.
@elliotwaite20 күн бұрын
@@collinmccarthy thanks for sharing that recommendation. I'm an Andrej Karpathy fan as well.
@gabrieleruscelli4450Ай бұрын
The part in which he describes the magical box killed me :) goodjob btw!
@elliotwaiteАй бұрын
😂 Thanks, I'm glad you liked it.
@ishanmishra33206 жыл бұрын
Been searching the net for a while about autograd but this is one of the most intuitive explanation with an awesome walkthrough. Thanks!
@elliotwaite6 жыл бұрын
Thanks, Ishan!
@wanhonglau7794 жыл бұрын
this explanation is in-depth and original! can't find any other resources that explain pytorch autograd more in-depth than this
@jiawei83194 жыл бұрын
This is awesome. I've been trying to understand the autograd of PyTorch via bared codes for a whole week and I think I learned more in the 10 minutes of your video. Thank you and keep creating.
@elliotwaite4 жыл бұрын
Thanks for the encouragement, Jiawei!
@bijonguha22995 жыл бұрын
Superb explanation elliot. Thanks for making this video
@elliotwaite5 жыл бұрын
Thanks, Bijon Guha!
@vigneshpadmanabhan3 жыл бұрын
Glad I got your video as a recommendation... it's amazing. ... Have been using tensorflow for a while and youtube wants me to learn pytorch ...
@elliotwaite3 жыл бұрын
Haha. I used to use TensorFlow as well, but I was glad I made the switch to PyTorch. The main thing I like more about PyTorch is that the modularity feels cleaner. For example, the modules feel like legos made of legos, and when I want to try something new, it's easy to switch out any of those legos for my own custom one, regardless of if I'm working at the high level or the low level. And other parts of the framework, not just the modules, work similarly.
@Sneha_Negi2 жыл бұрын
would have never been able to understand such concept in just one video... thanks to you.
@elliotwaite2 жыл бұрын
I'm glad you found it helpful.
@mayankpj5 жыл бұрын
Color analogy is wonderful :) A very nicely done video Elliot !! Well done.... would love to see more of your stuff .....
@elliotwaite5 жыл бұрын
Thanks for the encouragement! I'm considering making more videos, just trying to figure out how to do so in a way that will best assist my long-term goals. What kind of content would you find most valuable, useful, or helpful?
@mayankpj5 жыл бұрын
@@elliotwaite From presentation perspective, I feel that one of the best ones i have seen out there are from 3blue1brown. I don't mean to say that its the only way. From content perspective, I would believe that videos explaining implementation of ML algorithms using Pytorch like SVM, Linear / logistic regression would be good to begin with. Videos explaining the fastai library's datablock and data loaders (in contrast with pytorch's stock versions) would be really helpful and be probably more viewed too. This can then also lead to explaining more advanced model architectures and contrast them like various vision architectures, nlp architectures especially as they are in most use.
@elliotwaite5 жыл бұрын
@@mayankpj Nice. Thanks for the suggestions! I really like 3blue1brown too. I might have to just start experimenting with different video styles to see which are most rewarding.
@mayankpj5 жыл бұрын
@@elliotwaite Good Luck ..... my best wishes !
@raufbhat80964 жыл бұрын
Thanks for the amazing in-depth explanation. Please do more videos on PyTroch.
@elliotwaite4 жыл бұрын
Thanks, Rauf!
@QiuzhuangLian2 жыл бұрын
Awesome torch `Tensor` tutorial, thank you so much.
@elliotwaite2 жыл бұрын
I'm glad you liked it. Thanks for the feedback.
@jiangpengli865 ай бұрын
Thank you for this fantastic tutorial and the graph. They are really helpful.❤
@elliotwaite5 ай бұрын
Thanks, I’m glad you liked it.
@m.y.s42605 жыл бұрын
Nicest explanation for autograd on the internet! Thx!
@slavanikulin80694 жыл бұрын
Thanks, dude. Didn't understand the last example, will try again later
@elliotwaite4 жыл бұрын
Sounds good. If you let me know which part of the last example you didn't understand, I could try to elaborate on it.
@tungnguyendinh3314 жыл бұрын
Just an awesome explanation. Thank you very much
@elliotwaite4 жыл бұрын
Thanks, Tung!
@yusun57222 жыл бұрын
Great video on the internals of PyTorch. Thanks.
@forresthu6204 Жыл бұрын
Mr. Waite makes very complex concept easy to understand, big thanks!
@elliotwaite Жыл бұрын
Thanks, Forrest!
@MrDarkNightmare6663 жыл бұрын
after a week trying to understand autograd , now it's clear !! well, clear maybe too much but it doesn't seem impossible
@elliotwaite3 жыл бұрын
Glad to hear this video helped.
@jackperrin38524 жыл бұрын
Awesome explanation. Really helped me understand how PyTorch implements back prop
@elliotwaite4 жыл бұрын
Thanks, Jack!
@TalibHussain-ih1ev2 жыл бұрын
That is extremely helpful understanding the concept of autograd and backward
@elliotwaite2 жыл бұрын
Thanks. I'm glad you found it helpful.
@christopherross9214 Жыл бұрын
this is exactly what I needed. been pulling my hair out. thank you.
@gunjanmimo3 жыл бұрын
This video is awesome. Thank you for making this video.
@elliotwaite3 жыл бұрын
I'm glad you liked it. Thanks for the comment.
@eladjohn4913 жыл бұрын
This is the best video in the world.
@SachinKumar-js8yd5 жыл бұрын
Nice explanation bro !! THanks for this video.
@rong50085 жыл бұрын
Woudl be amazing if you make more pytorch videos!!!!! this is simply the best explanation of pytorch autograd
@elliotwaite5 жыл бұрын
Sirong Huang Thanks! Not sure how soon it will be, but I’ll probably make some more eventually.
@scottvasquez98805 жыл бұрын
Great visualizations, dude!
@elliotwaite5 жыл бұрын
Thanks!
@bitdribble4 жыл бұрын
I could finally understand a bit how this works. Thank you.
@AlexeyMatushevsky2 жыл бұрын
That's magical! Some much effort to explain simple concept! As a result - I really feel like I really understood it. Thanks! Schema's in the hands of the master - priceless!
@elliotwaite2 жыл бұрын
Glad you liked it!
@王甯-h2x Жыл бұрын
Hi, Elliot Waite thank you so much for creating this, it's really helpful for me!
@elliotwaite Жыл бұрын
I'm glad you liked it.
@namanchindaliya10125 жыл бұрын
Great explanation. Keep up good work.
@elliotwaite5 жыл бұрын
naman chindaliya, thanks!
@shashank31654 жыл бұрын
The video is amazing, man. Subscribed and looking forward to other videos.
@elliotwaite4 жыл бұрын
Thanks!
@shantanuagarwal96603 жыл бұрын
Thanks a lot Elliot. Excellent work.
@zilianglin74175 жыл бұрын
This video is really helpful. It teaches me not only the presupposition of the 60 minute Blitz but also an overview of the computational/backward graph.
@KSK9864 жыл бұрын
Nice, crisp and clear explanation. Thanks for sharing this knowledge.
@jingbolin88355 жыл бұрын
From Pytorch Forums, nice and detailed explanation for Pytorcher!!!
@ravihammond6 жыл бұрын
This is the exact explanation I needed to clear up autograd. Many thanks from Australia!
@elliotwaite6 жыл бұрын
Thanks for the feedback. Glad you found it helpful.
@RohitKumarSingh256 жыл бұрын
best explanation of autograd available on youtube (Y)
@elliotwaite6 жыл бұрын
Thanks, Rohit!
@xzl202124 жыл бұрын
Brilliant. I super like your metaphors of colors for "leaves", dried-up leaves and magics😍
@reemgody54923 жыл бұрын
totally amazing explanation. Thanks
@elliotwaite3 жыл бұрын
Thanks, Reem!
@weiwang67065 жыл бұрын
That's so great and clear! Thanks Elliot!
@GoodOldYoucefCef6 жыл бұрын
Thanks, your graphs made things clear to me!
@gabrielcbenedito4 жыл бұрын
I'm so glad you decided to make this video... this explanation is just perfect! I just subscribed and hope to see more of those!
@elliotwaite4 жыл бұрын
Thanks! I hope to make some more machine learning related videos soon.
@akashupadhyay43736 жыл бұрын
Great video , please make more video on pytorch
@ОбезьянаЧичичи-в8л Жыл бұрын
My english skills are bad, but this explanation is ultra-understandable🤝
@elliotwaite Жыл бұрын
Good to know, thanks for the feedback.
@atursams64713 жыл бұрын
This is a good video. Thanks for making it.
@egecnargurpnar57324 жыл бұрын
Elliot, that was an marvelous video. You are explaining such a natural way and fast that you implemented that function. Thank you :)
@elliotwaite4 жыл бұрын
Thanks, Cemal!
@adhoc30184 жыл бұрын
Excellent explanation
@elliotwaite4 жыл бұрын
Thanks!
@parthasarathimukherjee70206 жыл бұрын
Great video! More pytorch tutorials please!
@elliotwaite6 жыл бұрын
Partha Mukherjee, thanks! I’ll probably make some more soon.
@mohammadhassanvali19423 жыл бұрын
A really fantastic and thorough explanation of autograd. Thanks a lot!
@elliotwaite3 жыл бұрын
Glad you liked it.
@MithileshVaidya5 жыл бұрын
Awesome introduction! Thanks a lot!
@ijyotir4 жыл бұрын
Appreciate the tutorial. Please keep doing this.
@starlord73834 жыл бұрын
Very informative!
@elliotwaite4 жыл бұрын
Thanks!
@sushilkhadka80695 ай бұрын
Amazing video, thanks!!
@elliotwaite5 ай бұрын
@@sushilkhadka8069 thanks! Glad you liked it.
@andreazanetti014 жыл бұрын
thanks for this video, really clarified a lot of things!
@elliotwaite4 жыл бұрын
Glad it helped!
@EdedML2 жыл бұрын
Amazing explanation, thank you so much!
@terohannula302 жыл бұрын
Thanks! This kind of information I was looking for, I want to write own tiny autograd system and searched larger frameworks might have implemented it
@amitozazad15843 жыл бұрын
This is splendid, highly impressive.
@chienyao87995 жыл бұрын
Thank you for this wonderful explaination! Looking forward to your subsequent PyTorch videos.
@corentinlingier35494 жыл бұрын
Awesome viz! Thanks a lot
@egogo56754 жыл бұрын
Excellent :) thank you so much
@yassineouali18886 жыл бұрын
Great introduction to autograd, thanks
@pabloo.o1912 Жыл бұрын
Great explanation!
@elliotwaite Жыл бұрын
Thanks, Pablo!
@muhammadroshan73155 жыл бұрын
So much information in such short time. I am amazed just simply startled :o
@khaledsalah92482 ай бұрын
Briliant work : )
@elliotwaite2 ай бұрын
@@khaledsalah9248 Thank you!
@JungeumKim6 жыл бұрын
Thank you sooo much!! Your first video is super helpful!!
@Flibber26 жыл бұрын
Nice video, more machine learning/deep learning related materials please!
@elliotwaite6 жыл бұрын
Coming right up. Thanks, Flibber2!
@钟辉-n8c2 жыл бұрын
it's very clear. Thanks!
@ShahabShokouhi11 ай бұрын
Dude! you are the best.
@elliotwaite11 ай бұрын
Thanks!
@samas694203 жыл бұрын
10:18 why no ctx variable is passed during the add operation?
@elliotwaite3 жыл бұрын
To explain, lets first understand why the Mul operation and its corresponding MulBackward operation need to use the ctx variable. It's because the MulBackward node needs to know what the the original input values were to the Mul operation to figure out how to update the gradients that get passed through the MulBackward node in the backward pass. And this is what the ctx variable is used for, to store data from the forward pass to be used later in the backward pass. Now, if we think about the Add operation and its corresponding AddBackward operation, we can ask, "What data from the forward pass does the AddBackward node need in the backward pass?" And the answer is... none. The AddBackward node just passes along whatever gradient gets passed to it to the next nodes in the backward graph, without changing its value (at most it will only need to broadcast the gradient to a different shape for the next nodes in the backward graph). So since no data from the forward pass is needed in the backward pass, the ctx variable is not used. Hope this helps. Let me know if you still have any questions about this. And thanks for asking, this will probably also help others who are wondering the same thing.
@bobobopan23544 жыл бұрын
excellent video!
@kalekalekale5 жыл бұрын
Beautiful video and great for a beginner level understanding. The explanation of your color choice made me giggle. One conceptual question I'm struggling with: When you wish to not update (freeze) parts of the network, the recommended solution is to set requires_grad to False. I would like to clarify that all this does is avoid unnecessary computation and storage of gradients at those nodes. However, the node will still contain a grad_fn (if it has one) so that in the backward pass, the gradient from that node is still technically passed backward and the chain rule is still maintained? Another solution that was recommended is to not send the parameters you wish to freeze to the optimizer function. However, some recommend to set requires_grad to False as well to save memory storage.
@elliotwaite5 жыл бұрын
Thanks. Great question. I should have shown in the video what happens when you freeze nodes, but I'll try to explain here. You can actually only set requires_grad = False on leaf nodes, and leaf nodes don't have grad_fn values, it's the intermediate branch nodes (non-leaf nodes) that have the grad_fn values. So for example: x = torch.tensor(1.0) weight_1 = torch.tensor(2.0, requires_grad=True) weight_2 = torch.tensor(3.0, requires_grad=True) branch_node_1 = x * weight_1 branch_node_2 = branch_node_1 * weight_2 Here branch_node_2's grad_fn will be a MulBackward object that passes the gradient along to an AccumulateGrad object for weight_2 and also passes the gradient along to branch_node_1's MulBackward object, which then passes the gradient to the AccumulateGrad object for weight_1. If we freeze weight_2: x = torch.tensor(1.0) weight_1 = torch.tensor(2.0, requires_grad=True) weight_2 = torch.tensor(3.0, requires_grad=True) weight_2.requires_grad = False branch_node_1 = x * weight_1 branch_node_2 = branch_node_1 * weight_2 This will be the same as above, except branch_node_2's MulBackward won't pass the gradient along to the AccumulateGrad for weight_2, it will only pass the gradient along to branch_node_1's MulBackward, which then passes the gradient along to the AccumulateGrad for weight_1. However, if we freeze weight_1: x = torch.tensor(1.0) weight_1 = torch.tensor(2.0, requires_grad=True) weight_2 = torch.tensor(3.0, requires_grad=True) weight_1.requires_grad = False branch_node_1 = x * weight_1 branch_node_2 = branch_node_1 * weight_2 Then branch_node_1 will not have a grad_fn at all because none of the nodes going into it require a gradient. In fact, branch_node_1 will become a leaf node. And branch_node_2's grad_fn will be a MulBackward that only passes the gradient along to the AccumalteGrad for weight_2. And if we try to freeze one of the branch nodes that has a grad_fn: x = torch.tensor(1.0) weight_1 = torch.tensor(2.0, requires_grad=True) weight_2 = torch.tensor(3.0, requires_grad=True) branch_node_1 = x * weight_1 branch_node_1.requires_grad = False branch_node_2 = branch_node_1 * weight_2 We get the following error: "RuntimeError: you can only change requires_grad flags of leaf variables. ..." So using the tree analogy again, you can only freeze green leaves. And freezing a green leaf is like changing it to a yellow leaf, and any brown branches that lead to only yellow leaves will actually also become yellow leaves. And the backward graph that gets created will only be for the brown branches that lead to green leaves. So when you freeze part of the graph, the intermediate nodes (branches) will update their grad_fn objects to not accumulate gradients for those frozen nodes (which are now yellow leaves), and if all of the inputs to a branch node don't require gradients (are all yellow leaves), that branch node won't have a grad_fn at all, and will become a leaf node that doesn't require a gradient (a yellow leaf), and it won't be involved in the backward graph. A new backward graph is created with each forward pass, so when you change the requires_grad values of any of the nodes involved in the forward graph, it will also change the structure of the backward graph that gets created. This change may just be that some of the AccumulateGrad nodes are not created, but in some cases, larger parts of the backward graph may be omitted if they aren't needed, such as when you freeze the early nodes in a graph. So using the requires_grad = False method to freeze nodes is better than just not passing those nodes to the optimizer, because setting requires_grad = False will be like pruning the backward graph, reducing both memory usage and computation time when the backward gradients are computed. I hope this helps clarify what happens when you freeze nodes. If there is any part that you'd like me to clarify further, let me know.
@kalekalekale5 жыл бұрын
@@elliotwaite Thank you very, very much for the detailed and thoughtful reply, it was extremely helpful. I think the mental block I had was that I was failing to separate the weights (parameters) from the computation nodes, and I thought that setting requires_grad = False in a network.parameter() loop was occurring on the weights AND computation nodes. But as you demonstrated, that is invalid for branches like Mul, and I was considering the computations as parameters. With the tree/leaf/branch analogy (new-ish to Comp. Sci.), it all seems so simple and easy to visualize now. Do I need to set requires_grad = False in every loop? I've read that you do not and can do it once before training (and should remember to set to True after), but your comment of "A new backward graph is created with each forward pass" leads me to think otherwise.
@elliotwaite5 жыл бұрын
@@kalekalekale You only need to call requires_grad = True once to freeze the nodes. The backward graph that gets recreated each time is the collection of blue nodes, the MulBackward and AccumulateGrad nodes. The frozen leaf nodes are not recreated and will retain their attribute values.
@kalekalekale5 жыл бұрын
@@elliotwaite Thanks again! You have my vote for more PyTorch content! Cheers.
@Aditya_Kumar_12_pass4 жыл бұрын
thank you. this was the best
@gwennmalivet3073 Жыл бұрын
Hi , this is a good video. I just have a question about something. When we pause the video at 11:42 , on the Divbackward Case i didn't understand how we get -72. But i I understood how we got the 9 with 18/2. Maybe I've dropped a detail. Sincerely
@elliotwaite Жыл бұрын
That one is a bit tricky. It comes from the i = g / h, where g is 16 and h is 2, and we are trying to find the derivative of h with respect to i. To use for familiar notation, we can think of it as y = 16 / x, where x is 2, and we are trying to find the derivative of y with respect to x, which is dy/dx = -16 / x^2, and when we replace x with 2, we get -16 / 2^2 → -16 / 4 → -4. And then the -72 is just 18 * -4.
@yumnafatma91982 жыл бұрын
I have some difficulty understanding how DivBackward works, an explanation would be very helpful. Thanks for the nice video.
@elliotwaite2 жыл бұрын
To explain, I'll use the example in the video (11:18) of: i = g / h (where g = 16, and h = 2). To understand how DivBackward works, we need to know how to backpropagate gradients through a division. This means figuring out how changing each of the inputs will change the output. To make this equation look more familiar, I'll replace some of the values so that we get the output is represented by "y" and the value we are trying to find the derivative of is represented by "x". So, to figure out the gradient for "g", we replace "g" with "x", and we replace the output "i" with "y", and then we replace any other variables with their current value, meaning we replace "h" with 2, and it becomes: y = x / 2 Now we just find the derivative of this equation (dy/dx) using the rules of calculus: dy/dx = 1 / 2 So the gradient that gets passed back for "g" is half the input gradient to DivBackward. Then we can do the same for "h", replacing "h" with "x", replacing the output "i" with "y", and the other variable "g" with its current value of 16, and we get: y = 16 / x And then we find the derivative of this which is: dy/dx = -16 / (x^2) And now that we've done the derivative part, we can replace "x" with its current value, and since "x" represents "h", we can now replace the "x" with the current value of "h", which is 2. So it becomes: dy/dx = -16 / (2^2) = -16 / 4 = -4 So the gradient that gets passed back for "h" is -4 times the input gradient to DivBackward. Now we can compare these values to what can be seen in the video at 11:18, and we can that these are the values that DivBackward uses. The input gradient to DivBackward is 18, and the gradient passed back for "g" is 18 * 0.5 (9), and for "h" it's 18 * -4 (-72). So how DivBackward works is that it saves the values of the numerator and denominator passed into the division operation and then uses them during the backward pass in the following way: gradient_of_numerator = input_gradient * 1 / denominator gradient_of_denominator = input_gradient * -numerator / (denominator ** 2) These equations represent the same steps we followed above. I hope this helps. Sorry if it was confusing. Feel free to let me know if there is anything you want me to clarify.
@yumnafatma91982 жыл бұрын
@@elliotwaite Thanks for writing detailed explanation. It is clear to me now.
@elliotwaite2 жыл бұрын
@@yumnafatma9198 glad I could help.
@ridfieldcris40645 жыл бұрын
Very intuitive and straight forward explanation, I am looking forward to further PyTorch video, any update will pop-up in the future?
@elliotwaite5 жыл бұрын
Ridfield Cris, thanks! I’m not sure when I’ll make another PyTorch video, but if I do I’ll post it here on this channel, so if you’re subscribed, the video should show up in your subscriptions (if that’s what you were asking about).
@ridfieldcris40645 жыл бұрын
Thank you for replying, I will definitely subscribe.@@elliotwaite
@pierreeugenevalassakis88976 жыл бұрын
That's a great walk-through for autograd, thanks!! Something that would be nice for the next video would be if you have any tricks to make sure the graph is setup correctly and the gradients propagate where you want them to ect... for instance how to use crayon, which I personally haven't really got into yet, but I'm hearing is good.
@elliotwaite6 жыл бұрын
Thanks for the suggestion, that's a good idea. I haven't tried logging my PyTorch training in TensorBoard yet, but perhaps I'll make a video about how to do it once I learn more about it. Also, here's the link to the draw.io diagrams I made for this video if you're still interested, regarding your earlier comment: drive.google.com/file/d/1bq3akhmA5DGRCiFYJfNPSn7il2wvCkEY/view?usp=sharing
@pierreeugenevalassakis88976 жыл бұрын
Nice, thanks! I realised it was something like draw.io upon re-watching, in the beginning I though it was some sort of interactive front-end to a backend pytorch model!
@홍성의-i2y Жыл бұрын
tensor.detach() is used when we do not need to keep the gradient in that specific tensor. There are also other alternatives, such as .item(), .tolist(), etc. A typical example is the value generated by target-NN within the compute_loss part in the RL literature.
@elliotwaite Жыл бұрын
Thanks for sharing this info.
@DizzyDadProductions5 жыл бұрын
Subbed! I hope to see more content soon!
@elliotwaite5 жыл бұрын
DizzyDad Reviews, thanks!
@lucanina82212 жыл бұрын
This video saved me daysssss
@elliotwaite2 жыл бұрын
I'm glad it helped.
@randalllionelkharkrang40473 ай бұрын
that in-place operator change of c+= 1 around 6:11, is not clear to me. You mentioned the tensor values themselves dont matter, but in reverse mode differentiation(or foward) they actually do dont they? to actually get the value of the derivative at that point? maybe im just hella confused lol
@elliotwaite3 ай бұрын
The addition operation is special in that it just passes along the gradient during backprop, so we don't actually have to know the values of c or d to backpropagate through them. This is because the gradient is asking "if I change one of the inputs by a small amount, how much will that change the output?" And if we change c by 0.001, for example, it will also change the output of c + d by 0.001, no matter what the values of c or d are. So their values can be ignored when backpropagating through the addition operation. This is why we can perform an in place operation on c (c += 1) and, as long as that is done after the e = c + d, it won't affect the backpropagation. To give a more concrete example, we can replace "e = c + d" with "total_loss = loss_1 + loss_2". What the above is saying is that instead of calling both loss_1.backward() and loss_2.backward(), we could instead just call total_loss.backward(), and we would get the same gradient values regardless of the values of loss_1 and loss_2, because the gradient just gets passed through the addition operation in the backward pass. Let me know if there is still any confusion.
@ghaliahmed4 жыл бұрын
SUPER GOOD VIDEO!!!!!!!!!!!!!!!!
@alalaben5 жыл бұрын
this video is great, do you have plan to make more like this?
@elliotwaite5 жыл бұрын
yang yuan, thanks! Not sure yet what future videos I’ll make.
@snehotoshbanerjee19384 жыл бұрын
Excellent!
@BOURNE3995 жыл бұрын
!One word: AWESOME!!!
@AndPacheco345 жыл бұрын
Thank you! It was very useful!
@interlingua26125 жыл бұрын
Wow this is what I exactly needed of! thank you so much!
@alexanderlewzey11024 жыл бұрын
v good explanation
@elliotwaite4 жыл бұрын
Thanks, Alexander!
@abhinavyel1664 жыл бұрын
Can I say that requires_grad=True will lead to getting value for gradient in leaf nodes and this gradient value is used for training and if that is the case then how are the weights for intermediate nodes changes in reference to neural network
@elliotwaite4 жыл бұрын
Although the weights can seem like intermediate nodes, they are actually also leaf nodes. For example, when using them for a matrix multiply in a linear layer, you multiply the weights (a leaf node of parameters) with the previous layer's output values (an intermediate node), and the result of this multiplication is another intermediate node. So anytime you use parameter values (weights, biases, etc.), these will be leaf nodes (unless you're doing something uncommon where you perform a calculation that requires a gradient to generate your weights). And then, as long as your model is in training mode, the weight parameters will accumulate a gradient when you call loss.backward(), and then that accumulated gradient will be used to update the parameter values when you call optimizer.step(). Let me know if there is anything that still doesn't make sense.
@abhinavyel1664 жыл бұрын
@@elliotwaite so is require_grad by default set to True for any given layer that we add in the model and Training of a layer only happens when require_grad is True
@elliotwaite4 жыл бұрын
@@abhinavyel166, yeah, layers are modules, and the weights in modules are usually parameters, and for parameters, the default value of require_grad is True: pytorch.org/docs/stable/nn.html#torch.nn.Parameter
@abhinavyel1664 жыл бұрын
@@elliotwaite thanks for the great video and then this explanation.
@jaideepkukkadapu26002 жыл бұрын
How do we know which variables should be stored in ctx variable? My explanation:- we calculate vjp on forward pass so we know which variables should be stored. Please tell me the correct explanation
@elliotwaite2 жыл бұрын
I'm not sure I understand your question, so correct me if you were looking for a different kind of answer. You want to store in the ctx variable whatever values you will need to be able to compute the gradients of the inputs given the gradients of the outputs. In the typical case, you'll want to store the Jacobian of the forward operation, and then during the backward pass, you would multiple that Jacobian and the output gradients together to get the input gradients (making sure to use the correct left or right multiply, and the correct transposed or non-transposed version of the Jacobian depending on how your inputs and outputs are represented). However, some operations don't need to store the full Jacobian, for example, addition and subtraction don't need to store any values since the Jacobian for those operations would just be the identity, so instead, in the backward pass, the output gradients are returned directly as the input gradients without any intermediate multiplication. Also, if your input and output are non-typical tensors, for example, the unbind operation which outputs multiple tensors, I'm not sure if the stored values would still be called the Jacobian, but they would essentially have the same function. At least that's how I think about it, but please correct me if I'm wrong about anything.
@jaideepkukkadapu26002 жыл бұрын
@@elliotwaite Thanks for the explanation to phrase my question again :- How do we know what variables should be stored in the context(ctx) variable?(Do we store every value or we do some calculations in the forward pass so we store only required variables that are used in backward pass)
@elliotwaite2 жыл бұрын
@@jaideepkukkadapu2600 only store what data you need in the backward pass and allow the rest to get garbage collected.
@BOURNE3995 жыл бұрын
Also the ending bgm is a good picking!
@samyamr5 жыл бұрын
Thank you for explaining the AutoGrad with such clarity. I am also wondering how you generated such beautiful graphs. Could you share the tool you used?
@elliotwaite5 жыл бұрын
Thanks! The graphs were made with www.draw.io.
@samyamr5 жыл бұрын
@@elliotwaite I am also wondering, is it possible to modify the backward graph after its construction? For instance, would it be possible to modify it in a way to just compute the gradients of the activations in a neural network, while computing the gradients of the parameters at a later phase.
@elliotwaite5 жыл бұрын
@@samyamr I'm not aware of a way to modify the backward graph after its constructed, but you could split up the graph during the forward pass using the detach method. Something like: # Keep a copy of the original activations around that are attached to the first graph. activations_separate_graph = activations # Create a copy of the activations that are detached from the graph. activations = activations.detach() # Set `requires_grad` to True to start a new graph from these activations onward. activations.requires_grad = True ... # Then you could get the gradients of the activations in the first backward call. loss.backward() # And then the gradients of the variables in a second backward call, passing alow the gradients from the first. acivations_separate_graph.backward(activations.grad) That's the idea at least, but I haven't tested this code so there might be errors in it.
@kuzlovsky123 жыл бұрын
Elliot, wanted to find out how autograd works by myself but you have already done the hard work, great video! Could you please let us know how did you dig into the details? Did you use hooks or went to the C++ code? In other words if you hadn't made this video how could we have figured all this out ourselves?
@elliotwaite3 жыл бұрын
Thanks. Yeah I mostly read through the Python and C++ source code. I'm not as familiar with C++, so I had to do a bit of deciphering. I also used PyCharm's builtin debugger to test things out, and read through some related discussions on the PyTorch forums.
@kuzlovsky123 жыл бұрын
@@elliotwaite , I am following your footsetps. Managed to get pytorch built in debug mode and can step through the C++ autograd code with CLion. If you want this I can help you set it up.
@elliotwaite3 жыл бұрын
@@kuzlovsky12, oh nice, that's a good idea. I might try that in the future if I do any more deep dives in the PyTorch code or other C++ libraries. Hopefully I'll be able to figure it out if I do, but I might try reaching out to you if I get stuck. Thanks for the tip.
@kuzlovsky123 жыл бұрын
@@elliotwaite OK sent you some pointers to follow with the setup. I think it would be a valuable content for a video, there are no resources how to debug the pytorch c++ code with an IDE.
@elliotwaite3 жыл бұрын
@@kuzlovsky12, true. I haven't been interesting in making more educational videos lately, but I've added the idea to my list of potential future video ideas, and maybe I'll get to it one day. Thanks for the suggestions.
@robinranabhat31252 жыл бұрын
Hi Elliot. Thanks for this !! . Could you be able to explain on how would these computational graphs diagrams would extend for the higher order derivatives. I am hopelessly trying to figure out and cannot find the answer.
@elliotwaite2 жыл бұрын
The first time through the forward graph you get a backward graph that can be used to compute the first-order derivatives. You can then traverse that backward graph to create a second backward graph that can be used to compute the second-order derivatives. And so on again and again for higher-order derivatives. For example, let's say our forward graph is the same as in the example: a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) c = a * b d = torch.tensor(4.0, requires_grad=True) e = c * d This will create a backward graph, which we can then traverse to calculate the first-order derivatives with respect to `e` by calling: [a_grad_e, b_grad_e, d_grad_e] = torch.autograd.grad(e, [a, b, d], create_graph=True) Calling that will be similar to running this code (where `a_grad_e` means the gradient of `a` with respect to `e`): // Same as above, except we don't need to calculate `e` so that line is commented out. a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) c = a * b d = torch.tensor(4.0, requires_grad=True) // e = c * d // Plus this part. e_grad_e = torch.tensor(1.0) d_grad_e = e_grad_e * c // Which equals: 1 * (a * b) = 1 * (2 * 3) = 6 c_grad_e = e_grad_e * d // Which equals: 1 * d = 1 * 4 = 4 b_grad_e = c_grad_e * a // Which equals: 4 * a = 4 * 2 = 8 a_grad_e = c_grad_e * b // Which equals: 4 * b = 4 * 3 = 12 You can see that the extra part is generated by just starting at `e` in the original graph and working backward applying the derivative rules. And if we ran that code it would create another bigger backward graph, which we could then traverse to calculate the second-order derivatives with respect to `a_grad_e` by calling: [ a__grad__a_grad_e, b__grad__a_grad_e, d__grad__a_grad_e, ] = torch.autograd.grad(a_grad_e, [a, b, d], allow_unused=True) Calling that will be similar to running this code (here `_grad_` is used for first derivatives and `__grad__` for second derivatives): // Same as above, except I commented out the lines we don't need. // a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) // c = a * b d = torch.tensor(4.0, requires_grad=True) // e = c * d e_grad_e = torch.tensor(1.0) // d_grad_e = e_grad_e * c c_grad_e = e_grad_e * d // b_grad_e = c_grad_e * a // a_grad_e = c_grad_e * b // Plus this part. Again we start with `a_grad_e` and work backward. a_grad_e__grad__a_grad_e = torch.tensor(1.0) b__grad__a_grad_e = a_grad_e__grad__a_grad_e * c_grad_e // Which equals: 1 * (e_grad_e * d) = 1 * (1 * 4) = 4 c_grad_e__grad__a_grad_e = a_grad_e__grad__a_grad_e * b // Which equals: 1 * b = 1 * 3 = 3 d__grad__a_grad_e = c_grad_e__grad__a_grad_e * e_grad_e // Which equals: 3 * 1 = 3 So the second-order gradients we end up with are: a__grad__a_grad_e = None (it wasn't reached in this backward graph) b__grad__a_grad_e = 4 d__grad__a_grad_e = 3 You can confirm those values by running this example: import torch a = torch.tensor(2.0, requires_grad=True) b = torch.tensor(3.0, requires_grad=True) c = a * b d = torch.tensor(4.0, requires_grad=True) e = c * d [a_grad_e] = torch.autograd.grad(e, [a], create_graph=True) [ a__grad__a_grad_e, b__grad__a_grad_e, d__grad__a_grad_e, ] = torch.autograd.grad(a_grad_e, [a, b, d], allow_unused=True) print(a__grad__a_grad_e) print(b__grad__a_grad_e) print(d__grad__a_grad_e) Hope that helps.
Enjoyed a lot. Thanks. May I know which tool did you use to create those figures? Dia?
@elliotwaite5 жыл бұрын
Abhishek Singh Sambyal, Thanks! I used this: www.draw.io
@rahulseetharaman45253 жыл бұрын
Thanks for the explanation. I have a few doubts. 1. In the video you used expressions which have only two operands. How does next functions list look like when there are more complicated expressions say d = a*(b^2), requires grad is true for a and b. 2. In the context of activation fucntions like relu or mod (which is not differentiable at x = 0) how does autogard take care of such functions ? And if we define our own activation functions, how does autograd work in this case ? Particularly, if we have some function that is not differentiable but still we use it in pytorch and try to compute backward on it.
@elliotwaite3 жыл бұрын
1. Complicated expressions are broken down into operations that can be applied one at a time in operator precedence order. So `d = a * (b ** 2)` is equivalent to `temp = b ** 2; d = a * temp`. 2. The gradient of relu at 0 is 0. The gradient of fmod at 0 is 1. I figured these out just by testing them. I'm not sure if there is a general rule, but how the backward gradients are handled should be defined somewhere in the code, though sometimes it's hard to find this code because it's often written somewhere in the C++ implementations of these operators. If you create a custom activation function, it will just use the combined gradient of all the operations used, or you can define a custom gradient function by defining a backward() method, as explained here: pytorch.org/docs/stable/notes/extending.html Hope this helps.
@사이보그-i6p2 жыл бұрын
Great Video
@elliotwaite2 жыл бұрын
Thanks!
@alexanderlewzey11024 жыл бұрын
I'm still a little confused by the difference between requires_grad and is_leaf? Why do we need both it sounds like they are similar, there is obviously a difference I am failing to grasp? Particularly how they both relate to the formal representation of a network ie are the leafs the weights and biases (hence we need to calculate their grad so we can update them)? Are the nodes with requires_grad = True and is_leaf = False the activation function operations? Any help with this would be much appreciated.
@elliotwaite4 жыл бұрын
Yeah, it's kind of confusing. Basically, there are three types of tensors (I think of them in terms of colors and as leaves or branches). They are: Yellow leaf (is_leaf = True, requires_grad = False) Green leaf (is_leaf = True, requires_grad = True) Brown branch (is_leaf = False, requires_grad = True) The input data for the model would be a yellow leaf. Any constant values would also be yellow leaves. Any operations where the only input tensors are yellow leaves, the output tensor will also be a yellow leaf. For example, if you wanted to normalize your input values by dividing them by the mean of the dataset (a constant value, so a yellow leaf) and dividing them by the standard deviation of the dataset (also a constant value, so a yellow leaf), the output of that operation, the normalized values, would also be a yellow leaf. Any parameters that you want to be able to update through training, such as weights and biases, will be green leaves. And any operations that have an input tensor that is a green leaf, will output a brown branch. Such as if you multiply your input values (a yellow leaf) with a matrix of weight values (a green leaf), the output will be a brown branch. All branches are intermediate values or output values in the graph. They aren't input data, constants, or trainable variables. Any operations that have an input tensor that is a brown branch will output a brown branch. So if you take the brown branch from the last step (that you got from multiplying the inputs by the weights), and then if you add to it the biases (a green leaf), you get another intermediate tensor (a brown branch). And then if you pass that through a Relu function (an operation that only takes in one value), you get another intermediate tensor (another brown branch), and on and on until you get to the output values or the loss value. Once you get the to the loss tensor (a brown branch) you can call the backward() method on it, and it will backpropagate the gradients through all the brown branches until it has reached all the green leaves, and then it will update the grad values of all those green leaves. Then when you call optimizer.step(), it updates the values of those green leaves based on their grad values. Here are the possible inputs and outputs (on the left side of the arrow are the input tensors, and on the right of the arrow is the output tensor): yellow → yellow green → brown brown → brown yellow + yellow → yellow yellow + green → brown yellow + brown → brown green + green → brown green + brown → brown brown + brown → brown yellow + green + brown → brown If all the inputs are yellow, the output will be yellow, otherwise, the output will be brown. I hope this has helped clarify.