Deriving Matrix Equations for Backpropagation on a Linear Layer

Рет қаралды 5,718

Күн бұрын

Doing the index tracking to figure out the matrix form of backpropagation is one of the more tedious aspects of working with neural networks but still quite useful to go through in detail every now and then. I can't claim you'll find this video entertaining or particularly interesting, but I hope some of you will find it useful.
Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.
Sections:
0:00 - Setting up notation
6:50 - ∂L / ∂W
20:10 = ∂L / ∂b
23:30 = ∂L / ∂x

Пікірлер: 8

@alex-ai7517 Жыл бұрын

Note that at 1:53 I made a mistake. It should be that b ∈ R^N. The batch dimension B was already accounted for when I wrote the bias matrix as repeated rows of b.

@sagarshravane961 9 ай бұрын

yeah................to be precise ------> b ∈ R ^ (1XN) and how it is added to each instance really depends on the implementation in the code... but it is actually frustatingly confusing for beginners..as rows need not be repeated in pytorch or numpy due to their elementwise operational capability.. thanks for the awesome lecture.

@huseyinsenol1769 7 ай бұрын

That was the one of the greatest lecture videos I've ever seen. Thanks.

@guoguowg1443 Ай бұрын

great stuff Alex

@ludwigkraken9935 9 ай бұрын

great explanation!

@Glaszg Жыл бұрын

Great video, really useful! If you could also do dL/d sigmoid(y) and dL/d Prelu(h) That’d help a lot!

@beniaminradomir9798 7 ай бұрын

This is a very helpful video! I'm learning back propagation for the first time and was totally confused be the shapes of the matrices that would never align for me. However one thing makes me wonder: Can this method only be used for (linear) NNs that don't use activation functions? This appears to be the case. Does that mean that if I'd want to do the same derivations for NNs that do use activation functions it would be even more complicated? Oh man, and there is me, thinking that this would be easy and straight forward haha

@swazza9999 7 ай бұрын

I'd say this video covers the "hardest" bit. Activations are easy to incorporate because they typically act on one neuron at a time, so there's no index tracking to do, it's just i -> i. In fact, this video already shows how to deal with activations. If you look at my final expressions they still have dL/dy in them. I left the loss function general, leaving you to fill that in depending on which specific loss function you are using. But what if this was a layer somewhere in the middle of the neural network, and I was really calculating da/dx, da/dW and da/db (a is for "activation"). All the math in the video would be exactly the same, but instead of dL/dy in the final expressions, you'd have da/dy. So you see, incorporating an activation just amounts to incorporating its derivative into the chain rule, and since the activation is a scalar to scalar function, there's no matrix multiplication. It's just a scalar. For example, what if my activation is a(y) = y**2 / 2. Then da/dy = y. Then you just plug y in where I have all the dL/dy in the video. If it's still not clear to you after reading this, I'd encourage you to just sit with it for some time and try to work through it. I feel like you are close.