Softmax Function Explained In Depth with 3D Visuals

Рет қаралды 41,564

Күн бұрын

Пікірлер: 107

@elliotwaite 4 жыл бұрын

CORRECTIONS: • At 5:22, instead of "more random" I probably should have said "less predictable," and instead of "less random, more deterministic," I probably should have said "more predictable," since it's only deterministic if one class has a value of 1 and all others 0, and other than that it represents randomness just with different distributions. Also, this is related to the context of when one is sampling from this distribution, for example when choosing the next character or word in a sentence when doing text generation. • At 16:51 I said, "It's basically the idea that as our networks are training, they're trying to push these logit values towards those Gaussian distributions," but this should only be taken loosely since the gradients don't point towards any single point. It would be more accurate to describe the direction of the gradient as being a weighted combination of both towards the subspace occupied by that Gaussian and towards the intersection of all the other subspaces, which can be seen by the slope of the gradient shown at 10:04.

@apophenic_ Жыл бұрын

Thank you!

@rembautimes8808 3 жыл бұрын

After 30s you know that this lecture is worth its weight in gold.

@elliotwaite 3 жыл бұрын

Thanks!

@harry8 2 жыл бұрын

This content is amazing, and the visualizations are so intuitive and easy-to-understand. Awesome video!

@elliotwaite 2 жыл бұрын

Thanks, Harry! I appreciate it.

@alidanish6303 Жыл бұрын

I was searching for intuitive explanation for Sigmoid and Softmax because they both have something in common for that something there are rare materials.And to my surprise Elliot you explained most of nuances in the concept if not all. I was hoping to find intuition on some other aspects in activation functions but sadly there you didn't made any videos for the last many years. I understand due to lack of time and probably nothing comes out of this hobby of yours you left making videos on technical content. But the way you have perceived these ideas have produced gold.

@elliotwaite Жыл бұрын

Thanks. I'm glad you liked the explanation. I may make more videos in the future. The main reason I stopped is because I want to build an AI company, and I felt like making these videos was slowing down my progress towards that goal. But I think once I get my business going a little more, it will actually be beneficial to get back into making videos. But we'll see. Thanks again for the kind comment.

@tomgreene4329 2 жыл бұрын

this is probably the best math video I have ever watched

@elliotwaite 2 жыл бұрын

Thanks!

@aureliobeiza3422 3 ай бұрын

What an explanation! All I had to do is watch the first 2 minutes and everything made sense to me.

@elliotwaite 3 ай бұрын

Thanks!

@ryzurrin 3 жыл бұрын

Amazing video, by far one of the best and most vidual explanations I have yet to see. Thanks for making such a great video. Can't wait to watch more of your videos.

@elliotwaite 3 жыл бұрын

Thanks!

@shashankdhananjaya9923 3 жыл бұрын

Ultimate explanation. Thank you a million.

@elliotwaite 3 жыл бұрын

Thanks!

@Nico-rl4bo 4 жыл бұрын

I started a neural nets from scratch tutorial and your videos are amazing for support/understanding.

@elliotwaite 4 жыл бұрын

Thanks!

@LifeKiT-i Жыл бұрын

I particularly love the way you explain it in graphical calculator, which most youtubers won't dare to use. The whole ML operation is a math, love you videos!! please update us by uploading more videos like this!

@elliotwaite Жыл бұрын

Thanks, I appreciate the comment. I hope to make more videos eventually.

@bArda26 4 жыл бұрын

wow you make desmos sing! I love desmos as well. such as an amazing thing, incredible visualization tool. Great video, please keep making more!

@elliotwaite 4 жыл бұрын

Agreed, Desmos is great! Fun to use and helpful in gaining insights. Thanks, bArda26! I'll keep the videos coming.

@RomainPuech Жыл бұрын

Best, most thorough explanation possible!

@elliotwaite Жыл бұрын

Thank you!

@boutiquemaths Жыл бұрын

Thanks so much for creating this thorough amazing explainer exposing the hidden details of softmax visually. Amazing!

@elliotwaite Жыл бұрын

:) I'm glad you liked it.

@rutvikjaiswal4986 3 жыл бұрын

Even without watching this video, I liked it! because I know this is really going to change my thought toward softmax function and I learn a lot through this video.

@mic9657 10 ай бұрын

This is great! You have put so much effort into this to make it easy to understand

@elliotwaite 10 ай бұрын

Thanks!

@opaelfpv 3 ай бұрын

I don't undersend why you initialize the last linear layer to zeros. Great video! You provide really good explanation for a paper i wrote recently

@elliotwaite 3 ай бұрын

I'm not sure how much benefit zeroing the last layer actually adds, but the idea behind initializing the last linear layer's weights and biases to zero is that it will make all the inputs to the softmax be the same value (zero in this case). And if all the inputs to the softmax are the same value, then the initial guess for each output will be the same probability (for 2 classes it will output a probability of 0.5 for each class, and for 10 classes it will output a probability of 0.1 for each class, etc.). This essentially means that we initially assume we don't know anything about the data, which makes sense, and is a stable starting state for the neural network to start learning from. This also will make all the outputs initially be at the center of our output space where there will be a good amount of gradient for all values that will start separating them in the correct directions. If we didn't zero the last layer, the learning algorithm would still work, but it would start in a bit of a tangled state, where some outputs would be correct and some would be wrong, and the gradient would essentially have to untangle them. I think I read somewhere (but I can't remember) that someone tested zeroing the last layer and it tended to improve the convergence rate since it makes it so the network doesn't initially start in a tangled state, but I think it is only a minor optimization and either way the network will be able to learn a correct set of weights. Another thing to note is that the last linear layer is the only layer that we can safely zero out. If we zeroed out any of the earlier layers, then the input to the next linear layer would be all zeros, which would make it so that all the gradients for that layer's weights would also be zero, and it wouldn't be able to learn. The only reason we can zero out the last layer is that those values are fed to a softmax function instead of another linear layer, and the softmax function still provides gradients even if all the inputs are zeros. Thanks for the question. I probably should have explained this more in the video.

@opaelfpv 3 ай бұрын

@@elliotwaite thank you very much for your reply :) ! I think that starting in a somewhat tangled state may depend on the initialization of the various layers, the type of layers and activations, and above all on the distribution of the input data. I agree that initializing the last layer (classifier) with zero allows to better optimize the softmax function, placing all the values in the center, therefore at the equiprobability point of the probability simplex of the softmax function. if we remove the dependence of distribution of the data and the network (like Unconstrained Feature Model), most likely the logits would be optimized following the best direction that leads from the center of the probability simplex to the vertices. Do you agree?

@elliotwaite 3 ай бұрын

@@opaelfpv I think I agree, if I understand you correctly. And that's a good point about the tangledness depending on those other factors. Also, if the training data is known ahead of time, in which case we could figure out the expected probability for each output class, maybe it would be even better to still zero the weights but to set the biases to match the expected distribution for the output classes, since that may be an even more neural and stable starting state. I'm not sure I fully understand what you mean by removing the dependence of the distribution of the data and the network, and I'm not familiar with the unconstrained feature model yet, so I'd have to do some more research before being able to offer any feedback on your last question. By the way, if your paper is publicly available, on Arxiv or something, I be interested in finding out more about it.

@buffmypickles 4 жыл бұрын

Well thought out and extremely informative. Thank you for making this.

@elliotwaite 4 жыл бұрын

Thanks!

@stanshunpike369 2 жыл бұрын

I just did a presentation with Desmos and I thought mine was good, but WOW props to you. Ur Desmos skills are insanely good man

@elliotwaite 2 жыл бұрын

Thanks! Yeah, I'm a big Desmos fan. It's a great learning tool.

@chrisogonas 2 жыл бұрын

Awesome illustrations! Thanks

@eugene63218 3 жыл бұрын

Man. This is the best video I've seen on this topic.

@elliotwaite 3 жыл бұрын

Thanks!

@ajaygunalan1995 9 күн бұрын

This is GOLD! Pure Gold!!!!

@EpicGamer-ux1tu Жыл бұрын

I love you man, this video is what I needed. Much love and best of luck mate.

@elliotwaite Жыл бұрын

Thanks!

@andreykorneychuk8053 4 жыл бұрын

The best video I have ever seen about softmax

@elliotwaite 4 жыл бұрын

Thanks!

@catthrowvandisc5633 4 жыл бұрын

thank you for taking the effort to produce this, it is super helpful!

@elliotwaite 4 жыл бұрын

Thanks, catthrowvandisc! Glad you found it helpful.

@EBCEu4 4 жыл бұрын

Awesome visuals! Thank you for a great work!

@elliotwaite 4 жыл бұрын

Thanks, Yuriy! Glad you liked it.

@avinrajan4815 Жыл бұрын

Very helpful! The visualisation was very good!! Thank you...

@elliotwaite Жыл бұрын

Thanks! Glad you liked it.

@vatsal_gamit 4 жыл бұрын

Very well explained 👍

@elliotwaite 3 жыл бұрын

Thanks!

@wencesvm 4 жыл бұрын

Amazing video Elliot!!! Great insghts, keep it up.

@elliotwaite 4 жыл бұрын

Thanks, Wenceslao! Comments like this make my day.

@patite3103 4 жыл бұрын

What you've done is just amazing!!

@elliotwaite 4 жыл бұрын

Thanks!

@mohammadkashif750 3 жыл бұрын

very helpful, content was amazing😍

@nikhilshingadiya7798 3 жыл бұрын

I have no words for you !!! Awesome

@tato_good 3 жыл бұрын

very good explanation

@toniiicarbonelll287 3 жыл бұрын

So useful! Thanks a lot

@anton_98 3 жыл бұрын

Addition (shift) to all classes is changing the scores. For example we have 3 classes with values: 5, 3 and 2. So we can say: 5/10 = 3/10 + 2/10. After addition 1 to value of all calsses: 6/13 != 4/13 + 3/13. So addition must change scores. It is also easy to see if we add 1000 for example. So it will be 1005, 1003 and 1002. And score of each roughly become to 0.3333....

@anton_98 3 жыл бұрын

It was comment to this part of video: kzbin.info/www/bejne/r6XFioV_g5WBask.

@elliotwaite 3 жыл бұрын

The reason an addition to all inputs doesn't change their outputs is because instead of using the raw input values in the fractions, you exponentiate the inputs first before using them in the fraction: e^2 / (e^2 + e^3 + e^5) = e^1002 / (e^1002 + e^1003 + e^1005) Thanks for the question.

@anton_98 3 жыл бұрын

@@elliotwaite Thank you for explanation.

@snehotoshbanerjee1938 4 жыл бұрын

Superb, Excellent Elliot!!! BTW, which software you used?

@elliotwaite 4 жыл бұрын

Thanks, Snehotosh! For the 2D graph, I used Desmos, and for the 3D graphs, I used GeoGebra. They both have web-based versions and GeoGebra also has a native version. I posted links in the description to all the interactive graphs shown in the video if you want to get some ideas for how to use them.

@HeitorPastorelli Жыл бұрын

awesome video, thank you

@elliotwaite Жыл бұрын

:) thanks for the comment. I'm glad you liked it.

@horizon9863 4 жыл бұрын

Please keep doing it!

@elliotwaite 4 жыл бұрын

Thanks, Cankun! I plan to make more videos soon.

@Mahesha999 3 жыл бұрын

I didnt get why exactly the smaller curve itself is changing shape when you decrease / increase green value at 2:25. I guess going through this desmos will bring more clarity. Can you open source it? Or its already shared?

@elliotwaite 3 жыл бұрын

The link to the publicly accessible Desmos graph is in the description. The bottom graph is a copy of the top graph, but scaled down vertically so that if you add up the heights of all the dots, that total height has a height of 1. So when you increase the height of the green dot, that increases the total height of all the dots, so the bottom graph has to be scaled down even more to make that total height stay at a height of 1, which is why the bottom graph squashes down more as the height of the green value increases.

@alroy01g 3 жыл бұрын

great video, many thanks! One question: around 5:00, you mentioned that getting probabilities together make them more random, and separating them more deterministic, I can see that graphically but I can't reconcile with the definition of variance which asserts the more spread of the points the more variance, maybe I'm mixing things up here??

@elliotwaite 3 жыл бұрын

Thanks, glad you liked the video. About your question, it is a bit counterintuitive at first, but I think the key to understanding how having more similar probabilities can lead to a more random outcome is remembering that the probabilities we are talking about are the probabilities for different outcomes. And when I say "random," I'm using it a bit vaguely and not in the same way as variance, which I explain below. So for example, if we were rolling a six sided die, we could think of plotting the bar heights of the probabilities of getting each of the six sides. If it were a weighted die that almost always rolled a 5, then the bar for 5 would be large and the others would be small, and we might say the die was not very random. In this case, the distribution of the value we'd get from rolling the die would also have a low variance with most of the distribution mass being for 5. On the other hand, for a fair die, the probabilities for the sides would almost all be equal with their bar heights all being about the same, and we might say the die was more random. In this case the distribution of the value we'd get from rolling the die would also have a higher variance with the probability mass distributed more evenly across the outcomes. To see the difference in my usage of the word "random" and the precise term "variance," let's consider a weighted die that rolls a 1 half the time and a 6 the other half. We might say that this die was less random than a fair die, but the distribution of the outcome values would actually have a higher variance than a fair die. And in fact it would be the way to weight the die to have the highest variance possible. So when I say "random" I just mean that we are less certain about which class will be the outcome, but if those classes correspond to specific values in a field (a 1d, 2d, 3d, or higher dimensional space), then the variance of the distribution of those values could be higher or lower depending on which values the classes correspond to. I hope that helps.

@alroy01g 3 жыл бұрын

@@elliotwaite Ah, yes you're right, those are different outcomes, appreciate the detailed explanation, thanks for your time 👍

@nagakushalageeru135 3 ай бұрын

Awesome !

@elliotwaite 3 ай бұрын

@@nagakushalageeru135 Thanks!

@scotth.hawley1560 4 жыл бұрын

Great video. The point where you take the log of all these functions to make the loss seems like it could use a bit more motivation: The log makes sense for exponential-like sigmoids and if you're requiring a probabilistic interpretation, but perhaps there is some other function more appropriate for algebraic sigmoids? (Although I can't think of any that would produce the "nice" properties that you get from cross-entropy loss & logistic sigmoid, apart from maybe a piecewise scheme like "Generalized smooth hinge loss".)

@elliotwaite 4 жыл бұрын

Good point. I didn't go into the reasoning of why I chose the negative log-likelihood loss other than mentioning that it is the standard used for softmax outputs that are interpreted as probabilities. I wasn't sure of a concise way to explain it, so I thought I would just take it as a given without going into the reasoning. But maybe I could have at least pointed viewers to another resource that they could read for if they wanted to better understand why it is typically used any how other loss functions could also potentially be used. Thanks for the feedback.

@yoggi1222 3 жыл бұрын

Excellent !

@moorboor4977 2 ай бұрын

Impressive!

@elliotwaite 2 ай бұрын

@@moorboor4977 thanks!

@ناصرالبدراني-ش9س 2 жыл бұрын

great stuff

@zyzhang1130 3 жыл бұрын

very helpful!

@adityaghosh8601 4 жыл бұрын

Thank you so much.

@melkenhoning158 3 жыл бұрын

THANK YOU

@TylerMatthewHarris 4 жыл бұрын

Thanks!

@arseniymaryin738 3 жыл бұрын

Hi, can I see your neural network training (at 12.12) in some environment such as Jupiter notebook?

@elliotwaite 3 жыл бұрын

I don't have a notebook of it, but you can get the code here: github.com/elliotwaite/softmax-logit-paths

@my_master55 2 жыл бұрын

Thanks Elliot, great video! 👍 Could you please clarify why the output space is evenly distributed among classes? Shouldn't there be more space for the dominant class? Or you consider that all the classes are equivalently dominant? Thanks 🙌

@elliotwaite 2 жыл бұрын

So the visuals are actually showing all possible cases in one graph, both what the output would be when one class is dominant and what the output would be when the classes are equivalent. The spaces where one color is above the others is where that class is dominant and the place in the middle where the curves cross and are all equal height is what the output would be if the classes were equivalent. And the spaces are all the same size because the softmax function doesn’t give any preference to any of the inputs before knowing their values. Let me know if it is still unclear.

@my_master55 2 жыл бұрын

@@elliotwaite oh, okay, thank you 👍 So what we see is just "the middle of the decision" and therefore it only looks like the classes are equal. Which means, this does not imply that the classes are like 50/50 everywhere in case of 2 classes, but at the intersection it indeed looks like they are 50/50. Hope I got it right 😊

@elliotwaite 2 жыл бұрын

@@my_master55 I’m not sure what you mean by "the middle of the decision." But yes, at the intersection they are 50/50 in the 2 class example and 1/3, 1/3, 1/3 in the 3 class example.

@zhiyingwang1234 Жыл бұрын

The comment (5:06) about temperature parameter is not accurate, when temp parameter

@elliotwaite Жыл бұрын

I'm not sure I understand your comment. Can you clarify what you mean by "theta values"? In the video I was trying to say that increasing the temperature flattens the distribution (making it more uniform), and that lowering the termpature pushes the arg max up the others down (making it more like a spike for only the arg max probability). Are you saying that you think it is the other way around, or was there something else that you thought was inaccurate?

@lenam8048 Жыл бұрын

why you have so few subscribers :

@elliotwaite Жыл бұрын

I know 😥. I think I need to make more videos.

@fibonacci112358s Жыл бұрын

"To imagine 23-dimensional space, I first consider n-dimensional space, and then I set n = 23"

@elliotwaite Жыл бұрын

Haha. For real though, that probably is the best way to do it.

@nexyboye5111 4 ай бұрын

wow this is real shit

@elliotwaite 4 ай бұрын

@@nexyboye5111 thanks

@DiogoSanti 4 жыл бұрын

Lol "I am not confortable with geometry"... If you aren't, imagine marginal people... Haha

@elliotwaite 4 жыл бұрын

😁 Thanks, I try. The higher dimensional stuff is tough, but I've slowly been getting better at finding ways to reason about it. But I'd imagine there are people, perhaps mathematicians who work in higher dimensions regularly, that have much more effective mental tools than I do.

@DiogoSanti 4 жыл бұрын

@@elliotwaite Maybe but higher dimensions are hypothetical as we are trapped in 3 dimensions ... the rest is just imagination hehe