Why Do Neural Networks Love the Softmax?

Рет қаралды 66,079

Күн бұрын

Пікірлер: 126

@mgostIH Жыл бұрын

softmax is also invariant by constant addition! This is often used in implementations computing combinations with the resulting probabilities, for example in attention with the paper "self attention does not need O(N^2) memory" uses this to avoid blowups in the computation while computing the combination sequentially, avoiding the need to form the entire attention matrix.

@Mutual_Information Жыл бұрын

Very interesting - pinned!

@BuddyVQ Жыл бұрын

I’m my courses, this was one of the big advantages of using exponential (alongside the convenient Jacobian). Without invariance to constant shifts, distributed scores like [0,1,2] vs [100,101,102] vary significantly when they should not.

@nerkulec Жыл бұрын

Highest quality Deep Learning content out here on youtube.

@Mutual_Information Жыл бұрын

I'm certainly working on it!

@Otomega1 Жыл бұрын

Just about to write this comment So I'll just like this one

@Friemelkubus Жыл бұрын

@mCoding Жыл бұрын

Great intuition explained in simple terms, and top tier visualizations as always!

@Mutual_Information Жыл бұрын

Thank you James - appreciate the love!

@alvinjamur1 Жыл бұрын

As a long time NN guy (since ‘93) & quant….IMHO….this channel deserves 10 gazillion readers. Very well done!

@Mutual_Information Жыл бұрын

Appreciate it - but I don't quite think there's 10 gazillion NN guys out there. We'll just have to settle for being the cool club 😎

@monikaherath7505 Жыл бұрын

Hello friend. As a beginner in NN, why has ML and similar stuff only just seem to have exploded? Especially at universities? Why has it become a fad just now when it has been researched for decades? Thanks for your help

@manavt2000 9 ай бұрын

@@monikaherath7505 because of the advancements in computing power...superfast gpus

@zezkai7887 Жыл бұрын

Interestingly, as mentioned in one of Andrej Karpathy's videos , this shift (at 1:58) is also performed for softmax to ensure numerical stability.

@gar4772 7 ай бұрын

Hands down one of the very best ML channels on youtube! Simple, concise explanations. Dj is one of the best educators I have ever seen on the subject. Thank you!

@Mutual_Information 7 ай бұрын

Thank you my dude!

@CalvinHirschOoO Жыл бұрын

Great video. Softmax is great but sometimes is restrictive. A recent paper "Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing" discovered that transformers actually abuse/circumvent the softmax in order to perform no-ops (i.e. avoid normalization). Highly recommend reading if you're interested in how softmax affects learning.

@Mutual_Information Жыл бұрын

Thanks, there's always so much research to keep up with..

@nuko_paon1351 Жыл бұрын

lately, youtube was recommending me garbage langchain tutorials with their garbage ideas like a ton, in the term of recently posted videos. but sometimes, I have to truly appreciate that it also shows me the way to the hidden gem, like you. again, thanks for making great contents. keep it up! edited: subbed!

@Mutual_Information Жыл бұрын

Appreciate it! I am doing my best to not succumb to the hype trends :)

@hellfishii Жыл бұрын

The ammount of math behind the justification of why the soft max function is chosen is insane. what a fascinating time to be alive.

@ScoobyDoo-xu6oi 3 ай бұрын

There is no math behind it, and he didn't give any justification but rather an opinion.

@hellfishii 3 ай бұрын

@@ScoobyDoo-xu6oi he indeed didn't do anything original just communicating in a clear and soft maxxing way for non technical folks. Brother is not a paper, what are you doing in KZbin

@yubtubtime Жыл бұрын

What a beautiful visualization of the Jacobian 🤩

@jonashallgren4446 Жыл бұрын

Man, I can feel a binge watch of your videos coming, great content, at least I will procrastinate while doing something very useful lol

@Mutual_Information Жыл бұрын

As far as procrastination tasks go, this videos aren't a terrible use of time. That said, my old stuff is a lot harder to watch lol

@gregorykafanelis5093 Жыл бұрын

This greatly resembles the Boltzman term used in physics, which states that the probability of a given state (in thermal equilibrium) will be p_i = exp(-ε/(κT))/ Z,with Z being the sum of all these states to normalize the result. Also the expression you give for the loss function mimics the way entropy can be defined using Boltzman probabilities. Nature truly provides the most elegant way to reach equilibrium.

@Mutual_Information Жыл бұрын

Yea how wild is that!? Statistical physics was way ahead (or inspired?) modern DL

@matthewtang1489 Жыл бұрын

just realized I found gold. Beautiful video, never thought of it even though I use it every day.

@jcamargo2005 3 ай бұрын

This is fascinating. Because of their usefulness, softmax and sum-of-exp deserve at least a mention on undergrad math courses.

@1495978707 Жыл бұрын

Just found you. Great content, taking the time to seriously explain why a choice is made and a good choice is so rare to be so well done

@wedenigt Жыл бұрын

Great walkthrough! Your channel definitely deserves more attention. At 8:47, maybe one should emphasize that the derivative of the loss w.r.t. f(s) is independent of our choice of f. Thus, the simplicity of this term cannot be attributed to the softmax - it's only due to the choice of the loss.

@Mutual_Information Жыл бұрын

It's funny you say that. At 9:01, I had originally said this: "Now the simplicity isn't due to the softmax, but the natural log from within the loss. Fortunately, that also plays nicely with derivatives." But I cut it out. I felt it was deducible from what was on screen at the time, but in retrospect.. it's a good thing to clarity.

@oceannuclear 6 ай бұрын

This is so beautifully done! Thank you! The pacing is perfect (for me anyways, since I can pause and mentally check the maths), the inclusion of the actual expression at 5:58 is helpful too. And the choice of topic is extremely insightful as well! I never thought how softmax is prevalent because the maths cancels out to simplify matrix multiplication operations into simple vector subtraction!

@Mutual_Information 6 ай бұрын

You get it!! :)

@xfts1988 Жыл бұрын

Seeing how you represented the J of sm 7:15 helped me see how softmax is the actual generalization of the logistic in higher dimensions. I always took softmax as an algebraic tool to crunch matrices into probability scores. Thank you for your amazing content

@OpsAeterna Жыл бұрын

new vid just dropped from the absolute legend!

@OpsAeterna Жыл бұрын

I thank Dr.Orabona for recommending your channel!

@Mutual_Information Жыл бұрын

Thank you Ops!

@yingqiangli6026 Жыл бұрын

One of the best lectures on Softmax on the entire Internet!

@Mutual_Information Жыл бұрын

Thank you Yingqiang!

@niccolomartinello7610 Жыл бұрын

I always assumed (based on quality and quantity of videos) that Mutual Information was one of the most famous stats channel on yt, I can't believe that he has only 20k followers (at the time of writing). The second that the algorithm will favour one of your videos the channel will blow up, I'd bet good money on that. Keep up the good work.

@Mutual_Information Жыл бұрын

Thanks! I hope you're right. So far it's coming along. I think my upcoming videos will do well enough, so I'm optimistic too

@phovos 8 ай бұрын

wow this is amazing, ty! I'm going to watch this every day until I can explain this.

@Mutual_Information 8 ай бұрын

I certainly don't mind that plan. If there's something in particular that's confusing, feel free to ask!

@Nahte001 Жыл бұрын

Great vid, only thing I think might have been worth mentioning is the grounding in and reliance on a probabilistic objective. While I certainly see the point you were making with the various shortcuts it affords the gradient calculation, it's only useful in so far as the prediction space is tractable and the input-output relation is many-to-one. Also I'm a big fan of BCE with logits as an implementation of this idea, there's more to life than cross entropy!!!

@aliasziken7847 Жыл бұрын

In fact, softmax can be derived from an optimization point of view, that is, from the point of view of max entropy

@CodeEmporium Жыл бұрын

Very interesting! Thanks for the quality content!

@Mutual_Information Жыл бұрын

Excellent to hear from you CodeEmporium - appreciate the compliment!

@broccoli322 Жыл бұрын

Well explained! This channel deserves more subscribers.

@chaddoomslayer4721 Жыл бұрын

Always wait for your videos more than I wait for my birthday ha

@rufus9508 Жыл бұрын

Great quality content, hope this channel gets more attention!

@madcauchyren6209 Жыл бұрын

This is really an informative video. Thank you!

@eblouissement Жыл бұрын

really useful for learning

@Mutual_Information Жыл бұрын

That's what I'm for!

@광광이-i9t Жыл бұрын

amazing !! thank you so much for your great explanation !!

@Mutual_Information Жыл бұрын

Glad you liked it - more to come!

@ali-om4uv Жыл бұрын

That was impressive! I would really appreciate if you could give us a list of books and papers you read during your ml learning journey!

@Mutual_Information Жыл бұрын

I don't have that on hand at the moment, but each video has sources on what I researched for that video. Maybe that helps? Also, I can tell you generally my overall favorite books. Those are Probabilistic Machine Learning by Kevin Murphy, Elements of Statistical Learning by Hastie et al and Deep Learning by Goodfellow et al. There are other greats as well.

@jacksonstenger Жыл бұрын

Thanks for another great video!

@MathVisualProofs Жыл бұрын

Excellent!

@Mutual_Information Жыл бұрын

Thank you MVP!

@nikoskonstantinou3681 Жыл бұрын

Really insightful video. Good work!

@user-wr4yl7tx3w Жыл бұрын

Truly insightful

@cartercheng5846 Жыл бұрын

finally updated!

@piratepartyftw Жыл бұрын

The softmax function ultimately comes out of statistical mechanics. It's the same function as the pdf of the boltzman distribution, and the denominator is the "partition function" (super important in statistical physics). In statmech, the softmax (boltzmann distribution) comes out of the fact that each system state has some probability to occur, and this is represented by the softmax (numerator being the "frequency" with which some state might happen, and denominator being the sum over the all states). Basically the frequency of the state (e.g. temperature and pressure of a gas) is proportional to the number of microstates (e.g. positions and velocities of gas molecules) that might represent it. But in physics we dont really work with counts of microstates, we work with entropies, which is the log of the count of microstates. So the formula for softmax ends up being the exponential of the entropy, to undo the log in the entropy formula to get the raw microstates count. That's where the exponent comes from: inverting the log in an entropy formula. But when you generalize the idea to other systems that have entropy (or things you wanna treat as entropy in a max-entropy sorta way, like machine learning scores), you still gotta take the exponential, even when there's no "microstates" to count or think about. A lot of the hardcore math people in machine learning were originally trained as physicists, so ML inherited a lot of stuff from physics.

@piratepartyftw Жыл бұрын

Indicentally, Boltzmann's entropy is just a special case of shannon's entropy where each outcome is equally likely (because each microstate is indistinguishable and therefore has to be equiprobable for consistency). So that's why there's a log in the entropy- same reason as in the Shannon entropy.

@fluo9576 Жыл бұрын

The structure of the nature always comes out when you go deep enough

@azaleacolburn Жыл бұрын

Great content thank you!

@Mutual_Information Жыл бұрын

Thank you right back

@abdolrahimtooraanian5615 10 ай бұрын

Well explained!!!

@yorailevi6747 Жыл бұрын

They way i see it is related to ideal gas models and normal distribution. in a sense our dataset is particles of certain energy and we need to find the distribution most fit for them. sadly however the gas (data) is mixed but we are lucky, because we do have a maxwells demon (labels) so if we just let it do its job it will find a good uniform distribution (latent space) that can be used to pick out the particles

@Mutual_Information Жыл бұрын

I don't quite understand this but if it informs your intuition, that's good news

@h4ck314 Жыл бұрын

quite insightful, thanks

@luciengrondin5802 Жыл бұрын

What I find the most interesting is how it relates to statistical physics.

@Zooted1278 Жыл бұрын

New to all of this stuff and even though I didn’t quite understand all the terms due to lacking some mathematical background for it, I still find it absolutely incredible that by using the -log and exp function you can take such a nightmarish matrix multiplication and reduce it to literally just subtracting 2 vectors. Makes me really excited to learn even more about deep learning. Would you happen to have any good recommendations for resources where I can shore up my math knowledge for this kind of content?

@Mutual_Information Жыл бұрын

I'm a big fan of the deep learning book by Goodfellow and others: www.deeplearningbook.org/ It's got some essential math in there useful for DL. It's a well known book, so you may have already come across it.

@Zooted1278 Жыл бұрын

@@Mutual_InformationSounds awesome. I’ll check it out right away. Thanks for the recommendation

@Alexander_Sannikov Жыл бұрын

when i was fooling around with homebrew neural networks, i didn't know softmax was a thing, so i used an L2 norm (sum of squared differences). why is that worse?

@brianprzezdziecki Жыл бұрын

That was fucking incredible

@GregThatcher 5 ай бұрын

Thanks!

@gix_lg Жыл бұрын

I'm not an expert on NN, so forgive my dumb question: nowadays the activation function more used is not ReLU?

@Mutual_Information Жыл бұрын

Yes! Softmax is not used as an activation much anymore. ReLU is the common choice. But the very last layer, for classification tasks, is almost a Softmax

@fizipcfx Жыл бұрын

Here is the emberrassing thing: i have been coding neural networks for two years and i learned just now what it means to have a "differantiable activation function". I always wondered how does the pytorch differantiate my custom function algebraically 😂😂

@Mutual_Information Жыл бұрын

PyTorch is magic! Or a sufficiently advanced technology that sometimes it looks like magic

@fizipcfx Жыл бұрын

@@Mutual_Information yess

@stacksmasherninja7266 Жыл бұрын

I really doubt your sentene of NLL "matching" empirical distribution. Are there any known results that prove this? I don't think it has to match the empirical distribution at all

@lulube11e111 Жыл бұрын

Was this video made with manim?

@Mutual_Information Жыл бұрын

No actually, I'm using a personal library built on top of Altair

@taxtr4535 Жыл бұрын

this video was fucking great! keep it up bro

@Mutual_Information Жыл бұрын

No plans of slowing down!

@parlor3115 Жыл бұрын

Idk, I've been lately noticing a trend of many AI research and development groups (OpenAI included), transitioning towards using the hardmin function instead.

@betacenturion237 Жыл бұрын

In physics we get our dicks really hard about symmetry symmetry symmetry. It's everywhere can you can't function in this field without it. I thought that 'real life' was more often asymmetric and thus I thought the value of symmetry was overblown in practical contexts. I'm not saying that symmetry is useless, it just felt like once you had to start dealing with real, noisy data, all of those symmetry tricks go out the window. Little did I expect that these AI engineers were exploiting a similar idea to perform simple computations of the loss function. Instead of dealing with a dense matrix of derivatives (which inherently are computationally unstable compared to integrals), why don't we just construct our function in such a way that we only get diagonal terms? This is exactly what physicists do! I'm just surprised by its far reaching consequences I guess...

@kalisticmodiani2613 Жыл бұрын

Aren't the softmax in some of these models deep inside the network ? Why would the loss function applied to the outputs influence that selection of the exponential function deep inside the network ?

@Mutual_Information Жыл бұрын

Softmax as an activation function is separate question. They used to be popular, but ReLU ultimately took it's place b/c it saturates less. Using the softmax as an activation can make for a lot of zero-gradients.. and make learning tricky.

@Nahte001 Жыл бұрын

@@Mutual_Information Softmax (as well as other kWTA-esqe activation funcs) play a vital role in multi-headed attention as a way of encouraging disentanglement between the heads. The reason they're ineffective in CNNs/RNNs is because without the structural prior of multi-headed attention, its selective nature forces information loss whenever it's applied. When the loss happens before the natural reduction to class logits, this isn't an issue, but when the layer in question is trying to learn features but can only express one thing at a time you can see where the issue arises.

@Mutual_Information Жыл бұрын

@@Nahte001 I see, thank you

@aram9167 Жыл бұрын

Am I tripping or are you using 3B1B's voice in some places? For example at 8:28 and 9:47??

@Mutual_Information Жыл бұрын

Lol I assure you I am not using his voice

@azophi Жыл бұрын

Problem: negative model scores Solution: make a neural network to map model scores onto probabilities 🧠

@MTd2 Жыл бұрын

Isn't this basically trying to calculate and then trying minimize shannon's entropy?

@Mutual_Information Жыл бұрын

I don't see shannon's entropy showing up explicitly here. Though the NLL looks a lot like a shannon entropy, but it uses two different distribution (the empirical one, y, and the model's, sigma(s)).. shannon entropy is a calculation on one distribution.

@MTd2 Жыл бұрын

@@Mutual_Information what do you mean by empirical? And it's very difficult to not think about shannon entropy because one of the first papers on chess AI was written by Shannon and used entropy to calculate some sort of minimax function.

@hw5622 Жыл бұрын

Well explained! Thank you ❤ by the way I kept distracted by you nice looking face….

@Mutual_Information Жыл бұрын

haha thank you, I will wear a mask next time :)

@hw5622 Жыл бұрын

@@Mutual_Information haha that won’t be necessary. It’s good to see. Thank you again, I love all your videos and I am slowly going through them.❤

@Stopinvadingmyhardware Жыл бұрын

Because it pushes the bad results out of the domain.

@AR-iu7tf Жыл бұрын

Very nice explanation of the utility of softmax and the advantage of using it with cross entropy loss. Thank you! Here is another recent video that complements this video - why use cross entropy loss that leverages softmax to create a probability distribution? kzbin.info/www/bejne/goDLZmCCicmiqbc

@peceed Жыл бұрын

So it is almost certain that biological brains use the same transformation.

@Mutual_Information Жыл бұрын

ha well I don't have nearly the evidence to suggest that, but you never know

@peceed Жыл бұрын

@@Mutual_Information Biology search for "mathematical opportunities". Do you believe that physically distributed weights of neurons (that can be large) can be multiplied in matrix or rather that they compute local difference? End there is evidence that brain uses "logarithmic representation".

@june6959 Жыл бұрын

"Promo SM" 😞

@kimchi_taco Жыл бұрын

softmax is misleading name. It should be softARGmax.

@alfrednewman2234 Жыл бұрын

AI generated image, text

@yash1152 Жыл бұрын

0:09 sorry! but dislike for low audio volume levels.

@yash1152 Жыл бұрын

1:39 i so much want to like the video, but i wont.... i am tired of this plague on entirety of small youtube channels these days.... its this one thing which news channels get always right. the plague being either super low up audio levels in vocals speech, or deafening high levels of intro/bg music.

@yash1152 Жыл бұрын

4:46 5:04 5:18 ohw, so, here, it was likely a result of ultra focused mic pointed at the neck, and not the mouth. KZbinrs PLEASE listen ur videos in comparison with a pre-tested accepted sample _at least_ once _before_ posting to youtube. u spent hours & hours on visuals, dont mess up on audio please.

@yash1152 Жыл бұрын

this is being experienced widespread on utube likely as utubers are pouring money on _upgrading_ to expensive _focused_ mics but due to inexperience with it, still editing according to their last used mics... heh, money alone aint never enough eh!! its not the tools by itself, its the worker who excels at it.

@Mutual_Information Жыл бұрын

You caught me - I know very little about audio quality. All I do is apply a denoiser in Adobe Premiere.. what specifically should I do to fix this? Is it just that the the volume is too high and sometimes too low? What type of audio processing would you recommend (hopefully it's available in Adobe Premiere..) Thanks!

@ronaldnixon8226 Жыл бұрын

Obama caint force me to learn none a this nonsense! Trump own's them document's!

@444haluk Жыл бұрын

I find the eagerness for trying to come up with simpler terms disturbing. It is basically laziness with a few extra steps. Simple doesn't mean useful, better or true.

@arturprzybysz6614 Жыл бұрын

Simpler terms could be useful, as they provide bigger conceptual "boxes" (less accurate ones), which sometimes allow for generalization and using intuition from other areas. Sometimes simple means more useful.

@Eye-vp5de Жыл бұрын

Simple does mean better in this case, because a neural network must be trained, and I think it's training would be much more expensive if the expression wouldn't be this simple

@lunafoxfire Жыл бұрын

Well I find the fact that you "find the eagerness to come up with simpler terms disturbing" to be itself disturbing. Equating simplicity with laziness is itself lazy.