The weirdest paradox in statistics (and machine learning)

  Рет қаралды 1,051,407

Mathemaniac

Mathemaniac

Күн бұрын

🌏 AD: Get Exclusive NordVPN deal here ➼ nordvpn.com/mathemaniac. It's risk-free with Nord's 30-day money-back guarantee! ✌
Second channel video: • Why James-Stein estima...
Stein's paradox is of fundamental importance in modern statistics, introducing concepts of shrinkage to further reduce the mean squared error, especially in higher dimensional statistics that is particularly relevant nowadays, in the world of machine learning, for example. However, this is usually ignored, because it is mostly seen as a toy problem. Precisely because it is such a simple problem that illustrates the problem of maximum likelihood estimation! This paradox is the subject of many blogposts (linked below), but not really here on KZbin, except in some lecture recordings, so I have to bring this up to KZbin.
This is not to say that maximum likelihood estimator is not useful - in most situations, especially in lower dimensional statistics, it is still good, but to hold it to such a high place, as statisticians did before 1961? That is not a healthy attitude to this theory.
One thing I did not say, but perhaps a lot of people will want me to, is that this is an emprical Bayes estimator, but again, more links below.
Video chapters:
00:00 Introduction
04:38 Chapter 1: The "best" estimator
09:48 Chapter 2: Why shrinkage works
15:51 Chapter 3: Bias-variance tradeoff
18:45 Chapter 4: Applications
Further reading:
The “baseball paper”: efron.ckirby.su.domains//othe...
Wikipedia: en.wikipedia.org/wiki/Stein%2...
Dominating the (positive-part) James-Stein estimator: projecteuclid.org/journals/an...
Wikipedia (Empirical Bayes): en.wikipedia.org/wiki/Empiric...
Other writeups:
www.ime.unicamp.br/~veronica/M...
joe-antognini.github.io/machi...
www.jchau.org/2021/01/29/demy...
www.naftaliharris.com/blog/st...
austinrochford.com/posts/2013...
duphan.wordpress.com/2016/07/...
www.statslab.cam.ac.uk/~rjs57/...
(Philosophical implications) philsci-archive.pitt.edu/13303...
Other than commenting on the video, you are very welcome to fill in a Google form linked below, which helps me make better videos by catering for your math levels:
forms.gle/QJ29hocF9uQAyZyH6
If you want to know more interesting Mathematics, stay tuned for the next video!
SUBSCRIBE and see you in the next video!
If you are wondering how I made all these videos, even though it is stylistically similar to 3Blue1Brown, I don't use his animation engine Manim, but I will probably reveal how I did it in a potential subscriber milestone, so do subscribe!
Social media:
Facebook: / mathemaniacyt
Instagram: / _mathemaniac_
Twitter: / mathemaniacyt
Patreon: / mathemaniac (support if you want to and can afford to!)
Merch: mathemaniac.myspreadshop.co.uk
Ko-fi: ko-fi.com/mathemaniac [for one-time support]
For my contact email, check my About page on a PC.
See you next time!

Пікірлер: 897
@mathemaniac
@mathemaniac Жыл бұрын
Go to nordvpn.com/mathemaniac to get the two year plan with an exclusive deal PLUS 4 months free. It’s risk free with NordVPN’s 30 day money back guarantee! Please sign up because it really helps the channel! [My pinned comment gets removed by KZbin AGAIN!!!]
@JCResDoc94
@JCResDoc94 Жыл бұрын
bc everything is related, eventually. in the oneness of God. right? _JC
@andsalomoni
@andsalomoni Жыл бұрын
This paradox should mean that you can't have 3 or more independent distributions. The maximum is 2.
@qkktech
@qkktech 11 ай бұрын
there is better estimator when do furier transformation and go single dimenaion on system
@terrywilder9
@terrywilder9 9 ай бұрын
@@andsalomoni That doesn't work! Any three elements of a functional basis are independent. That's why when you are making a maximum likelyhood estimate you are assuming a distribution also.
@ludomine7746
@ludomine7746 Жыл бұрын
This is insane. The demonstration with the points in 3d and 2d space not only made it clear why it works, but also made it clear why it doesnt work as well in 2d. Going from the paradox being magic to somewhat understandable is beautiful. I loved this video.
@mathemaniac
@mathemaniac Жыл бұрын
Thanks for the kind words!
@mrbutish
@mrbutish Жыл бұрын
Also when I use mse and lme with the ordinary estimator I PCA the n dimensions into 2D so that this situation never arises and mse is effective and dominates. Instead of PCA, lda, svm also works. If no PCA go RMS prop + momentum, Adam does well/dominates
@arnoldsander4600
@arnoldsander4600 Жыл бұрын
@@mrbutish I hoped for a similar moment but the accent really hurt my brain. couldnt concentrate on anything but the pronounciation of estematourr. Darn my brain.
@user-jb8yv
@user-jb8yv 9 ай бұрын
@@arnoldsander4600not even a strong accent
@john-ic5pz
@john-ic5pz 9 ай бұрын
​@@arnoldsander4600i like the way he says "sure". 😊
@marshallc6215
@marshallc6215 Жыл бұрын
For a layman, I think the worry after first seeing this explained (given the *very* fast hand waving with the errors at the beginning) is that you might suddenly be able to estimate something better by adding your own random data to the question, which by definition, makes the three data points not independent. The thing is, and I'm surprised you never clarified this, we aren't talking about a better estimation for any given distribution. We're talking about the best estimator for *all three* distributions as a collective. We're no longer asking 3 questions about 3 independent data sets, but 1 question about 1 data set containing 3 independent series. There is no paradox here, because it is pure numbers gamesmanship and is no longer the intuitive problem we asked at the beginning. When we went to multiple data sets, the phrasing of the question is the same, but the semantic meaning changes.
@Achrononmaster
@Achrononmaster Жыл бұрын
That is a good summary. One should not geometrize independent variables. A similar "paradox" occurs in uncertainties for complex numbers, so even the 2D case. If you Monte Carlo sample the modulus of a random z ϵ ℂ with 2D Z-dist centered at 0 + 0i the average |z| is something like a random walk distance, sqrt(N). But if you sample real and imaginary parts, average them, then compute the mean z then take |·| of that , it'll converge to zero as N → ∞. The average |z| ∼ sqrt(N), but the average z = 0 + 0i.
@guillaumecharrier7269
@guillaumecharrier7269 Жыл бұрын
Well put - I think this would have deserved at least a sentence or two in the video.
@sender1496
@sender1496 Жыл бұрын
I think the only thing that might need clarifying is the definition of "better". Still though, I think the video made it clear that this estimator won't be better on average for the individual collections, but rather for this new cost function which adds the individual costs collectively. You're right however that it gets hard to phrase it as three independent questions, because they would be like: "Find the estimator f(x1, x2, x3) that minimizes the cost", when said "cost" would also involves the other collections.
@xyzbesixdouze
@xyzbesixdouze Жыл бұрын
if you include an own random set to get beyond 2 dimensions, then those fake data with their influence on the mean error will take over, so that there is no meaningfull conclusion on the original sets. on the other hand is you just duplicate a set 3times to go from 1d to 3d then you didn't introduce other data and still get another mean while the original mean is proven to be the best?
@sender1496
@sender1496 Жыл бұрын
@@xyzbesixdouze But duplicating the set wouldn't generate a new independent set, would it? There would be correlation. This changes the distribution completely (won't be circles/spheres/etc. around the mean point), meaning that the justification for the James-Stein estimator won't work.
@Achrononmaster
@Achrononmaster Жыл бұрын
Lesson: One should not geometrize independent variables. A similar "paradox" occurs in uncertainties for complex numbers, so even the 2D case. If you Monte Carlo sample the modulus of a random z ϵ ℂ with 2D Z-dist centred at 0 + 0i the average |z| is something like sqrt(π)/2 (Rayleigh distribution). But if you sample real and imaginary parts, average them, then compute the mean z then take |·| of that , it'll converge to zero as N → ∞. The average |z| ∼ sqrt(π)/2, but the average z = 0 + 0i.
@roromaniac8
@roromaniac8 9 ай бұрын
What is this “paradox” called?
@cubing7276
@cubing7276 8 ай бұрын
they don't feel the same tbh, i think a more similar comparison would be to compute the average distance traveled in the real and imaginary component and then add them up
@SirGisebert
@SirGisebert Жыл бұрын
The bias-variance decomposition is Part of my PhD thesis and i just gotta say your visualizations and explanations are very clean and intuitive. Good job!
@mathemaniac
@mathemaniac Жыл бұрын
Wow, thank you!
@FirdausIsmail1
@FirdausIsmail1 Жыл бұрын
This presentation is phd level and beyond! So clear and easily digestible
@dukeingreen7980
@dukeingreen7980 Жыл бұрын
I am glad it is still of relevance. It was one key element of my Doctorate dissertation 30 years ago even if I did not fully understood the relevance at that point. Best wishes fro your career if you are young and thank you for sharing.
@maxwornowizki422
@maxwornowizki422 Жыл бұрын
Another great real life visualization of the concept is the following: Imagine two people playing darts. One of them hits all parts of the dartboard more or less symmetrically. They are on average in the middle, but each individual arrow might land oclose to the edges of the board. This is low or even zero bias but high variance. The other player's arrows always land very close to each other, but they don't center around The bullseye. The person is very focused and consistent, but can't get around the systematic missjudgement of the bulleye's position. Still, If they are close enought, they might win the majority of matches.
@brendawilliams8062
@brendawilliams8062 11 ай бұрын
I am not a PhD. I would divide 7408 by 3. Then I would take 2469333…. And the square root is very close to pi. If you times it by two. That’s why the denominator I will do best with the largest no. You are not avoiding crystals.
@logician1234
@logician1234 Жыл бұрын
Does this paradox have any connection to the fact that random walk in 1 or 2 dimensions almost always returns, while in 3 and more dimensions it has a finite probability that it may never return? Proof for this uses normal distribution but I may be terribly wrong lol
@mathemaniac
@mathemaniac Жыл бұрын
Have you seen my idea list? (I mean I did post it on Patreon) Yes, there is a connection! But the next video is just about the random walk itself (without using normal distribution / central limit theorem), because the connection is explored in a very involved paper by Brown: projecteuclid.org/journals/annals-of-mathematical-statistics/volume-42/issue-3/Admissible-Estimators-Recurrent-Diffusions-and-Insoluble-Boundary-Value-Problems/10.1214/aoms/1177693318.full
@logician1234
@logician1234 Жыл бұрын
Cool, I haven't seen your list, I don't use patreon. Can't wait for the next video
@leif1075
@leif1075 Жыл бұрын
@@mathemaniac any tips on how to pay attention and stay interested and focused in statistics especially when it gets sso looonng and tedious??
@enbyarchmage
@enbyarchmage Жыл бұрын
@@leif1075 As someone with ADHD, I know very well how long and tedious lectures can make focusing literally impossible. Thus, I've given myself the liberty to give you a tip: try doing most of your research using resources that actually make the subject seem interesting to you. There surely are books that can teach even advanced college-level Statistics in simultaneously accessible and rigorous ways.
@leif1075
@leif1075 Жыл бұрын
@@mathemaniac why is p there in p minusv2.. yiu didn't mention that at all
@CampingAvocado
@CampingAvocado Жыл бұрын
The fact that I'm not particularly interested in statistics and also on my only 3 weeks of holidays from my maths-centric studies, yet I still was really excited to watch this video speaks for its quality. Thank you again for the amazing free content you provide to everyone!!
@mathemaniac
@mathemaniac Жыл бұрын
Thanks for the kind words!
@peterlustig2048
@peterlustig2048 Жыл бұрын
Eth-Student?
@CampingAvocado
@CampingAvocado Жыл бұрын
@@peterlustig2048 indeed
@peterlustig2048
@peterlustig2048 Жыл бұрын
@@CampingAvocado Cant wait to finally complete my master, I had so little free time the last few years...
@CampingAvocado
@CampingAvocado Жыл бұрын
@@peterlustig2048 Congrats to your soon to be acquired freedom then :)
@abdulmasaiev9024
@abdulmasaiev9024 Жыл бұрын
This is very good. The only notes I have for how it might be improved are: 1. Make it clearer that when we have the 3 data points early in the video, we know from which distribution each of them comes, rather than just having 3 numbers. So, we know that we have say 3 generated from X_1, 9 generated from X_2 and 4 generated from X_3 rather than knowing that there's X_1, X_2 and X_3 and each generated a number and the set of the numbers that were generated is 3, 9, 4 but have no idea which comes from which. It can be sort of inferred from them ending up in a vector, but still. 2. "Near end" vs "far end", the near end being finite vs far end being infinite is a bit ehh as a point. It invites the thought of "well who cares how big the effect is in the finite area or how small it is in the infinitely large area, there will be more total shift in the latter anyway - it's infinite after all!". What matters is the probability mass for each of those areas (and it's distribution and what happens to it), and that's finite either way. Other than that, excellent video. Nice and clear for some relatively high level concepts.
@tanvach
@tanvach Жыл бұрын
I think shrinkage isn’t widely discussed is because choosing MSE as a metric for goodness of parameter estimation is an arbitrary choice. It makes sense that introducing this metric would couple the individual estimations together, so it’s not really a paradox (in hindsight). In some sense, you want to see how well the model works, not how accurate the parameters are, since a model is usually too simplistic. But I do see this used in econometrics. I think I’m seeing more L1 norm used in deep learning as the regularizer, wonder what form of shrinkage factor that will have?
@eugeybear
@eugeybear Жыл бұрын
I was wondering the same thing. The paradox seems to arise from the fact that our error is calculated using an L2 metric, but the two coordinates are being treated independently. Aside from wondering how using an L1 norm would affect this, I was also thinking that rather than using two independent normal distributions whether this paradox would still exist if we used a 2-dimensional gaussian distribution. Because in this case, all points with the same distance from the center would now all have the same probability, which wouldn't be true using two independent normal distributions.
@nodrance
@nodrance 9 ай бұрын
I was thinking the same thing. This isn't a better estimation, this is a trick that takes advantage of how we measure things.
@ChatSceptique
@ChatSceptique Жыл бұрын
I'm a PhD in statistics, never heard of that one before. It's really cool, thanks for sharing
@themazeisntmeantforyou4284
@themazeisntmeantforyou4284 Жыл бұрын
Bullshit
@dcterr1
@dcterr1 Жыл бұрын
I'm not all that familiar with advanced statistics, but I was pretty blown away by this paradox when you first presented it! However, once you started explaining how we normally throw out outliers in any case, It began to make a lot more sense. Good video!
@ej3281
@ej3281 Жыл бұрын
this was really good, thank you! I used to work in a machine learning/DSP shop and did a lot of reading about estimators but I'm not sure I ever fully understood until I saw this video.
@djtwo2
@djtwo2 Жыл бұрын
The relevance of "mean square error" here can be considered in the context of the probability distribution of the error. Here "error" means total squared error as in the video. Applying the shrinkage moves part of range of errors towards zero, while moving a small part of the range away from zero. The "mean squared error" measure doesn't care enough about the few possibly extremely large errors resulting from being moved away from zero to counterbalance the apparent benefit from moving some errors towards zero, But any other measure of goodness of estimation has the same problem. There are approaches to ranking estimation methods (other ways of defining "dominance") that are based on the whole of the probability distribution of errors not just a summary statistic. This is similar to the idea if "uniformly most powerful" for significance tests. The practical worry here is that a revised estimation formula can occasionally produce extremely poor estimates, as is illustrated in the slides in this video.
@frankjohnson123
@frankjohnson123 Жыл бұрын
Statistics seems to shun elegance for practicality more than most branches of mathematics. The ordinary estimator is clean and intuitive while the James-Stein one is like a machine held together by duct tape, yet the latter works better in many cases.
@Wence42
@Wence42 Жыл бұрын
I feel like you might be missing out on something if the James-Stein Estimator doesn't seem elegant by the end of this video. I would say this formula is more transparent in terms of what it does and why it works than most of the stuff we memorize in algebra. It is entirely possible I'm the weird one for looking at this and thinking "yeah, that looks like the right way." Different brains understand things in different ways.
@matthewliu1800
@matthewliu1800 Жыл бұрын
No, the James-Stein estimator is biased and practically useless. Note that it doesn't matter which point you shrink towards, it will lower the error. That by itself should tell you how ridiculous this is. What we are truly looking for is the minimum-variance unbaised estimator. That is the definition of the "best" estimator. All this video shows is that MSE is insufficient to determine the best estimator. There are biased estimators with less MSE than unbiased ones.
@extagram
@extagram Жыл бұрын
@@matthewliu1800 Really reminded me of Goodhart's law here " When a measure becomes a target, it ceases to be a good measure." James Steins estimator chase the target of being "best" estimator which resulted in the failure of this "best" estimator.
@panner11
@panner11 10 ай бұрын
@@matthewliu1800 Of course the James-Stein estimator is very rough and rudimentary, but the point of the video is how it served as inspiration for the idea of Bias-Variable tradeoff. So back to the point of elegance vs practicality. Minimum-variance unbiased estimator might be what you are "looking" for, but in reality that is just a conceptual dream. Bias-Variable tradeoff and how it's widely used in real world machine learning applications for regularization is the practical part that can't be dismissed and already applied everywhere.
@scraps7624
@scraps7624 Жыл бұрын
This is a masterclass in how to teach statistics, absolutely incredible work. Scripting, visualization, pacing, everything was on point
@mathemaniac
@mathemaniac Жыл бұрын
Glad you enjoyed it!
@amphicorp4725
@amphicorp4725 Жыл бұрын
I kept forgetting that the distributions were unrelated and every time I remembered, it blew my mind. Absolutely fantastic video
@stevepittman3770
@stevepittman3770 Жыл бұрын
I have to admit that as someone not very familiar with statistics I was starting to get lost until you got to the 2D vs 3D visualization and I immediately grasped what was going on. That was an excellent way to explain it, and reminded me a lot of 3blue1brown's visual maths videos.
@mathemaniac
@mathemaniac Жыл бұрын
Thank you!
@nikolasscholz7983
@nikolasscholz7983 Жыл бұрын
The paradox stopped feeling paradoxically to me as soon as i realised that it all comes from adding all the errors together with equal weights. That already assumes that the estimated values are all on the same scale, are worth the same. There is not a lot more steps from there to assuming all the samples estimate the same value. We could for example have had one estimated value being in the magnitude of 10^24 and the other around 10^-24 and one would clearly decide against just adding the estimation errors together like one does here.
@vishesh0512
@vishesh0512 Жыл бұрын
The variance from the mean is the same for all (1). So even if one mean is 10^24, the samples you collect will most likely be within +/- 1. And similarly the 10^-24 guy will still give you samples in 10^-24 +/- 1
@vishesh0512
@vishesh0512 Жыл бұрын
The reason the Stein guy performs better is that the error is sum of 3 things. And there is a way to adjust your "estimator" so that it isn't the best for any one of the 3 variables, but the total is still less.
@nikolasscholz7983
@nikolasscholz7983 Жыл бұрын
@@vishesh0512 oh yeah you're right, i forgot the fact that the variance of each is 1. Thank you, your explanation is better. That does make the JS estimator pretty powerful though. Evem though one could think of other ways of combining the errors other than summing, summing seems to be the very obvious choice.
@vinny5004
@vinny5004 Жыл бұрын
Yes. The OP kept saying “completely independent distributions,” but that is an inaccurate description of the problem. A vector in n-dims is a single object, not the same as n separate distributions on n axes. The latter has nothing to do with Stein’s paradox, and actually the way this video begins is incorrect and does have an answer of the naive estimates as presented.
@vinny5004
@vinny5004 Жыл бұрын
In fact, one can even read on Wikipedia: “In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.” For a 21+ min video, you would think the author would at least spend the effort to accurately present the problem at the beginning.
@mingliangang8221
@mingliangang8221 Жыл бұрын
It is pretty awesome that you covering one of the most counterintuitive examples in statistics. This example motivates many exciting ideas in modern statistics like empirical Bayes. Keep up the good work.
@mathemaniac
@mathemaniac Жыл бұрын
Originally Stein's paradox was just a bit of a footnote in my class in statistics, but when I dived a little bit deeper into it, it is actually a much bigger deal than I first thought, so I decided to share it here!
@mingliangang8221
@mingliangang8221 Жыл бұрын
@@mathemaniac Yup, it is. Maybe next time, you can cover something from stein as well, like stein's identity, which is a pretty powerful tool for proving the central limit theorem and its generalisations. Sadly, there aren't many videos explaining it to a wider audience except to other graduate students.
@randyzeitman1354
@randyzeitman1354 Жыл бұрын
I’m a layman but this doesn’t seem counterintuitive because the distributions are the same. So what if they’re unrelated … they share the same reality. Are you surprised that mass is measured the same way for a rock or water? It’s simply recursive…the more data sets you have the more likely one of the points will be to center. It’s a weighted distribution of a normal distribution.
@mingliangang8221
@mingliangang8221 Жыл бұрын
​@@randyzeitman1354 I am not entirely what you mean by "sharing the same reality" and the "weighted distribution of a normal distribution". However, this estimator would work when x_1, x_2, x_3 come from different datasets for example, X_1 can be from a dataset for the height of building, X_2 can be from a dataset for the average lifetime of a fly and X_3 can be from a dataset of the number of times a cat meows. If we want to find the average of each of these datasets, it turns out it is better to use the James stein estimator then if we were to take the average of each of these things. That is what makes it counterintuitive for me. I would like to hear your intuition though,
@ssvis2
@ssvis2 Жыл бұрын
This is a great explanation of estimators and non-intuitive relations. I like that you highlighted its importance in machine learning. It would be worth doing another video about how the variance/bias relation and subsequent weightings adjustments affect those models, especially in the context of overfitting.
@mathemaniac
@mathemaniac Жыл бұрын
Will have to think about how to do it though... thanks for the suggestion.
@jadegrace1312
@jadegrace1312 Жыл бұрын
I don't think you did a very good job in the introduction of giving motivation for why it would even be possible to find a better estimator than our naive guess. As the video went on it made sense, but at the beginning when you were introducing the concept of multiple independent distributions, I wish you had included a line like "we are trying to find the best estimator overall for the system of three independent distributions, which may not be the same as the best estimator for each independent distribution".
@mathemaniac
@mathemaniac Жыл бұрын
Thanks for the feedback! I did initially want to include this into the script but eventually decided against it. This is because when I first read about Stein's paradox, and that it is because of reducing the overall error rather than individual errors, I just moved on, because I immediately felt the paradox is resolved. But when I read about James-Stein estimator again (because of the connection with the next video), I realised it was a much bigger deal than I thought it would be, like the idea of shrinkage and bias-variance tradeoff. In my opinion, this would be a much, much more important concept. In other words, if I said the line that you suggested, in the beginning of the video, my past self just would not continue to learn the much more important lessons later on in the video. So perhaps if given the second chance, I could have said it at the end of the video, but I would still not put this in the beginning.
@afterthesmash
@afterthesmash Жыл бұрын
@@mathemaniac Ah, but you must also know that burying the lead for tactical reasons is a very dangerous game. My formal math education predates Moses, but I think I still have good instincts, most of the time. In my own writing practice I often take wildly unconventional paths, to help break people out of established cognitive grooves. It's a useful posture, and sometimes it's not bad to inform the process from an introspective stance on _your own_ foibles and aversions. But you also have to be as honest as possible up front, and not go "hey, surprise, bias!" in the third act, when the gun was already smoking at the first rise of the curtain. Surely there's only one possible unbiased estimator for a symmetric distribution. You know, that first screen you introduced. Which way would you deviate? It's symmetric, you can't choose. Having but one unbiased estimator on the store shelf, if you have no bias tolerance, you are done, done, done in the first act. This was making me scream inside for the first ten minutes. And then if you go on to show that least squares estimation steers you into a biased estimator, what you _ought_ to conclude is that least squares (as applied here) is _totally inappropriate_ for use in regimes with zero bias tolerance. Which is an interesting result on its own terms. Furthermore, I had a lot of trouble with the starting point where you know the variance for certain, but you're scrabbling away with one data point to estimate the mean. Variance is the higher moment, which means we are operating in a moment inversion (like a temperature inversion over Los Archangeles), where our certitude in higher moments precedes our certitude in lower moments, which is pretty weird in real life. So I mentally filed this as follows: in an Escherian landscape where you know your higher order moments before your lower order moments (weird), then sometimes grabbing for least squares error estimation by knee-jerk habit will either A) lead you badly astray (zero bias tolerance); or B) lead you to a surprising glade in the promised land (you managed to pawn some bias tolerance for a dominating error estimator). I admire your thought process to take a motivated, pedagogical excursion. But failing to state that the naive estimator is the only possible unbiased estimator at first opportunity merely opened you up to a different scream from a different bridge. Because this whole thing was The Scream for me for the first ten minutes. So then your early segue is "but look at the surprising result you might obtain if you relax your knee-jerk fetish for zero bias" and _then_ I would have settled in to enjoy the ride, exactly as you steered it.
@afterthesmash
@afterthesmash Жыл бұрын
@@mathemaniac I had to get that first point out of my system, before I could gather my thoughts about the other aspect of this that was driving me nuts. It was pretty clear to me from early on that if your combined least squares estimator imposed a Euclidean metric, that you could win the battle on the kind of volumetric consideration we ended up with. I'm am _totally_ schooled on the volumetric paradox of high-dimensional spaces (e.g. all random pairs of points, on average, become equidistant in the limit; I usually visualize this as vertices of discrete hypercubes, with distance determined by bit vector difference counting - it's my view of continuous mathematics that has degraded greatly since the time of Moses). But then I had a minor additional scream: why should our combined estimator be allowed to impose a Euclidean metric on this problem space? When did this arranged marriage with Euclid first transpire, and why wasn't I notified? Did Gauss himself ever apply least squares with a Euclidean overlay informed by independent free parameters? It seems to me that if you just have many instances of the same thing with a _shared_ free parameter, and complete indifference about where your error falls, this amounts to an obvious heuristic, without much need for additional justification. But then when you have independent free parameters, the unexpected arrival of a Euclidean metric space needs to be thoroughly frisked at first contact, like Miracle Max, before entering Thunderdome, to possibly revive the losing contestant. Tina Turner: "True Love". You heard him? You could not ask for a more noble cause than that. Miracle Max: What's love got to do with it? But in any case that’s not what he said-he distinctly said “To blave”- Valerie: Liar! Liar! Liar! Miracle Max: And besides, my impetuous harridan, he was worked over by a chainsaw strung from a bungee cord, and now most of his body is scattered around like pink wedding confetti. Valerie: Ah, shucks.
@afterthesmash
@afterthesmash Жыл бұрын
@@mathemaniac Final comment, sorry for the many fragments. 1) you're willing to sell bias up the river (but only for a good price) 2) you're in an Escherian problem domain where a higher order moment is fixed in stone by some magic incantation (e.g. Excaliber) while a lower order moment is anybody's guess 3) you don't find it odd that your aggregated error function imposes a Euclidean metric space then 4) you arrive at this weird, counterintuitive, nay, positively _paradoxical_ result But, actually, for me, by the time I've swallowed all three numbered swords, any lingering whiff of paradox has left the building with all limbs normally attached.
@mathemaniac
@mathemaniac Жыл бұрын
@@afterthesmash Re: the variance point. If you use a lot of data points to estimate the mean for each distribution, then you will still be able to obtain an estimation of variance, and use that to construct the (modified) James-Stein estimator, and it will still dominate the ordinary estimator. More details on the Wikipedia page for James-Stein estimator.
@asdf56790
@asdf56790 Жыл бұрын
What a great video! For me you perfectly hit the pace. I was never bored but still didn't need to rewatch sections, because they were too fast. This is one of those beautiful paradoxes which you can't beleive, if you haven't seen the explanation.
@mathemaniac
@mathemaniac Жыл бұрын
Glad you enjoyed it!
@adrienadrien5940
@adrienadrien5940 Жыл бұрын
All this paradox comes from trying to minimize the squared errors. The squared errors are used mostly because its easy to compute for most of classical statistics law and it fit prety well with most minimization algorithms. But in real world,in many cases, one will be more interested of the average absolute errors instead of squared errors. I think the "paradox" is there, we are using a arbitrary metric, and we never question it. When I used to be a quantitative analyst I often used the abs value instead of squared for error minimization, I found the result way more relevant despite some slight difficulty to run some algorithms.
@ahmad_asep
@ahmad_asep Жыл бұрын
Nice video! I have studied machine learning since 2014, I have heard the term "bias-variance tradeoff" multiple times and only now I understand. Thank you so much for the explanation.
@dananskidolf
@dananskidolf Жыл бұрын
The way hypervolumes have such dense neighbourhoods seems to be very interesting and useful in many places - I suspected it'd be involved as soon as you mentioned 'in 3 or more dimensions'. And that stems from a little personal experience I had. I was working on a quality optimisation computation in 32 dimensions a while ago and opted to use simulated annealing algorithm, on a hunch that stochastic algorithms would scale best in this higher number of dimension. I had to laugh when trying to figure out a sensible distance function (used to govern how far the sample picker would jump in an iteration). We had felt overwhelmed by the size of the sample space since the start, but I began to realise that all these trillions of coordinates were in fact within only a few nearest neighbours of each other.
@anibalismaelfermandois6943
@anibalismaelfermandois6943 Жыл бұрын
Really great video, incredibly paced. The question that occurred to me is: Are we just abusing the definition of mean square error passed it's useful/intended use? Are we sure that lowering it is ALWAYS desirable?
@jsupim1
@jsupim1 Жыл бұрын
Good point. I think it's pointless to minimize the mse if the estimator you are using is biased (the James-Stein estimator is).
@chrislankford7939
@chrislankford7939 Жыл бұрын
@@jsupim1 This is a really naive thought that, sadly, pervades much of even professional science. While I can see your thinking on this in the context of a "broad-use" estimator like James-Stein--I disagree, but I see it--this thought simply falls apart when applied to a more nuanced scenario. Imagine a situation where you want to use relatively little data to infer something about a highly complex system. Say, data from an MRI to infer something about brain vasculature. There are dozens upon dozens of parameters that might affect even the simplest model of blood flow in the brain: vessel size distributions, arterial/venous blood pressure, blood viscosity, body temperature, and mental and physical activity levels. If you leave all of those as fitted, unbiased parameters, you do not have enough information to solve the inverse problem and retrieve your answer. (For the sake of argument, let's say average vessel size is what you're interested in.) So the unbiased estimator totally fails, as the mse is many times larger than the parameters. Now open up the idea of parametric constraint, a special case of the broader "regularization" described in this video. Let's say you measure blood pressure before someone enters the scanner, use 37C for temperature, go to literature to find the average blood viscosity, and assume all vessels are one unknown size in a small region. None of these will be _exactly accurate_ to the patient during the scan. What you've done is created a biased estimator that might just be able to work out the one thing you're interested in: average vessel size. Unless your guesses are very, very wrong, it will almost certainly have a lower vessel size mse than the unbiased estimator.
@phatrickmoore
@phatrickmoore Жыл бұрын
Thank you, this is exactly how I feel. As soon as MSE leads us to use information from non-correlated, independent distributions to make deductions on the one under focus means MSE is wrong. That needs to be an axiom of statistics or something. Valid Error systems cannot have dominant approximators that use info from outside, non correlated systems.
@phatrickmoore
@phatrickmoore Жыл бұрын
@@chrislankford7939 all of those distributions will be correlated, so your example doesn’t apply.
@simongunkel7457
@simongunkel7457 Жыл бұрын
@@phatrickmoore I think your intuition leads you astray, just consider genetic algorithms for optimization problems. These can often outperform any deterministic approach, even though they use stochasticity (hence random variables drawn from distributions that are independent from the optimization problem).
@GeorgeZoto
@GeorgeZoto 9 ай бұрын
Excellent content, research, pace and presentation. Thank you for putting this together and explaining it in simpler terms than the paper :)
@JamesSCavenaugh
@JamesSCavenaugh Жыл бұрын
This was my first time to encounter Mathemaniac, and I was impressed with this video. Good job!
@mathemaniac
@mathemaniac Жыл бұрын
Thank you so much!
@henriquemagalhaessoares8739
@henriquemagalhaessoares8739 Жыл бұрын
I've been using regularization on a daily basis and this is the best explanation on why shrinkage might be desirable I've ever seen. Bravo.
@mathemaniac
@mathemaniac Жыл бұрын
Great to hear!
@switen
@switen 5 ай бұрын
As a male who swims in cold water, I agree.
@xorenpetrosyan2879
@xorenpetrosyan2879 Жыл бұрын
such a cool video, I am a Machine Learning engineer and use regularisation techniques like shrinkage daily yet I didn't know it's origins were rooted in a paradox!
@mathemaniac
@mathemaniac Жыл бұрын
Great to hear!
@klausstock8020
@klausstock8020 Жыл бұрын
Never did anything like "shrinkage", and didn't get how all of this connects with machine learning. Until 45 seconds before the end, when suddenly all the pieces connected and I realized that I had been using shrinkage. And that the five-dimensional data in the database (which gets aggregated into four-dimensional data, which is then fed into the ML algorithm as a two-dimensial field) actually consists of 50,000-dimensional vectors. Ah, yes, the happy blissfully unaware life of an engineer! Anecdotal evidence: A group of engineers and a group of mathematicians meet in a a train, both travelling to a congress. The engineers are surprised to learn that the mathematicians only bought one ticket for the whole group of mathematicians, but the mathematicians won't explain. Suddenly, one mathematicians yells "conductor!". All mathematicians run to the toilet and cram themselves into the tiny room before locking the door. The conductor appears, checks the tickets of the engineers and then goes to the toilet, knocks at the door and says "ticket, please!". The mathematicians slide their single under the door to the conductor, and the conductor leaves, satisfied. When the mathematicians return to the group of engineers, the engineers complement the mathematicians on their method and say that they will use it themselves on the return trip. On the return trip, the engineers arrive with their single ticket, but are surprised to learn that the mathematicians had bought no ticket at all this time. Suddenly, one mathematicians yells "conductor!". All engineers run to the toilet and cram themselves into the tiny room before locking the door. One mathematician walks to the toilet, knocks at the door and says "ticket, please!". TL;DR version: the engineers use the methods of the mathematicians, but they don't understand them.
@newerstillimproved
@newerstillimproved Жыл бұрын
@@klausstock8020 This joke made the video all the more worthwhile.
@TUMENG-TSUNGF
@TUMENG-TSUNGF Жыл бұрын
@@klausstock8020 Good story! I had thought the mathematicians would cram into the same bathroom with the engineers, but the actual ending was even more brilliant!
@TRex-fu7bt
@TRex-fu7bt Жыл бұрын
Ooh I use a lot of smoothing/shrinkage stats models and have seen the JS estimator a few times mentioned in my reference books. Excited to see cool video about it.
@TRex-fu7bt
@TRex-fu7bt Жыл бұрын
The original baseball example (that you link to in the description) is still really good. The players’ batting averages are independent and a player’s past performance should be the best predictor of their future performance but the shrinkage smooths some noise out.
@kel3747
@kel3747 9 ай бұрын
Currently studying ML and went over Thompson Sampling recently . This is a great video as i immediately saw the similarities and was able to follow along even though i knew nothing about ML before i got started. Definitely subscribing .
@Ewuilibrium
@Ewuilibrium Жыл бұрын
Thanks for the video, I learned something new. I thought it was really interesting seeing the generalized formula for the MSE being derived from the variance formulas I learned in school and the visualizations helped make the variance bias trade make intuitive sense.
@fluffigverbimmelt
@fluffigverbimmelt Жыл бұрын
I found it a bit funny how recently statistics has become interesting (again), by referring to machine learning. But hands down: Great concept of two channels for "the engineer version" as well as the full details and your general style of teaching. Very understandable, good to grasp and intriguing. Subbed
@42isthemeaningoflife
@42isthemeaningoflife 10 ай бұрын
It was always interesting to us scientists and people who are interested in making empirical deductions. Transformer models aren't the only reason to be interested in statistics.
@rserserserse
@rserserserse Жыл бұрын
I saw a talk on this at my uni about a year ago. This paradox is so fascinating imo
@cmilkau
@cmilkau Жыл бұрын
The fact that this method treats the origin special should already be a red flag that something is off. The only thing that can be off is the way we measure how "good" an estimator is. There are several options that seem equally valid. Why do we take the square deviation? Why do we take the sum of the expected values? Why not the expected value of the Euklidean norm of the deviation? Or maybe we shouldn't take any squares at all?
@mathemaniac
@mathemaniac Жыл бұрын
It does not need to be the origin - you can equally shrink towards some other point (but pre-picked), James-Stein estimator still dominates the ordinary estimator. As to the mean squared error, I agree that this is somewhat arbitrary, but it is partly due to convenience - the calculations would be, normally, the easiest if we just take the squares; and without these calculations, we wouldn't be able to verify that James-Stein is indeed better. But if you adopt the view of Bayesian statistics, then mean squared error has a meaning there - by minimising it, you are taking the mean of the posterior distribution.
@djtwo2
@djtwo2 Жыл бұрын
The relevance of "mean square error" here can be considered in the context of the probability distribution of the error. Applying the shrinkage moves part of range of errors towards zero, while moving a small part of the range away from zero. The "mean squared error" measure doesn't care enough about the few possibly extremely large errors resulting from being moved away from zero to counterbalance the apparent benefit from moving some errors towards zero, But any other measure of goodness of estimation has the same problem. There are approaches to ranking estimation methods (other ways of defining "dominance") that are based on the whole of the probability distribution of errors not just a summary statistic. The practical worry here is that a revised estimation formula can occasionally produce extremely poor estimates, as is illustrated in the slides in this video.
@cmilkau
@cmilkau Жыл бұрын
@@djtwo2 That's what the video itself says. But there is no explanation given for that awkward quality metric over several dimensions. It's just a sum over each dimension without any further justification. Honesty, I would expect a norm on the higher-dimensional space on the bottom of the formula, then taking expectation of the squares like in 1D. But that's not what's happening. I mean expectation value is a linear operator so it may boil down to the Euclidean norm.
@robbielualhati1731
@robbielualhati1731 Жыл бұрын
Incredible video! I never fully understood why regularisation works especially with penalised regression but this video explains it very well.
@mathemaniac
@mathemaniac Жыл бұрын
Thank you!
@Anis_Hdd
@Anis_Hdd Жыл бұрын
I did my PhD on shrinkage estimators of a covariance matrix. This is the best vulgarization of Stein's paradox I have ever seen! Thanks
@toniokettner4821
@toniokettner4821 Жыл бұрын
people might read the word "vulgar" and assume you're negatively criticizing the video
@PunmasterSTP
@PunmasterSTP Жыл бұрын
This just blew my mind. I kept expecting to see some disclaimer come up that would relegate this paradox to purely an academic context. But dang, this concept is incredible!
@inothernews
@inothernews Жыл бұрын
As a graduate student who has poured through countless math explanation youtube videos in the past years, this has to be one of the most beautiful! The writing, the story, the visuals, and the PACE --- all skillfully designed and executed. Definitely recommending this to my peers. Great fun to learn something new in this way. I appreciate your work greatly!
@mathemaniac
@mathemaniac Жыл бұрын
Thank you so much for the compliment! Really encouraging!
@fergalmdaly
@fergalmdaly Жыл бұрын
Also, don't forget that the mean squared error is an arbitrary definition of error, used mostly because squaring something makes it positive without making a huge mess of the algebra. It arguably has nothing to do with intuition, it puts far more weight on large errors than our intuition might. I feel like my intuition is closer to mean-absolute than mean squared. Would the JS-estimator or anything else be better if we used mean-absolute error?
@mathemaniac
@mathemaniac Жыл бұрын
The intuitive explanation given in this video does not really have anything to do with the exact form of error that we consider. It might not be the JS estimator, but some other shrinkage estimator might dominate the ordinary estimator, e.g. www.jstor.org/stable/2670307#metadata_info_tab_contents But as you noted, the algebra is going to be messy, and it will be very difficult to obtain a definitive answer, just empirical evidence.
@fergalmdaly
@fergalmdaly Жыл бұрын
@@mathemaniac Thanks. I could be missing it (there's a lot in there I cannot parse) but it's a bit unclear to me what they have found there, it doesn't seem to claim that it dominates in LAD error. They say "Finally, using stock return data, we present some empirical evidence that the combination estimators have the potential to improve out-of-sample prediction in terms of both mean squared error and mean absolute error." which seems like a much weaker claim. Anyway, thanks for your video, it was very interesting and well presented. Just LS-error has always bugged me, it was chosen for convenience, we should expect unintuitive results sometimes.
@MDMAx
@MDMAx Жыл бұрын
Idk what I expected by watching it or why I watched it having a nonexistent education of statistics. At least now I know that I don't understand yet another semi-complicated concept in this universe. Judging by the comments you did a decent job of explaining and visualizing this topic. Keep up with the good effort!
@amaarquadri
@amaarquadri Жыл бұрын
This is one of the most counterintuitive things I've ever seen! Statistics is crazy.
@haritoshpatel4216
@haritoshpatel4216 Жыл бұрын
This is an well made video. Clear visualizations and amazing explanation. Keep it up
@mathemaniac
@mathemaniac Жыл бұрын
Thank you very much!
@johanneshendriks9602
@johanneshendriks9602 Жыл бұрын
Really great video and some great intuition. I did feel that one extra concept could have been added. The concept of a "typical set" for probability distributions. For example, for a high dimensional Gaussian distribution the typical set ends up being a shell like volume some distance away from the mean. This could add to the explanation as to why taking just the point is not ideal, and also as to why it's more 'likely' that you will be in the 'far end' rather than the 'near end'
@ostrodmit
@ostrodmit Жыл бұрын
I like to give deriving the James-Stein estimator as a homework problem when teaching Math 541b at USC. Cool stuff!
@gerrychen
@gerrychen Жыл бұрын
Amazing video - perfectly paced and exactly right amount of background info!
@mathemaniac
@mathemaniac Жыл бұрын
Glad you enjoyed it!
@kasuha
@kasuha Жыл бұрын
What disturbs me on this method is that it is not scale invariant. Let's say we have three random measurements of distance, 1 m, 2 m, and 3 m. Then the estimates would be 0.92, 1.85, and 2.78. But if we express the same measurements in feet, calculate the estimates and then convert them back to meters, they will be 0.99, 1.98, and 2.98. That does not sound right. Or did I miss something?
@coreyyanofsky
@coreyyanofsky Жыл бұрын
The MSE as expressed in the video is dimensionally inconsistent for measurements with units. Implicitly the variance is setting the scale here -- you measure in units such that the standard deviation is 1, and this scaling eats the units.
@sternmg
@sternmg Жыл бұрын
The estimator requires that all component quantities be normalized, i.e., to be dimensionless and have variance 1. This means real-world input components must all be scaled as x_i := x_i/σ_i, which means that all component _variances must be known beforehand_ . That is not exactly practical and also makes the estimator less miraculous.
@mathemaniac
@mathemaniac Жыл бұрын
You can use the usual estimate for the variances (if you have more data points, in which case, the means still follow normal distribution, just with different variances), and the James-Stein estimator still dominate the ordinary estimate, so you don't have to know the variances actually.
@ziyangxie8607
@ziyangxie8607 Жыл бұрын
A fantastic demonstration of the Stein's paradox. Literally one of the best math videos I've watched
@mathemaniac
@mathemaniac Жыл бұрын
Thank you so much!
@mrbeancanman
@mrbeancanman Жыл бұрын
never knew the link between shrinkage and regularisation... good stuff.
@charliethomas6317
@charliethomas6317 Жыл бұрын
In 1982 I contacted Dr Ephron at Stanford University and on his help used the JS estimates for stands of bottom land forest in Arkansas, Louisiana and? Mississippi. These stands were residual acres of valuable cypress and oaks?
@johnchessant3012
@johnchessant3012 Жыл бұрын
That's a really cool paradox, great video! Question about the "best estimator": Would this definition mean always guessing 7 is also an admissible estimator because no other estimator can have mean squared error = 0 in the case that the actual mean is 7?
@mathemaniac
@mathemaniac Жыл бұрын
Yes! I originally wanted to say this in the video but decided against it to make it a bit more concise. Indeed, your observation adds fire to the anger by those statisticians who really believed in Fisher - admissibility (what I called "best" estimator) is a weak criteria for estimators, but our ordinary estimate fails this!
@leif1075
@leif1075 Жыл бұрын
@@mathemaniac around 14:30 you just mean a higher distance results I smaller shrinkage because since the denominator is getting larger, the entire term p Mina 2 over tbst distance will shrink since the numerator stays the same..that's all you meanr right?
@mathemaniac
@mathemaniac Жыл бұрын
@@leif1075 Yes - if the original distance is large, then the absolute reduction in distance will be small, because the original distance is in the denominator.
@viliml2763
@viliml2763 Жыл бұрын
@@mathemaniac I read somewhere that the James-Stein estimator is itself also inadmissible. Is there any "good" admissible estimator?
@hwangsaessi2335
@hwangsaessi2335 Жыл бұрын
Great video! Paradoxes like this are why I like the Bayesian formulation of estimation theory a lot; you can essentially also get regularization effects by choosing appropriate priors and estimators, but without many of the same conceptual pitfalls. (I am no math/statistics expert, but I do work with applied estimation so not a total layman either.)
@4dtoaster819
@4dtoaster819 Жыл бұрын
There is something satisfying about an idea going from ridicules to obvious in a short span of time.
@chrislankford7939
@chrislankford7939 Жыл бұрын
As much as I'd like to say my own work involving the bias-variance tradeoff is a must-read on the topic, the absolute MVP paper on this subject is: Kay, S and Eldar, YC. Rethinking Biased Estimation. IEEE Signal Processing Magazine. 2008. It's rooted in Steven Kay's excellent "Fundamentals of Statistical Signal Processing" textbook series and does some quick and dirty proofs of multiple biased estimators that are actually superior to their unbiased counterparts.
@ipudisciple
@ipudisciple Жыл бұрын
The main reason that this is counter-intuitive, IMHO, is that it does not have the obvious symmetry. Suppose we sample from [N(m1,s1), N(m2,s2), N(m3,a2)] and get [x1, x2, x3]. Suppose our estimator for [m1, m2, m3] is [m'1, m'2, m'3]. This might be [x1, x2, x3] or it might not. Now suppose we get [x1+t1, x2+t2, x3+t3]. Imagine the t1, t2, t3 as being very large. Surely our estimator should be [m'1+t1, m'2+t2, m'3+t3]. The problem has a symmetry, so surely our solution should exhibit the same symmetry. The James-Stein estimator does not have that property. But here's the thing. If a problem has a symmetry, then the set of all solutions must have the same symmetry, but unless the solution is unique no individual solution needs to have that symmetry. Spontaneous symmetry breaking and all that. So there are other James-Stein estimators which are given by taking the origin to be at [u1, u2, u3], and these also beat the [x1, x2, x3] estimator, and the set of all of them has the expected symmetry.
@mathemaniac
@mathemaniac Жыл бұрын
Yes - you can also shrink it towards any other arbitrary, but pre-picked point. You can even think of the ordinary estimate as just shrinking towards infinity.
@GerardSans
@GerardSans 10 ай бұрын
It seems than the reasoning behind is where the gains happen for errors (closer to mean). For 2 each unknown mean can fall either on the right or left but when we introduce a third this will fall into right or left making it closer after applying the inverted proportion. For n=4 then either the new mean fall either right/left making the new value closer to all in the group where the mean is positioned right/left of the value and so on. P-2 corrects the initial 2 best and the squares allow for a conservative approach vs straight sum or ˆ3.
@sternmg
@sternmg Жыл бұрын
To my physics-trained eyes, the formula at 3:00 looks incorrect or at least incomplete for general variables having units. Are all _x_ components expected to be dimensionless and normalized to σ_i = 1? But where would one get the σ_i from?
@frankjohnson123
@frankjohnson123 Жыл бұрын
I believe all that's required is the inputs are dimensionless, so you can do the naïve thing and divide by the unit or be more precise by using some physical scale for that dimension if it's known.
@sternmg
@sternmg Жыл бұрын
Aha, on Wikipedia the James-Stein estimator is shown with σ² in the numerator, which would indeed take care of units and scale. Alas, this makes the estimator _dramatically less useful_ in real-world situations because it can only be applied if σ² is known _a priori_ .
@Pystro
@Pystro Жыл бұрын
I was thinking the same thing. If you wanted to define a shrinkage factor that works for data sets with variances that aren't normalized to 1, you'd need to explicitly write that into the equation. I.e. every time there's an x_i in the shrinkage factor, you'd replace it with x_i/sigma_i. One consequence is that the James Stein estimator can only be used if you know (or have an estimate for) the variance. And if you have only an estimate for the variance (which is the best you can hope for if you don't know the true distribution already), then that can deteriorate the quality of the estimator.
@mathemaniac
@mathemaniac Жыл бұрын
No, that's not true. Also on Wikipedia, you can apply the James-Stein estimator if the variance is unknown - you just replace it with the standard estimator of variance.
@coreyyanofsky
@coreyyanofsky Жыл бұрын
@@sternmg the JS phenomenon was only ever meant to be a counter-example of sorts, not applied statistics -- that's why they didn't bother defining an obvious improvement that dominates the JS estimator (to wit, the "positive-part JS estimator" that sets the estimate to zero when the shrinkage factor goes negative). If you want practical shrinkage methods use penalized maximum likelihood with L1 ("lasso") or L2 ("ridge") penalties (or both, "elastic net") or Bayes.
@Icenri
@Icenri Жыл бұрын
It made sense to me that the variance was the cause of the paradox but the real reason is mind boggling.
@hellohey8088
@hellohey8088 Жыл бұрын
Nice video. I guess the graphical explanation for how the JS estimator "might" work does not apply if the shrinkage factor is negative. I wonder if there is an intuitive explanation for the case when the shrinkage factor is negative too?
@michaelhiggins9188
@michaelhiggins9188 Жыл бұрын
Congratulations on reaching 100 K subscribers! I think this channel will continue to grow because the content is very high quality and there aren't many like this.
@mathemaniac
@mathemaniac Жыл бұрын
Thank you very much!
@cmyk8964
@cmyk8964 Жыл бұрын
It reminds me of the Curse of Dimensionality. Some stuff works well in 2D but not in higher dimensions. It’s like the “sphere between 1-unit spheres packing a 2-unit cube”. If you draw a circle that touches the inside of 4 unit circles forming a square, it would have a radius of √2-1 ≈ 0.414 units; if you draw a sphere that touches the inside of 8 unit spheres forming a cube, it would have a radius of √3-1 ≈ 0.732 units. But for 4D and up, the center hypersphere is the same size as the corner hyperspheres (√4-1=1), and in 5D and above, the center hypersphere is bigger, and eventually becomes uncontainable in the hypercube.
@michaelhunte743
@michaelhunte743 9 ай бұрын
1-([p-2]/[encasing domain state in p terms]) effectively is just an addition of distributions assumed normal. If they are assumed normal then their rates of change would follow uniformly within a [0,1] set.
@miguelcampos867
@miguelcampos867 Жыл бұрын
Amazing video. What does it come next? Cant wait for it
@rossjennings4755
@rossjennings4755 Жыл бұрын
A lot of people say that they find the Banach-Tarski theorem to be upsetting, but this result is so much worse than that. You can make the Banach-Tarski phenomenon go away with some pretty weak continuity assumptions, but this is a really strong result that applies in real-world situations and isn't going to go away no matter what you throw at it. In fact I suspect you can make some pretty sweeping generalizations of it. I think the main reason I find it so hard to accept is that I have a really strong intuitive sense that there should be a unique "best" estimator -- i.e., you shouldn't be able to get a better estimator by biasing it in an arbitrary direction, which is exactly what happens with the James-Stein estimator. I suspect that, based on similar reasoning to what's presented in this video, you can show that, in these kinds of situations, there can be no unique "best" estimator. (Edit: I originally had "admissible" where I now have "best", but I've since realized that's not really what I meant.)
@ronalddobos8390
@ronalddobos8390 Жыл бұрын
Amazing video! But I have one nitpicky comment: at 15:00 your arrows are misleading, the shrinkage factor is actually the same for the bottom left arrow and for the "near end" arrow
@porglezomp7235
@porglezomp7235 Жыл бұрын
As soon as you started talking about bias-variance tradeoff I started thinking about biased sampling in Monte Carlo methods (and in rendering in particular). Sometimes it's worth losing the eventual convergence guarantees of the unbiased estimators if it also kills the sampling noise that high variance introduces.
@noplan113
@noplan113 Жыл бұрын
I have a naive question about why this works: So given the original setup, you basically draw numbers (mu) in the range from [-infinity,+infinity]. If all numbers are equally likely, the expected value for this drawing should be zero? Then we get a second information, that is the single confirmed value that we know for each distribution. Given that the expected value of all mus should be zero, can we just assume that it is more likely that the actual mu is slightly closer to zero than the number we know? However if you shrink too much you will also lose out on accuracy. Therefore there could be an optimal "amount" of shrinkage? Does this make sense?
@Temari_Virus
@Temari_Virus Жыл бұрын
I think the expected error will always be the same no matter what the shrinkage factor is? A uniform distribution is basically a straight line, so it'll look the same no matter how you stretch or shrink it. The variance of the distributions is (infinity - infinity) / 2 = ...dammit. Ok let's draw numbers from the range [-x, x] instead. So now the variance of the distributions is (x - x) / 2 = 0, which approaches 0 as x approaches infinity. The shrinkage factor basically multiplies this variance, and 0 multiplied by anything is still 0. (Don't quote me on this, I don't know much about statistics, but this just made sense to me)
@neiljudell1437
@neiljudell1437 Жыл бұрын
Clarification: If there exists an unbiased estimator, then the unbiased estimator of lowest variance is MLE (Cramer-Rao Theorem). Sometimes we really care that the estimate be unbiased. Sometimes we want MMSE. Sometimes, we want MAP. Depends heavily on the application. Now - let's do the Steiner paradox in a generic Gaussian vector, not with identity covariance matrix. Gets interesting quickly
@matteogirelli1023
@matteogirelli1023 Жыл бұрын
For some very important statistical applications though, we would never adopt a biased estimator for a more precise one, for example where we want to make a causal inference
@tricky778
@tricky778 8 ай бұрын
If the squares of the data points sum to zero then the James stein estimator makes no estimate or estimates ±∞ because the shrinkage term depends on a zero denominator. That's when they're all zero or if they're complex then lots of data points will cause that. Is the estimator's performance biased towards distributions that don't produce zero?
@112BALAGE112
@112BALAGE112 Жыл бұрын
This is another great example of how higher dimensional space defies intuition.
@jan.kowalski
@jan.kowalski Жыл бұрын
One of the best teaching experiences. Amazing!
@justinlowenthal3208
@justinlowenthal3208 Жыл бұрын
I am wondering… If I had a single measurement to estimate in one dimension. Could I use a random number generator to create data sets in two more dimensions, then use the James Stien estimator to get a more accurate result? Basically shoehorn the estimator into a one dimensional problem?
@Smo1k
@Smo1k Жыл бұрын
Heh. Good thought, but nope: This is about the "best" overall guess for the whole set of variables with the same variance; there's no saying which mean you will have the biggest error guessing. If you think of the p-2 over the division line as your degrees of freedom, and you do the J-S equation for 4 numbers, then run a second number on each variable and remove the worst fit to get down to 3, chances are equal that it's the variable you wanted to shoehorn which gets tossed.
@nerfpls
@nerfpls Жыл бұрын
My impression is that the reason shrinkage works is fundamentally because we have an additional bit of information a priori: Values closer to 0 are more likely than values further away. This becomes obvious with very large numbers. We know intuitively that any distribution we encounter in real life will be unlikely to have a mean above 2^50 lets say. This is important because for values far from zero, the James Stein Estimator loses its edge. If we didnt assume a bias towards 0 and would truly consider all possible values equally (eg a mean of 2^50^50 is just as likely as a mean between 0 and 1 million), we would see that the James Stein estimator is in fact not measurably better over all possible numbers (its average error approaches the same limit as the simple estimator). Its just better for numbers close to 0, which turns out to include any distribution we will ever encounter at least to some degree because nature is biased towards number closer to 0.
@mathemaniac
@mathemaniac Жыл бұрын
If you know a priori that your true value is actually very large, you can shrink towards that far away point instead! There is nothing special about 0.
@nerfpls
@nerfpls Жыл бұрын
If you consider all numbers, any finite positive number you pick, no matter how large, will still be small in the sense that there is an infinite range of larger numbers than your chosen number, but only a finite number of smaller positive numbers. So compared to all numbers, we cannot help but pick numbers close to 0! Knowing this, we can bias towards small numbers and improve. Any other number you might chose to shrink to is special too because in the same sense it is also a small number (it might be better than or worse than 0, but just like 0 it will help at least a little bit). If you "shrink" towards infinity, I think that will only help if you change the methodology a bit and shrink not based on the distance to infinity (that would get you just a constant additive shift to all values - that doesnt help) but based on the distance to a finite set point. So again, as you get further from the set point, the benefit of shrinking will decrease and approach 0. That being said I am confused as to why shrinkage doesnt work in 1d and 2d, so maybe I am mistaken.
@nvs3221
@nvs3221 Жыл бұрын
Awesome video, would love some more statistics content. Pure maths people don't pay it enough respect :)
@gowrissshanker9109
@gowrissshanker9109 Жыл бұрын
Hlo Sir, How complex analysis is useful in special theory of relativity?(as you have mentioned in your complex analysis intro vedio) Thank you
@NewtonianT
@NewtonianT Жыл бұрын
Nice video, I would like to ask, could you recommend me to a book to begin to understand statistics and probability?
@chaitanyalodha3948
@chaitanyalodha3948 11 ай бұрын
I somehow feel this is really connected to the concept of higher dimensional spheres, which 3b1b hadmade a video on. About their volumes and shapes
@raywang5619
@raywang5619 Жыл бұрын
Fantastic intuition elaboration. Thank you so much
@russellsharpe288
@russellsharpe288 Жыл бұрын
I haven't thought about this in detail at all, but is this counterintuitive result dependent on the use of the mean squared error? Would it be avoided if one used eg the mean absolute error instead? (If so, doesn't it amount to a reductio ad absurdum refutation of the use of mean squared error?)
@coreyyanofsky
@coreyyanofsky Жыл бұрын
It happens because MSE treats errors in each parameter as comparable. If you think about actually estimating quantities of interest you'll see that the MSE as expressed here isn't dimensionally consistent: there's an implicit conversion factor that says that whatever the variance in the individual components is, that sets the scale for how errors in different components are traded off against one another. It's the way this trading off of errors in the different components works that leads to the the shrinkage estimator dominating the maximum likelihood estimator. I haven't checked but using mean absolute error would require an the same trading off of estimation errors so I'd expect to have a James-Stein-style result with that loss function too.
@terdragontra8900
@terdragontra8900 Жыл бұрын
@@coreyyanofsky If you had some data set where errors in dimensions aren't comparable because, say, you weigh error twice as heavily in x_1 than in x_2, then you can just scale x_1 by a factor of two and try to estimate 2mu_1, and the paradox still happens. I suppose instead you may be completely unwilling to compare the dimensions, but then "best estimator" for the set is meaningless. This is strange.
@coreyyanofsky
@coreyyanofsky Жыл бұрын
@@terdragontra8900 If you change the weighting so that you're no longer variance 1 in some component then the loss function is weighted MSE and the sphere in the video becomes an ellipsoid; this will make the math more complicated for no real gain because the JS phenomenon was supposed to be a counter-example of sorts and not applied statistics.
@SolomonUcko
@SolomonUcko Жыл бұрын
Wouldn't reweighting the MSE just lead to a weighted JS estimator?
@orangereplyer
@orangereplyer Жыл бұрын
I think they key insight is that, in higher dimensions, it's not like you're getting a better estimate *for each separate dimension* than you would've if you'd estimated each separately. But the, like, "length" of the error vector will be less. The problem might be how we ought to be interpreting that length.
@damonjalali8669
@damonjalali8669 Жыл бұрын
Ohh fantastic!! This video tutorial is really interesting and amazing. Thanks a lot .
@pawedziedzic3250
@pawedziedzic3250 9 ай бұрын
It would be cool if the terms used in this video were explained a bit. Up until 13:00 I thought that mu is meant to be value at maximum, not the point at which maximum occurs, which was pretty confusing
@edzielinski
@edzielinski Жыл бұрын
This could be wrong but I see an example to illustrate this in real life: Consider a box of screws of varying length. Now randomly pull a screw. The length of the screw will be the data point. The tolerance will be the mechanical variation in length. Let’s also add the requirement that the variation will be consistent across all screws, just as the variance is the same in the data point examples at the start. Intuitively and mathematically, then, more information about the actual value of the tolerance, which is unknown will be accumulated as more screws are examined. This information can be used to predict how close the actual “ideal” length of the screws compares to the observed length. You can then select “better” screws from what has been accumulated. The apparent independence of the samples (length) is an illusion because they share a common value of the variance or spread.
@ckq
@ckq Жыл бұрын
Nice video but I'm a bit confused on the intuition. So it is worse in a certain region where the data is incorrectly regressed toward the origin but better everywhere else, leading to better overall performance? Let me try (a, b, c) = (4, 5, 6). Let x, y, z be random variables, mean 0, variance 1. Our data is (4+x, 5+y, 6+z). So our goal has a squared distance from the origin of 4²+5²+6² = 77, but our data has a slightly larger expected variance of 77 + 3 = 80. The shrinkage factor is 79/80 cuts that to around 78. So we got a right triangle with legs √77 (could be any number) and √3 (larger in higher dimensions). Let's say the origin is point A, the truth is point B (√77 away from A) and the data is C (√3 away from B). AB is the initial vector, BC is the error, and AC is the data (hypotenuse). So it's saying if we take the hypotenuse and go some of the way back to the origin, we get closer to the right angle. I think this view kind of makes sense now since for example in a 3-4-5 triangle, when you drop the altitude from the right angle to the hypotenuse it is 16/25th of the way down the hypotenuse. So for that √77, √3, √80 triangle we can get closest to the right angle by regularizing by a factor of 77/80. I think I'm almost there now, now all that's left to understand is why we subtract 2 in the numerator to find out why it doesn't work in 1 or 2 dimensions. So if truth is t, and we're given t+x. Guessing t+x, gives variance of 1 obviously. But what if we guessed (t+x)*(1-1/(t+x)²)? Running it through the calculator, it seems like we get t²/(t³-3) > 1 If we guessed, (t+x)*t²/(t²+1) instead the variance actually does decrease into t²/(t²+1)
@jenaf4208
@jenaf4208 Жыл бұрын
If we use a different error weighting function than "mean square error" I assume that other estimators will be best.
@bejoscha
@bejoscha 11 ай бұрын
Lovely video which can give one a "take away" message without the need to fully understand all mathematical details. The 3D picture really makes it intuitive. (too bad, so many interesting things only happen if d
@alangivre2474
@alangivre2474 Жыл бұрын
You are exceptionally clear!!!!! I hope this channel grows!!!
@mathemaniac
@mathemaniac Жыл бұрын
Thank you so much!
@alangivre2474
@alangivre2474 Жыл бұрын
@@mathemaniac I am doing my PhD in Information Theory in Biophysics and I have never heard about this estimator!! Very enriching.
@Fred-yq3fs
@Fred-yq3fs Жыл бұрын
very unintuitive. Outstanding content. Thought provoking. Love it! Keep it up.
@mathemaniac
@mathemaniac Жыл бұрын
Glad you liked it!
@nathanoupresque4017
@nathanoupresque4017 Жыл бұрын
Since the problem seems to me invariant by change of origin, one could also pull the estimated point towards another point than (0,0,0)? What would be the formula in this case? Should we replace the naive estimate by λ*naive_estimate+(1-λ)*shrinkage_target ; with λ being the shrinkage coefficient : (1 - 1/||naive_estimate - shrinkage_target||²)?
@kylebowles9820
@kylebowles9820 Жыл бұрын
I like how you can see in higher dimensions the volume of the error sphere becomes less relevant
@miguelcampos867
@miguelcampos867 Жыл бұрын
I would love the explanation of density estimation with normalizing flow
@MithicSpirit
@MithicSpirit Жыл бұрын
10:39 i mean nowdays if you're doing anything at all sensitive on a non-https website you're making a big mistake. 10:55 browser fingerprinting is a thing and can often uniquely identify a device. sure, a vpn makes it a bit harder, but unless you're hardening your browser (which anyone in the intended audience of this ad is not doing) it doesn't matter that much.
@kylewilson6425
@kylewilson6425 Жыл бұрын
Great demonstration! You've earned a subscriber! Thank you very much! 👍
@praveenb9048
@praveenb9048 Жыл бұрын
Has this principle been absorbed into other algorithms like the Kalman filter etc?
@FalconX88
@FalconX88 4 ай бұрын
Man I was super confused because the intro and even some parts in the video ("let's take a naive estimator of 7", while it should be "let's take a random sample and it turned out to be 7") makes it sound like we pick the ordinary estimator value in a quest to find the mean. If that's the case then this makes no sense whatsoever. But if we pick a random sample and from that we get that value, then it makes sense.
@thaq8.2
@thaq8.2 9 ай бұрын
Static 0:59 bcaa CBRN apex shocks 2:09 WIMP filtered politics 3:00 variable quality vs survival evolving cordons 6:21
@withoutdad7616
@withoutdad7616 11 ай бұрын
1990's Wonder Years Algebra Teacher: Every problem contains it's own solution. 2023 AI: Convergence to true.
@spillfish4327
@spillfish4327 Жыл бұрын
I’m studying MAS-I right now and this was super helpful!
@robharwood3538
@robharwood3538 Жыл бұрын
How does this result connect with Bayesian estimation? To me it seems to make sense that the reduction starts to happen at 3 sampled points because there is an implicit (within the MLE estimator) prior hyper-parameter distribution on the supposedly-independent distributions, namely that their (improper) prior means are uniform on the Real line. But once you have sampled from at least 3 of these distributions, you now have enough data from the hyper-prior distribution to outweigh the improper uniform prior of the individual distributions. Namely, the hyper-prior on the original 'independent' means should be updated to be somewhere close to the average of the three sample points. So, I imagine that if this whole scenario was rephrased in terms of a hierarchical Bayes model, with hyper-parameters for the means of the multiple distributions, you would not only get a better estimator than the naive MLE estimator, but you'd almost certainly eliminate the negativity flaw in the James-Stein estimator.
@coreyyanofsky
@coreyyanofsky Жыл бұрын
I'm a Bayesian and I don't think there's necessarily a connection here. This phenomenon happens because of the way the loss function trades off estimation error in the different components of the estimand. This particular loss function is not an essential piece of Bayesian machinery, and if you think about it, Bayes licenses you to shrink whenever the prior information justifies it even in 1 or 2 dimensions.
@mathemaniac
@mathemaniac Жыл бұрын
James-Stein estimator is also an example of an empirical Bayes estimator. You can derive it by considering the prior distribution as centred around the origin, but the variance-covariance matrix is estimated from the data itself.
@robharwood3538
@robharwood3538 Жыл бұрын
@@mathemaniac Thanks!
Dream cheating scandal - explaining ALL the math simply
26:05
Mathemaniac
Рет қаралды 680 М.
Researchers thought this was a bug (Borwein integrals)
17:26
3Blue1Brown
Рет қаралды 3,2 МЛН
ТОМАТНЫЙ ДОЖДЬ #shorts
00:28
Паша Осадчий
Рет қаралды 12 МЛН
Сын Расстроился Из-за Новой Стрижки Папы 😂
00:21
Глеб Рандалайнен
Рет қаралды 4,1 МЛН
Мама и дневник Зомби (часть 1)🧟 #shorts
00:47
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 902 М.
The "Just One More" Paradox
9:13
Marcin Anforowicz
Рет қаралды 2,8 МЛН
It Took 2137 Years to Solve This
47:06
Another Roof
Рет қаралды 87 М.
Bayes theorem, the geometry of changing beliefs
15:11
3Blue1Brown
Рет қаралды 4,1 МЛН
The Sierpinski-Mazurkiewicz Paradox (is really weird)
13:03
Zach Star
Рет қаралды 435 М.
How are memories stored in neural networks? | The Hopfield Network #SoME2
15:14
We should use this amazing mechanism that's inside a grasshopper leg
19:19
3 Paradoxes That Will Change the Way You Think About Everything
12:41
Pursuit of Wonder
Рет қаралды 1,7 МЛН
Russell's Paradox - a simple explanation of a profound problem
28:28
Jeffrey Kaplan
Рет қаралды 7 МЛН
ТОМАТНЫЙ ДОЖДЬ #shorts
00:28
Паша Осадчий
Рет қаралды 12 МЛН