He is a fantastic teacher. And all the energetic teaching has given him a bad throat.
@crestz1 Жыл бұрын
This lecturer is amazing. As a Ph.D candidate, I always revisit the lectures to familiarise myself with the basics.
@xiaoweidu46673 жыл бұрын
The key to deeper understanding of algorithms is the assumptions about the underlying data. Thank you and great respect.
@meenakshisarkar75294 жыл бұрын
This is probably the best explanation I came across regarding the difference between the Bayesian and Frequentists statistics. :D
@brahimimohamed2613 жыл бұрын
Someone from Algeria confirms that this lecture is incredible. You have transformed complex concepts very simple
@deltasun4 жыл бұрын
impressive lecture, thanks a lot! I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful
@sandeepreddy62953 жыл бұрын
Makes the concepts of MLE and MAP very very clear. We also get to know that - Bayesians and frequentists both trust the Bayes rule.
@cuysaurus5 жыл бұрын
48:46 He looks so happy.
@SundaraRamanR5 жыл бұрын
"Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection
@WahranRai Жыл бұрын
You are totally wrong !
@JohnWick-xd5zu4 жыл бұрын
Thank you Kilian, you are very talented!!
@mohammadaminzeynali98312 жыл бұрын
Thank you Dr. Weinberger. you are a great lecturer and also KZbin algorithm subtitles your "also" as "eurozone".
@DavesTechChannel4 жыл бұрын
Amazing lecture, best explanation of MLE vs MAP
@abunapha5 жыл бұрын
Starts at 2:37
@sumithhh93794 жыл бұрын
Thank you professor Kilian.
@andrewstark81072 жыл бұрын
From 30:00 pure gold content. :)
@Jeirown4 жыл бұрын
when he says basically, it sounds like bayesly. And most of the time it still makes sense
@zelazo815 жыл бұрын
I think I finally understood a difference between frequentist and bayesian reasoning, thank you :)
@vishchugh4 жыл бұрын
Hi Killian, While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT. Rght?
@dude83094 жыл бұрын
I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?
@JoaoVitorBRgomes4 жыл бұрын
At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!
@kilianweinberger6983 жыл бұрын
Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data. If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely. (In practice these two approaches end up being very similar ...)
@saitrinathdubba6 жыл бұрын
Just Brilliant !! Thank you prof. kilian !!!
@prwi87 Жыл бұрын
Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter. There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y. 1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t) is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";". 2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D) In our example of coin tossing P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads) Given that L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)), where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D. Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X). To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!
@beluga.314 Жыл бұрын
You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here
@arjunsigdel80704 жыл бұрын
Thank you. This is great service.
@marcogelsomini76552 жыл бұрын
48:18 loop this!! Thx Professor Weinberger!
@thachnnguyen9 ай бұрын
I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?
@SalekeenNayeem4 жыл бұрын
MLE starts at 11:40
@StarzzLAB3 жыл бұрын
I teared up at the end as well
@MarcoGelsomini-r8cАй бұрын
34:52 very interesting point!
@coolblue59292 жыл бұрын
Very enjoyable. I think a Killian is like a thousand million right? I got confused at the end though. I need to revise.
@HimZhang2 жыл бұрын
In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?
@Klisteristhashit4 жыл бұрын
xkcd commic mentioned in the lecture: xkcd.com/1132/
@jandraor5 жыл бұрын
What's the name of the last equation?
@jachawkvr4 жыл бұрын
I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.
@kilianweinberger6984 жыл бұрын
Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.
@jachawkvr4 жыл бұрын
Ok, I get it now. Thank you for explaining this!
@Bmmhable5 жыл бұрын
At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.
@kilianweinberger6985 жыл бұрын
Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.
@Bmmhable5 жыл бұрын
@@kilianweinberger698 Thanks a lot for the explanation. Highly appreciated.
@imnischaygowda2 жыл бұрын
nH + nT choose nH what exactly do u mean here?
@abhinavmishra94014 жыл бұрын
Impeccable
@jijie1333 жыл бұрын
Great!
@KulvinderSingh-pm7cr6 жыл бұрын
Made my day !! Learnt a lot !!
@sushmithavemula24985 жыл бұрын
Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !
@abhishekprajapat4154 жыл бұрын
18:19 how did that expression even came, like what is this expression even called in maths. By the way I am b.tech. student so I guess I might not have read the math behind this expression.
@SalekeenNayeem4 жыл бұрын
Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.
@pritamgouda729410 ай бұрын
can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09
@kilianweinberger6989 ай бұрын
kzbin.info/www/bejne/parQpXqifMmHY80
@pritamgouda72949 ай бұрын
@@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊
@yuniyunhaf57675 жыл бұрын
thanks prof
@hafsabenzzi36092 жыл бұрын
Amazing
@logicboard77463 жыл бұрын
bayesian @23:30, then 32:00
@kartikshrivastava15002 жыл бұрын
Wow, apt explanation. The captions were bad, at some point: "This means that theta is no longer parameter it's a random bear" 🤣
@utkarshtrehan91284 жыл бұрын
MVP
@subhasdh24462 жыл бұрын
I'm in the 7th lecture. I hope I find myself commenting on the last one.
@kilianweinberger6982 жыл бұрын
Don’t give up!
@vatsan164 жыл бұрын
So the trick to getting past the spam filter is to use obscure words in the english language eh. Who wohuld have thought xD
@kilianweinberger6984 жыл бұрын
Not the lesson I was trying to get across, but yes :-)
@vatsan164 жыл бұрын
@@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)
@deepfakevasmoy34774 жыл бұрын
12:46
@xiaoweidu46673 жыл бұрын
talking about logistics and taking stupid questions from students are major waste of talent of this great teacher.