Lecture 7 "Estimating Probabilities from Data: Maximum Likelihood Estimation" -Cornell CS4780 SP17

Рет қаралды 58,970

Kilian Weinberger

Күн бұрын

Пікірлер: 64

@shashanksharma1498 2 ай бұрын

He is a fantastic teacher. And all the energetic teaching has given him a bad throat.

@crestz1 Жыл бұрын

This lecturer is amazing. As a Ph.D candidate, I always revisit the lectures to familiarise myself with the basics.

@xiaoweidu4667 3 жыл бұрын

The key to deeper understanding of algorithms is the assumptions about the underlying data. Thank you and great respect.

@meenakshisarkar7529 4 жыл бұрын

This is probably the best explanation I came across regarding the difference between the Bayesian and Frequentists statistics. :D

@brahimimohamed261 3 жыл бұрын

Someone from Algeria confirms that this lecture is incredible. You have transformed complex concepts very simple

@deltasun 4 жыл бұрын

impressive lecture, thanks a lot! I was also impressed to discover that, if instead of taking the MAP you take the EAP (expected a posteriori), then the Bayesian approach implies smoothing even with uniform prior (that is alpha=beta=1)! beautiful

@sandeepreddy6295 3 жыл бұрын

Makes the concepts of MLE and MAP very very clear. We also get to know that - Bayesians and frequentists both trust the Bayes rule.

@cuysaurus 5 жыл бұрын

48:46 He looks so happy.

@SundaraRamanR 5 жыл бұрын

"Bayesian statistics has nothing to do with Bayes' rule" - knowing this would have avoided a lot of confusion for me over the years. I kept trying to make the (presumably strong) connection between the two and assumed I didn't understand Bayesian reasoning because I couldn't figure out this mysterious connection

@WahranRai Жыл бұрын

You are totally wrong !

@JohnWick-xd5zu 4 жыл бұрын

Thank you Kilian, you are very talented!!

@mohammadaminzeynali9831 2 жыл бұрын

Thank you Dr. Weinberger. you are a great lecturer and also KZbin algorithm subtitles your "also" as "eurozone".

@DavesTechChannel 4 жыл бұрын

Amazing lecture, best explanation of MLE vs MAP

@abunapha 5 жыл бұрын

Starts at 2:37

@sumithhh9379 4 жыл бұрын

Thank you professor Kilian.

@andrewstark8107 2 жыл бұрын

From 30:00 pure gold content. :)

@Jeirown 4 жыл бұрын

when he says basically, it sounds like bayesly. And most of the time it still makes sense

@zelazo81 5 жыл бұрын

I think I finally understood a difference between frequentist and bayesian reasoning, thank you :)

@vishchugh 4 жыл бұрын

Hi Killian, While calculating the likelihood function in the example. You have taken (nH+nT)choose(nH) also into consideration. Which doesn’t change the optimization though, but shouldn’t be there I guess, because in P(Data | parameter) , all samples being independent should just be Q^nH * (1-Q)^nT. Rght?

@dude8309 4 жыл бұрын

I have a question about how MLE is formulated when using the binomial distribution (or maybe in general?): I might be overly pedantic or just plain wrong but looking at 18:01 wouldn't it be "more correct" to say P(H | D; theta) instead of just P(D;theta)? Since we're looking at the probability of H given the Data, while using theta as a parameter?

@JoaoVitorBRgomes 4 жыл бұрын

At circa 37:28, professor, you say some on the lines of 'which parameter makes our data most likely', could I say in other words: 'which parameter it is that corresponds to this distribution of data' ? But not 'which parameter most probable corresponds to this distribution' ? Or neither? Because what confuses me is reading this P(D|theta) . I read as what's the probability of this data / dataset given I got this theta/parameters/weights, because when I start, I start with the data, then I try to estimate the parameters not the opposite. Suppose I have somehow weights then I try to discover the probability that this weights/parameteres/theta belongs to this dataset. Weird. Am I a Bayesian? Lol. (e.g. logistic classification task for fraud). Kind Regards!

@kilianweinberger698 3 жыл бұрын

Yes, you may be in the early stadium of turning into a Bayesian. Basically if you treat theta is a random variable and assign it a prior distribution you can estimate P(theta|D) i.e. what is the most likely parameter given this data. If you are frequentist, then theta is just a parameter of a distribution and you pretend that you drew the data from exactly this distribution. You then maximize P(D;theta) i.e. which parameter theta makes my data most likely. (In practice these two approaches end up being very similar ...)

@saitrinathdubba 6 жыл бұрын

Just Brilliant !! Thank you prof. kilian !!!

@prwi87 Жыл бұрын

Edit: After thinking, and checking, and finishing the lecture, and watching a bit of the lecture after this one i have came to the conclussion that my first explanation was wrong, as i didn't have enough knoweledge. The way it is calculated is good and fine, where i struggled was to understand the right PDF the Professor was using. What threw me off was P(D; theta) which is a joint PDF (i know it's PMF, but for me they are all pdfs if you put delta function in there) of obdaining exactly data D, because D is a realization of some random vector X, so to be more precise in notation P(D; theta) should be written as P(X = D; theta). But what Professor meant was the PDF P(H = n_h; len(D), theta) which is a binomial distribution. Then we can calculate MLE just as it was calculated during the lectures. But this is not the probability of getting the data D, but the probability of observing exactly n_h heads in len(D) tosses. Then in MAP we have conditional PDF H|theta ~ Binom(len(D), theta), written as P(H = n_h | theta; len(D)), we treat theta as random variable but len(D) as a parameter. There are two problems with explanation that starts around 18:00. Let me state the notation first. Let D be the data gathered, this data is the realization of random vector X. n_h is the number of heads tossed in D. nCr(x, y) is combinations of x choose y. 1. Professor writes that P(D;theta) is equal to the binomial distribution of the number of heads tossed which is not true. Binomial distribution is determined by two parameters, the number of independent Bernoulli trials (n) and the probability of obtaining a desired outcome (p), thus theta = (n, p). If we have tossed the coin n times, there is nothing we don't know about n, since we have choosen it, and so n is fixed and most importantly it is known to us! Because of that, let us denote n = len(D) and then theta = p. Let now H = number of heads tossed, then P(H = n_h; len(D), theta) = nCr(len(D), n_h) * theta ^ n_h * (1 - theta) ^ (len(D) - n_t) is precisely the distribution that was written by the Professor. I have also noticed that one person in comments asked why cannot we write P(H|D;theta), and more precisely P(H = n_h|len(D); theta). The reason for that is that len(D) is not a random variable, we are the one choosing the number of tosses and there is nothing random about it. Note that in this notation used in a particulat comment theta is treated as a parameter as it is written after ";". 2. To be precise P(X = D; theta) is a joint distribution. For example if we would have tossed the coin three times, then D = (d1, d2, d3) with d_i = {0, 1} (0 for tails and 1 for heads), and P(X = D;theta) = P(d1, d2, d3;theta). P(X = D;theta) is the joint probability of observing the data D we got from the experiment. The likelihood function is then defined as L(theta|D) = P(X = D;theta), but keep in mind that the likelihood is not a conditional probability distribution, as theta is not a random variable. The correct way to interpret L(theta|D) is as function of theta, which value also depends on the underlying measurements D. Now, if the data is i.i.d. then we can write that P(X = D;theta) = P(X_1 = d1;theta) * P(X_2 = d2;theta) * ... * P(X_len(D) = d_len(D);theta) = L(theta|D) In our example of coin tossing P(X_i = d_i;theta) = theta ^ d_i * (1 - theta) ^ (1 - d_i), where d_i = {0, 1} (0 for tails and 1 for heads) Given that L(theta|D) = theta ^ sum(d_i) * (1 - theta) ^ (len(D) - sum(d_i)), where sum(d_i) is simply n_h, the number of heads observed. And now we are maximizing the likelihood of observing the data we have obtained. Note that the way it was done during the lacures was right! But we were maximizing the likelihood of observing n_h heads in len(D) tosses, not of observing exactly data D. Also for any curious person, the "true bayesian method" that was described by the Professor at the end is called minimum mean-squared estimation (MMSE), that aims to minimize the expected squared error between random variable theta and some estimation of theta using the data random vector g(X). To support my argumenting, here are sources i used to write the above statements: "Foundations of Statistics for Data Scientists" by Alan Agresti (Chapter 4.2), and "Introduction to Probability for Data Science" by Stanley Chan (Chapter 8.1). Sorry for any grammar mistakes, as english is not my first language. As i'm still learning all this data science stuff i can be wrong, and i'm very open to any criticism and discussion. Happy learning!

@beluga.314 Жыл бұрын

You're mixing up 'distribution' and 'density'.P(d1, d2, d3;theta), this notation is correct but P(X = D;theta) is wrong as its a density function and you can't write like that. But since they are also probabilities(discrete), you can write like that here

@arjunsigdel8070 4 жыл бұрын

Thank you. This is great service.

@marcogelsomini7655 2 жыл бұрын

48:18 loop this!! Thx Professor Weinberger!

@thachnnguyen 9 ай бұрын

I raise my hand. Why you assume any type of distribution when discussing? What if I don't know that formula? But what I see is nH and nT. Why not work with those?

@SalekeenNayeem 4 жыл бұрын

MLE starts at 11:40

@StarzzLAB 3 жыл бұрын

I teared up at the end as well

@MarcoGelsomini-r8c Ай бұрын

34:52 very interesting point!

@coolblue5929 2 жыл бұрын

Very enjoyable. I think a Killian is like a thousand million right? I got confused at the end though. I need to revise.

@HimZhang 2 жыл бұрын

In the coin toss example (lecture notes, under "True" Bayesian approach), P(heads∣D)=...=E[θ|D] = (nH+α)/(nH+α+nT+β). Can anyone explain why the last equality holds?

@Klisteristhashit 4 жыл бұрын

xkcd commic mentioned in the lecture: xkcd.com/1132/

@jandraor 5 жыл бұрын

What's the name of the last equation?

@jachawkvr 4 жыл бұрын

I have a question. Is P(D;theta) the same as (D|theta)? The same value seems to be used for both in the lecture, but I recall Dr.Weinberger saying that there is a difference earlier in the lecture.

@kilianweinberger698 4 жыл бұрын

Well, for all means and purposes it is the same. If you write P(D|theta) you imply that theta is a random variable, enabling to impose a prior P(theta). If you write P(D;theta) you treat it as a parameter, and a prior distribution wouldn't make much sense. If you don't use a prior the two notations are identical in practice.

@jachawkvr 4 жыл бұрын

Ok, I get it now. Thank you for explaining this!

@Bmmhable 5 жыл бұрын

At 36:43 you call P(D|theta) the likelihood, the quantity we maximize in MLE, but earlier you emphasized how MLE is about maximizing P(D ; theta) and noted how you made a "terrible mistake" in your notes by writing P(D|theta), which is the Bayesian approach...I'm confused.

@kilianweinberger698 5 жыл бұрын

Actually, it is more subtle. Even if you optimize MAP, you still have a likelihood term. So it is not that Bayesian statistics doesn‘t have likelihoods, it is just that it allows you to treat the parameters as a random variable. So P(D|theta) is still the likelihood of the data, just here theta is a random variable, whereas in P(D;theta) it would be a hyper-parameter. Hope this makes sense.

@Bmmhable 5 жыл бұрын

@@kilianweinberger698 Thanks a lot for the explanation. Highly appreciated.

@imnischaygowda 2 жыл бұрын

nH + nT choose nH what exactly do u mean here?

@abhinavmishra9401 4 жыл бұрын

Impeccable

@jijie133 3 жыл бұрын

Great!

@KulvinderSingh-pm7cr 6 жыл бұрын

Made my day !! Learnt a lot !!

@sushmithavemula2498 5 жыл бұрын

Hey Prof, Your lectures are really good. But if you could provide some real time applications/examples while explaining a few concepts it would let every one understand the concepts better !

@abhishekprajapat415 4 жыл бұрын

18:19 how did that expression even came, like what is this expression even called in maths. By the way I am b.tech. student so I guess I might not have read the math behind this expression.

@SalekeenNayeem 4 жыл бұрын

Just look it up Binomial Distribution. Thats a usual way of writing probability of an event which follows binomial distribution. You may also wanna check Bernoulli's Distribution first.

@pritamgouda7294 10 ай бұрын

can someone tell where's the lecture in which he proved K nearest algorithm which he mentioned @5:09

@kilianweinberger698 9 ай бұрын

kzbin.info/www/bejne/parQpXqifMmHY80

@pritamgouda7294 9 ай бұрын

@@kilianweinberger698 sir I saw that lec and it's notes as well but in notes it's mentioned about Bayes optimal classifier but I don't think it's there in the video lec. Please correct me if I'm wrong. Thank you for your reply 😊

@yuniyunhaf5767 5 жыл бұрын

thanks prof

@hafsabenzzi3609 2 жыл бұрын

Amazing

@logicboard7746 3 жыл бұрын

bayesian @23:30, then 32:00

@kartikshrivastava1500 2 жыл бұрын

Wow, apt explanation. The captions were bad, at some point: "This means that theta is no longer parameter it's a random bear" 🤣

@utkarshtrehan9128 4 жыл бұрын

MVP

@subhasdh2446 2 жыл бұрын

I'm in the 7th lecture. I hope I find myself commenting on the last one.

@kilianweinberger698 2 жыл бұрын

Don’t give up!

@vatsan16 4 жыл бұрын

So the trick to getting past the spam filter is to use obscure words in the english language eh. Who wohuld have thought xD

@kilianweinberger698 4 жыл бұрын

Not the lesson I was trying to get across, but yes :-)

@vatsan16 4 жыл бұрын

@@kilianweinberger698 okay. I am now having a, "omg he replied!!" moment. :D. Anyway, you are a really great teacher. I have searched long and hard for a course on machine learning that covered it from a mathematical perspective. I found yours on a friday and i have now finished 9 lectures in 3 days. Danke schön! :)