MLE for the Multivariate Normal distribution | with example in TensorFlow Probability

Рет қаралды 7,942

Күн бұрын

Пікірлер: 24

@MachineLearningSimulation 2 жыл бұрын

Errata: At 14:40, the derivative of the log-likelihood is taken with respect to the mu vector. There is a small error in that derivative which led to an incorrect equality (a row vector is not equal to a column vector). The accurate derivative had to follow formula (108) of the matrix cookbook: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf . The mistake is more technical and does not affect the subsequent derivations nor the final result. Still, it's interesting to see what errors arise. Thanks to @gammeligeskäsebrot and @Aditya Mehrota for pointing this out :) The file on GitHub has been updated with some additional information and a correct version of the derivative: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

@xiaoweidu4667 3 жыл бұрын

You are giving fantastic tutorials, please keep it up. Really appreciate.

@MachineLearningSimulation 3 жыл бұрын

Thanks a lot ❤️ There is more like this coming in the future. It's that amazing feedback like yours that motivates me a lot. Therefore, thanks again 😊

@zejiachen9657 2 жыл бұрын

Thank you for this great tutorials! It helped a lot for getting my assignment down.

@MachineLearningSimulation 2 жыл бұрын

You're very welcome 😁

@user-or7ji5hv8y 3 жыл бұрын

Very clear and instructive. 😊

@MachineLearningSimulation 3 жыл бұрын

Thanks a lot :)

@ivankissiov Жыл бұрын

Thanks!

@MachineLearningSimulation Жыл бұрын

Thank you for the kind donation 😊

@heitorcarvalho4940 2 жыл бұрын

Found your videos just recently while studying for Multivariate Analysis at college. As many have said and I repeat, they're awesome, congratulations! Now, I'd like to ask you two questions if I may: 1. I'm using Richard Wichern as my main reference book. Have you used that one too, or do you have other recommendations? 2. Quite often in statistics - at least for me - it's very easy to get lost among all that math and often times I can't build the intuition on how to extrapolate that information to a Data Science problem. Do you have any other resources, besides your channel to help us with that intuition and how to apply that knowledge. Even tough Wichern's book is called Applied Multivariate Statistics, still seems too heavy for me sometimes.

@MachineLearningSimulation 2 жыл бұрын

Thanks for the kind words. :) 1.) I have not heard of the Richard Wichern book, but seems to be a good resource :). The reason I came to these topics is Machine Learning and not pure statistics (although those two are of course tightly coupled :)!). The book I mainly used was Christopher Bishop's "Pattern Recognition and Machine Learning". Although, admittedly, it is quite a hard read. 2.) I can totally understand this problem. Transferring knowledge to real issues is always tricky. Often, the necessary experience comes over time. Thanks, that you also mention the channel, although, I would argue that the issues this channel deals with are still toy-ish, in a sense :D. Probably, solving Kaggle challenges (and looking at the submission notebooks of other Kagglers) can be a helpful thing to do. I can also recommend this beautiful book by Aurelion Geron: www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

@user-or7ji5hv8y 3 жыл бұрын

it's interesting with MLE, that you always have to know or assume that the data follows a certain distribution. Like in this case Normal. I guess, assumption about the distribution is kind of like a prior.

@MachineLearningSimulation 3 жыл бұрын

I would agree. Our assumption on the model of the data is somewhat like a prior. Generalizing this idea, one could even speak of "variational priors", i.e. probability distributions over potential probability distributions (which then leads again to functionals). But for the application in Machine Learning & Simulation: it might be interesting to look at it the other way around: You don't necessarily choose the distribution based on how the data looks in a scatter, but you choose it based on your understanding of the underlying phenomenon (or the simplifying assumptions you made on it). Think for instance of Gaussian Mixture Models. Surely, in many clustering scenarios the clusters do not necessarily follow a Gaussian distribution. But we made this particular assumption for two (probably rather pragmatic) reasons: First, we can still get decent results with it and in the end that is what we usually care about. And more importantly, secondly, we found something that is tractable or we at least know how to work with it. This is also the reason the (Multivariate) Normal appears in so many scenarios. (Of course we can also derive it from Maximum entropy etc.), but it is just a distribution that is simple, yet versatile. We know its properties and many things are available analytically or in a closed-form.

@孔子迁 3 жыл бұрын

I am very appreciate your videos. Have you considered making a Bayesian Deep Learning tutorial?

@MachineLearningSimulation 3 жыл бұрын

Thanks a lot for the feedback :) There is a lot of content on my To-Do list. I will add Bayesian Deep Learning there, but I think it will take me some time until I can release something on it. For the next weeks I planned mostly videos on deep generative models. Optimistically, I could have some videos on Bayesian Deep Learning at the end of the year.

@GammeligesKaeseBrot 2 жыл бұрын

Very nice video ive got a question on the equation presented at min 14. if x is a 1byn vector and mue 2 and the Sigma Matrix is nxn then the 1st term would be of shape nx1 the second would be 1xn . How can they be the same then?

@MachineLearningSimulation 2 жыл бұрын

Hey, thanks a lot for the kind words :) Regarding your question: I am not sure, if I understand it correctly. What I wanted to show at that point in the video was that: x and mu are R^k vectors, i.e., vectors with k entries. And those are column vectors, one could also write R^{k x 1}. Then the sigma matrix is of shape R^{k x k}. Let's just simplify (x - mu) as x and let sigma^{-1} be A. Then the equation, I wrote down is: x^T @ A = A @ x The left-hand side of the equation describes a vector-matrix product and the right-hand side a matrix vector product. The vector-matrix product works, because we transpose x first, turning it into a row vector, i.e., it is of shape R^{1 x k}. The given equation holds, since A is symmetric. Does this shine some additional light on it? :) Let me know what is unclear.

@knowledgedistiller 2 жыл бұрын

@@MachineLearningSimulation I had a similar question Assume sigma is of shape (D x D). At 14:42, you are multiplying a vector of shape (1 x D) by a matrix of shape (D x D) on the left side, which produces shape (1 x D). On the right side, you are multiplying a matrix of shape (D x D) by a vector of shape (D x 1), which produces shape (D x 1). Aren't the left and right sides transposes of each other?

@MachineLearningSimulation 2 жыл бұрын

@@knowledgedistiller @gammeligeskäsebrot you are both correct. :) I think I made a mistake there. Though, it's not changing the derivation much, but might be an interesting technicality. Indeed, the equality sign would not be fully correct, since, as you note, the left-hand side refers to a row vector, whereas the right-hand side refers to a column vector. The issue is actually in the derivative of the log likelihood with respect to the mu vector. If we chose the correct form as highlighted in formula (108) of the matrix cookbook (www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) everything should be fine (due to the symmetry of the covariance matrix). You could also show that in index notation. Thanks for the hint. I will leave a pinned comment that there was a small mistake.

@MachineLearningSimulation 2 жыл бұрын

@@knowledgedistiller I updated the file on GitHub with a small hint on the correct vector derivative: github.com/Ceyron/machine-learning-and-simulation/blob/main/english/essential_pmf_pdf/multivariate_normal_mle.pdf

@user-or7ji5hv8y 3 жыл бұрын

Do you have any thoughts about the pro and the cons of PyMC3 or Stan? Can TensorFlow Probability do everything that PyMC3 can do?

@MachineLearningSimulation 3 жыл бұрын

I haven't worked with Stan yet and only did some introductory tutorials in PyMC3, but I still share my (not too informed) thoughts. PyMC3 is similar to TFP in that it uses plain Python for most parts (or the parts you interact with). Stan, on the other hand, is a domain-specific language, meaning that you write "probabilistic programs" and then call Stan to evaluate them for you, for example by MCMC. I personally prefer the approach of PyMC3 and TFP as this helps in combining your projects with other stuff you might do in Python anyway. When it comes to performance, it's hard for me to judge, but I would say that Stan could be faster on pure CPU, whereas TFP could benefit from the TensorFlow XLA backend and perform better on a GPU. However, for the tutorials on this channel, I think any of the three are more than sufficient. Speaking about functionality: For the basics (MCMC & Variational Inference), I think all the three should be mostly identical. I had the impression that TFP had a lot of distributions at choice which, together with its bijectors, can model almost everything commonly used nowadays. Probably, it's similar for the others. One big advantage of TFP over the other twos (as far as I am aware of) is the tight integration with Keras to build "probabilistic layers" (with integrated parameterization trick) which makes it soooo easy to build Variational Autoencoders (sth I also want to introduce in a future video). One thing that PyMC3 or Stan users would probably criticize about TFP is that it uses long function names and sometimes unintuitive ways of doing things, and that the documentation can sometimes be a little too complicated. However, I personally like the API of TFP a lot. Based on this limited view on the world of probabilistic programming, you could have probably guessed that I prefer TFP, which is also the reason I am presenting it here on the channel. However, I think that the other projects are also excellent work, and it might be worth taking a look at them Interesting projects to also take a look at would be Turing.jl (something like PyMC3 but for Julia) and Edward2 which is similar to Stan (in that it is a domain-specific language), but uses afaik TFP as a backend.

@MachineLearningSimulation 3 жыл бұрын

I found this notebook tutorial for PyMC3 docs.pymc.io/pymc-examples/examples/variational_inference/convolutional_vae_keras_advi.html but haven't looked at it yet. Based on this, I think that my point with the keras integration of TFP as the advantage would not be true. Therefore, to answer your 2nd question: I think that both have similar capabilities, maybe PyMC3 can even do more (or you can do things in less lines of code than TFP).

@MachineLearningSimulation 3 жыл бұрын

One last point in favor of PyMC3 over TFP: It is more mature, it existed longer than TensorFlow (Probability). And Google (the main maintainers of TFP) is known for killing projects killedbygoogle.com/