He is the only professor I've seen who asks, "Raise hands if this makes sense". He could have asked "Raise hands if it doesn't make sense", but he knows people who didn't understand feel under-confident and are generally hesitant to ask questions. He repeats key concepts till most of class has understood. Great respect for you sir! And great teaching method!
@astutzorr33554 жыл бұрын
finding your channel has helped me SOO MUCH, I really appreciate when there is some mathematical rigour and most of the ML guys on the internet just do a wave of hands every time a nuanced concept comes along. Thank you
@jaggermarlon40183 жыл бұрын
InstaBlaster
@nicolasrover88033 жыл бұрын
Professor, you are a true genius at explaining this stuff. Your lectures are so much fun and motivating!
@jenishah98252 жыл бұрын
Most people are viewing these lectures because they came across one video and now they want to learn everything Prof Kilian has to share. But if this was offered as a certified course online, more people would know about this and can officially take time out to complete the course.
@kilianweinberger6982 жыл бұрын
Actually, it is: online.cornell.edu/machine-learning/
@rohit27613 жыл бұрын
Hello Everyone, including The Prof Kilian, First of all thank you from the bottom of my heart to upload this online. These Gem Video Lectures. Absolutely Loved them. Anyone including Dr. Kilian , Can you please recommend a Deep learning course like this one ? I'd be highly obliged. It has to have quality like this course.
@jayantpriyadarshi92662 жыл бұрын
i can reccommend
@rohit27612 жыл бұрын
@@jayantpriyadarshi9266 tell
@jayantpriyadarshi92662 жыл бұрын
@@rohit2761 search for deep learning by Mitesh Khapra .. on KZbin
@rohit27612 жыл бұрын
@@jayantpriyadarshi9266 bro thanks, I've already done that course. Please suggest suve other one
@jayantpriyadarshi92662 жыл бұрын
@@rohit2761 there is deep learning 2 as well by him .. which is more about GAN s and all.. if you did this much i think you can read papers.
@vivekupadhyay66633 жыл бұрын
Thank You Sir! for this clear and concise explanation of the topic.
@Saganist4205 жыл бұрын
Fantastic lecture!
@salmaabdelmonem74824 жыл бұрын
Thanks a million prof kilian 😀
@sumitsingh-yz3vm3 жыл бұрын
Thank you for the amazing lecture series.
@jorgeestebanmendozaortiz8733 жыл бұрын
For the inpatients, lecture starts at 2:55.
@hdang19975 жыл бұрын
"I ask my uncle who's a magician".
@varunjindal15204 жыл бұрын
Awesome explanation..!!
@thatsharma10665 ай бұрын
lectures so good, kilian on my poster wall. :)
@roniswar3 жыл бұрын
I really enjoy your lectures Prof'! Could you please elaborate on your state that the sets D1,..,Dm are not I.I.D? Since you sample with replacement, I do not see any reason that knowing the previous sets D_1,..,D_(i-1), will help you to predict the set D_i (except for maybe the not very interesting case that D has only a single element, in this case all this process is doomed anyway...).
@kilianweinberger6983 жыл бұрын
Let’s say the elements in D are drawn i.i.d. from P. Given P, if you take the first half of the D, you gain no information about the second half. In contrast, if you now create data sets D_1, D_2,..., D_M from D (through sampling with replacement), these data sets are not i.i.d. For example, imagine P(z)>0 i.e. there is a non-zero chance that we sample a specific element z from P. But imagine that just by chance, z is not in D. If M is large enough, and you observe that z is not in D_1, not in D_2, not in D_3, ... , not in D_{M-1} you can conclude that z was almost certainly not in D, and therefore is extremely unlikely to be in D_M. In other words, the data sets D_1,..,D_{M-1} tell you something about D_M - and they are not independent. Hope this helps. Of course you can also simply look at the extreme case where D={z}. Clearly D_1 tells you something about D_2. Hope this helps.
@roniswar3 жыл бұрын
@@kilianweinberger698 Thanks a lot for the specific answer!! I understand that for outliers (usually the rarest ones, maybe also the common ones) you can learn something about D_m assuming that you have D_1, .., D_{M-1}. It is weird that even Wikipedia (perhaps not the accurate source of truth, but still usually pretty accurate) stats that : " This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling" here en.wikipedia.org/wiki/Bootstrap_aggregating
@kilianweinberger6983 жыл бұрын
Oh I see the confusion. The subsets are dependent samples from the original distribution P, but they are independent bootstraps from the original data set. The bootstraps are even independent of D={x}, as D_1 tells you nothing about D_2 that you couldn’t already infer from D. Hope this helps.
@roniswar3 жыл бұрын
@@kilianweinberger698 Yes, that's exactly was the confusion indeed!! Thanks for clearing it up Prof'!
@ChiragSinghishere4 жыл бұрын
before each split randomly subsample k≤dk≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.) .....is the line from lecture notes..... i think it should be "decrease the variance" ...
@kilianweinberger6984 жыл бұрын
No, it is correct. It is a little confusing, because generally fewer features would mean you decrease the variance of your classifier. However by reducing the features for each individual split, you build very different trees - so this *increases* the variance *across* the trees, which you then reduce again by averaging.
@ChiragSinghishere4 жыл бұрын
@@kilianweinberger698 Thanks alot...
@DommageCollateral Жыл бұрын
i love this teaching style, i would love to see tu berlin create such tutorium and lecture notes, rather than copying snippets from universities from all over the world and then just paste it. a lot of times the use code from huge projects but they just use a little part for the task, but they dont explain the rest of the code, but you need to understand the whole code to fill the gaps. so youre ending up with a project that is quite advanced but some parts are left out and suddenly you need to translate everything including the names of the math symbols and they use different symbols in english, so you always need to think of that. so they dont even crate their own content for the homework and that sucks.
@astutzorr33554 жыл бұрын
Can someone help me with why the bagging datasets are not inependant? are we not drawing them randomly? what would be an example that is independent then?
@kilianweinberger6984 жыл бұрын
Well, imagine in the extreme case your original training data set only has a single element {x}, drawn from some distribution P. Now you apply bagging on this data set of size one, and you re-sample m new data sets from it. But because it only contains one element, they are all identical {x},{x},{x},...,{x}. So clearly they are not independent. If I tell you what one data set is, you know all the others. Hope this helps.
@srisaisubramanyamdavanam99123 ай бұрын
The concept is bootstrapping.u are creating random subsets from same data.samples may be repeating in any of the trees..
@aragasparyan82953 жыл бұрын
Thanks for the lecture. I have a question concerning showing that bootstrapped data sets are drawn from the original distribution P. In your lecture notes you are assuming that P is discrete, is it an assumption that simplifies the proof of P(X=x) = Q(X=x)? Or in other words, can we show that P and Q are identical distributions in the case when P is not discrete (in the general case)?
@vocabularybytesbypriyankgo15584 ай бұрын
Thanks a lot Sir !!
@pmdl4 жыл бұрын
@prof, great lecture. Especially liked how you started from kd trees as a data structure and arrived at decision trees. Btw how do you show Di (the subsamples) is coming from the same distribution as D? Also, in random forests, k = sqrt(d) - what would be a justification / intuition for that?
@kilianweinberger6984 жыл бұрын
Yes, that should be in the notes. The D_i are sampled from the same distribution as D, but they are *not* independent.
@pmdl4 жыл бұрын
@@kilianweinberger698 yes, found that in the notes later. What about the intuition for the sqrt(n) for random forests feature subsample though? Not sure if I missed that too!
@pmdl4 жыл бұрын
Btw prof, the whole video series is awesome! Have to finish up kernels and Gaussian process ,but the rest have been wonderful!
@TheCuriousCurator-HindiАй бұрын
I have been a [senior] applied scientist at places like Amazon & Microsoft. Have been studying ML for 12 years and 7+ years industry experience at big tech. What are the chances I get into cornell PhD program? In terms of fluency I can read 80-90% PML : an introduction by Kevin Murphy in 3-4 weeks.
@broccolee43282 жыл бұрын
Hi Dr. Kilian, would you recommend pruning before bagging or it doesn't matter?
@alimamdouh17643 жыл бұрын
why do we solve the problem of variance by just not insisting on every point being correctly classified ? why dont we just keep branching until certain threshold then just do a majority vote ?
@kilianweinberger6983 жыл бұрын
Indeed, you can do that. If you enforce a max leaf-node size (e.g. stop splitting if
@alimamdouh17643 жыл бұрын
@@kilianweinberger698 it is thr first time I hear about "fiddling parameters". could you please recommend me some resource in which I can read about it.
@srisaisubramanyamdavanam99123 ай бұрын
@@alimamdouh1764 Hey he is trying to tell that we need less parameters to deal for bagging.
@rajatpatel46574 жыл бұрын
why the complexity of 1st approach is O(d*k*N^2)? we can compute all Pk first, and then multiply with H(SL) (for all K class labels ). Therefore, we can do this in O(d*N*(N+K)). isn't it?
@StarzzLAB4 жыл бұрын
I am thinking the same.
@husamalsayed80363 жыл бұрын
exactly , that what i was thinking about .
@71sephiroth4 жыл бұрын
Is it a property of an entropy that it's additive, because I can't seem to figure out why intuitievly we use the weighted sum of H(Sr) and H(Sl)?
@kilianweinberger6984 жыл бұрын
If you didn’t use the weighted sum, and just the sum H(Sr)+H(Sl) you would end up splitting off a single element. That would make one side completely pure (entropy 0), but wouldn’t help you much to improve your classifier. Hope this helps.
@71sephiroth4 жыл бұрын
@@kilianweinberger698 Ahh, thanks! But what is wrong with the 'regular average' aka weights = 1/2? @edit For example if we have target variable = [1, 0, 1, 0, 0, 0, 0, 1] then 1:(n-1) split will 'always' have maximal information gain compared to 2:(n-2), 3:(n-3) ... etc., be it that we use weighted average, 'normal' average or not using average at all in order to calculate information gain. I think in real life that it's not very likely we will have such feature that splits the target in 1:(n-1) fashion, but still I'm curious.
@haizhoushi82874 жыл бұрын
Both Gini and Entropy are concave functions, the form of weighted sum matches the form of the property of the concave function.
@71sephiroth4 жыл бұрын
@@haizhoushi8287 Can you please elaborate more!
@ousam20104 жыл бұрын
@@71sephiroth Actually, I think you are right that it is a weighted sum, i.e., weighted by the size of Sr and Sl. Let's use c(X) to indicate the class (c1,c2,...cK) a sample X belong to. So the weighted sum actually approximate H(Sl)Pr(Sl)+H(Sr)Pr(Sr) = H(c(X)|X in Sl) Pr(X in Sl) + H(c(X)|X in Sr) Pr(X in Sr) = H(c(X)|s(X)), where s(X) is the index indicating if X is in the left node or right node. For example, s(X)=1 if X in Sl and s(X)=0 if X in Sr. Since the conditional entropy H(c(X)|s(X)) can be viewed as the amount of remaining information c(X) given that s(X) is known. It is very natural that we would like to minimize H(c(X)|s(X)) to get the optimum split.
@ugurkap5 жыл бұрын
I think running time of the naive implementation may be O(dn^2) instead of O(dkn^2). We have to split every dimension n times, so nd is a must. But I don't have to do the rest of the operations k times, I believe. Using a loop such as: left_pt = 0 right_pt = 0 left_class_1 = 0 left_class_2 = 0 right_class_1 = 0 right_class_2 = 0 for i = 1 to n: class_1 = 0 class_2 = 0 if y_i == class_1: class_1 += 1 if y_i == class_2: class_2 += 1 if x_i[d] < some_threshold: left_pt += 1 left_class_1 += class_1 left_class_2 += class_2 if x_i[d] >= some_threshold: right_pt += 1 right_class_1 += class_1 right_class_2 += class_2 Effectively, this will calculate how many data points we have to the left and right. Additionally, it will calculate p_k as well since p_left_class_1 = left_class_1 / left_pt. Is there any reason we should also go over the labels with another loop?
@aayushchhabra74655 жыл бұрын
I was thinking the same. We can make one pass through the entire dataset and keep a running count on different variables for left, right, and one for each class on left and right each. As soon as we encounter any point, we either increase the left or right counter, and simultaneously update the corresponding class counter in the left or right variables. That I believe will remove k from the algorithmic complexity as you suggested.
@srisaisubramanyamdavanam99123 ай бұрын
I think he got mixed up in dealing continuous and discrete features.yes u are right for continuous features is O(n^2d)
The person who gave the thumbs down (dislike) to this video is currently jobless and wasting his money in data science diplomas/courses and consuming bullshit content :P