Curse of Dimensionality : Data Science Basics

Рет қаралды 28,052

Күн бұрын

Пікірлер: 54

@alaknandaagarwal8431 2 жыл бұрын

THE best explanation I have found for this concept. Very well explained with examples which is exactly what I was looking for. Everyone else was explaining what curse of dimensionality is and why it happens only theoretically. Thank you.

@karimelmokhtari9559 4 жыл бұрын

The curse of dimensionality is often overlooked and is behind unexpected predictions. Thank you for explaining this issue through a simple approach.

@ritvikmath 4 жыл бұрын

no problem!

@RajaSekharGowda 4 жыл бұрын

I just seen your playlist... It was wonderful for Data science enthusiasts like me... Finally I found best channel 👍

@ritvikmath 4 жыл бұрын

Wow, thanks!

@treelight1707 4 жыл бұрын

First time I get that explanation for the topic. Always assumed it was just about the computation time, that it can increase exponentially. Not that algos fail, and a LOT more samples are needed.

@rakkaalhazimi3672 3 жыл бұрын

You precisely explain it in simple and subtle ways. Now I get an Idea on what to write on my blog :D thanks.

@derBolide 3 жыл бұрын

Thank you for this well structured video! I have two questions though: 1. Why does the data form these peaks in the histogram, so more precisely, why are there exactly so many points with in example a distance of 10 two each other and at a distance of 40? 2. How come that if we have so clearly seperated groups of points, with i.e. distance of 10 and distance of 40, that we can't seperate them with nearest neighbor? Isn't it easy to say that if the point has a lot of points around him at distance 10, that it belongs to class A?

@gmermoud 2 жыл бұрын

You are absolutely right. This video is actually quite wrong. If you differentiate the histograms of intra- and inter-cluster distance, you see this clearly. If anything, the curse of dimensionality is here a blessing, as it makes the inter- and intra-cluster distances more sharply differentiated. What is true, however, is that, for a given cluster, the intra-cluster distances become all essentially the same as dimensionality increases. Check out Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363-387 for some discussions about when the curse of dimensionality is really a problem (and when it is actually a blessing). The wikipedia article is also a good reference.

@younique9710 Жыл бұрын

@@gmermoudhat means, even though we use a high dimension, if we observe different distances between different pairs, we would be good to go ahead for an analysis because the different distances imply not equal distance between every data point in the high dimension?

@JT-js3uf Жыл бұрын

@gmermoud: Interesting point. Doesn't the notion of similarity breakdown, especially if one is relying on distance measures such as Euclidean distance, as instances from all classes become so far spread out such that instances from different classes become very close together (unless the signal is located on lower dimension manifolds)? If not, why do kNN and self-organising maps scale so poorly to very high dimensional data?

@modakad 3 ай бұрын

Amazing explanation. the #varibes vs #samples needed graph is an eye opener. I have a question on the initial part : In the first part of the Euclidian pair-wise distance vs #dimensions, even though the max-min distance is shrinking, the ranking of distances will (or might) still hold true, irrespective of #dimensions. If that's the case, the algorithms should not loose any discriminative power in theory. In practice, yes, the strain this might bring on compute requirements can make it impractical and hence the needs to reduce dimensions. Does this make sense ? Would love to know everyone's thoughts

@exmanitor Жыл бұрын

Very good explanation, exactly what I was looking for.

@samwhite4284 4 жыл бұрын

These videos are super helpful for intuition, cheers

@ritvikmath 4 жыл бұрын

Glad you like them!

@diegososa5280 4 жыл бұрын

Brilliant once again, cheers

@ritvikmath 4 жыл бұрын

thanks :)

@honeyBadger582 4 жыл бұрын

Great video as always! instructive and very easy to understand

@ritvikmath 4 жыл бұрын

Glad to hear it!

@fadouamassaoudy7832 5 ай бұрын

can you give us the link of the notebook please

@grandthruadversity 4 жыл бұрын

Hey can you make a topic on regularization like the lasso etc

@san_lowkey 5 күн бұрын

thank you for this clear explanation

@SeyiTopeOgunji Жыл бұрын

Is it only applicable or more pronounced with KNN classifier, or applicable with other classifiers?

@christiansetzkorn6241 3 жыл бұрын

great intuitive presentation

@sanjaykrish8719 3 жыл бұрын

Whooo!! phenomenal explanation.. ❤

@ritvikmath 3 жыл бұрын

Glad you liked it!

@kyleciantar307 4 жыл бұрын

Can you clarify for me what the y-axis in the histograms represent? My understanding is that the x-axis is the distance between two measured points, and the height of the graph represents the number of times that distance occurs. My question is then why does the y-axis use decimal values if we are counting the number of times the same measurement is made?

@srividhyasainath9297 3 жыл бұрын

Hey, I have the same question. Were you able to crack it? Let me know if you did

@lukestorer4399 3 жыл бұрын

@@srividhyasainath9297 its the probablity density function. i.e. how often that value occurs

@saggarwal01 8 ай бұрын

Mind blown! Thank you Data god

@redherring0077 3 жыл бұрын

💜💜💜. Excellent excellent excellent. I am sharing these videos with my colleagues left and right 😛😋. Can you please do a video on multivariate adaptive regression splines and also a series on image processing? Love your videos.

@elvykamunyokomanunebo1441 2 жыл бұрын

what happens if you standardize the features before hand in the case of KNN?

@Phil-oy2mr 4 жыл бұрын

Can you further explain why there are 2 (or 3) peaks on the histogram? I was thinking there would be n peaks, where n are the amount of dimensions, assuming there are prevalent clusters.

@kdhlkjhdlk 4 жыл бұрын

There should be one peak, but he chose a silly example where there are two clusters that are perfectly separated in every dimension.

@jairjuliocc 4 жыл бұрын

Thank you. Very intuitive

@ritvikmath 4 жыл бұрын

You're very welcome!

@peterw8780 10 ай бұрын

Excellent explanation

@ritvikmath 10 ай бұрын

Glad it was helpful!

@keshavsharma267 4 жыл бұрын

Clearly explained. Thanks

@ritvikmath 4 жыл бұрын

You are welcome!

@pushkarparanjpe 3 жыл бұрын

Great explanation. Thanks! The dataset used to demonstrate the curse was randomly generated. Does this curse hold true even for "regular" datasets - say something like term count matrix or tfidf matrix computed from actual real world documents ? Often the vocabulary sizes of common NLP problems runs into many thousands yet tfidf is applied and seems to work decently - is it immune to this curse ?

@salarbasiri5959 4 жыл бұрын

Great explanation thanks

@chyldstudios 4 жыл бұрын

Very nice!

@ritvikmath 4 жыл бұрын

Thanks!

@paulfaulker4201 3 жыл бұрын

Hi, could you share the notebook in this video? I have tried finding it in your Github, but couldn't find it. Thanks.

@gorgolyt Жыл бұрын

I don't think the first example is great. Why are there two strong peaks? You didn't explain this. Also, why is this a problem? For a k-NN classifier with 2 clusters, isn't this exactly what you want? You want there to be a clear distinction between points in the same cluster that are close (the first peak) and points in different clusters (the second peak). Also it doesn't make sense that you always seem to have a significant residual of points with zero distance.

@senthilkumaran6812 3 ай бұрын

the plot shows the distance between each point with other so ,according to the plot every point have so much distance from other like 5, or 7 so the we will neglect the part that no one is nearest neighbour to the point