Cross Validation : Data Science Concepts

Рет қаралды 39,918

ritvikmath

Күн бұрын

Пікірлер

@MyGuitarDoctor 2 жыл бұрын

This is by far the best explanation on KZbin.

@starostadavid 4 жыл бұрын

This is how teachers should teach. Filled whiteboard and just explaining, jumping from one concept to another and simultaneously comparing. Not slowly filling the whiteboard, wasting time. Feels unprepared. Good job!

@SiliconTechAnalytics 6 ай бұрын

Not only that all the students gets TESTED; They all also get TRAINED! Variance in Accuracy significantly Reduced!! Well done RitVik!!!

@geoffreyanderson4719 3 жыл бұрын

Great video. It's a good start. Andrew Ng's courses on Coursera are the state of the art in explaining what to do. I will merely summarize real quick. A serious problem with cross validation used too soon as shown on just two data partitions ("training" and "testing"), is that you are left with no unbiased (new, that is, new to the model) data on which to estimate your final model afterward, so your estimate will be too optimistic. The second problem, and yes it is a problem for your development process of the machine learning model, is that CV mixes concerns together, so that you are simultaneously improving bias error and variance error. And you need a third partition of new data which never participates in training nor hyperparameter selection nor grid search, to be the unbiased error estimation on your final model. The CV methodology immediately clouds the decisions you need to take next as the model developer if your model still isnt' doing well enough: Should you work on improving your model's bias, or its variance error? And how would you know which one is the problem now (hint: you don't with CV). Should you add regularization? Add more features? Get more labeled examples? Deepen the tree depths, or less, or add more units or more layers to your DNN, or add dropout layers and batchnormalization, and on and on. There may be too many knobs to turn, too many options, which can waste your time. Imagine getting more labeled data when this was never the problem. This is why you may want to prefer to proceed more intelligently in the dev process. Also you may begin to assess and address problems with class-wise error disparities --- does your model have higher error only on a few particular classifications of Y but not others? It may be time to use resampling to increase the quantity of examples of minority classes that give the current model a higher error rate, or else weight the loss function terms to balance the total loss better. A better practical solution is a sequential methodology, where bias is reduced first, all by itself. Fit your model on an extremely few number of training examples at first. Your bias error is encouragingly low if your model can well fit the super small number of examples from training set alone (not the validation set). If the model can't fit the training set well yet, then you need to do more work on bias error, and do not concern yourself with variance error yet. Bias error is basically reduced by getting more and better data features (the "P" in this video), and a higher capacity model (more trees, more layers, more units), and a model architecture better suited to your dataset. Also, your dataset may be dirty and need improving -- it's not just the algorithm that makes a machine learning model work well. So you need to evaluate your dataset for problems systematically. Variance error by contrast is basically reduced by getting more data examples (the "N" in this video), and more regularization (and a better model architecture for your particular dataset, et al). I don't explain it as well as Andrew Ng of course. It takes more space than a bit of comments to communicate it all and well. Ng is by far the best practical explainer of this model dev process in the world. If you want to level up as a model developer, do yourself a big favor. Please go take his course Machine Learning on Coursera (if you are cheap like me) or at Stanford (if you have money and can get over to there), or just look on youtube for copies of his lectures on bias and error. Secondly, Ng's Deep Learning specialization(on Coursera) takes these concepts to an even higher level of effectiveness which being super-clearly explained. Third, Ng's specialization "Machine Learning for Production" goes into the most cutting edge data-focused approach to fixing bias and variance error. Honestly, his material is the best in existence on this topic. Additionally, Jeff Leek at Johns Hopkins also does a great job of explaining the need for an unbiased test set in his school's Coursera specialization Data Science (this should be only seen as secondary source of this material to Ng though).

@communicationvast9949 2 жыл бұрын

Great job explaining, my friend. Very easy to follow and understand when there is a good speaker and communicator. You are an excellent teacher.

@-0164- 2 жыл бұрын

I just cannot thank you enough for explaining ml concepts so well.

@oblivion2658 Ай бұрын

Ayye fellow bruin. Your TS series has helped me out alot thanks man! Taking Econ 144 economic forecasting.

@teegnas 4 жыл бұрын

Please make a similar video on bootstrap resampling and how it compare with cross-validation and when to do what

@marthalanaveen 4 жыл бұрын

Bootstrap is a very different sampling procedure compared to cross-validation. In CV, you make samples/subsets without replacement, meaning each observation will be included in only one of the samples, but in bootstrapping, one observation may(very likely will) be included in more than one sample. That said, if you train a model on 4 CV samples/subsets, your model will never have seen the observations that you will test it on, giving you a better estimate of variance of accuracy (or your choice of metric), which you can't be sure of, when trained with bootstrapped samples/subsets of data, since your model may have seen(or even memorised, for worse) the samples you test it on. Disclaimer: I am not as big of an expert as ritvik and i am talking in the context of the video.

@ritvikmath 4 жыл бұрын

great description! thanks for the reply :)

@Mara51029 9 ай бұрын

This excellent video for cross validation across KZbin channel ❤

@ritvikmath 9 ай бұрын

So glad!

@rubisc Жыл бұрын

Great explanation with a simple example that's easy to digest

@ritvikmath Жыл бұрын

Glad it was helpful!

@Guinhulol 11 ай бұрын

Dude! Totally worthy to sign in and leave a like!

@cleansquirrel2084 4 жыл бұрын

Another beautiful video!

@ritvikmath 4 жыл бұрын

Thanks again!

@PF-vn4qz 3 жыл бұрын

is there any reference list to details about what are the cross-validation techniques most recommended for time series data??

@willd0047 2 жыл бұрын

Underrated explanation

@ziaurrahmanutube 4 жыл бұрын

Very well explained! Only question I have would be to have a proof - for us geeks - on why the variance is reduced?

@simonjorstedt8552 3 жыл бұрын

I don't know if there is a proof (there are probably examples where cross validation does'nt work), but the idea is that all of the models variances will "cancel out". While one model might predict larger values than the true distribution, another model is likely to predict values smaller than the true distribution. So when taking the average of the models, the variance cancels out. Ideally...

@CodeEmporium 4 жыл бұрын

Nice work! Glad i found you

@ritvikmath 4 жыл бұрын

Awesome, thank you!

@fatemeh2222 2 жыл бұрын

OMG exactly what I was looking for. Thanks!

@ritvikmath 2 жыл бұрын

Glad I could help!

@nivethathayarupan4550 2 жыл бұрын

That is a very nice explnation. Thanks a lot

@hameddadgour 2 жыл бұрын

Great content! Thank you for sharing.

@gtalckmin 4 жыл бұрын

Hi @ritvikmath, I am unsure whether the best idea is to create an model ensemble as your final step. One could use cross-validation around different hyperparameters to have a global idea of the error associated with each hyperparameter (or combinations of it). That said. Once you have the best tuning parameters, you can pretty much (either train and test in a Validation set), or use the whole dataset to best find the most appropriate *coefficients for you model. Otherwise, you may be wasting data points.

@rubencardenes 3 жыл бұрын

I was going to comment something similar. Usually, I would use cross validation not to come up with a better model but to actually compare choices (different hyperparameters or different algorithms), so the comparison would not rely on the data that is randomly chosen.

@TerrenceShi 3 жыл бұрын

you explained this really well, thank you!

@gayatrisoni4525 2 жыл бұрын

very clear and excellent explanation. Thanks!🙂

@ResilientFighter 4 жыл бұрын

Great job ritvik!

@ritvikmath 4 жыл бұрын

Thank you! 😁

@gezahagnnegash9740 Жыл бұрын

Thanks for sharing, it's helpful for me!

@ritvikmath Жыл бұрын

Glad it was helpful!

@NickKravitz 4 жыл бұрын

Nice pen tricks. I am waiting for the ensemble model technique video.

@ritvikmath 4 жыл бұрын

Noted!

@solomonbalogun7651 2 жыл бұрын

Great job explaining this 👏

@ritvikmath 2 жыл бұрын

Glad it was helpful!

@violeta3235 5 күн бұрын

great explanarion! thanks!

@Dresseurdecode 2 жыл бұрын

You explain very well. Thank you

@bennyuhoranishema4765 Жыл бұрын

Great Explanation!

@zuloo37 4 жыл бұрын

I've often used cross-validation to get a better estimate of the true accuracy of a model on unseen data, to aid in model selection, but sometimes for the final model to be used in future predictions, I will train on the entire available training data. How does that compare to using an ensemble of the CV models as the final model?

@GohOnLeeds Жыл бұрын

If you train on the entire available training data, how do you know how well it performs?

@al38261 Жыл бұрын

Really , a wonderful clear explanation! Many thanks! Is there any video or will be a video about time series & crossvalidation?

@alinakapochka3993 2 жыл бұрын

Thank you so much for this video! ☺

@brockpodgurski6144 3 жыл бұрын

Excellent job man

@taitai645 2 жыл бұрын

Thanks for the video. Could we use CV for removing tendancy on data stacked by blocks ? Like block A for a period A, block B for a period B, .... block X for a period X with a risk of trend ? Please.

@trongnhantran7330 2 жыл бұрын

This is helpful 😍😍😍

@isaacnewton1545 Жыл бұрын

Let say we have 5 models, each has 3 hyperparameters and the number of folds are 5. Then does that mean we have train 5 * 3 *5= 75 models and choose the best one among them?

@tıbhendese 17 күн бұрын

Why do we need to reset the model, and re-train in each fold? Why we can't just continue to train the same model in each fold? what is the problem with it

@sivakumarprasadchebiyyam9444 Жыл бұрын

Hi its a very good video. Could you plz let me know if cross validation is done on train data or total data?

@yogustavo12 3 жыл бұрын

This was amazing!

@josht7238 Жыл бұрын

thankyou so much very helpful!!

@tremaineification Жыл бұрын

How are you calculating accuracy in every model?

@ginospazzino7498 4 жыл бұрын

Thanks for the video! Can you also do a video on nested cross validation?

@ritvikmath 4 жыл бұрын

Great suggestion!

@ranggayogiswara5148 Жыл бұрын

What is the difference between this and ensemble learning?

@best1games2studio3 3 жыл бұрын

great video!!

@mahrokhebrahimi6863 3 жыл бұрын

Bravo 👏

@ritvikmath 3 жыл бұрын

thanks!

@jefenovato4424 3 жыл бұрын

Great video! helped a lot!

@onkarchothe6897 3 жыл бұрын

can you suggest to me the book for cross-validation and k fold validation? which contains with example with solutions?

@almonddonut1818 3 жыл бұрын

Thank you for this!

@liz-hn1qm Жыл бұрын

Thanks a lot!!! You just got my sub and like!!

@ritvikmath Жыл бұрын

Welcome!

@ramoda13 Жыл бұрын

Nice vidéo thanks

@abdullahibabatunde2825 3 жыл бұрын

thanks, it is quite helpful

@krishnachauhan2850 3 жыл бұрын

You are awsome

@davidhoopsfan 4 жыл бұрын

hey man nice ucla volunteer center shirt, i have the same one haha

@ritvikmath 4 жыл бұрын

Haha! Nice, after all these years still one of my fave shirts

@brofessorsbooks3352 3 жыл бұрын

I love you bro

@jupiterhaha 2 жыл бұрын

This video should be titled K-Fold Cross-Validation! Not Cross Validation! This can be confusing to beginners!

@ammarparmr Жыл бұрын

Also, Why Rhe instructor used all 1000 in the cross validation Should not he leave the test set aside to do fonal check?