Should You Stop Splitting Your Data Like This?

Рет қаралды 5,451

Күн бұрын

Пікірлер: 39

@PritishMishra 2 жыл бұрын

When I was doing image captioning, I ran into this exact problem (my latest video is on the same topic). Each image has five different captions, which is why the image column contained similar images, and you guessed it. I did the random splitting. It resulted in similar images in the training and testing sets (leak), and when I tested the model, it performed exceptionally well on the testing set (as the model has already seen testing images while training). I was both happy and suspicious. Then it occurred to me that random splitting is not what I need, so I grouped similar images, split them, and then trained it again. The model was performing slightly worse than before. What a relief!

@underfitted 2 жыл бұрын

Thanks for telling the story!

@المسلمالمغربي-ص9د 2 жыл бұрын

the model performing worse is better in this case ? how ?

@3bdo3id Жыл бұрын

@@المسلمالمغربي-ص9د it was leaking information that shouldn't be, it was cheating [the model of course not the man 😁]

@dannybee9068 Жыл бұрын

I'm a beginner and I participated in some competition on tabular dataset for regression problem. And the top solutions were using KFold splitting to ensure that their train and test would be different so testing data would be more representative of the private dataset that was used to give scores on the leaderboard, so when training they have some correlation between test set used for evaluation and the private dataset. I've never seen done anything like that before and if someone has more information or links on where I can read about it more, It would be greatly appreciated

@santiagoprada4827 2 жыл бұрын

You inspired me to make content for youtube again. I'm a software developer and some months ago I stopped making content because I felt I was loosing my time ( no much views / no good ideas ). But after feeling that I'm learning in such a fun way with a format like yours, is something that I also want to make other people feel. Thanks man

@underfitted 2 жыл бұрын

Oh man, thanks for saying this! You just made my day!

@grilledcheeze101 2 жыл бұрын

It happens a lot with medical image data, since those type of data is very hard to collect, sometimes we get multiple images from the same person multiple times in our whole dataset.

@Offiziersmesser 9 ай бұрын

yup. Definitely facing this problem right now and I suspected random splitting was the culprit, this video just explained why.

@JordiRosell 2 жыл бұрын

I think this is the most important video I've ever seen in machine learning. Congratulations. ❤️

@underfitted 2 жыл бұрын

Thanks, Jordi!

@mar79379 2 жыл бұрын

Perhaps we should use pseudo randomisation?

@michaelduffy5309 2 жыл бұрын

I'm trying to make a point to watch one video in this series every day. Great content and presentation. Short and to the point. Thank you.

@jasdeepsinghgrover2470 2 жыл бұрын

I think in many cases leakage is also a very important feature. As long as the same information as leakage can actually be given to the model consistently during application, it can turn out to be a very strong predictor. Like in time series models autocorrelation is essentially leakage with one step back. In NLP models we use prompting and provide context which is a lot like leakage. As long as we can get the leaked information consistently and it's relevance to the task persists, it is a feature.

@underfitted 2 жыл бұрын

Right, in that case is not a leak anymore, but a feature.

@joseinsfran3807 2 жыл бұрын

I think that all your videos are amazing! Thank you so much for all the content! What about a video where you show all the books on machine learning/ Data Science you have, or at least one with the best books you've ever read

@fikriansyahadzaka6647 2 жыл бұрын

Just found your channel. Your video is well edited and easy to follow. Keep up the good work!

@underfitted 2 жыл бұрын

Thanks, will do!

@thevoyager7675 8 ай бұрын

Keep up the great content!

@curiousmind7967 2 жыл бұрын

I think the data overall should be pre-processed more. Probably use weekdays instead of specific dates. Maybe instead of using only one flight data, add 20 past flights history etc

@fdkaix9091 2 жыл бұрын

I appreciate the effort you put into your videos. Great content!

@underfitted 2 жыл бұрын

Glad you like the videos!

@knutjagersberg381 2 жыл бұрын

Thanks for the tip!

@usmanmuhammad196 2 жыл бұрын

Thanks a lot Sir

@Offiziersmesser 9 ай бұрын

This is wisdom!

@eduardoabreu78 2 жыл бұрын

Awesome channel!

@underfitted 2 жыл бұрын

Thanks Eduardo!

@austinefeak3794 2 жыл бұрын

Nice Insight to take home and look out for onwards from this video. However, do you quote Wikipedia in your research?

@underfitted 2 жыл бұрын

Many times, yes

@austinefeak3794 2 жыл бұрын

@@underfitted Well, in my research methodology class, we were told it's a bad idea to quote Wikipedia except if the research subject is Wikipedia itself. Always recommended quoting a published journal or article like those ones you showed. Nice video editing skills also, i commend.

@iftik 2 жыл бұрын

Why did I find this channel so late 💔

@underfitted 2 жыл бұрын

No worries! You are very early. I’m just getting started!

@chidubem31 2 жыл бұрын

Exponential Growth 💪 Exponential Knowledge 💪 Expoenetial Channel 💪

@underfitted 2 жыл бұрын

Thanks!

@3bdo3id Жыл бұрын

It should have been Exponential Thanks 💪

@javierHuertay 2 жыл бұрын

But you are only talking about time series here, i think the name of the video is unaccurate. And also why you are using the date as a variable in your model, i don't think is a explicative one, and cause a lot of trouble as you mention

@underfitted 2 жыл бұрын

The date in the model is to illustrate a specific point. The same happens with any other feature that could cause a leaking. For example, in a dataset of x-rays, you should always make sure that images from the same patient go into the same split. Splitting patients will cause a leaking validation strategy just like I mention in this video.