When I was doing image captioning, I ran into this exact problem (my latest video is on the same topic). Each image has five different captions, which is why the image column contained similar images, and you guessed it. I did the random splitting. It resulted in similar images in the training and testing sets (leak), and when I tested the model, it performed exceptionally well on the testing set (as the model has already seen testing images while training). I was both happy and suspicious. Then it occurred to me that random splitting is not what I need, so I grouped similar images, split them, and then trained it again. The model was performing slightly worse than before. What a relief!
@underfitted2 жыл бұрын
Thanks for telling the story!
@المسلمالمغربي-ص9د2 жыл бұрын
the model performing worse is better in this case ? how ?
@3bdo3id Жыл бұрын
@@المسلمالمغربي-ص9د it was leaking information that shouldn't be, it was cheating [the model of course not the man 😁]
@dannybee9068 Жыл бұрын
I'm a beginner and I participated in some competition on tabular dataset for regression problem. And the top solutions were using KFold splitting to ensure that their train and test would be different so testing data would be more representative of the private dataset that was used to give scores on the leaderboard, so when training they have some correlation between test set used for evaluation and the private dataset. I've never seen done anything like that before and if someone has more information or links on where I can read about it more, It would be greatly appreciated
@santiagoprada48272 жыл бұрын
You inspired me to make content for youtube again. I'm a software developer and some months ago I stopped making content because I felt I was loosing my time ( no much views / no good ideas ). But after feeling that I'm learning in such a fun way with a format like yours, is something that I also want to make other people feel. Thanks man
@underfitted2 жыл бұрын
Oh man, thanks for saying this! You just made my day!
@grilledcheeze1012 жыл бұрын
It happens a lot with medical image data, since those type of data is very hard to collect, sometimes we get multiple images from the same person multiple times in our whole dataset.
@Offiziersmesser9 ай бұрын
yup. Definitely facing this problem right now and I suspected random splitting was the culprit, this video just explained why.
@JordiRosell2 жыл бұрын
I think this is the most important video I've ever seen in machine learning. Congratulations. ❤️
@underfitted2 жыл бұрын
Thanks, Jordi!
@mar793792 жыл бұрын
Perhaps we should use pseudo randomisation?
@michaelduffy53092 жыл бұрын
I'm trying to make a point to watch one video in this series every day. Great content and presentation. Short and to the point. Thank you.
@jasdeepsinghgrover24702 жыл бұрын
I think in many cases leakage is also a very important feature. As long as the same information as leakage can actually be given to the model consistently during application, it can turn out to be a very strong predictor. Like in time series models autocorrelation is essentially leakage with one step back. In NLP models we use prompting and provide context which is a lot like leakage. As long as we can get the leaked information consistently and it's relevance to the task persists, it is a feature.
@underfitted2 жыл бұрын
Right, in that case is not a leak anymore, but a feature.
@joseinsfran38072 жыл бұрын
I think that all your videos are amazing! Thank you so much for all the content! What about a video where you show all the books on machine learning/ Data Science you have, or at least one with the best books you've ever read
@fikriansyahadzaka66472 жыл бұрын
Just found your channel. Your video is well edited and easy to follow. Keep up the good work!
@underfitted2 жыл бұрын
Thanks, will do!
@thevoyager76758 ай бұрын
Keep up the great content!
@curiousmind79672 жыл бұрын
I think the data overall should be pre-processed more. Probably use weekdays instead of specific dates. Maybe instead of using only one flight data, add 20 past flights history etc
@fdkaix90912 жыл бұрын
I appreciate the effort you put into your videos. Great content!
@underfitted2 жыл бұрын
Glad you like the videos!
@knutjagersberg3812 жыл бұрын
Thanks for the tip!
@usmanmuhammad1962 жыл бұрын
Thanks a lot Sir
@Offiziersmesser9 ай бұрын
This is wisdom!
@eduardoabreu782 жыл бұрын
Awesome channel!
@underfitted2 жыл бұрын
Thanks Eduardo!
@austinefeak37942 жыл бұрын
Nice Insight to take home and look out for onwards from this video. However, do you quote Wikipedia in your research?
@underfitted2 жыл бұрын
Many times, yes
@austinefeak37942 жыл бұрын
@@underfitted Well, in my research methodology class, we were told it's a bad idea to quote Wikipedia except if the research subject is Wikipedia itself. Always recommended quoting a published journal or article like those ones you showed. Nice video editing skills also, i commend.
@iftik2 жыл бұрын
Why did I find this channel so late 💔
@underfitted2 жыл бұрын
No worries! You are very early. I’m just getting started!
But you are only talking about time series here, i think the name of the video is unaccurate. And also why you are using the date as a variable in your model, i don't think is a explicative one, and cause a lot of trouble as you mention
@underfitted2 жыл бұрын
The date in the model is to illustrate a specific point. The same happens with any other feature that could cause a leaking. For example, in a dataset of x-rays, you should always make sure that images from the same patient go into the same split. Splitting patients will cause a leaking validation strategy just like I mention in this video.