Is curious how after the first SMOTHE application, when equaling the amount of data to 4 thousand for all categories, multiple points of the first and second categories are created in a space where they are not supposed to be (see scatter plot). That seems like a lot of noise for our final model. Could you explain the behavior of the algorithm in this case?. when you created a thousand points for those categories it looks better behaved.
@datahat64211 ай бұрын
Hey @afmonsalves, what you are saying is absolutely correct that there is additional noise when we increase the number of points further. This happens because SMOTE tries to fill in the neighboring regions which may or may not overlap with the other regions. The objective of SMOTE is primarily to approximate the data points in the close proximity and the sampling_strategy is more of an experiment to determine the best case. Hope this helps!!
@lohjjoo83334 ай бұрын
Hi there, may I know is this same as upsampling?
@datahat6424 ай бұрын
Yes, it is one of the techniques of upsampling
@hatembouhjar2 ай бұрын
Aren't you supposed to sample only on X_train and y_train ?
@datahat6422 ай бұрын
If you are working with the train test splits, then u can sample on the training data