nice video, what about checking from where the high threshold is coming and comparing the correlation with the target column and only dropping the one with the less correlation
@StatsWire2 жыл бұрын
We can decide the threshold and see which columns are having high correlation
@oomraden2 жыл бұрын
Hi thanks for the video! Would'nt that remove all high correlated columns instead of just leaving one column for every relationship?
@StatsWire2 жыл бұрын
It will leave one column for every relationship.
@SaifTreks2 жыл бұрын
@@StatsWire Great video! I don't quite get what you mean here. isn't the list returning every column that has a high correlation based on threshold? And then we're proceeding to remove all those columns. Should we intentionally just keep one instead of removing all of them. How is it automatically keeping one if that's what you are saying.
@anishdeshpande395 Жыл бұрын
Is this method better than variance inflation factor?
@StatsWire Жыл бұрын
Both of them are good.
@baburamchaudhary159 Жыл бұрын
I have been following you for feature selection, covered forward, backward, exhaustive, variance threshold, chi2, etc. You have not shared the dataset, in them. for us to follow along you, why don't you share dataset?
@StatsWire Жыл бұрын
Please find the dataset link: github.com/siddiquiamir/Data
@d1pranjal Жыл бұрын
How are diagonal elements being handled in the user defined function correlation(df, threshold) ?
@StatsWire Жыл бұрын
I did not get your question
@d1pranjal Жыл бұрын
@@StatsWire at diagonal elements... value is 1 > threshold... so all elements will show up in output
@StatsWire Жыл бұрын
@@d1pranjal Ok, hope you found the solution yourself.
@AnanyaJoshi-g2x Жыл бұрын
what about a scenario where the order of the columns change? since we're checking for adjacent columns and their correlations to be more than the threshold and then remove the first out of the two in case the threshold is matched or passed, if I change the order of columns, the result received will be different. is that going to a correct list of features as well?
@StatsWire Жыл бұрын
That is completely ok. You can change the order.
@AnanyaJoshi-g2x Жыл бұрын
@@StatsWire thanks for the reply, I did change the order and got a different set of features. Built an XgBoost model with both sets of features and got extremely different forecasts and accuracies in both cases. How do I decide which is correct?
@StatsWire Жыл бұрын
Yes, that is going to be a correct feature list. You can change column positions no problem at all.
@protapmaitra50492 жыл бұрын
This video was really helpful, thanks a ton.
@StatsWire2 жыл бұрын
You're welcome
@michaelsagols82952 жыл бұрын
Thank you for the video! very well explained! keep it up!
@StatsWire2 жыл бұрын
Thank you for your kind words.
@jorge18692 жыл бұрын
Excellent, thank you very much.
@StatsWire2 жыл бұрын
I'm glad you liked it. You're welcome
@akiwhitesoyo9182 жыл бұрын
Nice ! Would it be the same if we use PCA to avoid multicollinearity ?
@StatsWire2 жыл бұрын
Thank you! There would be some minor differences.
@mazharalamsiddiqui69042 жыл бұрын
Very nice
@StatsWire2 жыл бұрын
Thank you
@farahamirah20912 жыл бұрын
hi, how to get this dataset?
@StatsWire2 жыл бұрын
Hi, please find the dataset: github.com/siddiquiamir/Feature-Selection
@farahamirah20912 жыл бұрын
@@StatsWire thank you
@StatsWire2 жыл бұрын
@@farahamirah2091 You're welcome!
@maskman96302 жыл бұрын
how to find collinearity for categorical features
@StatsWire2 жыл бұрын
You can use chi-square
@maskman96302 жыл бұрын
@@StatsWire thanks brother...... Suppose I have done chi2 test of independent variable and dependent variables.and then i got f and p values, then how can I select features based on those f and p values...? Will u please clarify this brother
@StatsWire2 жыл бұрын
@@maskman9630 Select the variable whose p value is less compared to other variables.
@gisflow4062 жыл бұрын
This wasn't helpful at all. You just picked one of the correlated variables randomly without additional criteria. Anyways, correlation matrix can't do much. It's much more reliable to use VIF or hierarchical clustering for feature selection.
@StatsWire2 жыл бұрын
Hi, this is for demonstration purposes. You can deep dive and pick the variables based on your selection criteria:)
@naveedullah390 Жыл бұрын
when i enter the code line =====> corrmatrix = X_train.corr() it gives the error of =========> AttributeError: 'numpy.ndarray' object has no attribute 'corr'
@StatsWire Жыл бұрын
You need to make sure your data is in correct format.