Python Feature Selection: Remove Multicollinearity from Machine Learning Model in Python

Рет қаралды 7,933

Күн бұрын

Пікірлер: 39

@Nomar_7 2 жыл бұрын

nice video, what about checking from where the high threshold is coming and comparing the correlation with the target column and only dropping the one with the less correlation

@StatsWire 2 жыл бұрын

We can decide the threshold and see which columns are having high correlation

@oomraden 2 жыл бұрын

Hi thanks for the video! Would'nt that remove all high correlated columns instead of just leaving one column for every relationship?

@StatsWire 2 жыл бұрын

It will leave one column for every relationship.

@SaifTreks 2 жыл бұрын

@@StatsWire Great video! I don't quite get what you mean here. isn't the list returning every column that has a high correlation based on threshold? And then we're proceeding to remove all those columns. Should we intentionally just keep one instead of removing all of them. How is it automatically keeping one if that's what you are saying.

@anishdeshpande395 Жыл бұрын

Is this method better than variance inflation factor?

@StatsWire Жыл бұрын

Both of them are good.

@baburamchaudhary159 Жыл бұрын

I have been following you for feature selection, covered forward, backward, exhaustive, variance threshold, chi2, etc. You have not shared the dataset, in them. for us to follow along you, why don't you share dataset?

@StatsWire Жыл бұрын

Please find the dataset link: github.com/siddiquiamir/Data

@d1pranjal Жыл бұрын

How are diagonal elements being handled in the user defined function correlation(df, threshold) ?

@StatsWire Жыл бұрын

I did not get your question

@d1pranjal Жыл бұрын

@@StatsWire at diagonal elements... value is 1 > threshold... so all elements will show up in output

@StatsWire Жыл бұрын

@@d1pranjal Ok, hope you found the solution yourself.

@AnanyaJoshi-g2x Жыл бұрын

what about a scenario where the order of the columns change? since we're checking for adjacent columns and their correlations to be more than the threshold and then remove the first out of the two in case the threshold is matched or passed, if I change the order of columns, the result received will be different. is that going to a correct list of features as well?

@StatsWire Жыл бұрын

That is completely ok. You can change the order.

@AnanyaJoshi-g2x Жыл бұрын

@@StatsWire thanks for the reply, I did change the order and got a different set of features. Built an XgBoost model with both sets of features and got extremely different forecasts and accuracies in both cases. How do I decide which is correct?

@StatsWire Жыл бұрын

Yes, that is going to be a correct feature list. You can change column positions no problem at all.

@protapmaitra5049 2 жыл бұрын

This video was really helpful, thanks a ton.

@StatsWire 2 жыл бұрын

You're welcome

@michaelsagols8295 2 жыл бұрын

Thank you for the video! very well explained! keep it up!

@StatsWire 2 жыл бұрын

Thank you for your kind words.

@jorge1869 2 жыл бұрын

Excellent, thank you very much.

@StatsWire 2 жыл бұрын

I'm glad you liked it. You're welcome

@akiwhitesoyo918 2 жыл бұрын

Nice ! Would it be the same if we use PCA to avoid multicollinearity ?

@StatsWire 2 жыл бұрын

Thank you! There would be some minor differences.

@mazharalamsiddiqui6904 2 жыл бұрын

Very nice

@StatsWire 2 жыл бұрын

Thank you

@farahamirah2091 2 жыл бұрын

hi, how to get this dataset?

@StatsWire 2 жыл бұрын

Hi, please find the dataset: github.com/siddiquiamir/Feature-Selection

@farahamirah2091 2 жыл бұрын

@@StatsWire thank you

@StatsWire 2 жыл бұрын

@@farahamirah2091 You're welcome!

@maskman9630 2 жыл бұрын

how to find collinearity for categorical features

@StatsWire 2 жыл бұрын

You can use chi-square

@maskman9630 2 жыл бұрын

@@StatsWire thanks brother...... Suppose I have done chi2 test of independent variable and dependent variables.and then i got f and p values, then how can I select features based on those f and p values...? Will u please clarify this brother

@StatsWire 2 жыл бұрын

@@maskman9630 Select the variable whose p value is less compared to other variables.

@gisflow406 2 жыл бұрын

This wasn't helpful at all. You just picked one of the correlated variables randomly without additional criteria. Anyways, correlation matrix can't do much. It's much more reliable to use VIF or hierarchical clustering for feature selection.

@StatsWire 2 жыл бұрын

Hi, this is for demonstration purposes. You can deep dive and pick the variables based on your selection criteria:)

@naveedullah390 Жыл бұрын

when i enter the code line =====> corrmatrix = X_train.corr() it gives the error of =========> AttributeError: 'numpy.ndarray' object has no attribute 'corr'

@StatsWire Жыл бұрын

You need to make sure your data is in correct format.