Data Science Interview Questions- Multicollinearity In Linear And Logistic Regression

Рет қаралды 40,995

Күн бұрын

Пікірлер: 72

@harshstrum 4 жыл бұрын

Diff 1 --->Gradient descent takes all the data point into consideration to update the weight during back propagation to minimize the loss function..................whereas stochastic gradient descent considers only one data point at a time for weight updation. Diff 2 ----> In gradient descent convergence towards the minima is fast..............where as in stochastic gradient descent convergence is slow. Diff3-------> Since in gradient descent whole data points are loaded and use for calculation, computation get slow.........where as stochastic gradient descent is comparatively fast.

@GauravSharma-ui4yd 4 жыл бұрын

@Ashwini Verma it's 1 data point at a time only for SGD. Batch is for mini-batch gradient descent. But it's not your fault because many people uses mini-batch gradient descent and stochastic gradient descent interchangeably and infact all library implementation of SGD uses a mini-batch. But the thing is all data point at a time is GD/vanilla GD/Batch GD, 1 data point at a time is SGD and a batch at a time is mini-batch GD. And when batch-size is 1 then mini-batch GD is just SGD and when it is equal to entire dataset than it becomes vanilla GD

@GauravSharma-ui4yd 4 жыл бұрын

Just one thing to add don't put back propagation in the definition of GD it has nothing to do with GD in general. But it's an clever way of doing GD optimally and effectively in case of neural nets. But when we apply these optimizers on non-neural nets like linear regression then their is no notion of back propagation. So long story short, backprop is not a intrinsic part of GD but just a clever way of doing it effectively only when applied on neural nets.

@abhaypratapsingh4225 4 жыл бұрын

@@GauravSharma-ui4yd Great explanation! You seem to have a strong concept understanding. Thanks again

@vishprab 4 жыл бұрын

@@GauravSharma-ui4yd Hi, aren't the weights in the linear regression model too determined using backpropagation? Isn't it the same idea applied in a neural net which is updating the parameters after each step in GD ? Don't we go back and forth to determine the weights in any regression for that matter? So backprop is not a wrong term to use per se. Please let me know how these two are different in a conventional sense. Does having an activation function change the definition ? medium.com/@Aj.Cheng/linear-regression-by-gradient-decent-bb198724eb2c#:~:text=linear%20regression%20formulation%20is%20very,some%20detail%20of%20it%20later.&text=the%20purpose%20of%20backpropagation%20is,side%20the%20one%20update%20step.

@DS_AIML 3 жыл бұрын

Thats why people prefer Mini Batch stochastic gradient descent

@sushilchauhan2586 4 жыл бұрын

thanks krishna!...krishna i m from that person who like your videos first then i watch your videos .... i yesterday said you about it and u explained it...thank you.. so you r saying after applying regularization ... there will be no multicolinearity stochastic gradient descent : stochastic gradient descent is almost similar to gradient descent only difference is : if i have "n" no. of points in training data then it will randomly pick "k" no. of points where k

@bharathjc4700 4 жыл бұрын

multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the below reasons: When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

@ShivShankarDutta1 4 жыл бұрын

GD: Run all samples in training to do a single update for all params in a specific iteration SGD: Only one or subset of training sample from training set to update parameter in a specific iteration GD: If sample/features are larger it takes much time in updating the values SGD: It is faster because there is one training sample SGD conveges faster than GD.

@rahuldey6369 4 жыл бұрын

Yes. SGD converges faster than GD, but fails to find the global minima like GD but oscillates around a value close to global minima which people say is a good approximation to go with

@bharathjc4700 4 жыл бұрын

In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information. It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point). In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset. Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient. However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large. That means GD is preferable for small datasets while SGD is preferable for larger ones..

@cutyoopsmoments2800 4 жыл бұрын

Sir, I am great fan of yours

@DionysusEleutherios 4 жыл бұрын

If you have a a large feature space that contains multicollinearity, you could also try running a PCA and use only the first n components in your model (where n is the number of components that collectively explain at least 80% of the variance), since they are by definition orthogonal to each other.

@akashsaha3921 4 жыл бұрын

No.... PCA reduces dimensionality but doesn't consider class labels while doing that. In logistic classification doing PCA means u r dropping features wdout considering the class labels. So in practice, if our aim is to just find acc of model then PCA is good but if u want interpretation of wch features are important for classification in the model then PCA not recommended. Moreover PCA preserve variance BT creating new features. In practice, generally u might be asked to not create new features. That's y lasso n rigde regression are used.... To penalize

@hokapokas 4 жыл бұрын

Any response for VIF approach???

@akashsaha3921 4 жыл бұрын

@@hokapokas Look L1 and L2 is must u shd do to control over and under fitting. . . but using L1 when features are multicollinear will create more sparsity so in that case we do L2. Next to detect multicollinearity u can do pertubation test. U can do Forward feature selection also to select the important features. At the end u shd try every possible way to make ur model better. I prefer to use pertubation to check collinearity and then based on that select L1 or L2 or elastic net

@hokapokas 4 жыл бұрын

@@akashsaha3921 I agree on regulatisation to put constraints to a model in terms of feature selection or reducing magnitudes of coefficients but I was suggesting for a VIF approach as well to select and reject features. We should explore this approach as well for multicollinearity.

@GauravSharma-ui4yd 4 жыл бұрын

@@hokapokas I lot people are actually using this approach

@mahender1440 4 жыл бұрын

Hi, Krish Gradient descent : on big volume of data it takes more number of iterations,for each iteration it works with entire data so casuses High latency and more computing power, Solution : batch gradient Batch gradient : data is splitted into multiple batches,on each batch gradient will be applied separately,for each batch separate minimum loss is achieved,it considers finally the weight matrix of global minimum loss Problem with batch gradient : each batch contains few patterns the entire data,that means missing other patterns,model couldn't learn all patterns from the data

@brahimaksasse2577 4 жыл бұрын

GD algorithm uses all data for updating weights when optimising loss function in BP algorithm. However SGD uses a sample data at each iteration.

@AnotherproblemOn 3 жыл бұрын

You're simply the best, love you

@K-mk6pc 3 жыл бұрын

Stoic Gradient Descent is a type where the Feature Values are Taken randomly Unlike the other Type of Gradient Descent where the global minima is found out after training the Entire Model.

@cutyoopsmoments2800 4 жыл бұрын

Sir, kindly make all the videos of feature engineering and Feature selection which is present in your Github Link.. please..

@dragonhead48 4 жыл бұрын

@krishNaik you can add the links for lasso and ridge regularization techniques in this current video. That would be helpful and beneficial for both parties as well I think.

@Trendz-w5d 3 жыл бұрын

kzbin.info/www/bejne/b521p2NnfamIZtU here it is

@sridhar6358 4 жыл бұрын

Lasso and Ridge Regression - precondition is that there should not be multicollinearity, if we see linear relationship between the independent variables like how we see it with dependent and independent variables we call it multicollinearity which is not the same as correlation

@charlottedsouza274 4 жыл бұрын

Hi Krish...Thanks for such clear explanation. For large datasets, for regression problems we have ridge and lasso. What about classification problem..How to deal with multi collinearity for large datasets?

@SC-hp5dn 2 жыл бұрын

thats what i came here to find

@sathwickreddymora8767 4 жыл бұрын

Let's assume we are use a MSE cost function Gradient Descent -> It takes all the points into account for computing the derivatives of the cost function w.r.t each feature which tells the right direction to move. It is not productive if we have a large number of data points. SGD -> It computes the derivatives of the cost function w.r.t each feature based a single or some subset of data points and moves in that direction pretending it was the right direction. So, it decreases much of the computational complexity.

@swatisawant7632 4 жыл бұрын

Very nicely explained!!!

@charlottedsouza274 4 жыл бұрын

In addition, can you create a separate playlist for interview question so that it is all in one place?

@ganeshprabhakaran9316 4 жыл бұрын

That was an clear explanation ..Thanks Krish.. Small request can you make a video for feature selection using atleast 15-20 variables based on multicollinearity for better understanding by practice..

@rahuldey6369 4 жыл бұрын

@2:26 could you please explain what disadvantage can it cause to model performance? I mean, what if I remove correlated features,will my model performance increase or stays the same?

@sarveshmankar7272 4 жыл бұрын

Sir can we use pca to reduce multicollinarity if we have suppose more than 200 columns??

@venkivtz9961 2 жыл бұрын

PCA is the best in some cases of multicollinearity problems

@gaugogoi 4 жыл бұрын

Along with correlation heatmap and lasso & ridge regression ,can VIF is another option to figure out multicollinearity ?

@GauravSharma-ui4yd 4 жыл бұрын

Yes

@HBG-SHIVANI 4 жыл бұрын

Set the standard benchmark that if VIF >5 then multicollinearity exits.

@louerleseigneur4532 3 жыл бұрын

Thanks Krish

@BhanudaySharma506 3 жыл бұрын

In case of multicollinearity, why there is no mention of PCA?

@thunder440v3 4 жыл бұрын

So helpful video 🙏☺️

@swethakulkarni3563 4 жыл бұрын

Can you make a video for Naive Bayes in detail?

@surajshivakumar5124 3 жыл бұрын

We can just use variance inflation factor right?

@Arjun147gtk 3 жыл бұрын

is it recommended to remove highly negative correlated.

@sahubiswajit1996 4 жыл бұрын

Stochastic Gradient Descent (SGD): It means we are sending ONLY ONE DATA POINT (Only one row) for the training phase. Gradient Descent (GD): It means we are sending ALL DATA POINTS (all rows) for the training phase. Mini Batch SGD: It means we are sending SOME PORTION DATA POINTS (let us consider each 50 data points from 50K data points, it means 5000 epoch) for the training phase. Sir is this correct?

@GauravSharma-ui4yd 4 жыл бұрын

The only correct distinction so far. Just one thing it's not 5000 epochs it's 5000 steps/updates in a single epoch.

@sahubiswajit1996 4 жыл бұрын

@@GauravSharma-ui4yd last sentence I am not understanding. Please explain me again

@GauravSharma-ui4yd 4 жыл бұрын

@@sahubiswajit1996 1 epoch means 1 trip of entire dataset. So if 50k points are there and you are batching them in 50 then you have 5000 such batches and in each batch you calculate the grads and update weights in short a step in GD. And you do that 5000 times until your dataset is exhausted. But after doing this you make just one trip to the entire data and hence 1 epoch. If you do so for 10 epochs then there will be 5k * 10 steps/updates.

@haneulkim4902 4 жыл бұрын

When u are using small dataset and x1,x2 are highly correlated, drop which one?

@priyankabanda7562 7 ай бұрын

exact question asked in an interview

@Sunilgayakawad 4 жыл бұрын

Why multicollinearity reduces after standardization?

@alipaloda9571 4 жыл бұрын

How about to use box cox technique

@ManishKumar-qs1fm 4 жыл бұрын

Sir plz upload eigenvalues & eigenvector video

@adityay525125 4 жыл бұрын

SGD uses a variable learning rate and hence is better imo. I do not know the answer though, still a noob.

@shoraygoel 4 жыл бұрын

Why can't we remove features using correlation when there are many features?

@rohankavari8612 4 жыл бұрын

as there are many variables...we might find many variables that are highly correlated....and removing that many variables will lead to loss of information

@shoraygoel 4 жыл бұрын

@@rohankavari8612 if they are correlated then there is no loss of information right?

@prathmeshbusa2195 4 жыл бұрын

hello i am not able to do payment of 59rupees to join u r channel i tried with all the possible bank card but it always fails

@ManishKumar-qs1fm 4 жыл бұрын

Nice Sir

@ashum6612 4 жыл бұрын

Multicollinearity can be completely removed from the model. (True/False). Give reasons.

@rohankavari8612 4 жыл бұрын

false.....features with zero multicollinearity is an ideal situation...there will always be some multicollinearity present...our role is to deal with variables which have high multicollinearity

@harvinnation3027 2 жыл бұрын

i didnt know post malone is into data analysis !!

@ShubhanshuAnand 4 жыл бұрын

SGD: Picks k random sample from n samples in each iteration whereas GD considers all n samples.

@jagannadhareddykalagotla624 4 жыл бұрын

Stochastic gradient for globel minima Gradient desecent some times shows local minima

@GauravSharma-ui4yd 4 жыл бұрын

Nopes:( If function is convex than GD always converge to global minimum but SGD doesn't ensure that. But when problem is non-convex no one ensures convergence to global minimum but SGD has better chances as compare to GD in case of non-convex functions.