Diff 1 --->Gradient descent takes all the data point into consideration to update the weight during back propagation to minimize the loss function..................whereas stochastic gradient descent considers only one data point at a time for weight updation. Diff 2 ----> In gradient descent convergence towards the minima is fast..............where as in stochastic gradient descent convergence is slow. Diff3-------> Since in gradient descent whole data points are loaded and use for calculation, computation get slow.........where as stochastic gradient descent is comparatively fast.
@GauravSharma-ui4yd4 жыл бұрын
@Ashwini Verma it's 1 data point at a time only for SGD. Batch is for mini-batch gradient descent. But it's not your fault because many people uses mini-batch gradient descent and stochastic gradient descent interchangeably and infact all library implementation of SGD uses a mini-batch. But the thing is all data point at a time is GD/vanilla GD/Batch GD, 1 data point at a time is SGD and a batch at a time is mini-batch GD. And when batch-size is 1 then mini-batch GD is just SGD and when it is equal to entire dataset than it becomes vanilla GD
@GauravSharma-ui4yd4 жыл бұрын
Just one thing to add don't put back propagation in the definition of GD it has nothing to do with GD in general. But it's an clever way of doing GD optimally and effectively in case of neural nets. But when we apply these optimizers on non-neural nets like linear regression then their is no notion of back propagation. So long story short, backprop is not a intrinsic part of GD but just a clever way of doing it effectively only when applied on neural nets.
@abhaypratapsingh42254 жыл бұрын
@@GauravSharma-ui4yd Great explanation! You seem to have a strong concept understanding. Thanks again
@vishprab4 жыл бұрын
@@GauravSharma-ui4yd Hi, aren't the weights in the linear regression model too determined using backpropagation? Isn't it the same idea applied in a neural net which is updating the parameters after each step in GD ? Don't we go back and forth to determine the weights in any regression for that matter? So backprop is not a wrong term to use per se. Please let me know how these two are different in a conventional sense. Does having an activation function change the definition ? medium.com/@Aj.Cheng/linear-regression-by-gradient-decent-bb198724eb2c#:~:text=linear%20regression%20formulation%20is%20very,some%20detail%20of%20it%20later.&text=the%20purpose%20of%20backpropagation%20is,side%20the%20one%20update%20step.
@DS_AIML3 жыл бұрын
Thats why people prefer Mini Batch stochastic gradient descent
@sushilchauhan25864 жыл бұрын
thanks krishna!...krishna i m from that person who like your videos first then i watch your videos .... i yesterday said you about it and u explained it...thank you.. so you r saying after applying regularization ... there will be no multicolinearity stochastic gradient descent : stochastic gradient descent is almost similar to gradient descent only difference is : if i have "n" no. of points in training data then it will randomly pick "k" no. of points where k
@bharathjc47004 жыл бұрын
multicollinearity may not be a problem every time. The need to fix multicollinearity depends primarily on the below reasons: When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.
@ShivShankarDutta14 жыл бұрын
GD: Run all samples in training to do a single update for all params in a specific iteration SGD: Only one or subset of training sample from training set to update parameter in a specific iteration GD: If sample/features are larger it takes much time in updating the values SGD: It is faster because there is one training sample SGD conveges faster than GD.
@rahuldey63694 жыл бұрын
Yes. SGD converges faster than GD, but fails to find the global minima like GD but oscillates around a value close to global minima which people say is a good approximation to go with
@bharathjc47004 жыл бұрын
In batch gradient descent, you compute the gradient over the entire dataset, averaging over potentially a vast amount of information. It takes lots of memory to do that. But the real handicap is the batch gradient trajectory land you in a bad spot (saddle point). In pure SGD, on the other hand, you update your parameters by adding (minus sign) the gradient computed on a single instance of the dataset. Since it's based on one random data point, it's very noisy and may go off in a direction far from the batch gradient. However, the noisiness is exactly what you want in non-convex optimization, because it helps you escape from saddle points or local minima GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large. That means GD is preferable for small datasets while SGD is preferable for larger ones..
@cutyoopsmoments28004 жыл бұрын
Sir, I am great fan of yours
@DionysusEleutherios4 жыл бұрын
If you have a a large feature space that contains multicollinearity, you could also try running a PCA and use only the first n components in your model (where n is the number of components that collectively explain at least 80% of the variance), since they are by definition orthogonal to each other.
@akashsaha39214 жыл бұрын
No.... PCA reduces dimensionality but doesn't consider class labels while doing that. In logistic classification doing PCA means u r dropping features wdout considering the class labels. So in practice, if our aim is to just find acc of model then PCA is good but if u want interpretation of wch features are important for classification in the model then PCA not recommended. Moreover PCA preserve variance BT creating new features. In practice, generally u might be asked to not create new features. That's y lasso n rigde regression are used.... To penalize
@hokapokas4 жыл бұрын
Any response for VIF approach???
@akashsaha39214 жыл бұрын
@@hokapokas Look L1 and L2 is must u shd do to control over and under fitting. . . but using L1 when features are multicollinear will create more sparsity so in that case we do L2. Next to detect multicollinearity u can do pertubation test. U can do Forward feature selection also to select the important features. At the end u shd try every possible way to make ur model better. I prefer to use pertubation to check collinearity and then based on that select L1 or L2 or elastic net
@hokapokas4 жыл бұрын
@@akashsaha3921 I agree on regulatisation to put constraints to a model in terms of feature selection or reducing magnitudes of coefficients but I was suggesting for a VIF approach as well to select and reject features. We should explore this approach as well for multicollinearity.
@GauravSharma-ui4yd4 жыл бұрын
@@hokapokas I lot people are actually using this approach
@mahender14404 жыл бұрын
Hi, Krish Gradient descent : on big volume of data it takes more number of iterations,for each iteration it works with entire data so casuses High latency and more computing power, Solution : batch gradient Batch gradient : data is splitted into multiple batches,on each batch gradient will be applied separately,for each batch separate minimum loss is achieved,it considers finally the weight matrix of global minimum loss Problem with batch gradient : each batch contains few patterns the entire data,that means missing other patterns,model couldn't learn all patterns from the data
@brahimaksasse25774 жыл бұрын
GD algorithm uses all data for updating weights when optimising loss function in BP algorithm. However SGD uses a sample data at each iteration.
@AnotherproblemOn3 жыл бұрын
You're simply the best, love you
@K-mk6pc3 жыл бұрын
Stoic Gradient Descent is a type where the Feature Values are Taken randomly Unlike the other Type of Gradient Descent where the global minima is found out after training the Entire Model.
@cutyoopsmoments28004 жыл бұрын
Sir, kindly make all the videos of feature engineering and Feature selection which is present in your Github Link.. please..
@dragonhead484 жыл бұрын
@krishNaik you can add the links for lasso and ridge regularization techniques in this current video. That would be helpful and beneficial for both parties as well I think.
@Trendz-w5d3 жыл бұрын
kzbin.info/www/bejne/b521p2NnfamIZtU here it is
@sridhar63584 жыл бұрын
Lasso and Ridge Regression - precondition is that there should not be multicollinearity, if we see linear relationship between the independent variables like how we see it with dependent and independent variables we call it multicollinearity which is not the same as correlation
@charlottedsouza2744 жыл бұрын
Hi Krish...Thanks for such clear explanation. For large datasets, for regression problems we have ridge and lasso. What about classification problem..How to deal with multi collinearity for large datasets?
@SC-hp5dn2 жыл бұрын
thats what i came here to find
@sathwickreddymora87674 жыл бұрын
Let's assume we are use a MSE cost function Gradient Descent -> It takes all the points into account for computing the derivatives of the cost function w.r.t each feature which tells the right direction to move. It is not productive if we have a large number of data points. SGD -> It computes the derivatives of the cost function w.r.t each feature based a single or some subset of data points and moves in that direction pretending it was the right direction. So, it decreases much of the computational complexity.
@swatisawant76324 жыл бұрын
Very nicely explained!!!
@charlottedsouza2744 жыл бұрын
In addition, can you create a separate playlist for interview question so that it is all in one place?
@ganeshprabhakaran93164 жыл бұрын
That was an clear explanation ..Thanks Krish.. Small request can you make a video for feature selection using atleast 15-20 variables based on multicollinearity for better understanding by practice..
@rahuldey63694 жыл бұрын
@2:26 could you please explain what disadvantage can it cause to model performance? I mean, what if I remove correlated features,will my model performance increase or stays the same?
@sarveshmankar72724 жыл бұрын
Sir can we use pca to reduce multicollinarity if we have suppose more than 200 columns??
@venkivtz99612 жыл бұрын
PCA is the best in some cases of multicollinearity problems
@gaugogoi4 жыл бұрын
Along with correlation heatmap and lasso & ridge regression ,can VIF is another option to figure out multicollinearity ?
@GauravSharma-ui4yd4 жыл бұрын
Yes
@HBG-SHIVANI4 жыл бұрын
Set the standard benchmark that if VIF >5 then multicollinearity exits.
@louerleseigneur45323 жыл бұрын
Thanks Krish
@BhanudaySharma5063 жыл бұрын
In case of multicollinearity, why there is no mention of PCA?
@thunder440v34 жыл бұрын
So helpful video 🙏☺️
@swethakulkarni35634 жыл бұрын
Can you make a video for Naive Bayes in detail?
@surajshivakumar51243 жыл бұрын
We can just use variance inflation factor right?
@Arjun147gtk3 жыл бұрын
is it recommended to remove highly negative correlated.
@sahubiswajit19964 жыл бұрын
Stochastic Gradient Descent (SGD): It means we are sending ONLY ONE DATA POINT (Only one row) for the training phase. Gradient Descent (GD): It means we are sending ALL DATA POINTS (all rows) for the training phase. Mini Batch SGD: It means we are sending SOME PORTION DATA POINTS (let us consider each 50 data points from 50K data points, it means 5000 epoch) for the training phase. Sir is this correct?
@GauravSharma-ui4yd4 жыл бұрын
The only correct distinction so far. Just one thing it's not 5000 epochs it's 5000 steps/updates in a single epoch.
@sahubiswajit19964 жыл бұрын
@@GauravSharma-ui4yd last sentence I am not understanding. Please explain me again
@GauravSharma-ui4yd4 жыл бұрын
@@sahubiswajit1996 1 epoch means 1 trip of entire dataset. So if 50k points are there and you are batching them in 50 then you have 5000 such batches and in each batch you calculate the grads and update weights in short a step in GD. And you do that 5000 times until your dataset is exhausted. But after doing this you make just one trip to the entire data and hence 1 epoch. If you do so for 10 epochs then there will be 5k * 10 steps/updates.
@haneulkim49024 жыл бұрын
When u are using small dataset and x1,x2 are highly correlated, drop which one?
@priyankabanda75627 ай бұрын
exact question asked in an interview
@Sunilgayakawad4 жыл бұрын
Why multicollinearity reduces after standardization?
@alipaloda95714 жыл бұрын
How about to use box cox technique
@ManishKumar-qs1fm4 жыл бұрын
Sir plz upload eigenvalues & eigenvector video
@adityay5251254 жыл бұрын
SGD uses a variable learning rate and hence is better imo. I do not know the answer though, still a noob.
@shoraygoel4 жыл бұрын
Why can't we remove features using correlation when there are many features?
@rohankavari86124 жыл бұрын
as there are many variables...we might find many variables that are highly correlated....and removing that many variables will lead to loss of information
@shoraygoel4 жыл бұрын
@@rohankavari8612 if they are correlated then there is no loss of information right?
@prathmeshbusa21954 жыл бұрын
hello i am not able to do payment of 59rupees to join u r channel i tried with all the possible bank card but it always fails
@ManishKumar-qs1fm4 жыл бұрын
Nice Sir
@ashum66124 жыл бұрын
Multicollinearity can be completely removed from the model. (True/False). Give reasons.
@rohankavari86124 жыл бұрын
false.....features with zero multicollinearity is an ideal situation...there will always be some multicollinearity present...our role is to deal with variables which have high multicollinearity
@harvinnation30272 жыл бұрын
i didnt know post malone is into data analysis !!
@ShubhanshuAnand4 жыл бұрын
SGD: Picks k random sample from n samples in each iteration whereas GD considers all n samples.
@jagannadhareddykalagotla6244 жыл бұрын
Stochastic gradient for globel minima Gradient desecent some times shows local minima
@GauravSharma-ui4yd4 жыл бұрын
Nopes:( If function is convex than GD always converge to global minimum but SGD doesn't ensure that. But when problem is non-convex no one ensures convergence to global minimum but SGD has better chances as compare to GD in case of non-convex functions.