Really helpful. I hope we will get many more videos like that. Can you cover the Regulation topic after that? A good part of the video is short & easy to digest with the help of a diagram, can be cover in breaks as a revision.
@tagoreji21432 жыл бұрын
Thank you Sir
@datahat6423 жыл бұрын
Nice explanation
@naveens114 жыл бұрын
Excellent explanation 🙏🙏🙏
@souravbarik84704 жыл бұрын
But the idea of loss is to penalize the wrong prediction, so L2 would penalize more than L1 right?
@AppliedAICourse4 жыл бұрын
Yes, L2 would penalise more than L1 when e_i is large. But, the question here is about robustness to outliers.
@HarshKumar-zc4ox4 жыл бұрын
I think the X- axis is not ei, it's the coefficient of the function that we are using to fit the model.eg in dl/dw(gradient) that we use, w is the parameter of the function and L is the loss.so X- axis should be w(the parameter that we are optimizing). L2 will surely take more time to converge than L1 because of large value of L2 due to outliers
@AppliedAICourse4 жыл бұрын
I think you are getting mixing this question with L1 regularisation and hence getting confused with the x axis. We are not talking about regularisation in this problem, but about the loss itself.
@HarshKumar-zc4ox4 жыл бұрын
@@AppliedAICourse In first part of video, regularization doesn't even come into picture.why would I confuse with regularization L1/L2. What I meant to say was ei and Loss are more or less the same thing.what's the point of plotting one on x-axis ond another on Y-axis.
@AppliedAICourse4 жыл бұрын
Note that L1 and L2 are functions of e_i's and not same as e_i's. In the video we describe how e_i's are related to outliers. Then, to show their impact on the loss function, we plotted e_i's vs Losses. The plot helps us visually understand and appreciate the fact on how L1 and L2 differ in the impact of outlier points and their e_i's on the loss that we want to minimise.
@AKASHKUMAR-we5hg4 жыл бұрын
what's wrong with huber loss??
@AppliedAICourse4 жыл бұрын
That's a good loss to use in as it behaves like the L1 beyond a threshold(delta) and behaves like L2 otherwise. Actually, robust regression is one of the major applications of Huber Loss (en.wikipedia.org/wiki/Huber_loss). But, the question here was a choice between L1 and L2 and hence we picked L1 amongst the two.
@AKASHKUMAR-we5hg4 жыл бұрын
@@AppliedAICourse Thanks fro reply me back I always appreciate your effort, this kind of problem I encounter when I was doing time series forecasting in which data contain too many Outliers their I use Huber loss as a loss function.
@arun_sain4 жыл бұрын
👌👌👌👌👌👌
@vython884 жыл бұрын
How can we minimize the L1 when it's not differentiable ?
@AppliedAICourse4 жыл бұрын
Mathematically, L1 loss is not differentiable only at e_i=0. Hence, in most implementations, a hack is often used very similar to what is used in case of L1 regularisation in many ML optimisation functions. Sub-gradients is one way to handle the non differentiable nature of L1. For more info, please check out see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf
@vython884 жыл бұрын
@@AppliedAICourse Thank you so much for sharing this :)
@Joel95ify4 жыл бұрын
By that logic, shouldn't we use L1 everywhere? Even if there are no outliers. Since its value will always be less than L2 no matter which data points. Then why to use L2 in the first place?
@AppliedAICourse4 жыл бұрын
That’s a good question. When we don’t have outliers, L2 error converges faster as it’s gradient’s magnitude can be greater than 1 for larger magnitude e_i’s while L1’s gradient magnitude is always 1 even for larger magnitude e_i’s.
@Joel95ify4 жыл бұрын
@@AppliedAICourse So bottom line, if there are no outliers then L2 converges faster and for outliers L1 converges faster, right?
@AppliedAICourse4 жыл бұрын
Yes L2 converges faster than L1. But L1 is more robust to outliers.
@Naveenkumar-tp8ek4 жыл бұрын
sir we define our loss such that if we have large error our loss should be high so as per the question aren’t we supposed to consider L2?
@AppliedAICourse4 жыл бұрын
Yes, L2 penalises more than L1 for larger values of e_i, but the question here is about robustness to outliers.
@aravindraamasamy94534 жыл бұрын
Here the L2 loss is high if we have outliers , clearly we can see from the graph. Why are we selecting L1 loss as it is less affected by outliers , but our main aim is to find w,b such that it is not affected by outliers right , so why don't we use L2 loss
@AppliedAICourse4 жыл бұрын
We compute W and b by minding the Loss functions. Hence, anything that impacts the loss functions would impact the values of W and b. Since outliers impact the loss functions, they would also have an impact on W and b.
@kshitizomar67304 жыл бұрын
But Sir, what about the optimization problem? Isn't differentiating a modulus function more cumbersome than a squared function?
@AppliedAICourse4 жыл бұрын
Yes, L1 is trickier to differentiate than L2, but it is a very often performed operation in most ML and DeepLearning packages as they all have L1 regularisation. It is not a major concern anymore.
@satishkumar-vs2xh4 жыл бұрын
Sir, L2 is square of L1. No model expects to have more loss, if L1 = 2 then L2 = 4. So, Even without plotting a graph we can answer L2 Loss is more than L1 right. Why do we need to plot graph?. Sir Can you please tell us how can we write the graphs without values?
@AppliedAICourse4 жыл бұрын
Yea, we can answer it without plotting also. L2 is a quadratic function which L1 is linear. That’s how you can draw it quickly.
@harv6094 жыл бұрын
Which software is being used to write here, it looks really sleek 🔥
@AppliedAICourse4 жыл бұрын
Ink2Go and a Wacom tablet.
@varunsaproo41204 жыл бұрын
I think having L2 loss leads to hugher magnitudes of gradients, which intern leads to highly varying of weights, hence convergence in L2 is very difficult. Is my understanding correct? Can we say that gradients will be noisy?
@AppliedAICourse4 жыл бұрын
How is this related to the question at hand on the robustness to outliers ? It's not clear. Please elaborate.
@varunsaproo41204 жыл бұрын
@@AppliedAICourse If we consider Linear regression with loss L2, we will obtain gradient as -1/2 * (Yi^ - f(Xi)) * del_f where del_f is derivative of f w.r.t a weight. What I mean is, (Yi^ - f(Xi)) w.r.t to an outlier would be a large magnitude, hence the gradients should also have large magnitude. This would make convergence difficult.
@AppliedAICourse4 жыл бұрын
Larger gradient is better, right, as we will converge faster. Of course, we also have to update the learning rate with time as we often do in SGD, and hence avoid overshooting the minima. It is still not clear how the magnitude of the gradient impacts the robustness to outliers.
@varunsaproo41204 жыл бұрын
Sorry, I did not think about learning rate. I got little confused. Now, it is clear. Just one question, If I use L1 loss, will my model overfit less compared to when using with L2 loss, because the effect of outliers in L1 is less compared to L2 and my model would be more robust?