Machine Learning Interview Question: Outliers and Loss Functions

Рет қаралды 5,460

Applied AI Course

Күн бұрын

Пікірлер: 37

@ankesh23569 4 жыл бұрын

Your efforts and explanation is phenomenal sir

@saritasable5274 4 жыл бұрын

Really helpful. I hope we will get many more videos like that. Can you cover the Regulation topic after that? A good part of the video is short & easy to digest with the help of a diagram, can be cover in breaks as a revision.

@tagoreji2143 2 жыл бұрын

Thank you Sir

@datahat642 3 жыл бұрын

Nice explanation

@naveens11 4 жыл бұрын

Excellent explanation 🙏🙏🙏

@souravbarik8470 4 жыл бұрын

But the idea of loss is to penalize the wrong prediction, so L2 would penalize more than L1 right?

@AppliedAICourse 4 жыл бұрын

Yes, L2 would penalise more than L1 when e_i is large. But, the question here is about robustness to outliers.

@HarshKumar-zc4ox 4 жыл бұрын

I think the X- axis is not ei, it's the coefficient of the function that we are using to fit the model.eg in dl/dw(gradient) that we use, w is the parameter of the function and L is the loss.so X- axis should be w(the parameter that we are optimizing). L2 will surely take more time to converge than L1 because of large value of L2 due to outliers

@AppliedAICourse 4 жыл бұрын

I think you are getting mixing this question with L1 regularisation and hence getting confused with the x axis. We are not talking about regularisation in this problem, but about the loss itself.

@HarshKumar-zc4ox 4 жыл бұрын

@@AppliedAICourse In first part of video, regularization doesn't even come into picture.why would I confuse with regularization L1/L2. What I meant to say was ei and Loss are more or less the same thing.what's the point of plotting one on x-axis ond another on Y-axis.

@AppliedAICourse 4 жыл бұрын

Note that L1 and L2 are functions of e_i's and not same as e_i's. In the video we describe how e_i's are related to outliers. Then, to show their impact on the loss function, we plotted e_i's vs Losses. The plot helps us visually understand and appreciate the fact on how L1 and L2 differ in the impact of outlier points and their e_i's on the loss that we want to minimise.

@AKASHKUMAR-we5hg 4 жыл бұрын

what's wrong with huber loss??

@AppliedAICourse 4 жыл бұрын

That's a good loss to use in as it behaves like the L1 beyond a threshold(delta) and behaves like L2 otherwise. Actually, robust regression is one of the major applications of Huber Loss (en.wikipedia.org/wiki/Huber_loss). But, the question here was a choice between L1 and L2 and hence we picked L1 amongst the two.

@AKASHKUMAR-we5hg 4 жыл бұрын

@@AppliedAICourse Thanks fro reply me back I always appreciate your effort, this kind of problem I encounter when I was doing time series forecasting in which data contain too many Outliers their I use Huber loss as a loss function.

@arun_sain 4 жыл бұрын

👌👌👌👌👌👌

@vython88 4 жыл бұрын

How can we minimize the L1 when it's not differentiable ?

@AppliedAICourse 4 жыл бұрын

Mathematically, L1 loss is not differentiable only at e_i=0. Hence, in most implementations, a hack is often used very similar to what is used in case of L1 regularisation in many ML optimisation functions. Sub-gradients is one way to handle the non differentiable nature of L1. For more info, please check out see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf

@vython88 4 жыл бұрын

@@AppliedAICourse Thank you so much for sharing this :)

@Joel95ify 4 жыл бұрын

By that logic, shouldn't we use L1 everywhere? Even if there are no outliers. Since its value will always be less than L2 no matter which data points. Then why to use L2 in the first place?

@AppliedAICourse 4 жыл бұрын

That’s a good question. When we don’t have outliers, L2 error converges faster as it’s gradient’s magnitude can be greater than 1 for larger magnitude e_i’s while L1’s gradient magnitude is always 1 even for larger magnitude e_i’s.

@Joel95ify 4 жыл бұрын

@@AppliedAICourse So bottom line, if there are no outliers then L2 converges faster and for outliers L1 converges faster, right?

@AppliedAICourse 4 жыл бұрын

Yes L2 converges faster than L1. But L1 is more robust to outliers.

@Naveenkumar-tp8ek 4 жыл бұрын

sir we define our loss such that if we have large error our loss should be high so as per the question aren’t we supposed to consider L2?

@AppliedAICourse 4 жыл бұрын

Yes, L2 penalises more than L1 for larger values of e_i, but the question here is about robustness to outliers.

@aravindraamasamy9453 4 жыл бұрын

Here the L2 loss is high if we have outliers , clearly we can see from the graph. Why are we selecting L1 loss as it is less affected by outliers , but our main aim is to find w,b such that it is not affected by outliers right , so why don't we use L2 loss

@AppliedAICourse 4 жыл бұрын

We compute W and b by minding the Loss functions. Hence, anything that impacts the loss functions would impact the values of W and b. Since outliers impact the loss functions, they would also have an impact on W and b.

@kshitizomar6730 4 жыл бұрын

But Sir, what about the optimization problem? Isn't differentiating a modulus function more cumbersome than a squared function?

@AppliedAICourse 4 жыл бұрын

Yes, L1 is trickier to differentiate than L2, but it is a very often performed operation in most ML and DeepLearning packages as they all have L1 regularisation. It is not a major concern anymore.

@satishkumar-vs2xh 4 жыл бұрын

Sir, L2 is square of L1. No model expects to have more loss, if L1 = 2 then L2 = 4. So, Even without plotting a graph we can answer L2 Loss is more than L1 right. Why do we need to plot graph?. Sir Can you please tell us how can we write the graphs without values?

@AppliedAICourse 4 жыл бұрын

Yea, we can answer it without plotting also. L2 is a quadratic function which L1 is linear. That’s how you can draw it quickly.

@harv609 4 жыл бұрын

Which software is being used to write here, it looks really sleek 🔥

@AppliedAICourse 4 жыл бұрын

Ink2Go and a Wacom tablet.

@varunsaproo4120 4 жыл бұрын

I think having L2 loss leads to hugher magnitudes of gradients, which intern leads to highly varying of weights, hence convergence in L2 is very difficult. Is my understanding correct? Can we say that gradients will be noisy?

@AppliedAICourse 4 жыл бұрын

How is this related to the question at hand on the robustness to outliers ? It's not clear. Please elaborate.

@varunsaproo4120 4 жыл бұрын

@@AppliedAICourse If we consider Linear regression with loss L2, we will obtain gradient as -1/2 * (Yi^ - f(Xi)) * del_f where del_f is derivative of f w.r.t a weight. What I mean is, (Yi^ - f(Xi)) w.r.t to an outlier would be a large magnitude, hence the gradients should also have large magnitude. This would make convergence difficult.

@AppliedAICourse 4 жыл бұрын

Larger gradient is better, right, as we will converge faster. Of course, we also have to update the learning rate with time as we often do in SGD, and hence avoid overshooting the minima. It is still not clear how the magnitude of the gradient impacts the robustness to outliers.

@varunsaproo4120 4 жыл бұрын

Sorry, I did not think about learning rate. I got little confused. Now, it is clear. Just one question, If I use L1 loss, will my model overfit less compared to when using with L2 loss, because the effect of outliers in L1 is less compared to L2 and my model would be more robust?