Why Regularization Reduces Overfitting (C2W1L05)

Рет қаралды 93,080

Күн бұрын

Take the Deep Learning Specialization: bit.ly/2PGCWHg
Check out all our courses: www.deeplearni...
Subscribe to The Batch, our weekly newsletter: www.deeplearni...
Follow us:
Twitter: / deeplearningai_
Facebook: / deeplearninghq
Linkedin: / deeplearningai

Пікірлер: 42

@saanvisharma2081 5 жыл бұрын

During high bias, weights will be very small. During high variance, weights will be high. Similarly, during regularisation....if lambda is near infinity or high, our weights will tend to go down, because the function (gradient decent) will always try to minimize the overall value. If there's less lambda, weights will increase and model will try to fit each data point.......that also creates overfitting problems. So by tuning lambda in such a way that; both bias and variance should be in a acceptable range.

@aditisrivastava7079 5 жыл бұрын

i like your explanation

@lekjov6170 4 жыл бұрын

Thanks for your comment, it clicked for me now.

@dragonixZXgames 4 жыл бұрын

Thanks, I finally got it.

@ozziejin 3 жыл бұрын

interesting perspective, never thought of using the shape of tanh to help understand the intuition of regularization

@thedrei24 5 жыл бұрын

i feel like the second explanation with the tanh function is much better

@bzqp2 3 жыл бұрын

But for other activations (i.e. ReLU) this explanation is totally counterintuitive. For ReLU the nonlinearity is exactly at 0, so smaller absolute value of w wouldn't really reduce the unit's use of the nonlinear range.

@cbt0949 3 жыл бұрын

@@bzqp2 It's linear near 0 in ReLU.

@bzqp2 3 жыл бұрын

@@cbt0949 well NEAR 0 it is linear, but exactly at 0 the 1st derivative changes, which makes it nonlinear there.

@cbt0949 3 жыл бұрын

Leaky ReLU can be an example of non-linear, but ReLU is linear in my opinion, since a domain is [0,inf) or (0,inf).

@bzqp2 3 жыл бұрын

@@cbt0949 @Seunggu Kang Nope. Normal ReLU also is nonlinear. An activation function needs to be nonlinear (check Hinton's article on "Learning representations by back-propagating errors" for a more detailed explanation), and that's why we use ReLU and not a simple linear wx+b. A neural network with linear activations wouldn't learn anything no matter how deep it would be. The fact, that it has different derivatives over the domain makes it nonlinear.

@beluga7428 Жыл бұрын

I have a doubt if z is small then why does it have any effect on curve overfitting as we obtain decision boundary by putting z=0 so there is no involvement of tanhz functiion in plotting the decision boundry !!

@sandipansarkar9211 3 жыл бұрын

very nice explanation.need to watch again

@jesuspreachings2023 2 жыл бұрын

Relu is also linear activation function , how it doesn't reduce network into linear network?

@AvinashSingh-bk8kg 3 жыл бұрын

What an amazing intution 🙇‍♂️

@hackercop 2 жыл бұрын

Never thought of tanh (or sigmoid) in that perspective thanks

@redberries8039 6 жыл бұрын

Andrew says the tanh in the linear region will deliver linear models only ....what about RELU? that's all linear can it only deliver linear models??? RELU is popular I find that hard to believe ...where am i confused?

@JadtheProdigy 5 жыл бұрын

i see what you mean, it doesnt make sense to regularize with lambda for relu, but it does if you are regularizing with dropout. great point though

@chakibbachounda1721 5 жыл бұрын

datascience.stackexchange.com/questions/26475/why-is-relu-used-as-an-activation-function

@fupopanda 4 жыл бұрын

ReLU has non-linearity. It does not have a constant gradient. It's piece-wise linear, and not simply just linear.

@bzqp2 3 жыл бұрын

Relu has the nonlinearity at z=0 (which actually makes the explanation totally counterintuitive)

@redberries8039 3 жыл бұрын

@@bzqp2 yes I accept this reasoning now. cheers

@redberries8039 6 жыл бұрын

the tanh intuition makes sense to me [linearsizing and so simplifying the model] ...the first intuition, described as making the weights so small their effects disappear, does not make sense to me ...if ALL the weights are reduced by the same factor then its the same model. Isn't it? There would need to be some selectivity in the reductions it seems to me??

@RickLamers 5 жыл бұрын

The weights in the model are randomly initialized and some start close to 0 and others don't. The L2 regularization puts pressure on these weights during backpropagation from reaching high values which results in models that have fewer hidden units that significantly contribute and thus have reduced complexity in terms of number of units in the network that produce it's output.

@rahuldey6369 3 жыл бұрын

@@RickLamers Absolutely that is how regularization works. Over-fitting occurs when you are unnecessarily giving importance to most of the weights to fit the training set well,which end up giving a model that tries to capture the whole pattern of the training data in a precise manner, but our goal isn't to build a model that precisely captures the training data pattern, but to have a more generalized model,that will perform good on test/dev which is completely unseen to the model. So in order to reduce the complexity one may try to penalize the higher weights, the weights that the model things these weights are majorly responsible to match the hypothesis,but fails when it applies it on the test/dev

@rahuldey6369 3 жыл бұрын

as I've already done Prof.Andrew Ngs machine learning lectures, the terms seems familiar, and the explanation he had given in the logistic regression part lambda/2m. But it would have been more helpful, if you could give us some ideas in which scenarios we prefer using L2 over L1, or we can by default try with L2 having almost the same effect. I'm big fan of you professor

@lmadriles 4 жыл бұрын

Why high lambda isn't the same as a high learning rate?

@rahuldey6369 3 жыл бұрын

Check this- stats.stackexchange.com/questions/168666/boosting-why-is-the-learning-rate-called-a-regularization-parameter#:~:text=I%20don't%20get%20why,of%20Statistical%20Learning%2C%20section%2010.12.&text=Regularization%20means%20%22way%20to%20avoid,too%20high%20leads%20to%20overfitting).

@krishnachauhan2850 4 жыл бұрын

How come large lembda makes w matrix zero plz guide

@tamoorkhan3262 3 жыл бұрын

The updating formula for weights in a layer is W = (1-lambda/m)W - learning_rate*(dCost/dW), so from here you can see having larger lambda will make W small. This formula is when we are regularizing.

@X_platform 6 жыл бұрын

I am confused... if "lambda" is big, your "dw" will be big. So your "w" could become very negative after the update. (which in tanh is nonlinear)... won't that overfit even more?

6 жыл бұрын

Well, I think it's because of the optimization problem. In order to minimize the cost function J, if the lambda is large, we will need to choose smaller value of W.

@drummatick 5 жыл бұрын

You're absolutely right, There is no restriction on lambda imposed here, for an instance we can choose any negative lambda and it will obviously minimize it, for an instance choosing -infinity is best. He doesn't explains it clearly.

@drummatick 5 жыл бұрын

@ bro if we just take lambda=-infinity, we're done, that's the minimum of J you can get. Think about it

5 жыл бұрын

@@drummatick Hi, as Adrew Ng explained above, if lambda is very big then the optimizer will pick up the very small value of W (W is nearly to 0). It turns out the z value will be small and the model is nearly linear model (in case of tanh activation function) ==> This model is now UNDERFITTING In contrast, if you choose lambda to be very small (in your case: -infinity) then the optimizer can just pick up any NOT small value of W. This time, the model will be complex since the value of z will be large. Of course the value of the cost function J is nearly 0 (because value of lambda is -infinity as you said), but now the risk of overfitting is super high. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.

@satyajitgiri5060 5 жыл бұрын

If w does become negative the update function w = w - alpha* (update from back-propagation + lambda (w/m)), where lambda(w/m) would be negative. Since you'll be subtracting a negative number, it becomes equivalent to adding lambda(|w|/m).