What is L2 Regularization and how does it work in Neural Networks

If a model is doing great on a training set, but not on a test set, you can use regularization.  For example, if training set error is 1%, but test set error is 11%, we may be dealing with an overfitting or high variance issue.

There are 2 ways of dealing with overfitting; one is to get more data to train or try a regularization technique. When data is hard to come by or expensive, you can try regularization. L2 is the most commonly used regularization. Similar to a loss function, it minimizes loss and also the complexity of a model by adding an extra term to the loss function.

L2 regularization defines regularization term as the sum of the squares of the feature weights, which amplifies the impact of outlier weights that are too big.  For example, consider the following weights:

w1 = .3, w2= .1, w3 = 6, which results in 0.09 + 0.01 + 36 = 36.1, after squaring each weight. In this regularization term, just one weight, w3, contributes to most of the complexity.

L2 regularization prevents this by penalizing large weights for being too large,  to make it simple.

You can fine-tune the model by multiplying the regularization term with the regularization rate. By increasing the regularization rate, you encourage weights to go towards 0, thus making your model simpler. By decreasing the regularization rate, you make your weights bigger, thus making your model more complex.

How Regularization reduces overfitting

Increasing the regularization rate makes the whole network simpler by making weights smaller, which reduces the impact of a lot of hidden units.  This makes the activation function relatively linear as if each layer is linear. This is how L2 regularization solves overfitting.

Look at Keras L2 Regularization https://keras.io/regularizers/