Neural network – how Backward Propagation and Gradient descent are related

Back propagation is used by Optimization algorithms to adjust w and b.

At the end of a forward propagation (see my previous post), output layer results in a predicted value, we compare this predicted value with the corresponding actual value from our training set to figure out the difference, also referred to as cost function. Cost function measures how weight w and bias b are doing on a training item to come up with a good prediction.

If the cost function is high, it means network predicted a value that is far from the actual value. For example actual value is 6, network predicted 2.
If the cost function is low, it means network predicted a value that is close to the actual value. For example actual value is 6, network predicted 5.

So the goal is to minimize the cost function. Weight w and bias b impact how close or far prediction is from the actual value. Optimization algorithms like Gradient Descent, Adam etc., update w and b to minimize the cost function.

Back propagation figures out, impact on cost function (sensitivity) , in relation to w and b, but it does not update w and b. Optimization algorithms like Gradient descent determine how much to change and update w and b based on the sensitivity. 

For example in a simple 2 layered neural network, back propagation determines that increasing w in layer1, from 1 to 2, increases the cost function from 3 to 6. This means if you increase w by one unit, cost function goes up by 3 times the change.  In other words, 3 is the derivative of the cost function with respect to w. Similarly back propagation calculates derivate of b.  Gradient descent uses these derivatives to update w and b in order to minimize the cost function.

Neural Networks, Weights and Bias, Forward Propagation

A neural network learns relationships between inputs and the corresponding outputs from sample data, it then applies the learnings to new inputs to predict the best possible outputs.

Neural Network

Weights and bias

A Neural network consists of neurons and one or more layers. As in the pic above, weights (w) connect neurons from one layer to the next and represent the importance of a given input to its respective output.

For example, how important are the following inputs to decide where to go for a vacation?

1.Price of tickets (w=1)
2. Price of hotels (w=1)
3. Weather (w=0)

Higher weight means more importance.

While weights represent the importance, bias is used to fine-tune the relationship between inputs and outputs.

A neuron computes the weighted sum of its inputs, adds bias to compute (z) and passes the z to the activation function (a) to get a single output. A neural network learns by collecting these outputs from each neuron; this is done by doing forward propagation, getting the loss, and updating w and b to decrease the loss (backward propagation).


Forward Propagation

Neural Network (NN) takes an input x from the input layer, it multiplies it with its respective weight, adds bias to it and passes the resulting value to an activation function. The resulting value is then passed to the next layer as an input.

z = w * x + b and z passed to, a = sigmoid(z)

At the end of the forward propagation, the output layer results in a predicted value, then we compare the predicted value with the actual value to figure out the difference, and update w and b to decrease the difference. This process is repeated multiple times to get to the prediction we like.

In the next posts, we will discuss activation functions, cost functions, and backward propagation.

What is L2 Regularization and how does it work in Neural Networks

If a model is doing great on a training set, but not on a test set, you can use regularization.  For example, if training set error is 1%, but test set error is 11%, we may be dealing with an overfitting or high variance issue.

There are 2 ways of dealing with overfitting; one is to get more data to train or try a regularization technique. When data is hard to come by or expensive, you can try regularization. L2 is the most commonly used regularization. Similar to a loss function, it minimizes loss and also the complexity of a model by adding an extra term to the loss function.

L2 regularization defines regularization term as the sum of the squares of the feature weights, which amplifies the impact of outlier weights that are too big.  For example, consider the following weights:

w1 = .3, w2= .1, w3 = 6, which results in 0.09 + 0.01 + 36 = 36.1, after squaring each weight. In this regularization term, just one weight, w3, contributes to most of the complexity.

L2 regularization prevents this by penalizing large weights for being too large,  to make it simple.

You can fine-tune the model by multiplying the regularization term with the regularization rate. By increasing the regularization rate, you encourage weights to go towards 0, thus making your model simpler. By decreasing the regularization rate, you make your weights bigger, thus making your model more complex.

How Regularization reduces overfitting

Increasing the regularization rate makes the whole network simpler by making weights smaller, which reduces the impact of a lot of hidden units.  This makes the activation function relatively linear as if each layer is linear. This is how L2 regularization solves overfitting.

Look at Keras L2 Regularization



Build your own Deep Learning Computer

In this post, I am going to share how I built my own Deep Learning Computer and show you how you can build a Deep Learning Computer for yourself.

I am going to talk about various parts to get and how to make sure they are all compatible with each other.

Finally, I am going to talk about how to get the parts and prices.


When it comes to GPU, there is usually only one option, which is Nvidia.  So, the choice is easy. You want to get either 1080ti or 2080 RTX; my personal preference is 2080, which is the latest model. The funny fact is for some reason, older model 1080ti seems more expensive than the newer model—2080.

This is one common issue I found for most of the components. Models one version behind the latest ones are apparently more expensive. It is likely because manufacturers stopped producing them.

To me, it didn’t make any sense to pay more than 1,000 dollars for 1080xti, when I can get RTX 2080 for $800, which is a newer model.

If you want to spend money, you can get RTX 2080Ti or Titan for better performance.


Here is where you have to be a little careful. As each GPU needs 16 PCIe lanes to work optimally, if you are using one GPU, make sure your CPU has at least 16 PCIe lanes; perhaps more for other components.

If you are going to go with two GPUs, then you need at least 32 PCIe lanes, just for GPUs.

A lot of Intel processors are 16 PCIs lanes, except X series and Xeon.  I was actually looking for i9-7900X, which has 44 PCIe lanes, but I ended getting i9-7940X; because I could find it cheaper. This again is odd, where newer models cost less.  This is the case with both GPU and CPU.

You can also go with AMD processor, like Thread ripper, which is very good.  But, Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is optimized specifically for Intel architecture; so, keep that in mind.


Any motherboard with x299 chipset should be compatible with Intel X processors. I looked at 3 motherboards from ASUS, X299 Mark1, X299 Mark2 and Strix X299 E gaming. I ended up getting Strix X299 E gaming, only because it comes with WiFi.

Whatever you get, make sure it has LGA 2066 socket to be compatible with the X CPU, supports 44 PCI lanes, and supports SLI.  SLI is important for multiple GPUs to work together.  See the above link for more details on SLI.


I used 2-tier storage, an SSD for operating system and programs. A hard drive for data. For SSD, get NVMe PCIe M.2. NVMe gives very fast speeds, compared to SATA. I went with Samsung 970 EVO M.2 NVMe.

For HD, I chose Seagate 3TB Barracuda SATA 6Gb/s 64MB Cache.


When it comes to memory, there are 2 facts to keep in mind:  One is clock speed and the other is CAS Latency; the lower the latency, the better it is. I got two 16-GB C15s with 3000-MHz clock speed.

Power Unit

EVGA Supernova 850 G3, 80 Plus Gold 850W, Fully Modular, Eco Mode with New HDB Fan.


Phanteks ENTHOO EVOLV Mid Tower ATX.  This is a very good looking case for the price.  It comes in 3 different colors: black, gray, and silver. If you want a bigger case, you can go with a full tower.  In my opinion, Mid Tower is more than enough.


If you are planning on overclocking, I recommend cooling. When it comes to cooling, you have 3 main options: Air cooling, AIO (All in one), or liquid cooling. I would recommend going with either air cooling or AIO for ease of installation and maintenance. I ended up getting Corsair H150i.

Whichever cooling system you get, make sure it is compatible with your case.

Here is the final build

Final Thoughts

For each part, I would recommend checking multiple sites.  I checked 3 sites for my parts, and www.b&

Sometimes prices are vastly different from site to site. For my CPU, I noticed $200 difference between sites. Also, check for deals, and some sites have deals that combine multiple parts together for a better price.

Make sure all the parts are compatible with each other.  You can use and for this issue. I used pcpartspciker extensively, they have great forums and completed building sections with various configurations.

Check out my completed build in PC parts picker

Finally, here are the parts and configuration for my build.

Happy building your own Deep Learning Computer!

Logistic Regression with Keras

Logistic regression in machine learning is an algorithm for binary classification; this means output is one of the two choices like true/false, 0/1, spam/no-spam, male/female etc. Though the name says Logistic regression, it is actually a classification algorithm, not a regression algorithm.

In order to derive an output as above, from the underlying probabilities, it generally uses the sigmoid function, which has a range of 0 to 1. In other words, it is a linear algorithm with sigmoid function applied on output.

For our demonstration purposes, we will use Bank Note Dataset from UCI machine learning repo.

The first 4 columns in the data set are X input features.

Wavelet Transformed image (continuous) Variance; Wavelet Transformed image (continuous) skewness; Wavelet Transformed image (continuous) curtosis;  Image (continuous) Entropy.

The last column is Y, which says whether a given note is authentic (0) or forged (1)

In order to implement this in Keras, we follow the steps below:

  1. Create X input and Y output;
  2. Create a Sequential model;
  3. Add the input layer and a hidden layer with the number of neurons, number of input variables and activation function;
  4. Add the output layer with the sigmoid function;
  5. Compile the model; and
  6. Train the model.

Here is the code. You can play with different hyperparameters to increase accuracy.

Here is the sample output.

100/919 [==>………………………] – ETA: 0s – loss: 0.5984 – acc: 0.7700
919/919 [==============================] – 0s 14us/step – loss: 0.5483 – acc: 0.7889 – val_loss: 0.7256 – val_acc: 0.4768
Epoch 11/15

100/919 [==>………………………] – ETA: 0s – loss: 0.5178 – acc: 0.7900
919/919 [==============================] – 0s 14us/step – loss: 0.4934 – acc: 0.8118 – val_loss: 0.7121 – val_acc: 0.4768
Epoch 12/15


Supervised vs Unsupervised learning

In machine learning, supervised learning is used when you already know what the output is, for a given input. So, you already know that the output is Y when the input is X.  Given this, the goal of supervised learning is to learn a function that gives you the relationship between X and Y.

Unsupervised learning is used when you do NOT know what the output is, for a given input. So, you do not know what Y is, for a given X input. The goal here is to infer the best relationships and pattern structures in the data.

Supervised learning mainly falls into the following categories:

  1. Classification, it categorizes inputs into different classes. Examples include:
    1. Categorizing loan applicants into high, medium, and low-risk borrowers.
    2. Categorizing emails as spam or not.
  2. Regression, it outputs numerical data like size, quantity, age etc. Examples include:
    1. Predicting the age of a person.
    2. Predicting the price of a house.

Algorithms: Linear regression, Logistic regression, Neural networks etc.

Unsupervised learning mainly falls into the following categories:

  1. Clustering, it groups inputs based on similarity. Examples include
    1. Customer segmentation based on location, age, etc.
    2. Identifying high crime neighborhoods.
  2. Dimensionality reduction, it removes redundant, unnecessary data from a dataset and keeps parts of data that really matters. It is similar to data compression. Examples include:
    1. Reducing dimensionality (columns) in computer vision training.
    2. Reduce datasets containing customer social media engagement with brands from multiple devices.

Algorithms: Hierarchical clustering, k-Means clustering, PCA, SVD etc.