Closed-form expression for neural network

In this blog post, I will derive a closed-form expression for a simple feed-forward neural network.

Let our network be defined as follows:

• batch size $1$
• input layer with three neurons (i.e. $\mathbf{a}$ contains three features)
• one hidden layer ($l_1$) with ReLU activation ($\sigma$) and two neurons
• output layer ($l_2$) with one neuron, identity (“linear”) activation and one output $\hat{y}$
• loss function $\mathcal{L}$ with squared error
• no biases

We get some vector with real numbers from the user and our task is to predict the output value (regression).

The next step is to define mathematically the required functions:

Note that I included as parameters the weights $\mathbf{W_1}$ and $\mathbf{w_2}$. This is necessary, because we want to take the derivative with respect to these weights.

Forward propagation

The following diagram shows how we have to compose the functions to produce the prediction $\hat{y}$.

All parameters that have a dashed line are constants and cannot be changed during forward propagation.

In the derivation, I’m using the notation for partial composition. Alternatively, one could also create new functions without the weights as parameters.

Backpropagation

During backpropagation the weights become parameters of the functions and the inputs become constants. In other words, we cut off all layers before $\mathbf{w_2}$ (for weight 2) or $\mathbf{W_1}$ (for weight 1). Furthermore, we add the loss function $\mathcal{L}$.

Weight 2 (Output layer)

The following diagram shows how the layers were removed for weight 2:

Mathematically, this means we construct the following function $h_2$:

Since we want to minimize $\mathcal{L}$ with respect to the weight $\mathbf{w_2}$, we take the gradient of $h_2$.

Weight 1 (Hidden layer)

The next weights are located at the start of the neural network, which means we don’t have to remove any layers.

We construct again a new function:

This time the derivative is a bit more complicated. We will first flatten the matrix $\mathbf{W_1}$ to create a new function $g$:

Then by the chain rule of multivariable calculus

Note that the Jacobian of the ReLU function is the Heaviside step function. Since we apply the ReLU function element-wise, we only need to calculate the derivative once. So we actually don’t really need the Jacobian.

Next, multiply this product by $-\mathbf{w_2}^T$ and reshape the resulting $6 \times 1$ vector to obtain the $3 \times 2$ change matrix.

An alternative way to calculate the derivative is the following:

We can just remove $\mathbf{W_1}$, because it is no function i.e. $\frac{d}{dW_{11}}W_{11}a_1 = a_{1}$. Then use normal differentiation rules.

Summary

Forward propagation is given by:

And backpropagation is given by:

where $\gamma$ is the learning rate. Note that both expressions can still be simplified (removing Jacobian, calculating values like $\mathbf{b}$ during forward propagation etc.).

Categories:

Updated: