Neural networks consist of compositions of many functions. Taking derivatives through compositions of functions is what the chain rule of calculus helps us figure out.
If you have the expression:
That means you have an input x, you apply a function g to x, you get y and you apply another function f to y. Eventually, you get z.
Let’s say you would like to differentiate this whole expression with respect to x. The chain rule tells you that the derivative is given by the product of the derivative of g times the derivative of f:
The same exact expression applies to multivariate functions — we just have to be a little careful with the order, because the order in scalar multiplication is exchangeable, while the same is not true for matrix multiplication.
The same chain rule holds: dz/dy is a vector and dy/dx is a matrix.
Let’s look at this neural network:
We could use the chain rule to write out, for example, the derivative of the loss with respect to W(2). That’s equal to the derivative of z(2) with respect to W(2) times the derivative of the loss with respect to z(2):
In the same way, we can write out the derivative with respect to W(1):
Feel free to drop me a message or:
All images taken from Levine, CS182, 2021.