Recap: Forward Propagation and Backpropagation

In the previous lecture, we discussed how complex functions can represented using expression graphs. Once you have defined an expression graph, you can evaluate it for a given input using forward propagation. You can compute the gradient of the function with respect to any of these inputs using backpropagation, of which we discussed two variants.

Recall that each node in an expression graph is defined in terms of some function, $g_i$, and its parents, $\text{Pa(i)}$. The output of the node is $x_i = g_i(x_{\text{Pa}(i)})$, where $x_{\text{Pa}(i)}$ is a vector representing the output of the parent nodes. Input nodes have no parents, and are simply assigned some value. The output of a graph with $N$ nodes is $x_N$. For any expression graph, we can compute $x_N$ using the forward propagation algorithm:


Forward Propagation

For $i = n+1, n+2, \dots, N$ : $x_{i} \leftarrow g_{i} (x_{\text{Pa}(i)})$ Return $x_{N}$


Let $f$ be the function represented by the expression graph. For any $x_i$, we can compute $\frac{df}{d_{x_i}}$ using backpropagation, of which we discussed two variants. In both versions, we compute partial derivatives with respect to the internal nodes of the graph and use the chain rule to "propagate" these derivatives backwards through the graph. In the first version, we "pull" the partial derivatives from the all of the children of a given node onto the parent.


Backpropagation v1

(Do forward propagation.)

$\frac{d f}{d x_N}$ ← 1

For $i$ = $N-1, N-2, .... , 1$

$\frac{d f}{d x_i} ← \sum_{j \in \text{Ch}(i)} \frac{d f}{d x_j} \frac{\partial g_j}{\partial x_i}$

Return $\frac{d f}{d x_1},\frac{d f}{d x_2}.....,\frac{d f}{d x_{N}}$


In the second version, we "push" the partial derivatives from the children onto the parents.


Backpropagation v2

(Do forward propagation.)

Initialize $\frac{d f}{d x_N}$ ← 1 and $\frac{d f}{d x_1}=\frac{df}{dx_2}=.....=\frac{df}{dx_{N-1}}$ ← 0

For $j=N, N-1 ... , n+1$ :

For all $i$ $\in$ $\text{Pa}(j)$:

$\frac{d f}{d x_i}$= $\frac{d f}{\partial x_i}$ + $\frac{d f}{d x_j} \frac{\partial g_j}{\partial x_i}$

Return $\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2}.....,\frac{\partial f}{\partial x_{N}}$


Both variants are very similar and always produce the same output, but some people prefer one over the other for various reasons.

Autodiff for Machine Learning

In the previous lecture, we started to talk about how autodiff can be useful for machine learning. The key idea is this to represent the loss function as an expression graph. Consider the loss function

$$ \mathcal{L}(y, f(x)) = (y - f(x))^2. $$

where $f(x) = W^Tx$. The full loss computation can be represented using the following expression graph.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cf8cc23d-f41b-45b5-a579-eed114c24278/Expression_Graph.svg