In this lecture, we will continue to talk about automatic differentiation (autodiff), the key technology behind neural networks. The core idea of neural networks is that sometimes the function you want to model is very complex, so you need to choose among a very complex set of functions. This is possible if you have a huge amount of data, and the core mechanism by which we choose among this set of function is gradient descent. The core algorithmic problem becomes computing the gradient. The question is therefore: If you have complex function, how can you efficiently compute the gradient of the loss with respect to that function's many parameters?

Expression Graphs

The core concept behind autodiff is expression graphs. Last time, we started to look at the example function

$f(x)=\exp(\exp(x)+\exp(x)^2) + \sin(\exp(x)+\exp(x^2))$.

It is possible to compute the derivative of this function by hand, and in fact it's not that bad, with the result being

$\frac{\partial f}{\partial x}= \exp(\exp(x) + \exp(x)^2)(\exp(x))(\exp(x)+2\exp(x)^2)+\cos(\exp(x)+\exp(x)^2)(\exp(x)+2\exp(x)^2)$.

However, the derivative is still larger than the original function. We saw last time that sometimes the derivative can be drastically larger than the original function. Therefore, we need to move beyond this simplistic idea of only representing derivatives by equations.

The core idea of autodiff is to make better use of intermediate values. To start with, we are just going to represent the function differently, without talking about the derivative. One way we could represent this function is as follows:

$a=\exp(x)$

$b= a^2$

$c=a+b$

$d=\exp(c)$

$e=\sin(c)$

$f=d+e$.

If you were to evaluate each of lines in order, the final value $f$ will be the same as the value of the original function. You could save a little bit of computation in doing this, as we only need to compute values such as $\exp(x)$ once.

When representing functions in this way, it is helpful to represent them using an expression graph. We will draw these graph with circles representing variables and squares representing functions. For the function above, we have the following.

We will see that this type of expression graph representation is also helpful for computing derivatives. The above graph can also be represented programmatically. We're not going to do anything with this code yet, but it will set us up for some code we will use later.

import numpy as np
n = 1 # number of inputs
N = 7 # number of nodes

# functions to produce intermediate values
g = {}
g[2] = np.exp
g[3] = np.square
g[4] = np.sum
g[5] = np.exp
g[6] = np.sin
g[7] = np.sum

# parents for each node
Pa = {}
Pa[2] = [1]
Pa[3] = [2]
Pa[4] = [2,3]
Pa[5] = [4]
Pa[6] = [4]
Pa[7] = [5,6]

# compute children from parents
Ch = {}
for i in range(1, N + 1):
		Ch[i] = []
for i in range(n+1, N+1):
				for j in Pa[i]:
						Ch[j].append(i)
print(Ch)

Nothing exciting so far, but we will be able to build on this and create a very simple autodiff system.