In this lecture, we will discuss matrix calculus and linear regression. Why matrix calculus is important? 90% of the proofs we will discuss in this class will have the following structure:

  1. We have some function $f(w)$ and we want to find a minimum of it with respect to $w$;
  2. We calculate the gradient of that function with respect to $w$, which we write as $\nabla_w f(w)$;
  3. We solve the equation $\nabla_w f(w^)=0$ for $w^$ and hopefully find a solution.

Technically, we should verify that the second derivative is positive (or positive semi-definite) at the minimum. However, in this course, we will only be doing this with functions that have a single global minimum, so this isn't necessary.

1. Matrix calculus ("hacker's version")

Today we will focus on taking gradients of expressions and we will present a very simple way of doing matrix calculus. There are better ways for doing matrix calculus that is more general and elegant. However, they are intricate to understand. The method we will discuss is more clumsy but much easier to use. It is a hacker's version of matrix calculus.

Suppose we have some function $f(x)$ and we'd like to calculate the $\nabla_x f(x)$ — the gradient of $f$ with respect to $x$. Let's forget machine learning notation for now: $x$ is just some variable (it can be a scalar, a vector, a matrix) and $f$ is just some function. So, here is the approach:

  1. Break down all matrix expressions in $f(x)$ into sums involving scalars;
  2. Do normal, scalar, derivative operations;
  3. Reassemble the final result back into a matrix expression.

Example 1. Let's consider the function $f(x)=x^\top A y$, where vector $x\in \mathbb{R}^D$, vector $y\in \mathbb{R}^D$ and matrix $A \in \mathbb{R}^{D \times D}$.

Step one. We take $f(x)$ and turn it into scalars using the definition of quadratic form $x^\top A y$

$$ f(x) = \sum_{i=1}^D \sum_{j=1}^D x_i y_j A_{ij}.

$$

You might ask why did I change the order in the sums — actually it doesn't matter here since we are writing scalar expressions instead of vector ones. We will also omit summation bounds $\sum_{i=1}^D$and will write $\sum_i$ instead.

Step two. Now, we will look at the $\frac{\partial f(x)}{\partial x_k}$ — partial derivative of $f(x)$ with respect to some arbitrary component $x_k$ where $1 \le k \le D$. If we use the expression from Step one, we have

$$ \begin{aligned} \frac{\partial f(x)}{\partial x_k} &=\frac{\partial}{\partial x_k}\sum_i\sum_j x_i y_{j}A_{ij}\\
&= \sum_i\sum_jy_{j}A_{ij}\frac{\partial x_i}{\partial x_k}\\ &= \sum_i\sum_j y_{j}A_{ij}I[i=k]\\ &= \sum_j y_jA_{kj}\\ &= A_{k\bullet}y\\ &= (Ay)_k\\ \end{aligned} $$

To understand the meaning of $\frac{\partial x_i}{\partial x_k}$ in the second line just note that $\frac{\partial x_2}{\partial x_1}=0$ and $\frac{\partial x_2}{\partial x_2}=1$. We represent this fact with an "indicator function":

$$ I[i=k] = {\begin{cases}1~&{\text{ if }}~i=k,\\0~&{\text{ otherwise }}.\end{cases}} $$

You can think of $A_{k \bullet}$ as $1\times D$ matrix which corresponds to the $k$-th row of $A$ which gives you a scalar value when multiplied by $y$. Finally, $(Ay)k$ represents the $k$-th component of the vector $Ay$ which is the same quantity as $A{k \bullet}y$.

Step three. Now we can write the gradient of $f(x)$ in terms of the partial derivatives we computed in step 2 and then reassemble

$$ \nabla_x f(x)=\begin{bmatrix} \frac{\partial f(x)}{\partial x_1} \\ ..\\ ..\\ \frac{\partial f(x)}{\partial x_D}\\ \end{bmatrix}=\begin{bmatrix} (Ay)1\\ ..\\ ..\\ (Ay){D}\\ \end{bmatrix} =A y.

$$

The key insight here is that $(Ay)_1, (Ay)_2, \cdots (Ay)_D$ are just the different components of the vector $Ay$.