Let f :T→R be a differentiable function, where T is the space of tensors.The gradient of f at A∈T is the tensor ∇f(A)∈T defined by:∀dA∈T, t.q. ddA(−1)=dA(−1),Df(A)[dA]=⟨∇f(A),dA⟩
This reads as the variation of f evaluated at A for any small variation dA.
For scalars, this is the expected definition:
dA=h∈R and Df(A)[h]=⟨f′(A),h⟩=h∗f′(A)=f(A+h)−f(A) for h small enough. (∇f(A)=f′(A))
When we introduced the graph, we saw that the nodes represented an operation and pointed to other operations (or a scalar if we were at the end of the graph).
This scalar (often denoted L) represents a loss function. We will come back to this function later.
It is important to understand that L shows the performance of the model, so the lower L is, the better the model performs. We therefore seek to optimize the parameters according to L.
To do this, we will start by calculating the gradients of each tensor according to the topological order given above.
We will therefore start with L, since its gradient is trivially equal to 1.
Next, we will calculate the gradient of each node in reverse topological order using the gradients already calculated.
The goal is to calculate the gradient of A, B (or A) given the gradient of f(A, B) (or f(A)).
Let L:T→R be differentiable and f:T→T. (applied element by element) (e.g., tanh, ReLU)
We can therefore define f′ as the derivative of f in R (applicable to a tensor), because in reality f:R→R, we have just extended it to apply it to each element of the tensor.
Let Y=f(A) and G=∇YLY=f(A)∈T
A piece of the graph (the application of f) therefore looks like this:
First, note that since f acts point by point, with i=(i1,...,iN) an index (Ti is therefore scalar)
[Df(T)[dT]]i=f′(Ti)dTi⟹dY=Df(T)[dT]=f′(T)⊙dT.
where ⊙ means element-by-element multiplication. (Hadamard)
Expanding L∘f and according to the chain rule and the definition of the differential, we have:
(*) is very simple to derive:
⟨X,Y⊙Z⟩=xbijybijzbij with b the batch index. We can clearly see that everything commutes, and therefore ⟨X,Y⊙Z⟩=⟨Y⊙X,Z⟩