Now that we have shown how to calculate these gradients, we can move on to implementing the various operations to construct the graph!
Details - Shape correspondence
An important detail is that the gradient must have the same shape as its parent. For example, the gradient with respect to A must have the same shape as A. (Otherwise, it is absurd). However, this is no longer the case if there has been a broadcast in the previous gradient G. We must sum over each of the broadcasted batches to obtain the desired shape.
We will use multiplication as an example, but this is true for all operations.
Soient A∈RB×m×n, B∈Rn×p,
Y∈RB×m×p avec Yb=Ab@B.
Soit L:RB×m×p→R et Gb=∇YbL.
dL=b=1∑B⟨Gb, dYb⟩=b=1∑B⟨Gb, dAb@B+Ab@dB⟩=b=1∑B⟨Gb@B⊤, dAb⟩+b=1∑B⟨Ab⊤@Gb, dB⟩=b=1∑B⟨Gb@B⊤, dAb⟩+⟨b=1∑BAb⊤@Gb, dB⟩.(★)
By identification (Riesz), we obtain the gradients:
∇AbL=Gb