A neural network performs operations on tensors (addition, multiplication, tanh, etc.). So, we will first focus on implementing everything correctly for tensors.
People often don't understand what a tensor is at first, but you can think of these objects as the generalization of scalars/vectors/matrices to n dimensions.
For the definition of a tensor, I refer you to the excellent post by “Robot Chinwag” on the subject. However, you can stop before the “Tensor Calculus” section (and everything after it) because we will be using different principles, so this section will not be useful to us.
From now on, we will denote T as the tensor space, and T(N) as the tensor space of order N.
For example:
Lets consider a tensor: A∈R3×2×2⊂T(3). We can see it as two stacked matrices :A=[[a111a121a112a122],[a211a221a212a222],[a311a321a312a322]]. In other words, Ak,i,j denotes the element at position (i,j) in the k-th matrix.
We also denote dA(i) as the dimension of order i.
In this case, we have dA(1)=3 and dA(i)=2 for i∈{2,3}
With i,j the indices of the last two dimensions. The index z represents the “batch” index, i.e., the remainder of the previous dimensions.
If the dimensions of a tensor are i1,...,iN, then z must be considered as i1,...,iN−2.
An important concept for tensors is that of broadcast.
It was already explained in the post ofRobot Chinwag, but I will provide some examples to make the concept easier to understand.
A simple example is when you multiply a vector by a scalar. This can be seen as broadcasting the scalar over the shape of the vector and then multiplying element by element. Example:
[123]∗3=[123]∗[333]=[1∗32∗33∗3]=[369]where ∗ denotes the element-wise product (Hadamard)
Here is a more avanced example:
Let A∈R3×1etB∈R1×4.A=123,B=[10203040].During an element-by-element operation (e.g., C=A+B ), both tensors are broadcast to get:A′=123123123123,B′=101010202020303030404040.Ainsi, C=A+B=111213212223313233414243.
You can find the code for broadcasting in the following section, if that helps you understand better.
We will now assume that for each operation (multiplication, addition, etc.), we will broadcast the tensors so that their shapes match if possible.
Let A,B∈T(N)×T(M), t.q dA(N−1)=dB(M−1) Their scalar product is defined as follows: ⟨A,B⟩=Tr(A⊤B)=azjibzij
If you are still having trouble with this notation, despite the post by “Robot Chinwag”, I recommend looking at other resources on Einstein notation and even on einsum (PyTorch function). (Basically, here we sum over the indices z)