Goal
Now that we know how to perform matrix multiplication with textures, we implement a shallow neural network with an extra layer that will allow us to make non-linear predictions.
I will explain how to build the whole network step by step with what seems to be the most pedagogical architecture. Since its structure is a bit complex, I recommend following along creating the network as we go but not spending much time if it doesn’t work for you at the end. If that occurs it is better to use the example linked in this tutorial since it is very likely that you will make mistakes here and there while building it from scratch.
We will use the same data set we employed when building our single layer perceptron (our first example with a single neuron). That was a data set provided by the sklearn Python module. You may refresh your memory here. Once more, this code produces a file that we can then import in TouchDesigner:
from sklearn import datasetsimport matplotlib.pyplot as plt
# Get 10 points from the make moons data set
# We set the random state to 0, so we have reproducible data
X, y = datasets.make_moons(10, random_state=0, noise=0.0)
# Write data to a file
with open('moons_dataset.txt', 'w') as f:
f.write('{}t{}t{}n'.format('x', 'y', 'class'))
for i, coord in enumerate(X):
f.write('{}t{}t{}n'.format(coord[0], coord[1], y[i]))
# Plot the data if wanted
plt.figure(figsize=(8, 8))
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k')
plt.show()
Create a table from these data points and convert it to TOPs. We have in this manner data representation in pixels. We will be using matrix multiplication for our operations instead of vectors (uni-dimensional matrices).
Given that $m = 10 $ (the number of our training samples), our coordinates data is a (10, 2) matrix or a texture of resolution 2×10. This time, we will name the texture representing the data points as “X” and the labels as “ground truth” which illustrates better what these classes are.
Make sure to select your viewer to “nearest pixel” so you can comfortably see and use its values without interpolation. We are also going to right away create a helper object for ourselves to accurately inspect the numeric values in the textures. Let us call this little tool “Matrix Viewer”.
I will not show how to build this little helper, but this network screenshot and corresponding .tox file should suffice. It is simply a visualizer of the texture values to make sure we get the right values.
I have put this into a Base operator and added a custom parameter to point to the texture we want to observe. If we have done it correctly, the values we see in the table should correspond to the values that we used as input.
Preparing out network
Since we are going to have to access some values repeatedly, we can encapsulate our whole current network into a Base COMP and add some custom parameters that hold constants in our implementation:
- An int, for the size of hidden layer
- An int, for the size of the output layer
- Afloat, for learning rate
- A toggle, to enable learning
We set the size of our hidden layer to three since we want to start our learning process with just three neurons, and set the output layer to one since what we want to have is a single output value that tells us the class of the input data.
Let’s now go inside our master node and think about how we can achieve our goal.
The math for forward propagation
With a hidden layer involved, let’s revise the math necessary to compute our forward propagation. Notice that in this equations, $Z$ refers to multiplication between two matrices and an element-wise addition of a vector $b$, while $A$ refers to the activation function applied to $Z$.
We reach the following equations:
float random(vec2 st, float seed) {
return fract(sin(dot(st, vec2(15.63547, 87.84849))) * seed);
}
void main(){
vec2 uv = vUV.st;
float pixel = random(uv, 9999);
vec4 color = vec4(w);
fragColor = TDOutputSwizzle(pixel);
}
$ Z_1 = W_1X + b_1 $
$ A_1 = tanh(Z_1) $
$ Z_2 = W_2A_1 + b_2 $
$ A_2 = sigma(Z2) $
We need to create matrices that represent and a vector that represents.
What is W?
Given that we have an input [latex]X[/latex] (our training samples), we need to define a set of weights so that our algorithm can “fire” up the right neurons that ultimately define our model. The beauty of machine learning is that we ought not to do anything explicit in that regard, meaning that we don’t have to provide the values ourselves. On the contrary, we let the machine decide what these will be.
Since we do not know what these values are, we need to provide random values in some range to create an appropriate “search space”. In this case, we use signed random values (-1 to 1). In principle, the range that you choose depends somewhat on the application. The crucial point is that since we don’t know what values to start with, then [latex]W[/latex] has to be initialized with random numbers.
Generating W, random initialization
Create a Base object with custom parameters called “Size of this layer” and “Size of input layer”. This arrangement refers to the little trick I showed, that allowed us to predict the size of a given matrix by drawing the network that we are building. Here is the diagram to refresh your memory. Inspect it once more so that you see how this is related to our current endeavor:
Next, we create a mono channel shader inside this object, to generate values according to our properties. Set the resolution to the parameters by writing references to them, so set width to parent().par.Sizeofinputlayer and height to parent().par.Sizeofthislayer.
We then go one level higher and create some other parameters that will come in handy in our Base COMP, since we need to control the scale of our bipolar random values and we want to control a “debug” mode. Here are the extra parameters:
- A reset toggle, to switch later between training and non-training
- A constant float value, to use as a multiplier for our weights
- A seed int, to change our random number generator
- A debug mode toggle
- Afloat value to set the constant when we want to debug
Now, let’s write our shader:
// Initialize weights
uniform float constant;
uniform float seed;
uniform float debugVal;
uniform int debugMode;
out vec4 fragColor;// Create signed random values with a seed
float random(vec2 st, float seed) {
float n = fract(sin(dot(st, vec2(15.63547, 87.84849))) * seed);
return n * 2.0 - 1.0; // Signed random value
}
// We add a debug Val so that we can compare our values to a know-working-algorithm
// if desired, to ensure it all works fine
void main(){
vec2 uv = vUV.st;
float w = (debugMode == 0) ? random(uv, seed) : debugVal;
vec4 color = vec4(w) * constant;
fragColor = TDOutputSwizzle(color);
}
We can go one step higher to our parent Base COMP and proceed to define the “size of this layer” parameter to 3 (which is the number of neurons we have in this layer) and the “size of input layer” to 2 (which is the size of a single sample in our data set).
You can check the result with our custom MatrixViewer object if you want. You should have now a set of random values with appropriate dimensions:
Generating B, a vector
We also need to generate b (the bias). Its dimensions will be (size of the current layer, 1) or a (1×3) texture. The code for this is very simple because this is a single scalar value (often 0). We pass the initialization value via a uniform because it can come in handy when we want to debug. Here is the code:
// Initialize b
uniform float initVal;
out vec4 fragColor;
void main(){
vec4 color = vec4(initVal);
fragColor = color;
}
Like with W, the best is to encapsulate this in a Base COMP and pass our properties via custom parameters. You can proceed to create the following:
- A reset toggle, to switch later between training and non-training
- An initialization value float
- A “size of this layer” int
Set the shader’s width to 1 and its height to the parameter we just created:
You should now have this two re-usable objects:
Awesome! We are now ready to proceed with the equations of forward propagation.
Implementation
Now that we can increase the number of neurons efficiently thanks to matrix multiplication and know how the principle of hidden units works, we need to start thinking about a scalable way of building our network.
Our setup will be more complicated than in our Shallow neural network implementation. We deal with matrices (whole textures) to define our weights and biases, so we need to strategically find a comfortable way to access them in the various stages of our algorithm.
NOTE: what we will be doing is not the best structure possible, but I think it is the clearest for now. Further on, we can think about how to optimize the architecture in a more compact form. If you inspect the source code of TDNeuron, for instance, you will see that we have opted for a much more concise approach in there. For these tutorials, however, we are going to prioritize clarity over cleanliness and efficiency.
Forward propagation
We begin by creating a re-usable linear layer module that is going to be in charge of taking input, and then $W$ and $b$ values to perform matrix multiplication, just like we saw in the previous tutorials. We do so to perform the linear operation we saw in our very first introduction to the equations:
$ Z = WX + b$
We create a Base COMP and add three inputs to it. We then create a GLSL TOP inside to write our matrix multiplication shader and add an output. Go ahead and connect the inputs to the inlets of the GLSL TOP. We also create two info CHOPS, pointing to our first and second inputs, to grab their resolutions. Do not forget to set the pixel format to 32-bit float (Mono) in the GLSL TOP. Our setup looks like this:
Now, we need to specify the dimensions for the output texture of the shader. Remember that this is our matrix multiplication unit. It should be intuitive to say that we need to take the width of the first input (or its “resx”, which correspond to the columns) and the height of the second input (or “resy”, which correspond to the rows) as dimensions for this texture:
Great! The last remaining task is to write the code we learned in the previous tutorial and perform a dot product, although with a little minor addition. We need to account for the third term in our equation: $b$ (the bias), since what we want to achieve here is $WX + b$.
uniform int iterate;
uniform int activationFunction;
out vec4 fragColor;
void main()
{
vec2 uv = vUV.st; // Get coordinate
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
float z = 0.0; // The pixel to store our hand made dot product
for(int i = 0; i < iterate; i++) {
float x = texelFetch(sTD2DInputs[0], ivec2(xy.x, i), 0).r; // point to right row
float w = texelFetch(sTD2DInputs[1], ivec2(i, xy.y), 0).r; // point to right column
z += w * x;
}
// Add bias
float b = texelFetch(sTD2DInputs[2], ivec2(0, xy.y), 0).r; // Grab bias from right column
z += b;
fragColor = vec4(z);
}
The activation function
In previous patches, we had everything in the same shader. Now that we are making something more complicated, we will create a specialized module that performs an activation function over a given input. Our little shader will have the two activation functions we have used so far: tanh and sigmoid. We will add a mechanism to choose between them since we want this node to be reusable in different contexts.
Go ahead and create another base COMP, name it “Activation1” and add a custom parameter:
- Menu “activation function” consisting of “sigmoid” and “tanh”
Inside, create an input, a GLSL TOP, and an output. Set a “uActivationFunction” uniform and assign that to parent().par.Activationfunction. You should have this:
The code for the shader is simple, although slightly more sophisticated version of what we have had so far:
uniform float uActivationFunction;
// Define constants
#define INPUT 0
#define ACTIVATION_SIGMOID 0
#define ACTIVATION_TANH 1
// Activation functions
#define sigmoid(z) 1.0 / (1.0 + exp(-z))
// Apply activation function
float ApplyActivation(float z)
{
float activation = 0;
if(uActivationFunction == ACTIVATION_SIGMOID) {
activation = sigmoid(z);
} else if (uActivationFunction == ACTIVATION_TANH) {
activation = tanh(z);
}
return activation;
}
out float fragNeuron;
void main()
{
vec2 uv = vUV.st; // Get coordinate
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
float z = 0;
// Apply an activation function
z = texelFetch(sTD2DInputs[INPUT], xy, 0).x;
fragNeuron = ApplyActivation(z);
}
Wonderful! That constitutes all we need to perform our forward propagation.
We can begin to stack things according to what we are trying to achieve. Remember that we want to make a network that consists of 1 hidden layer with 3 neurons and 1 output layer with 1 neuron. For that, we can stack two copies of the hidden unit node we just made and their respective activation functions, one after the other. These will represent our hidden unit and our output layer (you may rename and color code things to be clear). We also need unique Weights and biases for each of these units, so we proceed to create two sets and connect them to their respective inputs:
Set the first activation function to tanh and the second activation function to sigmoid. There we have it! Hopefully, it all makes sense so far.
Our network won’t be doing much because before it can learn, we need to calculate the cost, create a backward propagation stage, and update the weights over several iterations (epochs).
Cost
We are going to follow our current paradigm and create a new node for the Cost computation. Remember that the cost measures how well our prediction is doing given the current set of weights.
Create a Base COMP. Inside throw two inputs, a GLSL TOP, and an output. We are going to write a shader that implements the cross-entropy formula from before:
$ -frac{1}{m}sum_{i=1}^{m}[Y_{i}log(hat{Y}_{i}) – (1-Y_{i}) log(1-hat{Y_{i}})] $
What does it mean again?
Let’s go term by term:
- $Y$ = classes (our ground truth)
- $hat{Y}$ = the output of our forward propagation
- $m$ = the number of training examples
- $sum$ = add all the individual losses
- $-frac{1}{m}$ = multiply all by the negative average
Our version this time is going to look slightly more refined than in previous examples, but it does the same thing. We are calling here “Prediction” the output of the forward propagation and “Ground Truth” the known set of labels Here is the code:
#define TEX_PREDICTION 0
#define TEX_GROUNDTRUTH 1
#define EPSILON 1e-11
out float fragNeuron;
void main()
{
vec2 uv = vUV.st;
vec2 res = uTDOutputInfo.res.zw;
ivec2 xy = ivec2(uv * res);
int samples = int(uTD2DInfos[TEX_GROUNDTRUTH].res.z); // Fetch amount of samples
int neurons = int(uTD2DInfos[TEX_PREDICTION].res.w); // Fetch amount of neurons
float cost = 0.0;
for (int i = 0; i < neurons; i++) {
float yhat = texelFetch(sTD2DInputs[TEX_PREDICTION], ivec2(xy.x, i), 0).x;
float y = texelFetch(sTD2DInputs[TEX_GROUNDTRUTH], ivec2(xy.x, i), 0).x;
// Clamping with epsilon to make sure no log(0) occurs
cost += y*log(max(yhat,EPSILON)) - (1-y)*log(max(1-yhat, EPSILON));
}
fragNeuron = -cost/samples;
}
NOTE: The extra max() function in the cost calculation is to avoid a possible $log(0)$, which results in a non-defined number, nothing more.
Backward propagation
Alright, forward propagation is easy enough. This following stage will be slightly more complex because we need to compute derivatives of the activated values from the forward propagation and for the weights and biases. That involves some calculus.
Derivatives
I will not demonstrate algebraically how to obtain the derivatives of our functions. However, I recommend checking out this great resource by Patrik David, where you can find proof for everything we are using here in good detail.
I have compiled here a cheat sheet with derivatives for all our functions that you can use as a guide.
Another reason for us to have subdivided our network like in this example is that we will build our backpropagation using the concept of a computational graph. It is much easier to split the task of getting derivatives of the matrices along independent stages in the computation (these are called “gradients”) than writing a single expression to determine their overall components.
There is a whole chapter devoted to computational graphs using TouchDesigner in this tutorials that you can inspect to clarify this concept, along with checking out this lecture from the CS231n course from Standford University:
https://youtu.be/i94OvYb6noo
For now, it suffices to understand that a computational graph is a way to simplify the process of getting gradients (derivatives stored in matrices) by splitting the task into various sub-components.
Derivative of the cost function
We begin by taking the derivative of the cross entropy function. We see that the formula to obtain the derivative of our cost is:
$dL = frac{-1}{y} + frac{1-hat{y}}{1-y}$
We should divide everything by the number of samples in our network, so that the final computation for this derivative is:
$gradient = dL / m $
In the equations, the symbols mean:
- $ y = our prediction $
- $ hat{y} = the groundtruth $
- $ m = number of samples $
We now have to create a backward propagation version of our cost node. So copy-paste the forward cost node and replace the code in the shader with the following:
#define TEX_INPUT 0
#define TEX_GROUNDTRUTH 1
#define EPSILON 1e-11
out float fragGradient;
void main()
{
vec2 uv = vUV.st;
vec2 res = uTDOutputInfo.res.zw;
ivec2 xy = ivec2(uv * res);
float samples = uTD2DInfos[TEX_INPUT].res.z; // Fetch amount of samples
// Clamping with epsilon to make sure no division by 0 occurs
float prediction = texelFetch(sTD2DInputs[TEX_INPUT], xy, 0).x;
float groundTruth = texelFetch(sTD2DInputs[TEX_GROUNDTRUTH], xy, 0).x;
float gradient = -groundTruth/max(prediction,EPSILON) + (1-groundTruth)/max(1-prediction, EPSILON);
fragGradient = gradient/samples;
}
Notice that here we are also taking care that division by zero does not occur, by using the same trick with max() as before.
Derivative of activation function
Similarly, go ahead and copy-paste our forward propagation version of the activation node. To compute gradients given how the “computational graph” works, for each stage in the backward propagation we need to access the corresponding values in their forward version. That means that we need to obtain the content of those nodes and plug them in our backward propagation versions.
Luckily, this is easy enough for us to do using select TOPs. To compute the gradients, we will need these two inputs:
The output values after the corresponding activation function
The output of the last node in our chain (in this case the cost)
So you can go ahead and create these selects for the inputs of our Base. We have named them “Input”, “Forward out” and “Gradient”.
Now, what exactly do we have to compute here? We need to compute the gradient of our activation functions. Since we are going backward, we will start with the derivative of a sigmoid function, since the last activation function we used in the forward propagation was the sigmoid:
$dZ = Z*(1.0-Z)$
And while we are at it since we are making re-usable modules, we can also see the derivative of a tanh function:
$dZ = 1.0-Z^2$
Our multi-purpose code becomes then:
#define TEX_FORWARD_OUT 0
#define TEX_GRADIENT 1
#define ACTIVATION_SIGMOID 1
#define ACTIVATION_TANH 2
uniform float uActivationFunction;
// Define derivatives calculation functions
#define dSigmoid(z) z*(1.0-z)
#define dTanh(z) 1.0-z*z
float CalculateLocalGradient(float layerOutput)
{
float localGradient = 0;
if(uActivationFunction == ACTIVATION_SIGMOID) {
localGradient = dSigmoid(layerOutput);
} else if (uActivationFunction == ACTIVATION_TANH) {
localGradient = dTanh(layerOutput);
}
return localGradient;
}
out float fragNeuron;
void main()
{
vec2 uv = vUV.st;
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
// Grab input value
float gradient = texelFetch(sTD2DInputs[TEX_GRADIENT], xy, 0).r;
float layerOutput = texelFetch(sTD2DInputs[TEX_FORWARD_OUT], xy, 0).r;
float localGradient = 0;
localGradient = CalculateLocalGradient(layerOutput);
// Ouput gradient flow (chain rule)
fragNeuron = localGradient * gradient;
}
Perfect! Remember to set the activation function to the proper value. For the backward node corresponding to our last Activation, this needs to be sigmoid. Finally, invoke the corresponding inputs in the selects:
Derivative of linear node
We will do something very similar now with the linear node:
$WX + b$
The formulas to compute the derivatives of each corresponding element in our backward version of the linear node are as follows, starting with the derivative of the weights:
$A^T cdot dZ$
Where $dZ$ refers to the gradient calculated immediately previous to this node while $A^T$ refers to the transposed matrix that was the input to the corresponding forward propagation version of this node.
The derivative of the biases:
$sum dZ$
And the derivative of the input gradient:
$ W^T cdot dZ $
We see that, like before, we need three inputs:
- The input to our linear layer $A$
- The weights for that linear layer $ W $
- The gradient immediately before this node (in this case the output of the gradient of the activation function) $dZ$
Create three selects and name them: “InputLinear2”, “Weights2” and “Gradient2”. You can already map the selects to their right inputs.
Then we have to create the code. In this case, we will have 3 shaders inside. Proceed to create these and let’s check how to implement them in GLSL.
Derivative of input
The code is nothing new, just matrix multiplication. But pay special attention to how we are fetching the weight values in this backward propagation version because you may or may not notice that our inputs are two non-compatible matrices! That is why the equation we just wrote for the derivative of the gradient transposes one of the matrices so that the operation is possible:
$W^T cdot dZ$
And therefore, we are performing the transposition by inverting the ivec2 coordinates for our texelFetch. The dimensions of this shader will be (gradient width, weights width).
//Computes the dot product of two input textures: dA = W.T * dZ
#define TEX_WEIGHTS 0
#define TEX_GRADIENT 1
out float fragNeuron;
void main() {
vec2 uv = vUV.st; // Get coordinate
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
float z = 0.0; // The pixel to store our hand made dot product
int outputNeurons = int(uTD2DInfos[TEX_WEIGHTS].res.w);
for(int i = 0; i < outputNeurons; i++) {
float weight = texelFetch(sTD2DInputs[TEX_WEIGHTS], ivec2(xy.y, i), 0).r; // (transposed) point to right row
float gradient = texelFetch(sTD2DInputs[TEX_GRADIENT], ivec2(xy.x, i), 0).r; // point to right column
z += weight * gradient;
}
// Output dot product
fragNeuron = z;
}
Derivative of weights
This is also a matrix multiplication shader, but see that in this case, we are transposing the input that we took from our forward propagation, just like the equation says:
$frac{1}{m} * dZ cdot A^T$
The dimensions of this shader will be (input height, gradient height).
//Computes the dot product plus bias of two input textures: dW = a.T * dZ
#define TEX_GRADIENT 0
#define TEX_INPUT 1
out float fragNeuron;
void main()
{
vec2 uv = vUV.st; // Get coordinate
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
float z = 0.0; // The pixel to store our hand made dot product
int samples = int(uTD2DInfos[TEX_INPUT].res.z); // Fetch amount of samples
for(int i = 0; i < samples; i++) {
float x = texelFetch(sTD2DInputs[TEX_INPUT], ivec2(i, xy.x), 0).r; // (transpose) point to right column
float gradient = texelFetch(sTD2DInputs[TEX_GRADIENT], ivec2(i, xy.y), 0).r; // point to right row
z += x * gradient;
}
// Output derivate of weights:
fragNeuron = z;
}
Derivative of bias
This is simply the sum of all the values of the gradient up to this node, divided by the number of samples, resulting in a single value. The dimensions of this shader will be (1, 1)
// computes sum(dZ)
#define TEX_INPUT 0
out vec4 fragColor;
void main()
{
vec2 uv = vUV.st; // Get coordinate
vec2 res = uTDOutputInfo.res.zw; // Get total width and height
ivec2 xy = ivec2(uv * res); // Get coordinate in pixel value
float sum = 0.0; // Variable to store the sum of all dz values
int samples = int(uTD2DInfos[TEX_INPUT].res.z); // Fetch amount of samples
// Add all elements
for(int i = 0; i < samples; i++) {
float dz2 = texelFetch(sTD2DInputs[TEX_INPUT], ivec2(i, xy.y), 0).r; // get right x in col
sum += dz2;
}
// Notice that we have to divide by number of samples
fragColor = vec4(sum / float(samples));
}
Great! You will end up with something that looks like this:
Completing remaining steps
We have arranged the nodes necessary for our output layer. We still need to do the hidden layer. Luckily for us, the thinking process is pretty much over, because thanks to our modular design for all these steps, we can simply copy-paste the backpropagation nodes we created and mapping them to the corresponding inputs:
Optimization
We are almost there. The only remaining step is to update our weights and biases considering the set of derivatives in the backward propagation stage. We can create another Base COMP and name it “Optimization”. See in this screenshot the selected end-point.
In there, we will create:
- 2 TOP inputs, for the derivatives of Weights and biases
- 2 TOP selects, for the original Weights and biases
We want to perform the update of the weight and biases from the learned values. So we want to solve the equation:
$W = W – dW * alpha $
Where $dW$ is the derivative of the Weights and $alpha$ is the learning rate.
We can do this operation with simple TOPs:
- A math TOP to multiply $dW$ by the learning rate
- A subtract TOP to subtract $dW$ from $W$
Fantastic! At the end of the chain I have created two nulls for:
- updatedWeights
- updatedBiases
I have also created 2 TOP custom parameters on this operator to invoke the correct TOPs from the parent level, and a float custom parameter called “learning rate” (you can set it to 0.001) that we map to the multiplying factor in our multiply TOP.
Stack these operators under the “Linear backward” nodes in the network and do the proper mappings. You will have something like this:
Updating
Alright, now it is time to create an updating mechanism. We will use a feedback TOP to do so. Insert a select node in the respective weights and bias generators to grab the updated values and put a feedback TOP between the shader and the output.
We can also create a custom TOP parameter on each node to comfortably select the updated value from the parent level. If you haven’t already, grab the Reset parameter we created at the beginning and plug it into the respective shader. You should have something like this for each weight and bias generator:
Awesome! We are ready to train the network and see it learn! Go ahead and fire it all up with the “Learn” toggle we created in the master operator. This network should be able to get us much closer to a non-linear problem solution.