Machine Learning in shaders – 1. Logistic Regression

We will call these ten data points “training examples” Therefore:

$m = 10$

The first step for us to understand the fundamental concepts of Machine Learning is to dissect its elements to the core. Based on my own experience, there is nothing better to learn a new skill than building some small project that integrates the necessary ingredients in a more or less challenging way. For this reason, this section will dive right away into a goal and explain things along the way, rather than trying to give a solid theoretical foundation first.

I assume that you know in principle what vectors and matrices are and that you are familiar with GLSL.

The goal

The current goal in this first step is to build a logistic regression algorithm capable of identifying boundaries for a set of coordinates in a 2D plane, using exclusively pixel shaders. In essence, it should be a classifier that tells us where the data lies along the plane.

We will be building our classifier with a single neuron. Sometimes this is called a “perceptron” although the name can vary depending on the specific implementation. For now, you may associate the name perceptron with a neural network that consists of a single neuron.

For simplicity, we will build it first with toy data created by hand and afterward with slightly more complicated data sets. The initial data that we will use contains only 10 points (excuse the handwriting):

A simple dataset

We will call these ten data points “training examples” Therefore:

$m = 10$

Furthermore, we see that each data point is a coordinate with x, y information, so we know that this vector has a shape of one row by two columns:

$vec{x} rightarrow [1 hspace{10px} 2]$

The most convenient way to express this in TouchDesigner is to create a table with the coordinates of each point. We need to notate the class to which each of these coordinates belongs because we are building a classifier. We could do this in the same table, but for the sake of clarity, we will create another table that contains the “classes” or “labels” according to our diagram. We end up then with two tables:


Since our goal is to perform computations in a shader, these tables won’t cut it for us. We need to convert them to textures to store the coordinate data and the class data in pixels. We can achieve this by converting them to CHOPs and then converting those CHOPS to TOPS:

Notice that we are outputting a channel per column and that our TOP for the data is of RG format. We are only feeding it two values per pixel that represent our (x, y) coordinates in the table. If you navigate to the resulting texture, enter “viewer active” mode, and press “d”, you can inspect the pixel values, which is a comfortable way for us to debug our algorithm! The R, G values should correspond to the values in our table:

We proceed to do the same with our classes (bare in mind that the data format should be the only R since there is a single value) and connect each resulting texture to a GLSL top, which will give us access to our data in the shader:

The math

Before we dive into the code for the shader, we need to understand the math on some level. Basic knowledge of linear algebra is required to comprehend the inner workings of these algorithms. Who would have known? Anyhow, I assume that you are somewhat familiar with Machine Learning (and most likely know more math than I did when I began learning this). In any case, I attempt to explain the math in simple words. I also provide further resources so that at least you have an intuition of what they do.

The notation here introduced is taken from Andrew Ng’s excellent AI courses, which I recommend to follow, along with Standford’s CS231n course, and the Machine Learning for Artists site, which I will reference as well further along this path.

Notation

$m =$ size of training set

$n =$ size of feature set

$x =$ input features vector

$y =$ classes vector

$w =$ weights

$b =$ bias

$sigma =$ sigmoid activation

$d =$ derivative

$alpha =$ learning rate

$z =$ logistic regression

$hat{y} =$

$a =$ activated $z$

$L(a,y) =$ loss

Equations

Here are the equations needed to compute a single step of logistic regression. Do not be scared by them! We will dive into what they mean and discover that they are not so complicated. For now, check them out, and do not worry too much if you do not fully understand them.

We begin with this concept of “forward propagation”. The starting point of this process is the linear mapping formula:

$z = W^{T}x + b$

Then, we have to define an activation function. In this case, it will be a sigmoid function:

$sigma = frac{1}{1 + e^{-z}}$

We apply this function to our linear mapping:

$hat{y} = sigma(z)$

And we compute the loss for a single example. There are various loss equations we could use, but we will stick to the cross entropy function:

$L(a,y) = -(ylog(a) + (1-y) log(1-a))$

Ok. That will do for our forward propagation. The next step is to perform something called “backward propagation“. We will compute derivatives for our activation function, our weight value, and the bias. In this example with a single neuron and binary classification, we can do:

$dz = a-y$

$dw = (x-dz) / m$

$db = dz / m$

Hold on… what?

If you read the previous equations and are ready to close this window hold on! It is not as daunting as it may look. Let me first try to explain the equations with simple words. We can find out what is going on.

Forward propagation is simply a set of operations that occur linearly from beginning to end and that produce a set of values that ultimately approximate a ground truth (which is a fancy way of calling our classes). For now, you may visualize it simply as an arrow pointing forward:

Backward propagation is, in contrast, a set of operations that occur linearly from end to beginning. Values computed in the forward propagation stage update values on the next run of the loop so that the results can get closer and closer to the ground truth on every run. You may visualize it simply as an arrow pointing backward

The linear mapping formula:

$z = W^{T}x + b$

means that we apply a given weight to our training sample and then compensate for the result using a bias. As we said, the weights are these values that iteratively get updated on every run, so that the result gets closer and closer to a known truth. That is the most fundamental intuition to have. I recommend to check out this fantastic blog and play a little bit with the sliders, to see how this works:

Visual interactive guide to the basics of neural networks by Jay Alammar.

The activation function is a mechanism that allows us to decide whether a value belongs to a category or not. The one we are using for this first example is called the “sigmoid function” and has this shape:

It works for us because we are trying to decide what class each of our 10 data points belongs to. Is it class 0 or class 1?

This function compresses the input values in the range of 0 to 1, as probabilities. For example, if a number is a huge negative number, the exponent part of this function makes it grow exponentially, and the value approaches 0. If instead is a huge positive number, the function reduces it exponentially, so it approaches 1.

If you are familiar with an audio compressor, you may notice the principle is quite similar, in that lower-powered signals gain strength while high power signals get compressed. You may think about it in that way.

Cross entropy

The cross-entropy portion of our equations takes care of computing how our prediction is doing. In other words, it checks whether the probability of a value being a 0 or a 1 in our example is high or low. From that follows that it penalizes poor predictions and rewards good predictions.
If the probability for a value to be one is 100% then its loss is 0, and if the probability to be one is 1%, then its loss is huge.

To understand this in more detail, you can check Understanding binary cross-entropy, by Daniel Godoy.

Derivatives

Derivatives are perhaps the most challenging part to understand. For our purposes, it is nonetheless sufficient to gain a good intuition of what derivatives do.

Derivatives measure the rate of change of our values, allowing us to establish a new measurement on each run that will “steer” the values in the right direction. Perhaps it helps to see this in a geometric representation, where a derivative is the slope of a line at a specific point:

When we use:

$dz = a-y$

$dw = (x-dz) / m$

$db = dz / m $

We are checking the change of our activated values, weights, and biases with the given input. If we imagine that by performing these steps we are traversing a line, the following will hopefully make sense:

“A derivative outputs an expression we can use to calculate the instantaneous rate of change, or slope, at a single point on a line. After solving for the derivative you can use it to calculate the slope at every other point on the line” (Go to this link from ML-Cheatsheet to read more about derivatives and calculus in general)

I highly recommend the “Essence of Calculus” series by 3blue1brown, which provides a fantastic insight into how calculus works. Find here the chapter about derivatives:

Updating

Finally, the easy part. In the optimization stage, we replace the original values of our weights and biases. That happens on every run of the forward/backward propagation cycle as we approximate an accurate prediction.

$w = w – alpha dw$

$b = b – alpha db$

That means: take the previous weight value and replace it with a new weight, equal to the current weight minus the derivative of that weight times some scalar $alpha$.

Gradient descent

This whole process computes a single step of what we call the gradient descent, updating our weights and biases so that our model can further converge to an optimum state.

Gradient descent is the process of continuously moving closer and closer to an optimal solution. You may visualize yourself on a hill with your eyes closed, with the task of getting down the mountain. Gradient descent is like taking little steps towards the steepest direction downwards by feeling the small changes in the ground.

Check this link from ML Cheatsheet to read more about it.

Loop

If all this makes sense, you will know that we need a loop somewhere to do this repeatedly until getting our answer. That is where we are going to get creative. But let us not get ahead of ourselves and try now to implement all this in GLSL.

Implementation

Let us build all these steps one by one in our shader. First and foremost, we must make sure we correctly have our data representation on the pixel level. We know that the input data is a vector with shape (2, 1), so we can use a vec2 to store this data. Our class is a single value, so we can use a float to store it.

We should proceed to initialize our parameters with some arbitrary values. I have created a uniform numExamples for the (surprise, surprise) number of examples, which will come in handy afterward:

float m = int(numExamples); 
vec2 w = vec2(0.001); // Initialise weights
float b = 0.0; // Initialise bias

Next, we define the activation function for our classifier:

float sigmoid(float z) {
    return 1. / (1. + exp(-z));
}

And proceed to compute our forward propagation:

// Forward propagation
float z = dot(w, x) + b;
float a = sigmoid(z);

Notice that our implementation performs directly the dot product of the vectors $w$ and $x$, even though our formula said to perform the dot product with a transpose of the matrix $W$:

$z = W^{T}x + b$

We take the transpose in the latter case so that the dot product operation is possible. $W$ with dimensions (2, 1) and $x$ with dimensions (2,1), are not “compatible”, since their axes are not aligned. The rule is that the number of columns in the first matrix must be equal to the number of rows in the second vector or matrix. If we transpose we get dimensions (1, 2), which can be used to compute the dot product with $x$ resulting in a matrix of dimensions (1, 1).

However, we are dealing with vectors, which are uni-dimensional, so there is no need to transpose weights so that the result of the operation matches the shape of x; They already do! Furthermore, with vectors, the order of operation in this particular case does not matter. We will have to think on dimensions in a moment nonetheless, but this will do for now.

Again, do not worry if this goes above your head. We will deal with matrix multiplication specifically in the next tutorials. Bear with me for now 

We can proceed then to compute the loss for this single step, using the cross-entropy function:

float loss = -(y * log(a) + (1. - y) * log(1.-a));

The backward propagation:

float dz = a - y;     // Derivative of activation function
vec2 dw = x * dz / m; // Derivative of weights
float db = dz / m;    // Derivative of bias

And finally a step of gradient descent:

w -= (learningRate * dw);
b  -= (learningRate * db);

Alright, we’ve done it! Here the full code:

// Compute one step of gradient descent
uniform int numExamples;
out vec4 fragColor;
float sigmoid(float z) {
    return 1. / (1. + exp(-z));
}
void main(){
    // Gather data
    vec2 x = texture(sTD2DInputs[0], vUV.st).rg; // data
    float y = texture(sTD2DInputs[1], vUV.st).r; // labels
    // Initialize
    float m = int(numExamples);
    vec2 w = vec2(0.001);
    float b = 0.0;
    // Forward propagation
    float z = dot(w, x) + b;
    float a = sigmoid(z);
    float loss = -(y * log(a) + (1. - y)*log(1.-a));
    //Backward propagation
    float dz = a - y;
    vec2 dw = (1/m) * x * dz; // Derivative of weights
    float db = (1/m) * dz;    // Derivative of bias
    // Compute one step of gradient descent
    w -= (learningRate * dw);
    b -= (learnuingRate * db);
    // Store it in pixels
    vec4 parameters = vec4(vec3(w, b), loss);
    fragColor = TDOutputSwizzle(parameters);
}

To loop or not to loop… gimme me some feedback!

Now, as we saw in our set of equations, we need to optimize our algorithm across several iterations, or “epochs”, for every feature we have. Our input texture at the moment is (10, 1), so we intuitively would want to create a for loop in the shader to traverse the width of the input texture and perform the updates. That would be fine, but since our shader has the resolution of the input (10, 1) it would be very wasteful given that in shaders every pixel computation runs in parallel.

If we put a for loop there, we would be asking each pixel to iterate over all those pixels and outputting the same result for each pixel in a (10, 1) texture, meaning that we would be doing 90 unnecessary steps! The proper way to do this is to simply change the shader resolution to be (1, 1) so that our for loop runs only once for all the input data and outputs a single value, which is the prediction.

Note that we could also parallelize our operation and have a shader that matches the size of the input and perform the operations pixel-wise, that would be the most efficient way to do it in this case. Nonetheless, we will choose the approach described above, because we are heading towards operations using matrix multiplication later on, where having this intuition will make sense.

Ok great, but how do we optimize across epochs? Remember that we need to update weights and biases on every step. The best way to do this in shaders is by creating a buffer to store new values per frame and perform a feedback update using the buffered texture. We need to ensure that the pixel format is set to be RGBA 32-bit float, so it outputs RGBA data with a signed floating-point resolution, else TouchDesigner uses the data’s, which in our case had only RG channels:

Let us re-think the way we are outputting our data. We know we want to output a single pixel. This pixel will contain the weights stored in RG, the bias stored in B, and the loss stored in A. We will use this pixel in a feedback loop to update the data continuously. That means that we need to redefine the parameters data in our code.

First, we create a new constant and a feedback TOP representing parameters, we connect the feedback to the third input of our GLSL TOP and then loop that:

We need to modify our initialization, so we sample from that newly created texture. That looks like this:

// Initialize
float m = int(numExamples);
vec3 parameters = texture(sTD2DInputs[2], vec2(0)).xyz;
vec2 w = parameters.xy;
float b = parameters.z;

Let us proceed to create a variable to store the cost. Before, we computed the loss for a single neuron. Now we compute the sum of all the losses at once.

That is referred to as the “cost” instead of “loss”. We also create a variable to store the derivatives we are going to compute:

float cost = 0.0;
vec3 derivatives = vec3(0.0)

We did this whole thing so we can compute in a single pass all the input features. Therefore, we need to grab all of them at once using a for loop of size and compute their respective forward and backward propagations. We can use the texelFetch() function to access each target feature:

for (int i = 0; i < m; i++) {
    vec2 x = texelFetch(sTD2DInputs[0], ivec2(i, 0), 0).xy; // data
    float y = texelFetch(sTD2DInputs[1], ivec2(i,0), 0).x; // class
    // Forward propagation
    float z = dot(w, x) + b;
    float a = sigmoid(z);
    // cost is a function in relation to all the input labels
    // which we update here:
    cost += -(y * log(a) + (1. - y) * log(1.-a)) / float(m);
    // Backward propagation
    float dz = a - y;
    vec2 dw = x * dz / float(m);
    float db = dz / float(m); // The derivative of the bias
    // Compute one step of gradient descent
    derivatives += vec3(dw,db);
}

Finally, we update our weights and biases and output that in a single pixel, which is what we have fed back into the system as a way of performing our iterations:

w -= (learningRate * derivatives.xy);
b -= (learningRate * derivatives.z);
fragColor = vec4(w, b, cost);

You may see a warning in your info node saying: “WARNING: Output of vertex shader ‘vUV’ not read by fragment shader”. That just means we have not used the provided vertexes. You can ignore it, or add somewhere in your code the following, to silence it:

vec2 uv = vUV.st; 

Alright! We have done it. Time to test it! Let’s create a uniform “uReset” and add this line of code to our shader, so we can reset the weight and bias to an initial state and see how our shader learns:

if (uReset == 1) {
   w = vec2(0.001);
   b = 0;
}

We can create a momentary button to reset our model and some operators to inspect the loss over time:

We should see our loss decreasing over time. Here the full code of our model:

uniform int numExamples;
uniform float learningRate;
uniform int uReset;
out vec4 fragColor;
float sigmoid(float z) {
	return 1. / (1. + exp(-z));
}
void main(){
	// vec2 uv = vUV.st;
	float m = int(numExamples);
	vec3 parameters = texture(sTD2DInputs[2], vec2(0)).xyz;
	vec2 w = parameters.xy;
	float b = parameters.z;
	// Reset to initialization
	if (uReset == 1) {
		w = vec2(0.001);
		b = 0;
	}
	float cost = 0;
	vec3 derivatives = vec3(0);
	for (int i = 0; i < m; i++) {
		vec2 x = texelFetch(sTD2DInputs[0], ivec2(i,0), 0).xy; // data
		float y = texelFetch(sTD2DInputs[1], ivec2(i,0), 0).x; // classification labels
		// Forward propagation
		float z = dot(w, x) + b;
		float a = sigmoid(z); 
		cost += -(y * log(a) + (1. - y) * log(1.-a)) / float(m); // Take the mean by dividing by m
		//Backward propagation
		float dz = a - y;
		vec2 dw = x*dz / float(m);
		float db = dz / float(m); // The derivative of the bias
		// Compute one step of gradient descent 
		derivatives += vec3(dw,db);
	}
	w -= (learningRate * derivatives.xy);
	b -= (learningRate*derivatives.z);
	fragColor = vec4(w, b, cost);
}

Prediction

Now, we can make a prediction. For that, we can create another shader, connect the output of our model to the input of the shader and use the following code:


uniform vec2 uInput;
out vec4 fragColor;
float sigmoid(float z) { 
	return 1. / (1. + exp(-z));
}
void main(){
	vec3 w = texture(sTD2DInputs[0], vUV.st).xyz; // data
	float predict = sigmoid(dot(w.xy, uInput) + w.z);
	fragColor = vec4(predict);
	// fragColor = vec4(round(predict)); // This gives you hard values instead
}

Use our model as input and pass it an arbitrary point via a vector uniform. Try values that you know will be 0: (0.9, 0.25) or 1: (0.9, 0.75) and see how it performs. Use the round version to have a clear answer: 0 or 1.

TOX File

You can find here a link to the full project with a handy UI to play around with parameters.

Comment

This post doesn't have any comment. Be the first one!

hide comments
Follow
...

This is a unique website which will require a more modern browser to work!

Please upgrade today!