Demo - The necessity of non-linearities

Project Source Code

Get the project source code below, and follow along with the lesson material.

Download Project Source Code

To set up the project on your local machine, please follow the directions provided in the README.md file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.

This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Fundamentals of transformers - Live Workshop, plus 70+ \newline books, guides and courses with the \newline Pro subscription.

Thumbnail for the \newline course Fundamentals of transformers - Live Workshop
  • [00:00 - 00:06] Awesome. All right, so it's now told 30. So let's continue. Previously, we built the self attention part of the transformer.

    [00:07 - 04:12] Now let's look at the second part is the second part is what we call the multi layer perceptron. You also might see a number of different other names for this. You might also hear we call the feed forward neural network. You might also be heard it might also be called some other names, but feed forward neural network multi layer perceptron, at least in the context of transformers are exactly the same thing. So the multi layer perceptron, MLP for short, modifies words or tokens individually. Right. So we mentioned before that self attention actually operates across words. It takes context, one of quote, from different words and uses them to modify the meaning of some words. The MLP takes each token individually, each word individually and does manipulations. So here's what that looks like illustrated here in the purple, we have self attention from before, right? It's taking in all the input tokens and using that determine what my new token looks like in purple. Now the MLP illustrated here in green operates only per token. Right. So per token, we run some transformations . That's illustrated here with these circles and lines. And this is an abstract representation of our feed forward neural network or MLP. And that finally consists of some output. Right. And this particular case, I just didn't change these numbers, but you can imagine that after MLP, the vectors look vastly different. Okay, so now let's build this MLP manually. This time, the MLP is going to be way fewer steps. What's simpler? Okay. So here, we're going to build the MLP manual. Now let's see what this MLP looks like. He really only has a few different steps. Here is what some of the steps look like. We're going to have a another weight. So anytime I write w underscore something, this is a weight, it's a parameter for our model. So w of projection is equal to one of the size of our, okay, so our side, the vector size is actually 300. Oh my gosh, these are huge. I should have made these smaller. But let me just see how this goes. So w down projection is equal to torch ran 400 300. And I say huge because I think I'm running on CPU. I'm running on CPU. That's what everything is so slow. I could also for you. Oh, actually, this might be whole for everyone. If you're running the LLM on Google Colab, you can actually go to this right here in the top right, click on that drop down arrow, click on change runtime type. You can actually select a GPU here. So in this case, you only have access to T4 GPUs and TPUs, right? So TPUs are Google version of a GPU. So you'll probably want the T 4 GPU, at least for now. PyTorch can also run on TPUs, but it's not well optimized for that. And we're using PyTorch. Okay, so if you want this run faster, just select that GPU, click save on the free version of Google Colab, this will still be available to you. Okay, all right, so back to here, building this MLP manually. Typically speaking, there are two layers to MLP for a transformer. The first layer, so this matrix multiply, or in this case, is just a matrix, but we're going to use this to do a matrix multiply. This first matrix will typically increase the dimensionality of your vector by four times, right? So my original dimensionality is 300, this would have gone up to 1200. But that's too big of a tensor. So I don't really want to do that. I'm just gonna make it 400. But just keep that in mind. Typically, dimensionality is increased by four times. Right, and then this would of course reduce dimension ality about four times. Back to the back down to whatever dimensionality the original tensor was in.

    [04:13 - 05:34] Like four times. Okay, that's capital W. And then now let's add something called a non linearity. I haven't explained what a non linearity is, and I haven't yet even explained what Sulu is. So for now, let's just think about what's non, or let's just ignore those details. And just nonlinearity is some function that takes in a vector and outputs another vector. And then I'll explain what that is in a second. Yeah, so questions from Daniel is, how do you think about dimensionality in your own words? Dimensionality is just the number of value. So a vector is just an array of numbers. And dimensionality is just how many values there are in that array. So how many numbers are in the vector is your dimensionality? In this case, our vectors, our input vectors have 300 numbers in them. And then this matrix multiply, we'll now create a new vector that has 400 values in it. So this first number is always going to be the input dimensionality. And this is always going to be the output dimensionality. Okay, so let's combine all three of these together. We're going to use the same syntactic sugar that we saw before with our at sign. So here, we're going to have the attention three. This is our token that was outputted from self attention.

    [05:35 - 05:50] And then we're going to write a matrix multiply with our projection matrix. Then we're going to run the nonlinearity, which I'll explain later. And then we 're going to do another matrix multiply with what we call our down projection.

    [05:51 - 06:29] So we run this. And not a whole lot happens. But basically, this is the MLP. So you've actually now built a very simple manual version of the MLP. So I also, oh, I also realized that for people joining late, you might not have access to messages from before. So let me just send this Google collab again. This is the quote unquote solutions. Like this can includes the completed code for everything that I'm writing. So I forget lost or if you need to start from the middle, that notebook would be the place to start. I'm just writing out code here. So you can see what this code looks like from scratch.

    [06:30 - 06:59] Okay. So this is our MLP. We had our attention before. You've now implemented a transformer manually, although we only implemented it for one token. You can imagine that we duplicate this code. We'd make it more efficient and batch it for more tokens to make a more production ready to transform. But otherwise, this is the basic essence of a transformer. All right. So let me go back to my slides.

    [07:00 - 07:23] Okay. All right. So why we need nonlinearities? Well, see, why we need non linearities. And for this part, actually, I realized I don't actually have a demo. There's no code per se. What I actually want to do was doodle. So in order to doodle, though, I hate my iPad.

    [07:24 - 07:25] Sorry. Give me a few seconds.