Modern day transformer architectures
Get the project source code below, and follow along with the lesson material.
Download Project Source CodeTo set up the project on your local machine, please follow the directions provided in the README.md
file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.
This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.
Get unlimited access to Fundamentals of transformers - Live Workshop, plus 70+ \newline books, guides and courses with the \newline Pro subscription.
[00:00 - 00:15] Okay, let me pull up my slide again. Okay, all right. So, I'm actually going to skip down to here.
[00:16 - 00:33] So let me just talk, actually, let me talk briefly about the section I was going to present next. Okay, so really quickly, I was going to talk about three different kinds of attention. So we talked about self attention in its vector form.
[00:34 - 00:49] There's multi-head, multi-query, and group query attention. You may have heard of these terms before, but all different variants of attention. So, let's take a time. I'm going to actually skip over this. We can come back to this if folks are interested. I want to make sure to give folks some time at the end to ask the general questions.
[00:50 - 00:57] And we did have some additional topics to go through from our Q&A just now. So I'll make sure we get to that first.
[00:58 - 01:07] So, modern large language models actually look a little bit different from most of the figures that you see online. A lot of the figures that you'll see online looks like something on the right.
[01:08 - 01:17] So this is what you'll find in any transformer 101 platform or 101 blog post or explainer. And that's because this came from the original transformer paper.
[01:18 - 01:33] Now, this diagram on the right is slightly outdated. What actually happens is the following. All of this part, that entire part called the decoder, and you can call it the left hand side, the encoder.
[01:34 - 01:48] The right hand side is completely chopped off. That's no longer there. We don't make a distinction between the encoder and the decoder for lums anymore. Instead, we just have one main path. And so what you'll often hear is something like transformers are decoder only.
[01:49 - 01:55] And that just means that we only have one path on the right hand side here. So, luckily for us, that means the architecture is a lot simpler.
[01:56 - 02:11] So knowing that this is a diagram you see the most often, I wanted to compare our diagram to theirs. You've talked about all the components in this diagram already. So the first is the embedding. This is where you convert words into vectors.
[02:12 - 02:22] Once you do that, the next thing is your traditional encoding, which is also illustrated here with that little plus symbol. That positional encoding as we saw before is a relative positional encoding.
[02:23 - 02:31] The second part of this is the first part of the transformer. This is now the attention module. I mentioned before that there's something called multi head attention.
[02:32 - 02:37] I wouldn't talk about this in much detail, but you can ignore that for now. Basically, this is self attention.
[02:38 - 02:45] After your self attention, you can see this arrow that sort of loops around. But just like how we illustrate skip connections, this is a skip connection.
[02:46 - 02:54] And finally, the third part of this is norm illustrated here with the yellow box. So norm is the RMS norm that we mentioned before.
[02:55 - 03:09] So we talked about all three components in this step, the attention, the residual connection or the skip connection, and then the RMS norm. We've also talked about everything here. So we talked about the feed forward network, which is the MLP.
[03:10 - 03:21] And we also talked about skip connection and finally the RMS norm. Now, this is one part that we didn't talk about fully. We talked about nearest neighbors and we talked about what it does conceptually.
[03:22 - 03:41] I didn't really walk through the code, but you have the core idea and the core concept for this part of the LOM, which is we're looking for the nearest neighbor point in a high dimensional space. Right. So right here on the right hand side, again, comparing with our figure on the left hand side, we have the transformer diagram that you see everywhere posted all around the web.
[03:42 - 03:54] And we've now talked about every component. Right. And I guess this took us like, you know, three, four hours to talk about, but now you have, hopefully a deeper understanding of what's going on here.