RMS norm

Project Source Code

Get the project source code below, and follow along with the lesson material.

Download Project Source Code

To set up the project on your local machine, please follow the directions provided in the README.md file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.

This lesson preview is part of the Fundamentals of transformers - Live Workshop course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Fundamentals of transformers - Live Workshop, plus 70+ \newline books, guides and courses with the \newline Pro subscription.

Thumbnail for the \newline course Fundamentals of transformers - Live Workshop
  • [00:00 - 00:00] Okay, yeah, it's a good point. Thanks, Ken.

    [00:01 - 00:04] Let's talk about QKV. And then for Jeff, we're on module.

    [00:05 - 00:08] I don't remember the numbers, unfortunately. So let me go look at the numbers.

    [00:09 - 00:14] We're on module six, module six of seven total. Okay.

    [00:15 - 00:21] And then yeah, so for QKV, let's go back to QKV. - I'm not seeing modules.

    [00:22 - 00:30] Anything past module five. Oh, actually, where are you looking for the modules?

    [00:31 - 00:37] - In the link that you gave me, I'm just following. - Oh. Yeah, good question.

    [00:38 - 00:47] Yeah, so actually I didn't have any more code for module six and seven, which is why. - Okay, so I shouldn't be seeing module six or seven.

    [00:48 - 00:53] - Yeah, exactly. - Okay. Will we be getting what you are doing here?

    [00:54 - 00:58] The end notebook? - Yeah, yeah, I can send this notebook out as well.

    [00:59 - 01:03] - Oh, okay. Okay. I'm just trying to follow along.

    [01:04 - 01:09] - Yeah, sounds good. Thanks for asking. - Yeah, that's good to hear.

    [01:10 - 01:11] The requirements.txt. Oh, I see.

    [01:12 - 01:18] My inputs are just spread out across all of these cells. Let me try to, let me just skim through really quickly.

    [01:19 - 01:24] So actually, okay, yeah. Torch, gentsim, and okay, let me just type it here.

    [01:25 - 01:37] So sorry, I don't have a proper, I don't have a proper requirement.txt, but at least I can give you the library names. Gentsim, NumPy, Torch, what else did I use?

    [01:38 - 01:46] Transformers, what else? I think that's pretty much it.

    [01:47 - 02:03] And if you don't have a local GPU that's totally fine, you can just run this all on CPU as well. (silence) So it's worth noting that if you're running this locally, instead of on Google call app, downloading OPT125 million might take a little bit of time.

    [02:04 - 02:07] It is 250 megabytes. So that's worth noting as well, for sure.

    [02:08 - 02:16] Oh yeah, so sorry, let me go back to what Ken was asking. So for QK and V, let's go look at that.

    [02:17 - 02:39] So I'd introduce QK and V very mechanically, mostly because there is a lot more, it takes a while to explain the intuition, but let's review like what we talked about first and then I can expand on it. So first is we take our input vector, let's say we're taking like you, right?

    [02:40 - 02:49] You now is split, or not split, it's converted into three new vectors. It's converted into query vector, a key vector and a value vector.

    [02:50 - 03:00] So let's talk directly about what these vectors stand for. Query stands for, so query and key together tell you how two words are related.

    [03:01 - 03:06] The query is the word that provides context. This is one that is doing the modification.

    [03:07 - 03:13] Key is the word that quote unquote receives context. It's the one that's being modified.

    [03:14 - 03:26] And so when you take the dot query and key together, that's when you understand the impact of one word on the meaning of another. So again, query gives the context, key receives the context.

    [03:27 - 03:46] Value is the word that you're actually taking the weighted average of. So once you determine the relationships between words and you now have the weights, which indicate how important each word is for understanding another, you then actually take a weighted average of all the values to then give you the final output for self attention.

    [03:47 - 03:57] In short, query gives context, key receives context and value is the actual, is the vector that you're taking a weighted average of. Yeah.

    [03:58 - 04:04] Does that help can or others that thought this section was confusing? It is very confusing, by the way.

    [04:05 - 04:06] Yes, exactly. QK is the weighting.

    [04:07 - 04:12] It's the weights down here. I wanna take the weighted average.

    [04:13 - 04:18] So the weights here, so this I can actually just rewrite to be the following, I can write. Attention three is U value.

    [04:19 - 04:30] And then this is just U key. So actually this is a U key and then U query, sorry, cool key.

    [04:31 - 04:41] And then this is R value times R query, cool key. And then this is cool value times.

    [04:42 - 04:46] This is cool query. And so this is how it looked like if I just written them all together.

    [04:47 - 04:56] It's not entirely correct because I don't have the softmax. Right, so I have to make sure that these three weights all sum to one, but this is roughly the idea.

    [04:57 - 05:10] If I was to write them all online. Let me know if that's still confusing.

    [05:11 - 05:26] , let's get questions. The question is we compute WIJ, but we also compute WJI. So isn't that redundant?

    [05:27 - 05:38] So that's actually why we have queries and keys separately. So it actually won't end up being, so basically what'll happen is let's say I have a tension of one, right?

    [05:39 - 05:56] So tension one of the U value times U query at U key plus R value times R query at U key and so on and so forth. Right, so because one of them is the query and the other one is the key, we actually won't repeat the same multiplication.

    [05:57 - 06:06] I confused myself, U key. Yeah, right, so you notice the key here that the queries and the key pairs are all different.

    [06:07 - 06:19] All right, and then if I do the last one, it'll be roughly the same thing. Yeah, so it's almost redundant, but it's not quite because we have keys and queries separately.

    [06:20 - 06:27] But you're right though, if we didn't have queries and keys separately, if you just had queries, for example, then you would have it done in Cs. Yeah.

    [06:28 - 06:36] Yeah, any other questions from anyone? Okay.

    [06:37 - 06:44] Okay. (mouse clicking) [ Silence ]