Attention Layer

- Why context is fundamental in LLMs - Limits of n-grams, RNNs, embeddings - Self-attention solves long-range context - QKV: query–key–value mechanics - Dynamic contextual embeddings per token - Attention weights determine word relevance - Multi-head attention = parallel perspectives - GQA reduces attention compute cost - Mixture-of-experts for specialized attention - Editing and modifying transformer layers - Decoder-only vs encoder–decoder framing - Building context-aware prediction systems

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 09:13] So we're going to go over attention, why context matters, part of the history of adding about trying to add context to languages as well as sentences. And then we're going to go into key query value, we're going to go after traditional attention, group query attention, multi head attention, and then make sure of expert attention. So we're going to go over about all of these, simply be key query value attention is what you get normally. But most modern architecture is basically implement group query attention, multi head attention, and then some of the latest ones implement mixture of experts. So you hear last year, you heard a lot about this year, and you basically hear a lot about deep seek. And part of the reason why you hear it is because they're able to climb the performance rice curve in that it generally speaking, the more parameters you have the more intelligence you have in the model. But with mixture of experts, what happens is you have much cheaper inference, and so you can get better performance with that lower amount of parameters. So the historical problem with attention is context. So let's your example, go to you and I say, I say bank, we keep on going after this example of bank, and I'm a complete stranger to you. If I just say bank, you don't know what it basically means. If I say, cool, it, you don't know what it means, you need to know what's prior, and subsequent to the word to be able to understand the information. So you don't know if it's basically a river bank, you don't know if it's a financial institution, the side of a trap or a slope. So it's the exact same token, but completely different meaning. And the meaning basically depends on the surrounding words. So basically, the earlier models was basically what's known as backup words and you guys have already implemented a neural network kind of model. And the problem with basically a neural network model is it's okay. It basically does okay, but it can't, as a couple of problems, it doesn't have real understanding of word order. It has a fixed size context window, and it can't generalize across longer term dependencies. And the other thing is historical kind of systems, they don't understand whether it's French, whether it's German, whatever it is, and then it can't answer kind of questions, and it doesn't do everything in an intelligent. So you basically have a word to bag geo, glow vector embedding models, where they're embedding models from a previous kind of a generation. So you had a one to one embedding per word, regardless of the context. So bank is the same vector in all states, and it's not dynamically adapted to the sentence. So what attention is, it gives a lot more information, basically to the to the token to the embedding , so that you're able to understand about the time test. And what is context exactly? Basically, it's it's all the words before and after it's all sent. So if you think about a book, it's the sentence , it's the paragraph, it's the chapter, it's the it's an entire clip book. So common engineering is a bit what you can do with prop engineering is you can do style transfer. For example, you can you can you can write something in, for example, Edgar Allan soul or combat Emily Dickerson and or write a screenplay like Christopher Nolan. The only way it's able to understand the global context is that it's able to ingest all the script data, all the sentence, all the paragraph and all the information and be able to establish and really understand that not only basically you're talking about a bank in a financial institution, you're talking about a bank within a movie about the Joker robbing a financial institution within a Christopher Nolan about movie, basically. And so therefore, when you basically write something, hey, write me a screenplay about a financial institution being robbed, like a Joker by Christopher Nolan, it's able to understand the style of Christopher Nolan, the entire kind of context, basically, and the surrounding words of a Christopher Nolan, they will basically be a very different than them, for example, a Old English literature about financial institutions by Adam Smith or something like that. So what you basically had previously is a previous generation. So if you remember basically the three generations that we basically talked about, so machine learning, deep learning, and transformer based systems were to back in love are in a machine learning era. So it finds the meat word reading, it's not fully aware of context a little bit basically. And then in 2015, you had the rise of deep learning, which is the first neural network models. So it has better sequential modeling, but it's slow. And then finally, you have transformers, but it wasn't until GPT, where transformers was was a where transformers was basically really used in a way, basically, where it was useful, which was the GPT kind of architecture. And and the GPT architecture was specifically a auto work, regressive, conductivity architecture. And the fact that it used reinforcement learning, human feedback, the combination was basically what made strategic useful. And so you would basically, occasionally see in people basically talk about RNA and LSTM. And the problem with these neural networks is they were in 2015 kind of on neural networks, and they're a form of architecture, which a learn global context in terms of the global context, which is being able to understand that this is a Christopher Nolan play, or being able to understand whether this is old English, we really have to really understand the global context of the book, rather than just understanding the entire sentence. So a lot of times when we basically talk about key query value, basically, and a lot of our exercises go into kind of how key query value gets multiplied within a sentence, you have to imagine it across the entire book, the entire kind of internet data, it's a very scalable architecture that really allows global context, and then RNNs and LSTMs, basically turn about to do very well. And so in 2017, you basically have this paper called attention is what you need. And so it introduced self attention. So specifically, it removed a component of RNNs called recurrence. And what that basically did is it basically says, okay, all you need is attention. And so the attention mechanism was actually invented before in a previous paper, but transformers basically said you actually don't need anything else, you just need this one component called self attention. And then you have global kind of dependencies. And it also one of the things that it does is it enables a parallel computation. So we basically kind of talk about this before, let me go back to this slide. But what you basically have is you have data, okay, this is just a reminder, you have data that goes into a token, it goes into an embedding, and then this embedding gets transformed, it gets passed through self attention, kind of a system, and you end up with an embedding with additional context. Basically, so what you're dealing with basically on a model basis is still embedding, okay, it's still embedding, but it depends on whether a transformer got applied or not, basically to this embedding, or it's just a normal embedding, basically. So it 's very important to understand that there's this sequence of things is basically adding more and more information. So if you think about basically data to tokens, you're talking about a dictionary, right, like a traditional dictionary, tokens to embeddings, you're adding multiple layers, and you're adding additional information. And then finally embeddings is basically adding kind of a global context. So this is basically like an enrichment process, okay , so then what we, so when you basically kind of, so if you remember basically the embedding is you basically have a humongous tensor, and in every one of these things are layers, and these layers represent different, the different meanings of the words that you basically are mentioning. So you basically have a big tensor, and then so what ends up happening is when token gets transformed into one embedding, and then, and then what happens after a transformer is this gets transformed into multiple embedding, basically, because why? Because a financial bank, which is needs to be closer in space to money, loan, ATM, basically, and a river bank needs to be closer to water, river, and shore. So you, so how it basically does it is if you have this kind of a transformation, but one embedding gets turned into multiple embedding, okay, and this is also a key difference between this and the previous generation, where like, word to back and glove, where, you know, word to back and glove was just one embedding. So just, so just to basically summarize, just like we have bank refers to financial river, you can have a hidden layer that refers to your network, or a hidden layer that basically refers to the matrices, basically the embedding matrices. So this is basically what we talk about, which is the difference between tokens and embeddings, right? Tokens are simply dictionaries, where embeddings basically have a lot more hidden dimensions that represent information.

[09:14 - 09:54] For example, cool, we'll basically have matrices representing temperature or representing positive sentiment, social removal, or other things. And so when we basically add transformers, we simply add additional meanings. Basically, in the weather, the weather is full, basically, represents have dimensions, reflecting that temperature has become much more prominent. This new gadget is cool, represents dimensions showing social appreciation, innovation, and positive sentiment. Basically, so the transformers attention mechanism basically shifts the embedding to highlight relevant interpretations of the surrounding words. So what is embedding of that?

[09:55 - 11:04] So what is a transformer exactly? The core data structure is, is what's known as query key value. It's simply three matrices that is calculated from it to be able to process a transformer embedding. Okay, so what is a query vector? What is a key vector? What is a value vector? So if you basically look at a sentence or a paragraph or a book, you now have to basically break down all the tokens and have relative associations between all the tokens . Basically, so for example, you have to relate all the different information. So the way the data structure and to be able to relate all the information is a query key value. So let's for example, you have something called furry cat, basically, and you want to be able to relate kind of about the two things. And a historical kind of machine learning basically says, okay, I'm going to just simply say furry cat is related to cat, maybe 0.01% of the time. So when you looked at the engram, kind of the bygram, kind of a language model, you're able to see a probability distribution where you have a statistics problem with that is too simple.

[11:05 - 13:15] Basically, so what they basically invented is a system to be able to capture instead of using a probability, it basically is as every single word, you compute very key and value. And you basically say, and what these matrices basically represent is they represent the relationship and attachment to other words, both locally as well as globally. Okay, so the way to basically kind of think about the query key value is a query value vector is what am I looking for? A key vector is what do I have to offer? And then the value vector is basically the actual information I can give. And so let's do an analogy to make it kind of a little bit simpler. Okay, so for example, you're searching for a Italian retina, right? A Italian restaurant, you basically have a query, basically, you 're going on Google's maps, you're going on Yo, basically, and you're basically saying, I want to be able to find certain things. So for example, I want to be able to find a Italian that basically open light, the vector, okay, the key vector, you can think of it as a summary of everything. So if you go on Google maps, if you go on Yo, you will basically see a rating, and you will have preliminary reviews and from preliminary information, so you can think of it as a small kind of a summary part of the information that proxies the information that you would want. And so think of it as like a it's four and a half stars or you're able to see the name that is a Italian restaurant, and you have some basic information there, but you don't have much more detailed information, basically. So the key is basically like a summary information of that real underlying information. And so for example, you do a query and then you do a Italian restaurant that is open light, the key is I'm an Italian restaurant, we stay open until midnight. And the value is the real details, basically the location, the ambiance, the menu is much more richer about information. So what it basically does is it goes through the entire corpus of text and individually calculates the relationship between every other word, and then calculates the very key and value, basically.

[13:16 - 14:33] So if you remember from the engram on your network model, we actually had to go through every single engram and calculate the relationship with every other thing. And so this is basically the same thing, except it's using a query key value as the core data structure rather than a probability about distribution. And so once you basically decide that this restaurant is relevant, so for example, researching for something that's open at midnight, basically then you gather the actual information from the actual value. So you basically get a high level kind of a score of the search that you're basically kind of using. And for example, you're searching for a restaurant that's open basically till midnight, Italian restaurant that's open until midnight. And then you get a result of Italian replace open midnight versus your query kind of on nine points, basically sushi plays open at 10 p.m. versus your query to two points, I'll hold stand versus your query at five points. So a key is a compact descriptor of the value. And so you're multiplying the query and key. And you're basically saying, are you related? Is this are you related? Basically, is furry related to cat is cool related to temperature, basically is basically it allows it to basically kind of quickly understand basically whether it's it should be related or not.

[14:34 - 17:39] And it has a score underneath. And then it basically goes and very multiplies by value in order to basically get the real information. So what this basically kind of so this is a tension in code, basically. So first you want to import all the information, all the relevant from libraries, and then create kind of about general embedding for you are pretty or basically or whatever it is that you want fully embed basically. And then you initialize random about matrix or query key and basically, and then and the way you actually kind of embed and transform and input vector into its query value is simply multiply the full convector with respect to the underlying kind of representation. So for example, if you want to embed you, you basically pass it here in X, and then X times the query matrix, the random query matrix, and then you get the query, the key basically, you multiply the you by the key random matrix, and then you get key, the value, you get value, and you're able to return kind of a very key value. Okay, and so here you basically see, see this. Okay, so it's this is the function where you you pass the u token into the embedded kind of function, and the embed function simply multiplies the u token matrix by the query, the random query matrix, the random kind of a key matrix, the random kind of a value kind of a matrix, and then you 're able to establish about the query kind of a key value representation for each token. And then what you basically kind of see here is you you basically see a you want to basically see the relative relationship basically between the different words, you want to basically see, okay, how is u related to r? And so you want to get a score basically on how do how are things going to be related, and then query you want to understand is pretty related to our basically, and then, and then you basically kind of get a score of how they're related. And so that so you remember basically, like we spent a lot of time doing matrix multiplication in the very beginning, in the very first week, simply because the it's crucial, every single thing there is basically matrix multiplication, or we're just doing a simple matrix, but a basically when you're actually calculating everything, you're just simply doing tensor multiplication basically. So then what you basically kind of so what you basically see here is the relative kind of a relevance of each other kind of a system. And then what you want to understand is you want to be able to understand the attention for pretty with respect to all other tokens. So if you look at the background kind of a neural network kind of model that we created, you remember basically the by ground matrix, which is basically a is related to b follows a point old 2% of the time c follows a point old 3 times. So what you're basically doing here is a very similar kind of a thing. And then you're trying to figure out basically how much everything is related to each other, basically, except it's not it's you're not using probability, you 're using query key value.

[17:40 - 19:08] So you want to basically be able to multiply every single thing with respect to every other token. And you want to understand basically how much you is related to pretty. So when in so in the verbiage, you refer to basically how much it's related as how much it attends or pays attention to basically. So just basically like you're walking down the street. Are you paying attention to basically Italian restaurant? Are you paying attention to so that tension model is basically like just saying basically is furry paying attention related to cat or is furry related to some other kind of a concept. And so the first thing that you want to basically do is you want to basically do you want to calculate the tension that people are basically about calculating with respect to all the other about tokens. And then after that, basically, you get a score, basically, how much is query, how much is you related pretty, how much is our related to pretty, how much is pretty related to itself. And you get a raw kind of a score. And so if you remember, basically, when we do a lot of these things, we have to process things just to basically enable the GPU to basically be able to process things correctly. So what we basically have here is we basically have a soft bounce. We then just basically take the raw scores and we basically apply it to our soft maps. And then what you basically do is you want to calculate the final attention kind of output for the token ready, and it's simply a weighted sum of all the values of the other tokens.

[19:09 - 20:56] Okay, the representation of pretty will basically be a combination of information from you are pretty, versus back to their attention score, basically. So what you basically get here is you get a actual score after you multiply query the query of you by the key of pretty, the query of our mind , the key of pretty, the query of pretty, multiply the key of pretty, and then you simply basically get the attention kind of a score. So the the attention kind of a mechanism works because it determines the relationship between the different words in a sentence. And so when we're basically talking about cats and dogs are animals, it is passed through the self-intentioned mechanism, and it evaluates how each word like animals relate to each other. For example, linking animals to obesity, cats, and about dog. So the linking mechanism, how cats and how animal is basically related to cats and dog is through query key best rule, the exact kind of a calculation that I basically mentioned before. And these animals are friendly is also passed through the self-attention mechanism. So the model looks at every single word, for example, these animals are friendly, and then, and then how it relates to each other, and linking these to animals and understanding that cats and dogs are also referenced by these. Understanding that cats and dogs are referenced by these, it helps the model grasp the context and meaning across the different parts of the sentence. Basically, so it's just another way of reinforcing basically the fact that query key value is the mechanism where all these things are able to relate basically. So we basically saw in the code. So once you calculate query a key value of everything, then what you want to basically do is you want to calculate a softmax, and then you have the output, which is the weighted values of the information.

[20:57 - 24:58] And so what you basically can have here is you have a input token, which is a, and then you turn it into a vector, okay, then you take a each token that basically produces a query key and value, and then you compute the query and the keys all relative to each other, then you calculate a softmax kind of over all of these scores, and then you do a weighted sum of, and so now basically, now you have a much more specific information that captures the nuances of the sentence, the paragraph, as well as the book basically. So if you're going to calculate this for a book, obviously, you have to calculate, you are pretty relative to every single other token within the, and so this basically kind of a, you're able to now shift and understand about the context. So when we're basically talking about transformer, it's two components, one is self attention, and other one is multilayer, kind of a reception. Yeah, context basically, so part of it is basically, like, why does this query kind of a key value system can work? It's because it 's calculated across a lot of konba data, and when you calculate it across a lot of data, it 's able to understand the context within the word, the sentence, as well as the sequence, basically, and then context basically allows us to understand the ambiguities and language. So when you basically have a prompt that basically comes in, I went to the bank and then, and deposited the money into a river bank, or I put the money basically from a river bank, I put the money from a financial bank into a river bank, it put the prompt basically using the same attention model, and it's able to understand, oh, this is most likely refers to a river bank, this most likely refers to a financial bank, and yeah, and yeah, so in the sentence, the bank was lined with beautiful trees , self attention helps the model recognize that this bank refers to the river bank, given the clues of the trees, okay, so the fact that basically, if I basically just gave you a sentence, the bank, you as a human wouldn't be able to understand what I'm referring to, basically, if I just gave you that information, you really have to understand, basically, the bank was lined with beautiful kon ba trees, and most, most likely, it refers to a river bank, why? Because it related to nature, it's related to trees, and so, technically, this is basically, for example, it could be a financial bank, the financial bank was lined with a beautiful tree, basically, because it refers to lined, basically, like, this river bank was lined, basically, plus, it basically refers to trees, but combination of the words basically allow you to understand, and so, this is what a very key value really allows you to be able to calculate. So, you basically have a system konba here, where you have a embedding model, basically, that sends in about the information, you have the self-attention mechanism here, and then you have the multilayer number sept ron in there. So, each word in a sequence basically looks to every other word so that to understand konba the end forms, and the model assigns different attention weights, basically, to each word, depending on how relevant it is to its current word, basically, so, more informed words relative to the word gets higher attention. I'm looking for an Italian place, or maybe a sushi space, and needs to be open until midnight. Also, I love cozy decor. So, phrases like Italian place, sushi spot, and open until konba midnight all connect. The attention mechanism recognizes these requests and weights them more heavily than details that don't match, for example, in about Mexican konba breakfast. So, we basically went over the code, which is basically it computes query key values, and then calculates the attention score using the DAW product . And so, when you look at the code, basically, when we looked at the diagram and gram model, you're able to understand that its probability that's the core thing. But here, what we're basically doing is we're using query key value to basically have the relative relationship between everything.

[24:59 - 27:06] So, we went over basically this analogy basically before, but effectively, a query is Italian restaurant that represents what you want to look for. Basically, a key is a compact descriptor of what the word is basically kind of provides basically. So, when we did the attention kind of a mechanism, it's mostly multiplying kind of query and key kind of altogether. And then, the value is the full details once we determine if the match is all good. So, we went over this as well, basically, the very, the attention kind of a mechanism is being able to compute, compare the query to every single key, and then you calculate a score at the very end. And so, the key succinctly signals what the value holds. And once the model sees that the key aligns well with the query, it fetches the relevant value basically. And under the hood, matrix multiplication for key and value can be different. But if there's good key to query alignment, the value is emphasized. So, what we basically have is actually different kind of attention kind of mechanisms. We have a single, we basically have group, and we also have multi head attention. And so, in standard attention, each token has to figure out its relationship, or it's figure out how to attend basically. So, when you basically see attend, just think about relationship. Every token has to figure out its relationship with every other token in a sequence. So, a sequence basically can refer to the bulk, it can be referred to a chapter, it can basically refer to the prompt basically . And so, the attention mechanism is computed for each very key pair basically. So, you guys are all programmers. So, if you basically have something that basically is multiplied against every other kind of thing, this approach can basically be a very computationally expensive, especially for long sequences. And because it calculates all single, every single token basically, because you can 't take anything for granted, right? So, the previous example, where we basically talked about, I went to the bank to withdraw my money, and then I then went to the river bank to throw it away or to hide it.

[27:07 - 28:16] Basically, in order to understand that, you can't just basically see bank and assume that it's a vocabulary, or you have to literally calculate every single bank that you see on the internet against every other kind of score that you basically would see. Typically, it's within that document, basically. So, within that webpage, basically. But this can basically be very expensive. And, yeah. So, that's basically the traditional kind of example, basically. So, in a traditional kind of attention mechanism, every word in the sentence attends, when you see a 10 think relationship, basically, to every other word, reducing a dense matrix, banks refer to banks handle deposits, customers, basically. And so, it pays attention or has a relationship, basically, to basically, so handle relates to bank handle deposit customer, deposit goes to banks handle deposit customer for and so forth. So, each word attends to all others. Each word has a relationship to all the all others, even though, basically, this word can basically be less relevant. What group care query attention is basically, it modifies this approach by dividing the queries into groups.

[28:17 - 29:28] Basically, the primary goal is to reduce the computational load and enhance the performance of the model. So, instead of treating each query independently, the model groups queries by smaller subsets. And these groups are processed together. And so, it reduces the number of attention calculated, basically. So, in a traditional kind of a model, that's for example , you have a webpage, you have to calculate every single word, key query value against every other word, basically. So, it's very expensive. So, this basically simply says, why don't you just look at the queries, basically. So, you basically say, okay, if I'm looking for a, to go for the restaurant analogy, I'm looking for a restaurant at 10 p.m. Basically, I'm looking for a restaurant at 12 p.m. You group it together. So, you basically assess it into smaller subs ets. Within each group, the query shares attention scores, which can lead to more efficient kind of computation. The model computes that attention score or represent representative subset of the queries in the group and applies these scores to the entire by group, Ruby queries. The number of required operations is reduced. And this can be especially beneficial for long sequences, where the quadratic complexity of attention mechanisms become a significant problem.

[29:29 - 30:34] So, group one is, so imagine, basically, instead of a bank handling different services, the bank basically organizes every customer into groups. And then within each group, the customers are serviced by specialized tellers, which means those specific requests. The grouping reduces the overall burden by allowing fewer specialists to handle each set of queries on more efficiently. Within this loop, within this loan kind of a queue, all customers might share a set of common screening questions. For example, that is for collateral loan. The loan officers can collectively agree on a common way to evaluate these parameters, rather than each teller separately calculating the same matrix. They share a standardized evaluation score. So, basically, in the group query attention, queries within the same group share attention scores. It calculates the attention for a representative subset and then applies this score to all queries in the group. And then just like all loan customers are evaluated to a shared process. So, and this is much faster and much more effective. And so, basically, how does that work? And we can group related concepts.

[30:35 - 32:50] For example, in a bank's context, you might group loans, interest, mortgage. Under a loan services group, you can group checking savings, deposit, withdrawal, under a deposit services group. And so, you can basically have group one, which is loan services, which is loans, mortgage, interest, group two, basically, those are connector terms. And the R, these are structural kind of our words. And then, yeah, and so, by allowing each route to handle its internal kind of attention, we reduce the number of connections the model has to process . Instead of a large, all the all matrix, we have a small group to screw or within the matrix by making it more attention efficient. And so, group one, basically, is attention kind of loans, mortgage and interest, group two, basically, and so forth. And then, what you basically figure out is that loans is related to mortgage or interest and is related to itself. Mortgage is related to loans and interest, interest is related to loans and mortgage. So, when you see a tend, just think related relationship, basically. For example, you might have a family that requires no loud music or coloring kind of our books, basically, for the children, or you might go on a date and you might want a romantic kind of ambiance that has dim lighting, basically, no old things. So, attention , basically kind of multi-head attention is just simply a different focus, basically. So, it allows you to really have a preference, so to speak. So, that's different than a future now. Yeah, so, basically, multi-head attention weighs different parts of the input and you can think of it as a different focus. Basically, it can be ambiance, it can be price, it could be dietary, basically, and different perspective. And these perspectives are more nuanced, basically, or encompass more things than what the query, basically, sorry, than what the key would basically have. So, for example, the key would basically say be able to have, like, location data, time data, basically, it's open to 10 p.m. Here, you could basically express, is this restaurant Michelin kind of outwardly or not, basically? So, it expresses more information and more express iveness, basically. So, once you have the self-attention mechanism, rule attention is simply a way of optimization. And then multi-head attention is basically, it's just more expressive, basically.

[32:51 - 34:14] How it basically works is that each head has its own very key value system. And each reviewer, basically, each head, but independently checks the menu hours and location. So, each head checks all the different keys and determines kind of about the relevance relative to the head. And so, the combined outcome, the values, mostly separate opinions into a comprehensive and about recommendation. For example, basically, I want an Italian restaurant with vegetable vegetarian kind of options open until midnight near downtown with a close eye. Head one focuses on the hours open until midnight and location near downtown. Head two focuses on whether it's genuinely Italian and has good menu variety. Head three zeros on ambience and checks the user reviews. By combining all these different insights, get a thorough evaluation of which restaurant meets all the criteria. So, just, this is an analogy about basically what heads are. Basically, in reality, what you're basically doing is you're comparing cat versus furry, basically cool versus other things. And so, you're basically trying to encapsulate is much more meaning, basically, and information. So, this basically is almost like a complex kind of a complex kind of a meaning. So , each head can basically attend or pay attention to a unique set of details. By merging these viewpoints, you have a well-rounded score of how suitable a restaurant is for your needs.

[34:15 - 35:43] The final thing that we want to basically cover is the mixture of experts. And the mixture of experts allows different experts kind of submodels within a larger model. And so, each expert becomes skilled at a particular type of task or pattern. And then instead of a massive model, trying to do everything, each smaller expert tackles what it's good at improving both speed and accuracy. And then as you add more experts, you grow capacity in a more modular way. And so, imagine a restaurant with multiple specialized chef, one for Italian, one for sushi, one for dessert . And then when an order comes in, this is the gating mechanism, basically the head chef. Now, routes and decides which specialized chef is best equipped to handle this order. Each chef focuses on a specific cuisine. So, complex orders are fulfilled and quickly and to a higher standard, basically. Yeah. So, this is basically, this is for reference permanently, basically. And this is what's already in the code, basically. So, it's just simply summary of everything. But it basically talks about the input, importing the library, defining attention, query, key value. And then you can also goes into an example embedding, basically. And then, and then here is you can actually combine all the embeddings and split them, basically. So, you can combine all embeddings into a single tensor and then add a batch dimension. So, it has a certain shape, basically batch size, sequence lines, the embedding dimension. And then you can use the same tensor for key query key value.

[35:44 - 36:46] And then you can call the self-attention function to compute the output, as well as the attention weights. And then we print the attention weights for each kind of all word, basically, for so that key pays attention better to two or a kind of all life. So, this is the actual, we have this in the code, but this is basically like the self-attention kind of our system, where you have the query key value, basically. And you're able to kind of calculate the attention scores. And so, it's its own kind of function, basically. So, the first kind of thing you do is you calculate kind of about this, you given the query key value, you calculate all the scores, and then you will apply a certain kind of scaling factor, basically. And then, once you get the actual kind of a score, you apply the softmax, basically, and then you multiply the attention kind of all weights to the value to get the weighted kind of all. So, once you basically have that function, then you basically see, okay, the cat, you have an example sentence, cat, and dogs are animal, basically. And then, and then you, you then basically can go in, and then you calculate kind of the actual system.

[36:47 - 38:00] Then, what you basically have here is you've burned the words into a sequence of embedding, and then, and then basically use the same embedding for query, kind of about key, kind of about value, and then you apply kind of about the attention kind of about mechanism, and then we print out about the weights. So, it's not unlike basically the cloud, I basically kind of showed you, it's just as a couple of things with matrix application. That's basically kind of it. The whole mark is basically kind of the following. So, basically, it's taking the original kind of attention system, basically, sorry, the original engram neural network that you basically implemented last time, and then your, what you're basically doing is you're then adding, you're implementing about the engram language model with no attention. This is basically the same as before, and you prepare the training data, and then you add engram with self-attention, and then you add, make sure the experts with self-attention, and then you can then train and compare it with the relative. We have the answer, basically, in a separate kind of document, basically. So, basically, what we're trying to do is we're adding kind of a continuation, basically. So, you have the engram model, then you have the neural network that you're basically implementing, then you're adding an attaching a self-attention inside, and you're attaching a mixture of experts.

[38:01 - 38:47] Part of the reason why we want to introduce you to these models is that self-attention is actually , sorry, make sure of experts is actually what deep-seak is, what Mistral is. So, a lot of the state-of-the-art models, which has lower parameter amount sizes, but has similar performances, use a mixture of experts. So, we're introducing this concept, basically, because as you're selecting different models, you can basically say, "Oh, basically, I'm basically using a mixture of experts," and there's different kind of attention mechanism. Once you basically understand the query sheet value and system, then you'll basically see models out there that basically have flash attention. They're further optimized for certain things, basically, and you'll be able to understand additional models that may have different architectures or different mechanisms .

[38:48 - 39:11] So, to summarize, the key thing is very key value, okay? Very key value is the key kind of a system basically here, and this is basically the main unrolled version of query key value here, okay? And it's from this kind of a simple system that all similarity or relationships are basically calculated between all words.