Neural Network Fundamentals
- Feedforward networks as transformer core - Linear layers for learned projections - Nonlinear activations enable expressiveness - SwiGLU powering modern FFN blocks - MLPs refine token representations - LayerNorm stabilizes deep training - Dropout prevents co-adaptation overfitting - Skip connections preserve information flow - Positional encoding injects word order - NLL loss guides probability learning - Encoder vs decoder architectures explained - FFNN + attention form transformer blocks
This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.
[00:00 - 00:38] So the goal is to be able to understand the components of a neural network and understand the architectures and be able to understand and about the components of a transformer. Neural networks were invented in the 1960s or 1950s, I forget the exact time, but basically they invented this artificial turbine neuron with an activation function which we covered in a previous kind of a lecture and basically this AI researcher named Minsky basically kind of killed the entire kind of field. So let me explain kind of a little bit . Traditional kind of about machine learning. So if you remember, we have three generations of machine learning, we have classic machine learning, deep learning, and the transformer based AI.
[00:39 - 01:25] And and traditional kind of machine learning, when you basically hear about it, it left you to where you take a Coursera class on machine learning. What they're basically trying to do is they're trying to do simpler problems. So you have a bunch of balls here on the left side that's red . You have a bunch of balls on the right side that screen, and then you basically draw dotted lines between them. And that's basically what's known as a linear decision about laundry. And then you can basically do what's known as a nonlinear decision boundary. That's as advanced as it basically can go. It's very simple about nonlinear and about systems. What happened within neural networks? Why are there so many layers? It's because they had to solve individual engineering problems. So the early kind of on neural networks couldn't solve a problem, basically called the axler problem.
[01:26 - 03:16] Basically, XOR was a it's a simple function, like you you do and then you do or , and then there's another function called XOR basically, and early neural networks couldn't basically solve the XOR problem. So you basically have XOR combined outputs, which is basically the green, as well as the red, basically, and a linear neural network was with one layer, just simply couldn't distinguish between the XOR combined outputs. This guy basically named Minsky and Pepper in 1969 showed that it couldn't solve that XOR problem. And and by doing that, basically, it caused the first AI winter. So if you go through history, there's been multiple times where AI completely got stuck for 10 or 15 years, basically. And then so people basically, as a result, started using only like linear regressions for vector machines, logistical regressions, and things that could basically model more linear boundaries. And so it wasn't until 1986, where Romhart, Hinton, and Williams introduced back propagation. And there's a backward pass. Basically, the forward pass was where you identify something as a cat, basically, and the backward pass is it you use cross entropy loss to identify that you label the cat incorrectly, and then you change the weights to basically reflect the learning. But it wasn't until 1986, that basically Williams basically introduced about back propagation. So a lot of these layers are basically just engineering tinkering, basically over 1960s to 19 2025, we're basically going on 60 years. So basically, like every 10 years or every 15 years, they basically figure one of these things out. Basically, there's more attention nowadays, because more resources like transformer based AI is now a thing, basically. So there's more attention to it. But for a period of time, it's like progress will progress every 10 years or so.
[03:17 - 04:30] And with neural networks, you basically have a set of inputs, and then you have a set of outputs, and then you have a set of input layers. And part of the problem they're basically trying to address is you're trying to address nonlinear understanding, basically. So if you're trying to identify a cat, if you're trying to identify a human or something, you're able you have to understand a hierarchy of knowledge. And it's the hidden layers that's able to understand all the nuances with the knowledge. By stacking hidden layers, you're able to approximate any function. And so this is basically what's known as the universal approximation theorem, basically. So it's just a complicated way of basically saying any, any squiggly line, that you can basically see, whether it 's a straight line, or it's kind of a semi, kind of a sweetly, this can basically reflect and reflect any kind of function. And so it can solve not only your problems, but you have to stack linear as well as nonlinear layers internally to it. So what is a linear layer? It's a simple kind of a matrix multiplication. Basically, you have the weights, you multiply it by a kind of a matrix, and then you add a bias to it. And so this is basically what's known as a linear layer.
[04:31 - 05:35] In Python, it's basically first dot and so it's just it's a function that basically you just have to call. And so a linear layer maps an input vector, like for example, of features of an image word to a new vector in another space, basically. So think of it as a lens or projection that emphasizes certain directions and stresses certain others. It learns which combination of inputs matter to solve a specific task. And so why is a linear layer needed? So imagine you 're making by addition the kitchen, and you have several ingredients, you have flour, eggs, sugar, and you want to combine it into a specific way to create a dish. These are the input data, like the features and the data set and or the activations from a previous layer in a neural network. And the recipe, the transformation is how much of the ingredients to mix what proportions to create the final dish. And so this is like the weight matrix in a linear layer. And it determines which ingredients contribute to the final outcome. And the bias is basically the optional ingredient.
[05:36 - 06:21] For example, you may want to add an extra dash of spice or sugar, your dish, which is independent of the ingredients. So this is called a bias, which adds a constant to the weighted sum to adjust the final outcome, basically. And after you follow the recipe and mix the ingredient, you find you get the final output, which is a transformed version of your input. So something like this, which is basically multiple hidden layers is a component of linear and nonlinear, about layers inside. And a linear layer is simply just an end linear. And we'll talk about nonlinear layers a little bit later. And so in a neural network, a linear layer is a fundamental building block. It applies a weighted sum and then adds up a bias to the bad term. It just simply our matrix multiplication.
[06:22 - 07:51] It's very it's a matrix multiplication. And it adds a basically, so a text set of inputs, multiply it by weights, and then adds a bias term. And in a neural network, a linear layer takes the inputs from a previous layer, and then produces an output. So when you basically see something like this, basically where it's sequential, like you see this arrow, okay, an arrow, it's just what it basically means is it's just a matrix that's outputted to this matrix. And then this matrix, then basically applies a multiplies by a weighted sum, and then adds a bias, basically. So these are all just like matrices, basically, and it's just applied to different functions. So linear, as well as nonlinear functions. So when you see these circles are basically mat rices, and then arrows are basically applying like a transformation, so to speak. Yeah , in a neural network, what you basically see here is the basically the weights here, and you multiply it by the weighted matrix, and then you add a biased term, and then you basically get the result based here. And so the output layer produces the final prediction. It could be a class classification, it could be review regression, it could basically multiple kind of things. And so this process happens for several layers, where every layer learns how to transform the data into more and more complex expressions. Not only your layer is a nonlinear function, basically, that that allows complex deep neural network learn more complex nonlinear about relationships.
[07:52 - 08:23] So if you remember, basically, what we basically talk about here is you have simple kind of representations, which are linear, basically, if this is like a linear kind of a relationship, where it's very simple, and then you have things that are much more nonlinear. So when you basically think nonlinear, think like a circle, kind of a sweetly line, where it's just, it's not very straightforward, what the boundaries are basically. And so linear layers are basically like straight edge stencils. Basically, nonlinear layers are like a curved brushstroke that allows artistic kind of expression.
[08:24 - 09:34] Without them, you would basically draw out with rulers and never expressed curves or textures. So if you remember, basically, when we went over self attention, we basically went over the restaurant analogy, basically, so you're looking at Google Maps, basically, and when you're basically looking for a restaurant at 10pm, basically, and the key is basically like the ratings and the reviews, basically. So why do you basically have nonlinear layers? It is because you're able to tease out information that's more nuances. So to continue with the analogy of the restaurant, basically, nonlinear layers allow you to basically tease out much more expressive information. Basically, like, for example, you have a family in front of you, basically, you're looking for maybe you have a glue technology, or you have a something that you basically want to fully express. That's not so simple, basically, with just a rating and review, basically. And so we 're going to go into some of the specific nonlinear, kind of our transformations. But basically, when we get into llama, and when we get into the internals of llama, and so what you basically see is, you basically see something like squate loop, basically, or loop, basically.
[09:35 - 12:45] And what it basically is simply, it's a nonlinear, about activation. So what it basically is, basically, instead of looking at a linear relationship, which is just a line, it's basically, it's nonlinear. It basically, it starts off basically kind of in one place, and then basically it reverses, or it turns into about something different. And the reason why you basically use these activ ations is primarily basically for transforming it to basically have much more express information, basically, what I mentioned, basically, as the restaurant analogy, we did basically, we did go over one nonlinear activation in a previous lecture, which is the sigmoid, basically, and swig glue is simply a basically a swig glue is a, it's a efficient and expressive converter system, and is specifically used in llama soft feed board neural network layer. So a swig glue acts as a get intelligent gatekeeper, it decides which part of the input should pass through for better learning. And so this is basically kind of about the code with a signal. It basically, it's simply a function that basically multiplies your expression by your sigmoid basically, which we went over in the first week, basically, with the Python aspects. So what is a transformer exactly? Basically, a transformer is primarily good about two components. One is self attention, which adds context across the words. And the other one is kind of a multi layer perceptron, which modifies words individually. And a multi layer could look perceptron is what's known as a feed forward network. It's a type of artificial neural network, where data moves in one direction, from input to output, passing through multiple hidden layers, there's no feedback loop, basically, meaning the data is processed by a layered and a monster next one. So just to be very clear, what it basically means is by no feedback loop, is it's not a back propagation system, a back propagation system is learning where it adjusts the weights, basically, as it goes through. This is basically what's known as a forward task of feed forward network, which basically, it only data gets multiplied in one direction, meaning you have a matrix, you apply linear and nonlinear activations, basically, you apply a non, sorry, you apply a linear layer and nonlinear activation, basically, and then you get to the next layer and so forth. And the multi layer perceptron modifies words individually, in the sense that each word representation is processed separately from each other. So in a MLP, each word's vector is processed through a series of transformation, basically linear and activation from a function. And this process happens for each words without any interaction in the sentence. So imagine, basically, a sentence like cats are cute, but each word individually cats are and cute is converted into into a vector in a vetting. And these word vectors are passed as inputs to the MLP. The word embedding has our cute are fed into the network. And each hidden layers applies waste and activation functions to learn the relationship between the words. For example, the word cute might have a strong positive sentiment.
[12:46 - 13:07] The word cat may be neutral, but the word are might connect to a verb with less weights. And so the final output of each word is to transform representation after passing through the NLP and the transform vector captures more complex relationship of the word than the initial embedding. Basically, the same process happens independently for each word in the sentence.
[13:08 - 13:37] And each word has its own embedding that gets transformed in the same way. Since the transformations are based on the individual sports embedding, the final output of each word will basically be different. So you have something like cat or cute cat are cute, and you basically transform it and you multiply basically this this linear layer, basically, and then about this nonlinear layer, basically nonlinear activation function. And then you basically sequentially apply it layer by layer.
[13:38 - 14:56] And so what you basically kind of do here is basically the MLP. So what we learned basically, I forget the second week was basically the embedding function. And then, and then last week, we basically learned self attention, which is basically here. And this is basically the MLP, basically layer, which is the green kind of a section here. And so the green section here is basically what what we basically can have. And then we then basically get an output component, basically about here. And so what you basically see here is something like field code, you go through embedding, you get this is like a toy example, right? And then you apply self attention, and then you have the linear layer applied, nonlinear activation is applied. And then you basically get an output. And once you basically get the output, it still doesn 't translate to anything. And that's where you get the nearest neighbor, basically, the nearest neighbor is basically where you say, Oh, this doesn't really make sense. I got a one comma zero, and one comma zero, and cat is one comma one comma zero. Cat is the closest concept to the vector I basically could have output it. Therefore, I think what my output is basically. And in the building blocks about feed forward, your network is pretty straightforward. You Yeah. And so what you basically have here is you have a basically kind of a neural network, and you basically define a sequence, basically.
[14:57 - 15:37] And so the sequence basically in the homework, you basically define like a pipeline, right? Basically, so this is basically a neural network, kind of a sequence. It basically says this one first, then this one, and then this one. And it basically just says apply linear layer, basically, and then, and then apply a non linear function to it. And then you apply another linear function to it as well. The role of a one of these systems is really to learn what's kind of our relationships. So let's basically go here. So when you basically here have here is, you have a basic kind of example of a neural network layer on will go into like encoder and decoder systems. But but basically, we have a small corpus of information.
[15:38 - 18:33] Basically, we have a hello world, this is a test, and so forth. And then you build a vocabulary about and you tokenize everything, okay, which we went over the positional encoding, we're basically going to go into a little bit later. But basically, this is this is what a few on your network, basically looks like you have a P for a new neural network, basically, you have a constructor that has the model dimensions, and has the hidden dimensions, and as the dropout rate, dropout rate, which will go over a little bit later, is simply a variable that prevents over fitting basically. And it's another known variable that you can basically apply. And the dimension ality of the input and output, we have kind of the model, we have the sign and the drop rate , you can think of it as the probability of dropping kind of on neurons, which is a way of basically avoiding fitting. And the few for you basically, then you initialize a layer norm, which we're going to which basically normalizes the combined functions features, we have a linear function, which is basically the mx plus b kind of thing that we basically saw, we have a nonlinear function, we have a dropout system, and then we have another linear function. And so if you look at the forward we're going to go into this, but basically it's the same thing. You have the normalization, you have a linear transformation by nonlinearity, we have a dropout, we have another not linear transformation, and then we have a skip connection. So we add the original back, we're going to get back to kind of some of the components as we go. But this is basically what a neural network is basically, and basically every one of these things basically took 10 years to figure out not exactly, but basically like someone had to figure out that you have to normalize it, then you have to apply dropout so that you don't overfit, basically you have to apply linearity to learn more information, we have to apply skip connections and so forth. Part of what you can basically do with the feed for a network is you can use a simple synthetic data to see if the feed for your network is functioning about properly. For example, you can basically map the key to life is to be, and you can map it to it happy, and you can observe or confirm if whether the feed for your network can learn basic patterns or not. And then you can use the loss to track the learning behavior, and then you can test the loss basically on behavior and test kind of our data. So what you basically kind of have with a language model is you have words that are going into your network, and it's just constantly being refined in learning additional kind of things about the sentence or sequence, basically. So basically, I'll go ahead. Yeah, I guess I'd like some clarification on on what is actually being learned here. Are we still talking about like transformers where the like the corpus of the entire internet is used, and like words learn their relationships to like every other word that's basically ever been put on the internet, or what is being learned about each word here, if not that.
[18:34 - 19:25] Yeah, so basically, so with self attention, basically, you have the basic kind of data structure with very key value, you basically you train it. And so now it has a more nuanced view, but the fee for network is basically refining it basically. So you basically kind of get it's basically using the underlying very key value and it's basically kind of refining the understanding. So the nonlinear, so to give in an analogy, right? So to the restaurant analogy, basically, I'm searching for a Italian restaurant at 2 p.m. The key is Italian restaurant. I'm open as a Italian restaurant at 9 p.m. Basically, that's the key. And the value could basically be the menu, the description and others. And so, you might be wondering, Oh, that's enough, right? What else is there to learn? There's actually more nuances. Once you basically have the menu, once you have the photos, why?
[19:26 - 20:47] Because basically, you can learn by going to the photo, by going to the restaurant, basically, you can learn whether it's family friendly or not, you can learn higher level contributes, basically, where you're teasing it out from the key, basically. So the fee for the MLP, the multilayer, about perceptron aspect is a further learning about because once you have the underlying data structure , if further learns, basically, the nuances. So you basically have to make like chefs, yeah, so that 's basically, you're basically learning whether this is family friendly, you're learning, basically, whether this is a date ready, whether it's whether it's a place, basically, for people that are have autoimmune conduct conditions, basically. So there's more nuances that of the restaurant that can be more fully learned with the multilayer perceptron process. So like, you've seen like, linear regression, where it's a bunch of dots, and then you basically have a line going through it. Basically, that's like a, that's like a regression problem. But still, like, the line is, it's a long, but it's not like a, like a curve or, and it's not, there's different types of curves. You basically have sine curves are more periodic, basically, but you can actually get very weird curves, basically. So what we basically mean by this, basically, this is about where I had it. And there's a section basically where I mentioned that it's learning the universal approximation theorem.
[20:48 - 25:03] Yeah, here, basically, it basically means that like, as long like, it can basically model like a very complex, like, sleeply line, basically, and traditional kind of systems are not able to learn it, basically. And this is why human brains are able to learn about these things, is because we're able to learn these nonlinear kind of relationships. And so if you look at the previous 10 by lectures, where we basically say, what is the hidden layers basically describing, like the hidden layers are basically describing of hierarchy of concepts that are basically marked, basically. So if you look at a convolution, you're not working, it's learning edges. And then it's basically using the edges to basically learn like, noeses, like mid-level features, and then basically, and then you learn higher-level features, which concept ually, if you think about it, it makes sense. But basically, we wanted to introduce that before we basically introduced the internals, basically things that these hidden layers represent, basically, things that you're basically learning that has the more information. But you're still not able to tease out until you go deep in the reviews and deepen the photos, whether it's family or not. Why? Because basically, family-friendly is a composite of attributes that you 're basically learning, basically, because it's more nuanced. Like, family-friendly could basically be, it could be, it's not very, maybe it has a kidsman, maybe it has a coloring book basically for kids. It could basically be a number of components that you have to read the reviews. And it's not so simple as basically seeing the exact kind of a description, you have to go more in detail . So there's actually more nuances, basically. So once you basically have the underlying data structure, which is the query key value, basically, then what you're basically doing is then you're basically learning all the nuances from when you basically collected in the information. Basically , oh, this is a linear layer, basically, and so forth. So most likely, you're not going to be actually programming these layers, basically, yourself. But if you remember the webinar, basically, what we're doing here is basically explaining to you, basically the AI concepts, so that if you're basically selecting for models, you can basically, you know more technical attention is what real query attention is. Basically, you call yourself AI engineer, you should be able to know, basically, what some of these concepts are, basically, even though you may not necessarily day-to-day program, or because they're in the model architecture, it's like going to a car, and then someone basically describes, hey, I have a VA engine, I have a V6 engine, it has this type of transmission, and then you're like, okay, I know what that means. So what is kind of about overfitting? So we talked about this before, basically, but to be a little bit more specific to it, overfitting and underfitting are two common problems about models, ability to generalize from training data to unseen data. So basically, you can see this when these students, right? So let's your example, you're teaching like a six-year-old how to add, basically, and you say, what is two plus two, and he or she basically says four, and then you basically say, and then you say, what's three plus three, and she says a six, and then you say, then what's plus 156, and then they're stumped, and you basically realize that basically the person, the kid, is not actually, they're just memorizing it, they actually didn't understand it. Basically, so overfitting and underfitting are two common problems where overfitting is basically where the system basically kind of completely memorizes it. And then when it basically encounters something new, it can't really adapt to the new problem, basically. And underfitting is basically where it just completely doesn't learn it at all, basically. And so you basically have kind of like these two problems. And so in the very beginning of like machine learning, you have more problems around underfitting, basically, whereas you started getting as computation, basically, and data basically starts getting more and more, you basically have a problem where the model memorizes the data, basically. And so how do you basically solve that? People basically, in 2014, I think, someone created kind of this thing called dropout, basically. And so in neural networks, certain perceptrons may be overly reliant on each others. Basically, for example, perceptron one may be reliant on perceptron two to make a prediction and so forth. And the dropout forces the network to learn redundant independent features by randomly dropping neurons, basically. And this prevents neurons from being too dependent on each other.
[25:04 - 26:20] And since dropout forces, the model to work with fewer neurons at a time, the network cannot rely on particularly subset of neurons. And as a result, the network learns robots and generalize features to make a kind of a prediction. So it's a generalization kind of thing. So drop out is so the way to basically think about dropout is if you look at the diagram, it's simply basically kind of about randomly deactivates a portion of the neurons, basically. So what you basically see here is it gets transformed over to here. And so, for example, the cat example, basically, for example, you're teaching a citrual, what is a cat? And you have a cat with stripes, basically, over. What you're trying to teach the citrual is not that this cat with like every cat has stripes, basically, you're trying to teach that this is a cat, basically. And so what dropout is basically, it randomly removes the features and neurons. So imagine, basically, you're teaching a kid stripes and you can affect this kid's kind of a brain. And so by randomly removing certain things, you can basically say you can randomly remove like the stripes, you can randomly remove other components. And then you have a more generalized kind of a model. And this technique, basically, where you drop a random set during each training set forces a model to learn more robust features.
[26:21 - 26:33] So during each mini batch, the drop randomly deactivates a portion, basically, and it's set to 0.5. And on average, half the neurons are turned off. Does it send certain matrix elements to zero or what?
[26:34 - 26:58] Yeah, it sets it to zero. Yeah. Okay. And it just literally puts certain neurons to zero. And so they, by doing this across different epochs, about the training, about sessions that you've seen in the homework, basically, this actually has been a technique that's been widely popularized to general. So the problem is basically, neural networks tend to memorize the data. And the neuron learns independently useful features.
[26:59 - 29:08] And so for example, for example, you have a basketball combat analogy. In normal practice, some people basically get lazy and depend on the stars. So in dropout practice, you bench random players each time. And now different combat bench players get training time, basically. So everyone learns to play every role. And the team doesn't collapse if someone 's injured during a game. So the team is more robust, balanced, and adaptive. So imagine you're a student in a classroom and your classmates are answering cognitive questions. If each student answers in a wildly different way, some with very high confidence, other unsure, it becomes hard for this teacher to evaluate the answers effectively. You need all students to be normalized on the same scale so that the teacher can evaluate them fairly and provide consistent feedback, basically. And neural networks, when data is passed through many layers, that output can have different scales and different distribution. This can cause training instability for the models learning process, because the model gets struggled to process these wildly different values. So layer norm is a technique used to normalize the inputs in a neural network. It ensures that the inputs to each layer has a consistent scale and distribution. And it adjusts the activations and the neurons to have zero mean and unit variance. It helps the model learn faster. Basically, it achieves this by calculating the mean and variance, normalizing the activ ations. So it's closer to a standard normal distribution. And then you layer norm can also buy normal parameters to properly scale for the next layer. But this is optional. And so you basically have different layer norms that you can basically do. Basically, you can have batch forms or you can have layer norms. Basically, layer norms is basically where you have maybe like a group having breathing exercises. And batch norm is where the coach helps everyone to breathe at the group's pace, whereas layer norm is more individualized for your own breathing things. So a lot of times, basically, when we're dealing with matrices, basically, we're either dealing with basically lower dimensional matrix or spaces or higher dimensional matrix and spaces. And why do you have to deal with why do you have to keep on transforming things to each other?
[29:09 - 29:53] It's because sometimes you basically need a more nuanced representation. And sometimes you need a more less nuanced. So again, basically, talk about the restaurant analogy, right? We 're searching for Italian restaurant at 10 PM. And at the very end, you simply just need a yes or no, basically, for every single restaurant. Taco, yes or no, Italian, yes or no, that's what's known as a lower dimensional space, basically. So you're just dealing with yes, no binary classification. But when you're actually talking about dealing with basically that actual, when you're actually like reading the reviews, and you're trying to see if it's a family friendly, if it's gluten-free, it has peanut, it cannot accommodate peanut allergies or whatever, you basically need a much more nuanced, you read much more data, matrix and multiplication to basically data.
[29:54 - 31:19] And so why do we basically need an encoder? Imagine we basically, you're a student that has to reach a book and summarize it into key points for a class presentation. The book represents the input data, for example, the sentence or a sequence of words. And your summary is the output, like a translated sentence, or to be able to create a good summary, you need to understand the core ideas and be able to compress that information in unsure and manageable format. So that encoder is like the student, who reads the book input and creates a summary representation that captures all the important ideas and leaves out the unnecessary details. The decoder then takes the summary and converts it to the final kind of output, like a presentation or translated kind of a sentence. So in simpler kind of a terms, an encoder is a part of a neural network that processes it and converts it into a useful representation. Basically, so it can be embedding, it could be a contextualized vector, basically, and it captures the essential information from the input. And then so you can basically understand complex data, basically things that have multiple different meanings. And the encoder was able to extract key information and compress it into a fixed size, kind of a representation. And in natural languages, sentence can be long and the relationships can be spaced out, basically, and transformers can use encoders to capture the relationships, to see how the sequences are related to each other. So why do you need a decoder?
[31:20 - 35:49] Basically, so for example, you're a translator, basically was given a summary, basically of a conversation in a foreign language, basically, the job of the translator is to take the summary and expand it back to the full conversation in the original language, make sure all the important details. So the summary is like the encoder representation of the enc oder and the translation of a summary is a full conversation of the decoder. The decoder ensures that the central details is translated into the target language or format, basically, so effectively, an encoder will basically take an English sentence, the cat is on the mat, convert it into a complex representation, and the decoder will basically take the representation and and could basically translate it into a different language, for example, in French, le chat, basically, once it's once the encoder compresses the information into a vector, the decoder is to basically help, but into a general kind of a final. So the key thing here, it doesn't just always translate it back to the original kind of information. So we basically talked about a book report, like a book summary, but once it basically has this information, this representation, then they can basically translate it into French, it can translate it into other kind of systems. And then with language, kind of about translation, the decoder would basically generate a sequence step by step, it informs the right structure and context, basically. So this is the classic transformer kind of architecture, basically. So what you have on the left side is basically the encoder, and then what you have on the outside is basically the fifth dec oder. But the reason why this is confusing, basically, is the current GPD, chat GPT, and large language models do not use this architecture. Okay, the chat GPT architecture is a dec oder only model, it doesn't use an encoder. And so it only uses basically this component. Okay, so that's why basically it's a little bit confusing. If you basically say, hey, I'm going to basically learn transformers, basically, like the classic transformer is attention to all you need, you basically get something like this. So GPT is a decoder only about model, where it needs to handle left to right language modeling, predicting the next word, given the previous kind of words, it reduces the model complexity by reducing that encoding layer, make a streamline or generation about task. And so this is basically like important, basically, because if you just maybe do like a global search, assuming that the transformer is what you basically get in the GPD architecture, it's a little bit different basically. And then bird is also different than transformer, which is different than GPD kind of architecture. In encoder conduct, decoder models, like for example, tasks like machine conduct translation, you need an encoder to fully understand the source sentence and a decoder to generate the target of a sentence. The decoder uses across attention to the enc oder output to ensure that the model can accurately translate or transforms the input sequence. So for example, in the previous kind of example, we talked about translation from English to French. So we basically translate it into a representation that's a lower dimensional of our representation. And so what cross attention basically does is it uses the query of the encoder and maps it to the key and value of the decoder basically. And so it uses a that type of system to be able to accurately translate or transform about the input space. So it's ideal for the encoder decoder model is ideal for us that has direct transformation from input to output, ideally translations. So you may also basically see in the model architecture, basically, you'll see what's known as element wise fiend. Basically, imagine you're a photographer working with an image and you need to adjust the brightness and contrast of the image. And you can think of each pixel in the image as an individual about data point, but you want to apply adjustment to the pixel uniformly, but individually adjusting the brightness is like shifting the values of the pixels and adjusting the contrast is like scaling the values of the pixels basically. And in your own networks element wise of being transformation basically shifts the value of each element by adding a constant to it, similar to changing the brightness of the image scaling multiplies each element by a constant similar to changing the contrast of an image. So this operation is applied to each element, allowing it to adjust values individually based on the learning scaling and shifting convey parameters. So it's just adjusting kind of a matrix, but it's a fancy word for saying it basically. And so we're what we basically do, why do we basically need it?
[35:50 - 36:54] It applies when your transformation for example, scaling and shifting each element by multiplying a weight and adding kind of a bias. And in your network, we often need to normalize or adjust the data to basically make learning easier or more stable. And so it improves training stability. Basically each neural network may have implicit with different scales. Some features may be very large. Some people may be very small. So element wise fiend operations like batch norm or layer norm, adjust the scale or shifts the weight in a way that makes learning more stable and efficient. So when we basically see layer norm or batch norm, you basically see it on the architecture or the there are basically an example of lm wise fiend introducing kind of flexibility . Each layer in a neural network may require different scaling and shifting to capture complex patterns. And all of my operation provide flexibility to learn these transformation each independently. And it allows for feature standardization, like certain things need to have zero mean and variance. And so you basically have kind of a batch form and layer norm as the function to be able to do that.
[36:55 - 38:26] So it's very kind of a straightforward, basically, you're able to basically kind of adjust the brightness or you're able to adjust basically the contrast basically. So let's see, let's basically, so I think we're going to basically run over a little bit. So let's basically go over the you basically see here is a single simple kind of a language model. And then you're able to see the positional encoding and the feed for blog basically. And then you also basically have a element wise kind of a fiend kind of a transformation here basically. And then you're able to compute the log probabilities here. So what this basically does is it uses negative log log likelihood basically. And then you uses Adam will optimize her to be able to minimize and about the loss. Basically, you're able to prepare the data, take the tokenize the data, create implicit and target kind of a pairs basically. And then you have the training loop here, you have the testing evaluation here to be able to check and your functions. So we basically see here is you see a simple model where you're able to predict using this neural network basically . So and then and then what you see here is basically you have a encoder kind of a system basically and in this area basically. So we have a very simple kind of a corpus and then we have a encoder only large language model just like bird. So you're able to basically kind of see the internals of the encoder kind of a only model basically here. And then you're able to basically see a decoder only language model, which is basically similar to what we basically talked about in the diagram.
[38:27 - 39:40] Simply now it's in code basically. So all the different layers that we basically mentioned are simply a step. This is the embedding. This is the positional encoding. This is a transformer encoder layer, but you have a transformer kind of decoder right here. And then and then you basically have a positioning and voting about you have a transpose kind of a here. And then you're able to simply basically goes the internals sequence about here. And then you basically have an encoder decoder large language model here. Basically the encoder kind of processes the input sentence, the decoder generates the target kind of a sequence. And then and then it attends to the decoder's past tokens and the encoder's output basically. So it uses kind of a cross attention kind of a mechanism to apply the encoder kind of a decoder kind of a architecture. And then you're able to basically test the individual about components here with a text kind of a corpus. So you have an encoder only kind of architecture. You have a decoder only and then you have an encoder decoder only about architecture here. This is basically a bird language model. Basically, this is basically actually using bird basically from pugging phase. So you can basically this is basically coding it from scratch basically and encoder decoder decoder only.
[39:41 - 40:25] And then basically, this is a a using a bird from scratch. And then this is using GPT basically, which is a decoder only model basically using autoregressive coding using upfront. So that's basically that's the code example of all the diagrams basically that we went over basically. And for the homework, the homework is basically about the following basically. We still have the Shakespeare about conversational data set. And then we have a encoder and decoder through that system here. We have a neural network system basically kind of like and here we have a feed forward neural network without a dropout. And here is a feed forward with the dropout basically. And what you want? So sorry. So this is actually the sorry. This is actually the answer.
[40:26 - 41:17] Basically, this is not the homework basically. So we're going to give you the MD information basically. But basically, the idea is basically take take advantage of what we basically couldn't about learn and basically can build a neural network that takes advantage of the previous aspects that we have learned, which is basically Shakespeare about conversational about data set and about dropout construct a feed forward network with normalization and negative log likelihood . And then yeah, basically, basically, it's just playing kind of what we learned basically on the previous neural network, previously, what we learned on basically and applying kind of other latest concepts about here. So you basically have a neural network language model with dropout and feed forward network and so forth. And that's basically about it, we'll follow up with the actual, we'll make this a little bit empty basically so that you're able to follow it a little bit more closely.
