Tokens and Embeddings
- Tokenization as dictionary for model input - Tokens → IDs → contextual embeddings - Semantic meaning emerges only in embeddings - Transformer layers reshape embeddings by context - Pretrained embeddings accelerate domain understanding - Good tokenization reduces loss, improves learning - Tokenizer choice impacts RAG chunking - Compression tradeoffs differ by domain needs - Tokenization affects inference cost and speed - Compare BPE, SentencePiece, custom tokenizers - Emerging trend: byte-level latent transformers - Generations of embeddings add deeper semantics - Similarity measured via dot products, distance - Embeddings enable search, clustering, retrieval systems
This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.
[00:00 - 00:09] So what we're gonna talk about is this process, converting data into tokens into embedding. And we're going, there's actually one more step later on and I'll explain what this basically means.
[00:10 - 00:15] First, I'm gonna go over the high level overview of everything. So why do you want to understand tokenization?
[00:16 - 00:33] So if you think about what we talked about last week, we talked about out prompt plus data and being able to understand how prompts map to your underlying data and you want to do retrieval augmented generation and as well as fine tuning. All of those require you to utilize tokenization and embedding.
[00:34 - 00:45] So the reason why we're basically over tokenization and embedding is that you can utilize these things properly. And also if you're designing multimodal models, you can utilize different tokens and different.
[00:46 - 00:52] You can tokenize and embed any type of data. You can tokenize image data, video data, time series data.
[00:53 - 01:04] So it's rumored that the guys that are in deep-seak actually created a, they tokenize and embedded a lot of stock data. And then they utilized to convert transformers to be able to predict stock prices.
[01:05 - 01:19] And that's how they were able to get like 20,000 H100s and they were able to shift and create their own transformer model easily because they already had that expertise. So understanding tokenization and embeddings is basically pretty key to, and it's pretty fundamental.
[01:20 - 01:31] So you want to understand tokenization because it's context specific. It's specific to your specific domain and it's also specific to different, yeah, it's specific to the industry and specific to your domain.
[01:32 - 01:35] So what is tokenization? Tokenization is about to be a dictionary.
[01:36 - 01:48] Every unique token, such as words, you can think of a basic tokenization as simply mapping words into a dictionary. As you get further and further, obviously, you actually start storing sub-words.
[01:49 - 01:52] You start storing punctuation. You start storing special words.
[01:53 - 02:18] And so the first process is you simply cut up a sentence and depending on the tokenization scheme, it's either divided simply like this. Tokenization is essential in NLP and it's divided into tokenization is essential NLP, but more complicated tokenization splits this up into sub-words and then it stores punctuation, special characters into its own token.
[02:19 - 02:28] So token is just a, you shouldn't think of it necessarily as a word, that's like a basic tokenization scheme. It just maps to some chunk within a sentence.
[02:29 - 02:50] So after you chunk it or represent things in a token, then it's represented in numbers. So you see that here machine learning archive goes through a tokenization process where it's broken down into machine learning archive explanation point and machine is mapped to two, learning is mapped to four, archive of eight, explanation point is to 12.
[02:51 - 03:01] This process maps chunks of the sentence and words and then maps it to a numerical representation. So obviously this is a very simple example.
[03:02 - 03:08] In practice, it's actually the vocabulary is much larger. So then tokens are then represented into an embedding.
[03:09 - 03:13] Everything underneath is a matrix. So you take a token string on the left side.
[03:14 - 03:23] So anything that's a bracket here like S or pad is a special care. And then you have punctuation here and then it maps over to a token ID.
[03:24 - 03:32] So you can basically say, okay, zero, one, two, three. And then over here, it's mapped over to a embedding matrix.
[03:33 - 03:40] So you don't need to worry about the underlying numerical things here. It's not designed to be understood by humans.
[03:41 - 03:51] And so what are these embeddings? This embedding captures context, relationship, and nuances between the tokens, enabling the language model to understand and process it numerically.
[03:52 - 03:58] And what does that mean exactly? It is adding semantic understanding and semantic understanding and context.
[03:59 - 04:02] So what does that mean? So if you basically say something like pool, right?
[04:03 - 04:09] If you see something like pool, initially it maps to something simple. It maps to a token of one, three, four.
[04:10 - 04:17] Then when it maps to embedding vector, there's multiple layers that represent information about pool. So pool can represent temperature.
[04:18 - 04:21] It can represent positive sentiment. It can represent social or global.
[04:22 - 04:34] And you may, if you took a look at the transformer workshop, you may be asking what's the difference between the semantics here versus the semantics and the transform. The transformer adds an additional layer of context.
[04:35 - 04:43] When processed for a transformer model, it becomes even more contextual. So the weather is cool, dimensions reflecting temperature becomes more prominent.
[04:44 - 04:48] This gadget is cool, reflects social appreciation. It reflects innovation.
[04:49 - 04:58] It reflects a positive sentiment. So transformer attention mechanism adds additional context into the original embedding.
[04:59 - 05:09] So basically just to summarize, what we're doing is we're adding, we're transforming tokens into embeddings. The transformer then transforms into embeddings with additional context and semantics.
[05:10 - 05:22] So underneath embeddings are just simply a multidimensional matrix. And every row represents a token and the columns represent semantic and contextual kind of features.
[05:23 - 05:43] For example, an embedding kind of a tensor for a cat may have 760 layers and they're called hidden layers. And all that is basically when you hear hidden layers, it just means that it represents a understanding of cat that or cool that's related to the different understanding.
[05:44 - 05:49] What we basically talked about here. And so one dimension of a cat could basically be domestic versus wild.
[05:50 - 06:00] Another dimension could be emotional sentiment associated with cat as a pet. Another dimension would be abstract concepts such as size or furiness.
[06:01 - 06:08] And so this is why when you look at a token, it looks very simple like 0, 1, 2, 3, 4, 5. When you look at embedding, it's more complex.
[06:09 - 06:15] It has multiple kind of layers. And so how do you transform a token into embedding?
[06:16 - 06:23] You use a preexisting embed embedding model. So these are embedding models that are created by people that are trained on the internet.
[06:24 - 06:35] And there's different examples of embedding in kind of our models. You know, you have the simplest, but here the original, if you like take classic kind of machine learning, the original examples people give you is word to back.
[06:36 - 06:42] Then you have love, you have Elmo, you have bird. Bird is a transformer based natural language processing system.
[06:43 - 06:47] It's deep learning based. And so what you can see here is that different generations reflected.
[06:48 - 07:02] So remember basically what I said about machine learning being three generations, machine learning, which is kind of about 1970s to early 2000s, deep learning 2016 to 2017, and then transformer based. You have word to back, which is classic machine learning.
[07:03 - 07:12] And then you have bird, which is more deep learning kind of our generation. Sorry, it's bird is actually transformer based generation, but it's not really designed around a language model.
[07:13 - 07:26] And the open AI models, which we'll go into are actually different kind of embedding models. So when we basically talk about AI applications, we actually said most people think of AI applications as one system, like one model in the back end.
[07:27 - 07:38] This is also an example of how you're actually using different models to even construct your underlying architecture or your data. So what it basically looks like in practice is you start with your data.
[07:39 - 07:43] You have a token dictionary. So if you ever open a dictionary, it's very kind of simple.
[07:44 - 07:50] You have, you know, you have the word and then you basically have a number associated with it. So tokens can are chunks of data.
[07:51 - 07:54] They have numerical IDs. And then you have an embedding model.
[07:55 - 07:59] It's a, it could be a machine learning system. It could be a simple statistical system.
[08:00 - 08:08] It could be a deep learning system and this embedding model converts it into a embedding, which is a tensor. And then finally, you have a transformer that adds self-attention.
[08:09 - 08:14] And then you have an embedding with additional context. So what is tokenization exactly?
[08:15 - 08:31] You're splitting text into little units so that people can process these different units. And the purpose is to balance compactness versus semantic understanding and be able to capture meaning and tokenization directly and pass the model learning and task performance.
[08:32 - 08:42] A well-designed tokenizer makes inputs easier for models to process and a poor design tokenization can lead to inefficient learning or conversion failure. So what's an example of that?
[08:43 - 08:47] This is a sentence that basically is being tokenized. Can I ask you, can I ask a question?
[08:48 - 08:51] And the output is can I ask? It's broken down word by word.
[08:52 - 09:14] And so if you have a correct tokenization system, then I love hugging face gets converted over into I love hugging face. And so I love if you have, and so what ends up happening here is you have a correct tokenization system and the output when you actually run the machine learning model, it's able to predict the next word accurately.
[09:15 - 09:26] And so you say, I love hugging face tools at the very end. If you have an incorrect tokenization example, you start with something like I love hugging face and then it turns into kind of about something like this.
[09:27 - 09:35] L-O-B-E, tug. And when you actually go and run the prediction and you put the tokenization across, you get a wrong output.
[09:36 - 09:44] You get a prediction of cat rather than I love hugging face tools. So you start with a sentence cat and dog are animals.
[09:45 - 09:54] And then you run it through a tokenizer and then you get a series of numbers. So just to basically be clear, these are the tokens and these are the token ID.
[09:55 - 10:00] The underlying current machine learning models, they use token IDs. They don't directly use tokens.
[10:01 - 10:08] And so you can basically have general kind of a tokens which basically map to words. And in practice, you actually get some nuances.
[10:09 - 10:18] Like for example, in GPD2, tokens are case sensitive. The first letter capitalized and written fully capitalized words are separate tokens.
[10:19 - 10:26] You need to handle space and fluctuation. In GPD2 specifically, pool is actually broken down to space/space, cool.
[10:27 - 10:40] So you sometimes have space plus the first letter, then the rest of the word token and cool can basically become like space, cool, my come C plus OOL. The point of this is you don't have to memorize any of this, any of this.
[10:41 - 10:50] Basically, the point is sometimes people have made specific optimizations on tokenization. And sometimes it's just, it's not as straightforward as you would basically think in terms of tokenization.
[10:51 - 10:56] But this specifically, you really don't need to memorize. I just wanted to show you what it is in practice.
[10:57 - 11:08] We have a tokenization example here. I'm going to, if you want to interact with it, but you basically have a tokenization system that's implemented with different types ofization examples here.
[11:09 - 11:13] You can use existing tokenization systems. You can use lama, which is meta.
[11:14 - 11:16] You can use Microsoft. You can use Google.
[11:17 - 11:26] You can use deep-seak when there's, you can use the different kind of open AI token, tokenization. Generally speaking, all these tokenization have different ways of breaking up everything.
[11:27 - 11:36] You can mess with this a little bit to see what the output is, what the token output is, as well as what the token ID kind of a output. What does it include?
[11:37 - 11:41] It's pretty straightforward. In code, you first import the tokenizer, whatever it is.
[11:42 - 11:47] So from transformers, import GPD to tokenizer or T5 tokenizer. And then you have a sentence.
[11:48 - 11:51] You then go execute the tokenizer. And then you have the tokenizer.
[11:52 - 12:01] And this is a tokenizer that you can use to organize different things. And then when you tokenize, you do GPD to tokenizer dot include the actual sentence.
[12:02 - 12:04] And then you get the output. You can get a GPT token, a Gemini token.
[12:05 - 12:09] And then you can print the results of the tokens. So there's different types of tokenizer.
[12:10 - 12:17] One type of tokenizer are word-based tokenizers. So it splits the text as the boundary in terms of the spaces.
[12:18 - 12:27] And then evaluates the compression rate the tokens generated versus the original kind of text size. If you think about these tokenizers, they're in some ways like a zip file.
[12:28 - 12:36] Basically, they're trying to represent a language into IDs. And so one of the ways you check it is basically like, how does it compress?
[12:37 - 12:45] In zip files, you take a look at a 50 gigabyte file and then you compress it into a 50 megabyte file and so forth. And then you have other type of systems.
[12:46 - 12:50] One is called hyper-including. It looks at frequent characters and pairs it into tokens.
[12:51 - 12:56] It generally has better compression than word-based tokenization. And then you have things like llama.
[12:57 - 13:05] And the llama kind of systems are pre-trained to compress and generally generalize efficiently. And you can also look at the compression rates versus baseline controls.
[13:06 - 13:13] So if you look at different tokenizers, they have different compression rates. The word token has large numbers tokens in efficient compression.
[13:14 - 13:23] The BPE token have fewer tokens, better compression or frequent patterns. The llama tokenizer has superior compression and even handles where words efficiently.
[13:24 - 13:40] And then you can also visualize this, the same text looking at about different tokenizers. So well designed tokenizers retoze training time by simplifying input representation and character levelization struggles to learn due to excessive input fragmentation.
[13:41 - 13:48] A advanced tokenizer's BPE or llama allows the model to learn faster and more effectively. So why do you need to learn tokenizer?
[13:49 - 13:59] If you're building AI tokenizers, it's the input into retrieval augmented generation. So remember basically what I mentioned about vector databases and how you can store your own data.
[14:00 - 14:14] So whether it's financial or insurance or other things, you have to be able to first cut your own text into tokens. And so in retrieval augmented generation, text first has to be split into chunks for efficient retrieval.
[14:15 - 14:19] And we're going to go into rag in detail a little bit later. So don't consider this as an introduction to rag.
[14:20 - 14:32] I'm just simply connecting rag into the concept of rubestly learning. So text needs to be split into chunk for efficient retrieval and how you chunk the text impacts the retrieval accuracy.
[14:33 - 14:48] For example, smaller chunks have finer retrieval granularity than embedding quality and so chunks need to be able to preserve the overall semantic information and context. For example, do you split a legal document by sections, by pages, by paragraphs or by sentences?
[14:49 - 14:53] How do you actually split everything? And the way you cut text depends on your rag application.
[14:54 - 14:59] So you do a question in answering. So smaller chunks help you retrieve specific answers.
[15:00 - 15:04] Larger chunks, do you do memorization? Larger chunks help you preserve broader context for some race.
[15:05 - 15:09] You do document search. It balances chunk size to enable faster and accurate retrieval.
[15:10 - 15:16] Proper tokenization ensures embeddings to be able to represent these chunks effectively. Don't worry about the details of rag.
[15:17 - 15:28] We're going to go into it a little bit later. Basically, so chunking is a pre-processing step ahead of tokenization where you're splitting the text into either paragraphs or pages.
[15:29 - 15:34] For example, you're taking a look at a book. You can chunk it or split it differently.
[15:35 - 15:46] You can basically split it based on chapter, based on sentence, based on paragraphs, based on so forth. And then you then go through a tokenization phase after that.
[15:47 - 15:58] And so first you basically use the chunks and then you apply the tokens. How you cut the chunks also affect the tokenization because the tokenization is based on what it sees in the chunks as well.
[15:59 - 16:03] So this is basically the end to end rag and about pipeline. Again, don't worry about this.
[16:04 - 16:11] We're going to go into rag in detail. But so at the very beginning, you have the raw text input, which is the legal document.
[16:12 - 16:20] Then you have the chunking, you have the text pre-processing. You clean up the text, remove noise, formatting, normalize, other things.
[16:21 - 16:23] You pre-process everything. Then you apply the tokenization.
[16:24 - 16:33] You split the text into tokens, whether it's words, sub words, or whatever the tokenization method is. Then you can organize everything.
[16:34 - 16:39] So this is actually in reverse. Basically, then you organize everything by chunking everything.
[16:40 - 16:45] And then you organize things by section, by paragraph, and so forth. And then you execute and embedding.
[16:46 - 16:52] You apply by chunks into vectors. And then you store the vectors into a database, a vector about database.
[16:53 - 16:58] And then you query it, what does section one say? And then you find relevant information based on query embedding.
[16:59 - 17:02] And then you basically get the output. Again, don't worry about this entire thing.
[17:03 - 17:18] I just wanted to basically show you where tokens and embeddings basically fit in terms of retrieval augmented generation so that you guys don't think we're basically learning about some basic concept that has no relevance in building AI applications. Also, tokenizers map into fine tuning.
[17:19 - 17:25] The way you cut your text depends on your rag application. The token use during fine tuning must match the token use during inference.
[17:26 - 17:32] The tokenizer impacts the embedding consistency and model performance. And then different tokenize can create mismatch embedding.
[17:33 - 17:44] Fine tuning ensures that the embeddings align with the task specific requirements. And then, for example, tokenizers for medical data might require fine tuning on domain specific and vocabulary.
[17:45 - 17:48] Tokenization bridges raw text with AI models. It compresses everything.
[17:49 - 17:55] So think of it as zipping by creating a zip file. And the advanced tokenizers go beyond world based systems.
[17:56 - 18:11] It's better to use state of the art tokenizer techniques for specific tasks rather than building it from scratch. And then part of the exercises and the homework we're going to basically do is you're going to be able to explore different tokenizers, be able to explore tokenizers, impact on model training.
[18:12 - 18:21] And then later on, we're actually going to explore multimodal tokens and how that different than language tokens. So we have a couple of different tokenizers.
[18:22 - 18:28] We did talk about this. The actual specifics doesn't really-- all you have to really know for these tokenizers is what it's good for.
[18:29 - 18:34] And then effectively for your own data, you just have to test it. The open source model is llama.
[18:35 - 18:38] It uses the sentence piece library. And it uses sub-word tokenization.
[18:39 - 18:43] And it's good for multimodal lingual training. And it handles a wide variety of languages seamlessly.
[18:44 - 18:56] The Gemini tokenizer uses a tokenizer similar to sentence piece, but it's optimized around multimodal data, text, image, and audio. It's adapted to various input types of ensuring efficient token representation.
[18:57 - 19:03] OpenAI uses something called biber encoding. It splits words into sub-words units based on frequency and training data.
[19:04 - 19:13] It's highly efficient for English text with a focus on reducing the token count for common words. All of these have different-- if you look at these things, they have different purposes.
[19:14 - 19:17] Some are designed for multiple languages. Some are designed for multimodal.
[19:18 - 19:21] Some are designed for compression. They all have different goals in their design.
[19:22 - 19:30] The tokenizer is designed for the models to target tasks and data distribution. There is a margin about trend, which is called-- they released it maybe three weeks ago.
[19:31 - 19:35] I don't remember the exact time. But basically, it's called a byte latent transformer.
[19:36 - 19:50] Basically, what meta claims to do is instead of doing the token embedding process, it actually uses another data structure, which is called a latent. A latent is just another compression kind of a system, compression data structure.
[19:51 - 20:00] And what a latent is just another tensor. It's a compression kind of a encoding that captures essential data while discarding irrelevant data.
[20:01 - 20:14] So instead of basically doing a double kind of a process, which is dictionary, visualization, plus embedding, it actually goes through a one step process, which is byte latent kind of a transformer. And so what it does actually is it doesn't use a dictionary.
[20:15 - 20:22] It actually processes things on the byte level. And then it understands the data without actually even understanding the actual language.
[20:23 - 20:30] Because traditional tokenization is broken up. So whether you have English, whether you have Spanish, and other things, it's broken up into chunks.
[20:31 - 20:39] And so sometimes it gets confused on the underlying kind of a system. So what this basically does is it doesn't even process it on the token level.
[20:40 - 20:45] It processes it on the byte level. And then it learns a tensor representation called a latent.
[20:46 - 20:54] And it's able to understand it without going through the typical kind of process. So what you guys all know what a byte is.
[20:55 - 21:04] And it basically bypasses the traditional tokenization and vocabulary about building process. And then it handles raw inputs, universally, regardless of language or file type.
[21:05 - 21:13] For all practical purposes, this is a merging trend. Most of the ecosystem is still based on the tokenization embedding kind of a model.
[21:14 - 21:20] Your vector databases and other things is based on this model. I just want you guys to be aware of this model.
[21:21 - 21:36] And because it does provide some advantages, but not all the library supported, the ecosystem kind of a support is not universal, simply because the dictionary embedding model is such a universal model in natural language processing that it's not fully represented. By as in, okay.
[21:37 - 21:42] By as in, it actually learns the actual binary. By as in a bit, binary is not a budget.
[21:43 - 21:52] So each byte can represent 256 unique values. And yeah, so it used to basically commonly include characters and text files, binary data.
[21:53 - 21:56] So it actually looks at the actual byte. Basically not that, yeah.
[21:57 - 22:02] So it actually looks at the actual binary digit, not actually the actual character. So this is Ken's question.
[22:03 - 22:12] By is bytes by me, it's about the same as character. So one of the challenges, if this is so great, basically, why don't everyone combine use it immediately?
[22:13 - 22:22] It does have additional computational complexity requiring higher memory and computation power. It doesn't, it sometimes can't handle longer kind of dependencies very well.
[22:23 - 22:31] And then it has 20 data size and quality, and then it reduces human interpretability. And it also have optimization kind of by challenges.
[22:32 - 22:40] Yeah, so Ken basically mentioned that's quite a radical change because the classic embedding space is word oriented. Yes, it definitely is a dramatic change.
[22:41 - 22:45] So it's something emerging. It's not, I just wanted you guys to know about it.
[22:46 - 22:50] It's emerging. It's something that meta came up with like maybe two or three weeks ago.
[22:51 - 22:53] I forgot the exact time. So what are embeddings exactly?
[22:54 - 23:00] So we first talk about tokens using, you should think of it as dictionaries. And then we transform it into an embedding.
[23:01 - 23:06] You should think of it as simple embedding as simply vectors. There are X and Y axes in numerical space.
[23:07 - 23:18] For example, king, queen, and man represent, represent points in X, Y axes, where relationships like queen equals king minus man plus woman, bold. So why should we basically understand them?
[23:19 - 23:27] Embedings capture the nuanced understanding of a human. So when someone basically looks at cat, your brain actually doesn't interpret cat as just a cat.
[23:28 - 23:37] Your brain may basically interpret cat as all the different cats that you've seen that are different species. It may basically you have smell memories.
[23:38 - 23:42] You may have allergies with cats. There's all sorts of basically hidden understanding, basically with cat.
[23:43 - 23:51] And that's not fully captured in the English language when you basically take a look at everything. So what we have here is you have token IDs and then you have tokens.
[23:52 - 23:58] And then we have a embedding. So embedding is a machine learning model that converts token IDs.
[23:59 - 24:14] It processes on the token IDs and then it converts it into the hidden layers. So remember basically what I mentioned about embeddings as a multi-dimensional kind of a matrix and the layers represent the semantic understanding of that specific thing.
[24:15 - 24:18] For example, cool can represent temperature. It can represent emotional state.
[24:19 - 24:26] It can represent appreciation. The hidden layers represent all the nuances that you would basically want to understand about that thing.
[24:27 - 24:37] So just think of hidden layers as a more nuanced definition of the token ID or token that you want to basically represent. And then you have the output, which is the actual embedding.
[24:38 - 24:46] In practice, you're not going to be able to interpret it, but you are able to add, subtract, and do other things with it. And we'll basically kind of go into an example like that.
[24:47 - 25:01] So an embedding layer is an embedding kind of a model that converts a token ID into a tensor. A tensor is simply a multi-dimensional matrix and that semantic information about it is called hidden layers.
[25:02 - 25:04] Okay, so that's basically it. It's not that complex.
[25:05 - 25:16] Basically, that's why when you basically say, "Hey, what is the matrix for a cat?" It will basically give you the embedding dimension and it could basically be a thousand for 48 by embedding dimensions.
[25:17 - 25:29] And so when I embedding dimensions, there's a thousand and 48 hidden understanding of that word, basically. So when I mean embedding dimension, I simply mean that it's just one part of the actual matrix.
[25:30 - 25:47] So generally speaking, we have a word and then these are represented in embedders or matrices or kind of about tensor. And so you is then passed to embedding and then it's further passed into a transformer transformer and then you represent it, then you basically apply it to a nearest neighbor.
[25:48 - 25:52] For the sake of this lecture, we don't worry about the transformer. We don't worry about the nearest neighbor.
[25:53 - 25:58] We're gonna talk about this a little bit later. Right now, we're primarily just talking about this section.
[25:59 - 26:02] You embed zero one. So what makes a good embedding?
[26:03 - 26:10] A good embedding is things that are similar to each other. So you basically may see the tensor and you basically say, "How do I even work with this?"
[26:11 - 26:19] This is not like what I do normally, which is as a software engineer, I have a variable, I am able to examine the variable. I know what it basically is passed.
[26:20 - 26:34] You can actually work with it by mapping the word to the embedding and being able to do a arithmetic to it, basically. So what's an example of arithmetic with embedding? Paris minus France plus Italy equals Rome.
[26:35 - 26:44] So this is an actual exercise that you can do with embedding. So this is what it basically means that embeddings have semantic information encapsulated with it.
[26:45 - 26:48] So again, what is embedding? It's a tensor that has multiple hidden layers.
[26:49 - 26:57] The number of layers is really dependent on the embedding algorithm. For example, open eyes number of layers, the atom model is like 1,048, I think.
[26:58 - 27:07] And then the bird model is 768, I think, something like that. So the number of embedding layers, sorry, the number of hidden layers depend on the actual embedding model.
[27:08 - 27:25] And what it basically does is it basically tunes it so that in a space, you're able to understand if Paris is close to Italy or not. So basically what we have is being able to understand, kind of about the distance between different vectors.
[27:26 - 27:32] So if we basically have something like here, kind of which is this is x, y axis. And so this is two, one.
[27:33 - 27:41] The distance between this and this is 1.41, okay? And the distance between this and this is 4.47.
[27:42 - 27:55] The way you calculate about this is simply doing Euclidean kind of a distance. It's simply subtracting x2 minus x1 squared plus y2 minus y1 squared with a square root on top.
[27:56 - 28:01] In practice, this is just code, basically. And so you're able to calculate the distance between two areas.
[28:02 - 28:09] So like for example, you're talking about the distance between Miami to New York. It's a hundred, it's quite a distance this way as you don't know the distance.
[28:10 - 28:16] Like for example, Miami to Orlando, maybe it's 300 miles from Orlando to Miami. So this is simply calculating the distance.
[28:17 - 28:28] So the way to understand about distance is the distance is basically how close a concept is from each other. So if you basically think about semantically, our cat is close to a dog.
[28:29 - 28:39] Not like that close, but both their animals. So in vector space, cat is closer to a dog than it is to a person, okay?
[28:40 - 28:55] And we're gonna go into basically how to actually compute whether things are similar or not, but distance between things is the simplest way to visualize if things are similar. So you should basically think of wherever you are, New York to Boston or Chicago to Iowa.
[28:56 - 29:03] Wherever it is, the distance is basically a measure of how similar it is. And if you look at basically like real life, that's also basically true.
[29:04 - 29:14] The culture are basically similar. A New York city person will be more similar to a New Jersey person than say, for example, a Midwestern about person in terms of cultural values.
[29:15 - 29:21] But you should basically think of these distance as similarity measure. So how do we actually compute similarity?
[29:22 - 29:29] We talked about this before, but just as a refresher, this is a dog product. So you have two times four equals eight, okay?
[29:30 - 29:33] And again, I know we basically went over this previously. This is just a refresher.
[29:34 - 29:42] And then this is you multiply one times seven, two times nine, three times 11, and you get 58. And you continue about this process.
[29:43 - 29:47] I'm going over it quickly because we already covered this. And then, and so what does a dog product measure?
[29:48 - 29:54] It actually measures similarity. So it measures how similar two vectors are in a given direction.
[29:55 - 30:06] If two vectors point in the same direction, their dog product will be large, meaning they are similar. If they are pointing in a completely different direction, their dog product will be small, meaning they are not similar.
[30:07 - 30:24] So what we talked about is basically visualization between vectors, but this only has one dimension, which is magnitude. You're able to understand kind of about the distance, but you're not able to understand other kind of aspects, which is for if you're subtracting two vectors, what, where are they heading?
[30:25 - 30:36] Basically, so if I'm basically going from Miami to Orlando versus Miami to Florida Keys, it's a different, it's a, it might be the same amount of distance. It's in a different direction.
[30:37 - 30:52] You have to be able to capture both direction and magnitude to be able to understand and about whether things are similar to each other. So intuitively me going to Miami going to Orlando, which is going north is different than going, Miami going to Florida Keys, which is going south.
[30:53 - 31:12] And so that product allows you to measure the distance, both the distance and the direction in which we're basically going. So if you have basically something like cat, and then you have the vector kind of about represents, then you have something like dog, and then you have something like a car kind of about, this is pretty kind of a straightforward.
[31:13 - 31:23] So the way, the way to basically represent this is you have one to one and then you do a dog product, exactly like what we basically mentioned before. One times two, okay?
[31:24 - 31:28] Two times two, one times one. And you get seven, seven.
[31:29 - 31:36] How are homophones represented? Homophones are represented, I think homes are represented in terms of their actual spelling.
[31:37 - 31:53] So if you have something like hair versus P-A-R-E versus P-A-R, they're treated as distinct tokens because they have different kind of spellings, okay? Which is P-A-I-R could be tokenized as it exactly.
[31:54 - 32:08] Pair as in P-E-A-R will be tokenized as P-A-R and then pair P-A-R-E will basically be tokenized as rep depending on the frequency, okay? So how it's treated is really dependent on its spelling.
[32:09 - 32:20] It's not dependent on the phonetic that look or sound combusting work. - Talking about, you keep using the example of cool or cold, so you like that, which is fine.
[32:21 - 32:41] But so what I'm talking about is, a dog is pretty much only has one meaning whereas cool and cold have multiple meanings. But if you're using an embedding vector in low dimensions to visualize it, had you reconciled the fact that the token has several different meanings?
[32:42 - 32:58] - The token and its meaning is not represented in the token layer is represented in the embedding layer. So yeah, so like in that sense, basically the tokenization kind of layer is very dumb, basically and so cool is just simply cool.
[32:59 - 33:02] It doesn't have different kind of tokens. - Yeah, I know that.
[33:03 - 33:18] But when you're training it on a giant corpus, it's going to see several different meanings or as you're calling them in this lecture, understandings for a token. And then as the loss function is minimized, is he gonna mask?
[33:19 - 33:29] - Yeah, it's good. - There's a conflict in how it should be located in the embedding space. - Yeah, so the nuances related to cool is all put into the hidden layers.
[33:30 - 33:44] So cool basically maps as a dictionary, whereas the embeddings like for example, all the nuances are put into the hidden layers. And then you don't have all the representation of the semantics or context.
[33:45 - 33:53] There's another final transformer that basically applies the additional context. So it basically has a two-step context process.
[33:54 - 34:19] One step is basically the embedding, which is you basically go over and then you try to understand as much as possible, basically so that's a one-step embedding process. - Yeah, I just thought that since this is such a heavily studied thing embedding as you put out there, there must have been a hundred different tokenizers that there would be some paper out there which could explain intuitively what happens.
[34:20 - 34:32] Like maybe there's a bias in this frequency bias or something that tends to determine the geometric position of a home of phone. But anyway, we don't need to waste any time on this.
[34:33 - 34:34] I'm sorry. - Yeah, okay, yeah.
[34:35 - 34:55] So I hope it's clear basically that the under-- - No, it's not clear because you're showing arithmetic examples which happen to works for certain, I don't know what the linguistic terminology is for something that basically has one meaning. But when you have multiple meanings, I don't see how arithmetic works at all.
[34:56 - 35:14] And I don't understand even their multiple semantic dimensions for even the capitals versus the distance between cities. I find it counterintuitive that the arithmetic-- - Yeah, so that's the even though, so you, okay, I understand what you're basically saying.
[35:15 - 35:29] You should think of this as hidden layers as adjectives or descriptions of the underlying for Kanabak concepts. And so they're adjacent to the actual Kanabak concepts and certain things are more dominant than others in the hidden layers.
[35:30 - 35:39] That's where you're able to add or subtract about these things. So for example, if you take a look at CAD, basically a hidden layer may represent a furiness.
[35:40 - 35:46] But the furiness is not the dominant representation. The dominant representation is in terms of everything.
[35:47 - 36:07] There's more things that are related to animalness as a general Kanabak concept, basically than person as a general concept. And so by doing, by, you should think of this basically the maybe an intuitive way to basically think about it is these tensors represent the sum of all attributes that are conceptual within Kanabak, this word.
[36:08 - 36:26] So CAD basically has a series of concepts that are related at a high level, like that is animal-ly, basically versus that are very person-oriented. And if you sum up all the hidden dimensions, basically and subtract or do a dot stop product with dog, that's why basically they're similar in vector space.
[36:27 - 36:29] Does that help or basically does that? - Again, thanks for trying.
[36:30 - 36:39] - What's the, basically, what's the confusion basically on the-- - And it's a subtlety, so let's not spend time on it. For your mean stream examples, I understand it.
[36:40 - 36:45] Like that, dog, Paris versus Rome. - Okay, you want a more specific example.
[36:46 - 36:49] - It doesn't matter. Let's just table it for now, baby. - I'll go into the specifics.
[36:50 - 36:58] I'll make a, I'll follow up with a more specific example, basically and as a Q&A, basically a unit one. So what we have, basically here, is a similarity with a dot product.
[36:59 - 37:11] One times two plus two times two plus one times one equals two plus four plus one equals seven. And then you have one times five, two times zero times one times zero, and this equals five.
[37:12 - 37:24] So if you look at the difference, cat dot product dog is seven and cat dot product car is five. So cat is more similar to dog in terms of a number output than car, cat times car.
[37:25 - 37:35] So with these dot products, you're able to understand whether certain things are similar or not. Basically and when a dot product, you're measuring both magnitude and also direction.
[37:36 - 37:49] And what let large language models basically do is they're able to use dot products to understand that car is a plausible substitute for dog, simply because their embeddings are similar. Again, cat is not replaceable to dog.
[37:50 - 37:58] We're basically just having this as an example. Obviously, a more specific example is like golden retriever will basically be a plausible substitute to dog.
[37:59 - 38:21] And then in a search and retrieval system, like a retrieval augmented generation, a query like pet will basically have a vector embedding that's closer to cat than dog than to car, helping retrieve relevant results. So this is why basically when you're doing rag and you're basically doing a query, you're able to see similar kind of things because you're able to understand similarity kind of by using dot product.
[38:22 - 38:27] And so what do basically this mean in practice? You have something kind of work to back and then you're able to download.
[38:28 - 38:38] So this is a kind of machine learning model, basically like classic machine learning model. And then you're able to plot embeddings of words like king, queen, and man in what x and y kind of about axes.
[38:39 - 38:47] So basically what you have here is you have a basic system. So you should be familiar with a lot of these things in your model kind of exercise.
[38:48 - 38:51] You're able to import a model. You're able to tokenize the words.
[38:52 - 38:57] You're able to decode the words. And then we have kind of about the overall kind of process here.
[38:58 - 39:07] So now what we basically see here is we're able to explore kind of embeddings. So first what you want to do is you want to be able to import the gen kind of a library.
[39:08 - 39:11] And then, yeah. - Can you scroll back a little bit?
[39:12 - 39:14] Stop, you went too far. Keep going down.
[39:15 - 39:23] There's something I like or something in the bag when you call them the batch. There are a lot of the and what is all that?
[39:24 - 39:35] Really don't understand the dim equals the dim argument or the argmap. Is that saying what dimension of the tensor it's going to compute the argmax zone?
[39:36 - 39:44] - So what you have, oh, there's a couple of different questions. One is why is the decoded kind of attacks kind of like this? It just predicts the next token.
[39:45 - 39:48] So sometimes it does make sense. Sometimes it doesn't make sense, basically.
[39:49 - 40:04] And it's because this is why basically language models are, what we basically provided is a Facebook OPT-1125 kind of a model. So the more parameters that you basically have, the more it basically makes sense, basically.
[40:05 - 40:17] And in terms of kind of what it basically kind of does. - I don't know, because when you go back and for the red, green, blue, what dimension is the argmax working on versus here?
[40:18 - 40:26] It's dim equals two, maybe it shouldn't make sense. I'm just trying to understand what this legit tensor is all about.
[40:27 - 40:37] - Yeah, logic is just basically the raw kind of an outlet, basically in a system. It's basically, so remember we basically talked about the soft mass, which is transformer everything into a probability.
[40:38 - 40:41] What happens when you're looking at the actual data? The actual data?
[40:42 - 41:02] - We get that it's the output of the neural net, but when you scroll back for the red, green, blue, or it is versus the seemingly nonsensical, can you go back a little bit to the logic for the red, green, blue? You see how you have colon comma minus one in square brackets above it?
[41:03 - 41:16] Output ID equals output dot logit logit square bracket colon comma negative one square bracket argmax. What's that notation versus the dim equals two notation?
[41:17 - 41:31] The lower it that causes the seemingly nonsensical. - So this is basically taking the last token and the sequence and then you're returning the highest logic, basically, which is like a greedy algorithm in that sense.
[41:32 - 41:43] You just basically take the very first one, basically you just take the index that's corresponding to the next log token. - Like the argmax isn't it turning the index of the macro?
[41:44 - 41:52] - Yeah, you were taking the highest logic, basically. And that corresponds to the highest, that corresponds to the predicted next token. - Right.
[41:53 - 42:05] - Yeah, and then this argmax is, it selects the index of the highest value among kind of the vocabulary that's about dimension. So it gives the token ID with the highest probability.
[42:06 - 42:17] And yeah, so it gives the highest probability and it's among the vocabulary dimension, which is dimension of two. And then it returns a tensor of the sequence lens and batch size.
[42:18 - 42:30] - So why is it returning the same thing that it previously did instead of a lot of the AND carriage return comma or new line comma? This isn't a temperature thing, is it?
[42:31 - 42:48] - No, it's not a temperature thing. So sometimes when you basically have like simpler models, it just, it doesn't predict as well as basically kind of a larger models, basically. So we started with a normal natural language processing model and then you just sometimes basically nothing predict as well.
[42:49 - 43:03] So when we went over the France, so what we have here is basically is a ability to execute math on top of the embeddings. So one of the exercises that we have for you is being able to add or subtract different things.
[43:04 - 43:12] So for example, what we're doing here is we're adding man plus woman and we're subtracting prints. And we want to see what we get as a result.
[43:13 - 43:17] We get a teenager. We get king plus woman minus man and we get queen.
[43:18 - 43:28] We get Paris plus Germany, we get France. So if you look at basically the semantic information, basically so king, what type of concepts does it have internally?
[43:29 - 43:39] It has a, it's a man and it's a royalty but it's ahead of a country. And then if you subtract, you add that to woman and then you subtract the man from it as a concept, then you basically get queen.
[43:40 - 43:50] Remember basically I mentioned embedding layers as ways of being able to capture the semantic information. You should think of this subtraction in addition as subtracting or adding the hidden layer.
[43:51 - 44:00] King plus woman minus man equals queen and then you're able to play with a lot of these concepts. For example, what happens when you add lunch plus breakfast plus dinner?
[44:01 - 44:11] So it's able to basically take all of the hidden layers and be able to understand that if you add all of these things together, you're able to get a meal. And what do you do if you add a car plus a farm?
[44:12 - 44:20] You're able to get a tractor. So we have a number of examples for you in the homework activity and then you should be able to run kind of different types of information.
[44:21 - 44:32] So try royalty, try location, try activity, try relative, try actor kind of in the homework kind of activity. So that's basically so that you are able to understand about everything.
[44:33 - 44:37] So this is basically the code that's already inside the everything. Opening eye has a text embedding.
[44:38 - 44:45] One of the original text embedding layers is add a 002. And what it has is 1536 dimension.
[44:46 - 44:53] And it works very well on various NLP tasks and it's designed for to be cost effective. And it transforms the text into numerical embedding.
[44:54 - 45:02] And so these embeddings are applied for kind of about different things, documents similarity. It can be also like recommendation systems, not coach language processing.
[45:03 - 45:10] It designed for multiple languages. So when you actually have a text embedding, it's not always necessarily used for a large language model.
[45:11 - 45:16] So it's the precursor. You should think of these embedding engines as being able to do various things.
[45:17 - 45:24] For example, if you embed different documents together, you can understand how similar one document is to another. And you can use it for recommendation systems.
[45:25 - 45:30] So I'm gonna skip some of this. So some of the embeddings are basically applied using different benchmarks.
[45:31 - 45:39] One benchmark is called multilingual informational retrieval. It's a benchmark designed to evaluate embedding and retrieval models across multiple languages.
[45:40 - 45:47] And so there are other benchmarks. Opening eye tends to use this one, basically when they're producing, they're talking about their embedding about models.
[45:48 - 45:53] And so they're embedding models. The ones that are on the side, you actually have multiple embedding models.
[45:54 - 45:59] But generally speaking, what the embedding models, why would you use different embedding models? There's cost, there's performance.
[46:00 - 46:03] And then there's accuracy. For example, you have tax embedding dash three dash large.
[46:04 - 46:11] This is basically the highest quality. It achieves up 54.9% miracle benchmarks for, for the multilingual retrieval tasks.
[46:12 - 46:21] And it's best for complex semantic tasks and accurate accuracy critical application. And this is the cost, 0.0013 for 1,000 token.
[46:22 - 46:37] It may seem it's very little, but when you're basically processing a lot of documents, the cost basically really adds up this. And if you look at the dimensions, 3000 and 72 right here, again, basically what we're talking about here is we're talking about the number of meanings, basically associated with the token.
[46:38 - 46:46] The number of meanings associated with the token is reflected here. And the number of dimensions here is smaller, 1,536.
[46:47 - 46:54] And it only achieves a 44% miracle score. And then it's a good performance to cost ratio, ideal for general purpose embedding tasks.
[46:55 - 47:00] And then it has a lower cost and it's six acts cheaper than the larger large. So this is what I wanted to basically illustrate.
[47:01 - 47:16] Basically with these embedding quarterback systems, not only do you have to match it to your model for fine tuning or rag, but there's also a cost versus accuracy versus performance kind of thing, basically by using these systems. Should you basically come between your own embedding?
[47:17 - 47:29] Generally speaking, it's not terribly recommended, but you can if you want to. So let me just basically walk through the homework about the activity as well as the actual code.
[47:30 - 47:44] So the code we have here is basically, you're able to explore different wordings and then you're able to go through and then check the actual embeddings of these individual kind of words. So you're able to see the overall kind of shape.
[47:45 - 47:51] You're able to see what's inside it, basically the vector. And then you're able to understand the similarity.
[47:52 - 48:03] And so you're able to see kind of about the dot product and about similarity. You're able to add these things together, king plus woman minus man, and then you're able to multiply and even divide things as well.
[48:04 - 48:11] So this is basically kind of a product example. A dot product example is you're able to take different words and you're able to do a dot product between them.
[48:12 - 48:18] Again, a dot product measures both magnitude as well as distance. So it's the difference between going north versus going south, basically.
[48:19 - 48:33] And this is dog at Apple, basically. And what you basically see here is dog at Apple has a score of point two, one, nine, six, and then orange and apple has a similarity of point three, nine between them.
[48:34 - 48:44] And then you can also visualize a bar chart computing about the difference between the different values as well. So dog versus apple is basically here and orange versus apple is here.
[48:45 - 49:03] So we're going to go into multimodal models, a little bit multimodal embedding a little bit later, but we're going to go into multimodal embedding, specifically voice, specifically image. We're going to go into the time series data as well, just in case you basically want to know how to tax maps to images within embedding space.
[49:04 - 49:16] And this is the homework activity. The homework activity is mostly just exploring vectors and then mostly just going through and seeing, okay, what happens when you subtract or an happy, sad, joyful, angry?
[49:17 - 49:21] What happens when you add these two graphical locations? Basically and apply this to fruit and food.
[49:22 - 49:28] What happens when you add or subtract them, divide or multiply them? And then what happens when you do dog product with them?
[49:29 - 49:43] And then we have another kind of exercise around scoring different kind of tokens as well, from scoring different tokenizers from a good base. So you're able to explore a kind of a different token from llama, GVD2, T5, and then you can basically see what the kind of outputs are.
[49:44 - 49:46] So that's basically it for kind of about token embedding.
