Multimodal Embeddings

- Foundations of multimodal representation learning - Text, image, audio, video embeddings - Contrastive learning for cross-modal alignment - Shared latent spaces across modalities - Vision encoders and patch tokenization - Transformer encoders for text meaning - Audio preprocessing and spectral features - Time-series tokenization via SAX or VQ - Fusion modules for modality alignment - Cross-attention for integrated reasoning - Zero-shot retrieval and multimodal search - Real-world multimodal applications overview

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 01:39] multimodal embeddings is basically it's almost another dimension. So if you ever basically kind of look at sci-fi or you watch sci-fi kind of a movie, basically there's the concept of multiple dimensions, whereas an entire alternate kind of a universe. So basically, the idea of basically multimodal embeddings is you're in a current kind of a area that's specifically text based, it's a text based universe. And then you basically have an image based universe , and then you have a video based universe, basically it could be text, sorry, it could be basically time, and then you have audio. And the idea is whichever modality you want to merge, you merge it back into the original universe. So you basically merge image into text, you basically merge video into text, audio into text, and you basically merge the entire universe into the same universe. And then, and then you basically, once you're basically in the same universe, obviously, things are all mixed up, like the planets are not aligned, basically image of a cat is basically not aligned, basically with the text, basically description of a cat. And then what you basically do is you train it to basically go together. So that's basically the high level, explain it like I'm in five, we can buy description, and we're going to go into the specific on how actually come up works. The idea is to be able to understand what exactly is generative AI, basically fundamentally, how do prompts actually get maps into different forms of data, and understand what clip is, understand how image and text are being tripped trained together, understand the data set behind clip, what multimodal embeddings are, and then being able to facilitate it and use this. So some of you guys basically might be wondering, Oh, I may not need this, I'm not doing any multimodal stuff.

[01:40 - 04:54] But in the last cohort, we basically had multiple people from insurance and banking. And at least the insurance people, they were basically using multimodal embeddings to basically look at insurance images and and fine-tune it around that. So even in e-commerce, people basically fine-tuned model to look at images, basically on Facebook marketplace, and be able to kind of detect the item condition to do arbitrage. And so there's much more use cases basically for clip fine-tuning than you basically would think. And then some of you guys basically are interested in text video. And if you want to build a voice receptionist, basically it can about where you have voice to text, one of the people in the previous cohort basically had was dealing with things like new column people are like saying this phone call may be recorded, basically basically kind of about dealing with that flow and and being able to map it to document processing. So there's much more use cases for clip fine-tuning and embedding fine-tuning. And this is why we basically kind of go over this basically because it's the fundamental layer for voice receptionist, image fine-tuning, video fine-tuning. What are multimodal large language models? So a multimodal large language model basically is designed to take inputs from different modalities, which means different types of information. So a text-only model can only understand a sentence like a pizza with melted cheese, but cannot recognize that a picture is an actually pizza. A multimodal large language model can combine that sentence with an image and confirm whether it's the two lines. So there's two ways to basically deal with AI applications. You can either basically embed it natively basically so your true multimodal large language model or you can fake it. So some of the things that people are basically doing is they're actually using text basically on the underlying kind of a system, they're fine-tuning it on the text, and then they're rendering it basically in the rendered language. So you can basically see this with a lot of the code idea basically where basically you can output different visual and react components. It's basically dealing with text basically and then you're basically doing instructions to the underlying to the markdown language. But what a multimodal model is it understands the image natively, understands the video natively, understands the audio natively. So this matters in the real world because sometimes information rather doesn't exist in isolation. Doctors look at both medical records and x-rays. Social media platforms combine captions with photos and videos. And if a model can't handle multiple mod alities, it cannot solve problems in these domains. Yeah, so that's basically it can't bother. And so the first thing basically of multimodal LLMs is data integration. So you have to be able to combine different models and you have to combine text basically, and it has to be able to read the sentence a pizza with monstrone and also look at the image. By integrating the two, it can gain a stronger understanding. Then what you basically do is you take the pizza token, the photo of a pizza, and even the sound of someone saying a pizza, and you map it into the same real world concept. So this is basically what I talked about, which is you have the different universes and you're basically squeezing them into one, smashing them into one universe. So without this, the modalities would stay disconnected.

[04:55 - 05:11] And then you then basically the model can integrate the different inputs and produce better answers. Given the sentence, describe this meal with this photo, it can now basically say a freshly baked pizza with mushroom and cheese. And the task flexibility is what makes these models practical.

[05:12 - 05:49] The same model can caption an image. So it can answer a question about an image or audio. It can connect a graph with a text explanation. And so a common pitfall is thinking multimodal models are locked into one specific task, like image captioning, whereas they can handle many use cases with the same underlying setup. And this misconception basically came because of the previous generations. The previous generation was basically deep learning models, which is very limited in integration with text and language. And so you could only really have it do one specific thing, for example, image captioning or recognizing images, object detection, and other things.

[05:50 - 06:57] And by evolving into this next generation of multimodal transformer based language model, they're able to basically do much more capabilities. Each input has its own including process. For example, a text with a word like pizza is nice and embedded. For an image, the model breaks the picture down into smaller parts like tiles. So a big basically image is broken down into tiles and embeds by each one. And in audio, the sound is turned into a spectrogram, a picture of sound frequencies over time, and then embedded. So the next step is mapping everything into a shared vector space. You can think of this space as a universal cornet system. In this space, the world, the word pizza, an image of or the audio someone saying pizza can all be placed next to each other. Then the model learns to align the representation, the work and the pizza image are pulled closely together. Negative pairs, which is the word pizza and the car image are pushed apart. This training process is what makes multimodal understanding possible. So a common misper ception is that the model understands images or audio directly. In reality, everything is translated into first.

[06:58 - 08:30] The strength of the multimodal large language model is they can bring these embeddings into a shared space. So multimodal embeddings are the systems where you take everything and then you translate and tokenize them, then you translate that into embedding, and then you fuse them together. And then you're able to basically have a particular output and you're able to query it using natural language. Why do we basically even need this? You can basically deal with image to text, audio to text. A system like an e-commerce search engine can do a text query, red pizza box, and be able to look at matching about products. And so this is basically different than the previous generation where most e-commerce search is based on text and tagging. And so you can now basically have the system natively understand that it's a red pizza box or a red shirt or whatever it is. Then you have audio plus text. A voice assistant can hear the pizza and map it to the same space as pizza and use it to fetch a recipe. A model can analyze a cooking video and generate subtitles for step-by-step summaries. A model can read a chart of pizza sales and explain the trend in plain language. So all multimodal applications are built in the same process. Encode each input, map it into a shared space, and then you align the representations. There's a lot of different examples. You can basically, obviously, you have things in metabolic imaging analysis, basically talking about x-rays, e-scans, to be able to detect information. You can basically analyze symptoms of the actual person in addition basically to visual data, basically to diagnose skin conditions.

[08:31 - 08:56] Then you can have assisted tools, a system for visually impaired people that can describe surroundings, the speech, image, text, combine real-time image processing to be able to for visual impaired users, speech to image, hearing impaired users converting spoken instructions. But this is not just basically accessibility, but a lot of times you can basically do voice reception as basically that can use different tools and be able to schedule things for you.

[08:57 - 09:34] Then you can also generate image captions and descriptions. This is basically used for e-commerce. Basically, it's useful for creating tags automatically, and then you can extract insights from charts, maps, photos, and analyze visual data in satellite imagery, climate change, patterns, financial charts, generating actionable insights, or textual reports. You can basically apply this satellite imagery. Basically, hedge funds use this for detecting supply and demand for commodities and basically to be able to predict patterns about there. Multi-mod al AI connects text, image, and audio using a share rep recipe. What is basically the underlying bus system?

[09:35 - 12:46] So, the underlying interval you saw in the embedding lecture is embedding lecture has particular types of embedding. There's opening eyes embedding. There's a lot of different types of embeddings. It's the same thing for image embedding. Clip was one of the original image to text embedding released by eye. Then you basically have audio clip, you have video clip, and then the models that you use basically whisper to deal with audio sits on top of it, basically, which is higher level architecture. For example, Weng or KnoKok video, basically, which are the top text of video open source applications. Sorry, open source framework basically sits on top of these embedding models. Now that you know what multi-modal matters, we want to basically dig into what they are inside. The internals of a multi-modal model is where it learns the entire correlation automatically and together. So, this comes from the actual data set basically from training one of these multi-modal models. So, if you look at basically, this image is basically a pizza. Basically, it says classic coalition pizza with a bubbly, charred crust and vibrant tomato sauce, dollops of creamy mozzarella are scattered across the sauce, and a single basil leaf sits in the center, adding a pop of color and fresh flavor. Keywords, pizza, new pollution, bubbly, charred. So, basically, the foundational model itself will have different recognition on golfers, basically, but it most likely won't have coaching information. So, for example, basically, a leisure example, you basically swing it a little bit, basically, I don't know, your swing is wrong somehow, basically. And so, it will recognize that it's a swing, but it won't recognize the specific aspects like how a coach would basically do. But one of the things that you could do is you could basically be with basically a professional golfer and then train the data set, basically. And then the data set will basically then go through different examples. And it would basically say, "Oh, this is an example of basically a backswing that's basically wrong. This is an example of basically it was as much textual." And then what will happen is basically then the model will basically naturally understand. And so, you basically have a, I don't know , real-time video camera or like a recorded thing, and then the AI will automatically kind of upgrade and then not only grade it, but also provide the textual input of feedback on what to do , but it's based on your training data. So, that's basically the value of the underlying model, but also the value of adapting it to specific domains, specific knowledge, that would not be organized very well on the internet. For example, like very specific coaching, golf coaching data will probably not be organized well on the internet. So, multimodal capabilities don't appear out of thin air. It is learned when both sides of the relationships are present. So, a text to image, basically, Kanba shows that you recognize the word pizza, mozzarella, and basil in specific photos that has specific shapes, textures, and colors. And without paired data, this model would have no way to discover that the pizza in text should align with a round bubbly red and gold object in the image. The pairing provides the signal for the alignment. And so, a common misconception is that a model can align text in the image without seeing them together in training.

[12:47 - 13:42] So, in practice, strong alignment basically comes from a lot of information on the data set. You can basically see this example again, and then you can basically see, okay, a pizza with a bubbly chart across, and then this single pair already gives the model a lot of aligned signals. Basically, and then you scale that up to millions of pairs with publicly available data sets like Lione 5 billion, Lione 400 million, and it contains diverse images and captions. And then you can generalize beyond one photo or phrasing. For example, it should align bubbly crust with the different photos, basically not just one particular pizza. So, the idea of basically of everything is first a text side is tokenized and turned into a text embedding. Basically, so on the image side, the picture is then processed into an image embedding. And so, at this moment, the two live in completely different universes. One is provided by the text encoder and the other one by the image encoder.

[13:43 - 14:04] Then we map both embeddings into the same vector space. Think of it as adopting a common coordinated system. So, distances and angles are the same, meaning for both modalities. So, then we align. The training objective pulls true pairs like pizza and pizza photo and buy it together. This is the core that produces useful behavior in search captioning and zero-shot classification.

[14:05 - 16:02] So, this is basically what contrastive learning is. Contrastive learning is the standard recipe behind modern text-to-image alignment. Consider several captions and image at once. So, at the start of the training, the points for skateboarder, basically, and dog wearing fabulous jacket and cool photo of a lizard and pizza are scattered. So, the model computes similarity between every caption and every image. And the training loss known as the space so that each caption moves together to the correct image and away from incorrect images. And after many steps, clusters, skateboarder caption is aligned with a skateboard image. The jacket caption near the jacket photo, the lizard caption near the lizard photo. And so, the space has this structure and you can type a caption and retrieve the image that matches it or classify image by comparing it to the text labels without fine-tuning on this test. So, the common misconception is that you require detailed labels like bounding boxes that the supervision is only at the pair level. So, this is why it scales to website's data sets. You can simply cannot human label all the information, basically. And then another pitfall is to assume that scores are perfect. They relax the patterns that the model learns from noisy data. So, you would expect occasional near-mat ches or non-zero similarities across related concepts. So, this is very important, basically, and this is related to what Mosafa basically asked, basically. So, when the universes get matched together, what you basically see is on the left side. Basically, all the text and all the images are in random places, basically. And then, when you're training and you're fine-tuning is you 're basically moving it closer to each other. Basically, you're moving the skateboarder closer to the skateboarder image, the fabulous dog with basically the fabulous jacket, basically, and so forth. And then, over time, you basically have a universe where everything is basically matched together, basically, properly. And this is basically the core idea behind contrastive learning.

[16:03 - 16:36] So, the transition closes the high-level conveyor arc. And so, the main idea should basically be clearer. We have two pairs of data. We convert each embedding size, and then we put it into a shared space. And then, we hold together examples that are the same, and we push apart things that are not the same, basically. And so, how does this basically work in practice? So, we start with a caption, a new position piece up with bubbly, charred crust. Then, the token ization breaks the subword into pieces. Each token becomes an integer via a tokenizer vocabulary.

[16:37 - 17:07] Then, the embedding converts each ID into a vector, and then, you basically get a text embedding. You guys understand this from the previous lecture. Then, a multimodal kind of setup. This text embedding is compared to what? Image embedding. This comparison does not happen with raw words or pixels . It happens between vectors living in a shared space. Token common pitfall is that tokens are not always full words, and capitalization and punctuation can change organization. Don't assume that identical wording is required for alignment later. The model learns to map similar phrases near each other.

[17:08 - 17:50] So, if you basically hold the same piece of photo in mind, the image arrived as pixel grid. The model works best when inputs share a predictable shape and scale. So, all the different images are normalized, and then they're basically sliced up into squares, basically. And so, this is why we resize and normalize the pixels first. So, rather than feed the entire image as an entire blob, we split it into non-overlapping pieces so that the encoder can attend to the local details and their relationship. So, this is basically with a vision transformer, these pieces are called patches, and a convolution neural network based encoder extracts similar local features. Local features are local attributes.

[17:51 - 18:22] The image encoder processes these pieces jointly and provides one image embed ding. You can think of this vector as the model's condensed description of the picture. Shapes, textures, colors, and composition cues that help distinguish a pizza from a skateboarder or a lizard. The exact dimensionality varies by the model, but a 512-dimension vector is a common reference point. So , the image embedding is the unit of comparison to against the text, and once we have text and image embedding, we can place them in the same space and learn the correct comparison pairs.

[18:23 - 18:37] We first standardize via resizing and normalization, then we divide the image into smaller parts, then we encode these parts into a single vector that represents the picture. Pre-processing may feel mundane, but it removes unnecessary and the models are trained on batches.

[18:38 - 20:38] If image shapes very widely, you waste compute on padding and complicated resizing during training. So, standard is basically 224 by 224 pixels, and then encoder and pre-trained weights expect that size that makes the transfer smoother. The normalization addresses value scale. Raw 8-bit channels vary from 0 to 255, dividing everything by 255, bring values to 0 to 1. This is basically the normalization aspect, basically, and many pipelines use 0 to 1, rather than specifically the 255 numbers, and then many pipelines then subtract a mean and divide by standard deviation that match the encoders pre-training statistics. So, this keeps channels comparable and helps gradients behave well. So, if we take our pizza photo at 3000 by 4000, we resize to 224 by 224, then we normalize the channels. The image may look odd if you visualize and normalize the values, but to the model, it's simply an input with stable scale. So, pitfalls, aggressive resizing, can remove tiny details, choosing a resolution that perverts the future to your task needs, and also ensures the normalization to match the encoder you plan to use. So, this is basically the high-level kind of facial structure, but a facial structure basically is composed of like ears, nose, and other things. So, this is basically what's known as a mid-level feature. The low level is basically the components of the ear eyes, the components of the ears, the components of the nose. So, you basically first slice the images into different squares, then you have a convolution network, basically understand kind of about the features, then you pass it into a vision transformer, basically, and that's basically the process where it's able to stand the objects, basically not. It's able to understand it, not at a pixel level, it's able to understand it at a higher level abstraction. So, cutting the image into pieces serves two goals. First, it surfaces local patterns, like edges of the crust, bubbles on cheese, or the shape of the basil leaf.

[20:39 - 21:24] Then, it gives the model a structure way to compose these local cues into global understanding. So, with a visual vision transformer, we flatten each patch, then we embed it, and then we use attention to let the patches exchange information. With a convolution neural network, each layer plays a similar role by scanning with local filters. For example, imagined 16 by 16 patches across the 224 by 224 by image. The encoder can attend to the ring of the crust, basically the red sauce regions, the white mozzarella areas, and then compose all of these components into what 's we know as a pizza. Then, the skateboarder image catches the and highlights the board edges.

[21:25 - 23:39] Basically, then, it's able to detect the outlines of a human, and for a lizard, it patches and captures a distinct scale and a shape. So, if the patches are too large, you might miss the fine texture. If it's too small, you can inflate the sequence line and slow the training. Non-overlapping patches are a common efficient default in vision transformer style coders. So, once we represent the local part, the image encoder turned the whole picture into a single embedding vector that we can compare with that. In all of the text and all of the image, what we're basically doing there with the embedding is it's able to learn a hierarchy of objects. So, the lower level embeddings on the text side, basically understand basic word relationships and synthetic pluralization, then the mid-level features understand things like cool or cold, and the higher-level features basically focus on context-specific meaning. So, if you want to basically roughly understand basically what's happening inside, it's basically learning different types of abstract ions, basically through this process, and then combining it together. So, we went over this image fragments. A convolution is the embedding is basically capture meaning in a hierarchical way. Early layers basically detect simple patterns like edges, corners, and textures, and the mid layers detect complex structures like eye, wheels, or a crust on a pizza. Then you're able to detect full objects, like a whole piece of a face or a tree. So, this is basically the step that creates the unit that we compare to text. The encoder processes all the patches together so that the local components can reinforce or correct itself. A circular ring plus red interior plus white blobs plus a green leaf jointly point to a pizza while a deck shape plus wheels point to a skateboarder . So, the output is a feature vector. Think of it as the model's compact description of an image . Its dimensionality depends on the architecture and the checkpoint, but a 512-dimension vector is a good mental reference. A vector is not a caption. It's not a set of object boxes. It's a numeric representation that represents the distances and angles that will become meaningful once we put the text and image into a single space. So, now we have both sides, text embeddings, and image embedding.

[23:40 - 25:54] Then we basically put it in the same space. So, we now basically project them into a common dimensionality and normalize the distances to become comparable. Then you compute the similarity between every caption and every image. Then you basically apply contrastive learning. Basically, and you move a pizza image, skateboarder text, skateboarder image, and so forth . What you basically first do is you basically provide a linear projection. Basically, a linear projection is a lightweight layer that maps each modality's embedding to the target dimensions. So, what we basically described at a very high level is we have different image universes. Basically, we have an image universe, we have a text universe, and so what we want to basically do is we're mapping it into, we're smashing all these things together, and so we have to make sure that both are normalized to the same dimension. So, that's what linear projection basically does. So, if a text encoder outputs 768 dimensions in an image encoder outputs 1024, we project both to say 512. Basically, this is not about adding deep complexity. It's about making shapes line up so we can compare vectors consistently. Obviously, if the text is 768 and the image encoder is 1,024, you can't really compare them. Basically, then normalization then scales each projected vectors into a particular unit length, and then that removes differences in rock magnitude that would otherwise bias that dot product. With unit vectors, the dot product becomes the cosine similarity. It allows you to measure clearly the alignment between the different about direction. So, for example, if you take the piece of caption image and then the piece of image embedding and both projected 512 dimensions and normalized, and then now all, and you do the same thing for the skateboarder, dog with jacket and lizard. Now, all five text images and all five image vectors now live in the same 512 dimensions with unit length, and you can now safely compare any text to any image. So, basically, roughly at a high level, you put everything basically into the same dimension. So, 1,224 to 512, and then now, you can now basically use cosine similarity to be able to compare things very quickly.

[25:55 - 26:21] Because if you can't compare things very quickly, you can't train it properly. If you comment, pitfall, as you skip normalization, it leads to a similarity score dominated by vector length, not content. And another pitfall is you always project both sides to the same output size before about the comparison. So, you dropping data when you do that normalization, you go from 1024 to 512, are you making assumptions the same way you compress a JPEG or something?

[26:22 - 27:29] Yeah, yeah, whenever you're basically doing that, you are dropping some type of information, putting everything like a JPEG. You have to constantly do it in when you're basically dealing with these things. So, this is one form of compression. Basically, there's other forms of compression, but basically, you're constantly compressing things simply because you can't just, you can't, it's just from a multimodal standpoint, you have to put everything in the same kind of bus space. So, this is basically about how to think about it, right? So, you basically have the different vector. So, you have the input image, then you have the image encoder . Obviously, this abstracts away a lot of different steps that you basically saw, but you get the output, which is basically a set of vectors. And these vectors are now basically mapped into a dimension kind of space, which is all the same dimension 512 or whatever it is, basically , it end up being. And then you start calculating similarity, and then you basically, depending on the similarity, then you basically move things about two m from. So, the matrix kind of, then you basically produce what's known as a similarity matrix. The matrix, the similarity matrix is just a big matrix that calculates this is basically close to this, this is not close to this, this is not close to this.

[27:30 - 32:02] This matrix makes sense only after projection and normalization puts the text and image vectors into the same space. So, from there, close and similarity between any caption and any image is a clean, comparable number. And so, we collect these numbers from the entire batch into an entire table, basically rows or captions, columns or image, and then you basically look at everything as a pair. So, why it matters from a training standpoint, think of the matrix as a scorecard for the batch. In each row, the diagonal cell is the one that we want to be the highest. In each column, the diagonal is again, the one we want to be the highest. So, the training step uses this table to compute a small nudge to the model. It increases the diagonal score, decreases the off diagonals, and it repeats this over many batches and gradually reshapes the space so that the matching text and image pairs become closer or apart. So, this simple, that simple, up for the right cell, down for the other rule is contrastive learning. Being able to have this similarity about matrix is important because you can also plot the matrix so you can catch issues quickly. You can catch batch ordering problems, missing normalization, captions that are biggest. Think of two matrices, one holding n text vector, one holding n image vector. So, because we normalize them, adopt product gives them cosine similarity. Doing this for every caption and image fills an n by n kind of a table. We lay down the captions on the left, images on the top, and the cells at row i column j is how similar caption i is to image j. So, when we lay it out this way, the diagonal is special. By how we assemble the batch, diagonal cells are the ground truth pairs. Before training, diagonals might not be the largest values in the rows or columns. The table is a readable snapshot and the raw material we use to adjust the model and correct pairs or incorrect pairs. You walk the matrix kind of out slowly. You start with one caption and you scan across. For dog wearing a fabulous jacket, the correct image should light up the highest value in that row. Nearby concept can register small positives. For instance, a dog on a skateboard might share fur texture or background color. And so, it has a small but noticeable score. So, if you move the cool photo of a lizard, the lizard image should dominate while unrelated images can stay near zero. If you look at the piece of caption, the piece of photo can stand out while a car image should be near zero. However, you should see small scores with similar visually similar colors or textures in the background. This is normal and reminds us that models learn patterns from many examples, not perfect labels. So, scores are signals, not absolute truth. They reflect what the model has learned so far and the matrix becomes cleaner through training. The role we see next is contrastive learning, which raises the right cells and lowers the right one, providing distinct neighborhoods for captions and images to provide together. So, right now, we basically embedded caption and images projected it into a similar minority matrix, basically. So, we projected it into a same dimension, normalized it for unit length and computed a similar matrix. So, this is basically the input to the learning. In each batch, the diagonal cells corresponds to the true text-to-image pairs while the off-diagonals represent mismatches. The training will basically represent that exact structure. So, the two rows, one object, as a diagnostic, the matrix helps you see whether it's closely aligned to the P-photo and whether dog wearing a fabulous jacket correctly lights up the jacket image. So, as a learning signal, the matrix flows into a loss function that rewards high similarity for each pair and penalizes high similarity for the rest. This is why we built the matrix here and earlier, it only makes sense after both modalities are living in the normalized space. So, this is the loop end-to-end, basically, for a batch of in-caption and images, you compute the end-by and similarity matrix. Then, along each row, the diagonal entry represents the correct image for that caption. Along each column, the diagonal entry is the correct caption for that image. The contrastive loss includes two goals at once. It makes each diagonal cell higher than the other cells in the row, the higher than the other cells in the column. The implementation varies in details, but the principle is the same. This repeats over many batches. The shared space organizes a running example separates cleanly. Pizza, text, and piece of photos cluster, skateboarder, text, and image cluster. Dog wearing a fabulous finds itself to a text image, and then off-topic pairs are basically kind of pushed away. So, how does it basically is this unrolled exactly? So, you can basically see this unrolled. For example, new polish and bubbly crust.

[32:03 - 33:19] So, you can see the image embedding, which is basically 0.12, negative 0.35, 0. 88, and the the text embedding is randomly utilized, basically. So, then the image embed ding, the model sees the first pizza example. The image encoder extracts random patches. The text enc oder produces an undrained text embedding, and the contrastive loss forces the vectors to be slightly closer. So, you see the cosine similarity is negative 0.1, basically. Then, the model starts recognizing basically useful features, round shapes, red toppings. The loss function nudges the embeddings closer together. You have the image vector and the text vector, and then the cosine similarity is 0.1, a little bit higher. Then, visual transformer starts recognizing pizza features, crust edges, cheese features. Basically, the text encoder starts associating pizza with images of circular, red, and golden round text. And so, the cosine summary goes into 0.25. Then, the model sees the difference between pizza and other foods. The model sees similar images, but with different descriptions. For example, burgers, pasta, then the contrastive loss ensures that the pizza doesn't get confused with other foods. So, the cosine similarity 0.4 with clear pizza text association. And basically, this is a negative example of something that's not pizza.

[33:20 - 34:19] Then, you have words like nucleation, chart, crust, and bubbly. It gets strongly associated with crust-like features. So, this is basically a picture of a burnt pizza, basically. And then, you basically have, this is the previous example with very strongly negative example. This is like slightly negative, but it starts to learn the difference between a burnt pizza versus not burnt pizza, basically. And then, the cosine similarity becomes 0.55. Then, the piece starts to generalize its understanding of pizza. It doesn't necessarily have to be round. Sometimes, when you go to a pizza ria, it's basically like, it's a rectangle. And then, you basically, the cosine similarity now is 0.7, high confidence, basically. So, then, it basically starts to learn more additional examples, like, for example, a chocolate style pizza, basically. And then, then, you basically have it from different angles, basically, not just from the top, basically. And then, now, it basically understands a bubbly crust, basically, in a golden spot , basically, with burnt areas, basically, in the association, the cosine similarity is 0.85.

[34:20 - 35:41] Then, you basically have an example where the crust is, basically, higher. And now, you basically are able to, basically, have a close similarity of 0.9. You can also, basically, take a look at image embeddings, and you can mess with image embeddings, as well. So, this is a URL that you can , basically, use its image/search.oape.com, basically. And then, you can look at image embeddings, like, you can type in zebra, you can type in horses, you can type in about different options. So, we've seen, kind of, about how text and image are aligned to each other through embeddings. You place them into the same place, and then, you train the model to pull together. And the same thing is, basically, for text and audio. Basically, the key difference is, basically, how audio is prepared and encoded before it becomes embedding. Basically, we looked at, basically, pizza with basil, with the pizza with basil as an image. And then, we need different type of examples, in different pairs, from different voices, accents, microphones, rooms, and background noises, so that the model is able to learn the robust connection between meaning and sound. And so, it's able to understand and distinguish between background noise and foreground noise, and then, it's able to understand and be able to connect everything. So, the model can't reliably put the text vector into the audio vector for the same concept, basically. And one of the common pitfalls is, pairs are often noisy in real datasets.

[35:42 - 36:36] A transcript can be imperfect, and a clip can make a background noise. The strength, basically, comes from the volume, the amount of examples and diversity of examples, not from any particular kind of error. So, for example, like, how is this basically mapped in practice? One is, you can create a software for accent reduction, basically, so you can train people, basically, that have accents to not have accents, or you can basically build something that have a real-time accent detection and corrects it in real- time. So, there's a company that basically does that in SF, that basically trained a model, basically, to do real-time detection, because a lot of people are basically in data centers or from the Philippines or in India, basically, and doing a real-time accent reduction is kind of a use case. So, you can actually basically train the model, basically, to do different things. And then, you can encode the text and audio perfectly, map both into the same space and align pairs, basically, so that the piece of was basal text and piece of was basal audio lines up as neighbors.

[36:37 - 37:10] So, we're not going to go into the textual embedding part, because you guys already know it. The audio pass is new. We first standardized the signal, represented in a way that the model can learn from, and that's typically splitting the waveform into short frames and computing the frequency-based features from each frame, since raw foods are not stable enough between voices and environments. Once we have a text embedding and an audio embedding, we still can't compare them directly if they live in different spaces. So, we bring them into a shared vector space where everything is comparable, and then we put the same alignment rule as the image.

[37:11 - 41:24] We notch correct pairs closer, incorrect pairs farther apart, and then basically, and then this basically reduces confusion later when we introduce contrastive learning for audio and text. So, this is basically very similar to images, text embedding audio embedding, and then a sound clip is a continuous waveform, and then you have to translate it into a fixed length embedding. So, this requires pre-processing, resampling normalization, breaking the signals into short frames, extracting the features, and finally, using an encult to summarize everything into an audio encult. Most models basically expect patterns to appear on similar scales. If one clip is recorded on 48 kilohertz and another one 16 kilohertz, the raw features would not be directly comparable . Resampling fixes that by converting everything to the same rate. It doesn't change the same word , but it puts the clips on the same consistent time grid, and then microphones and environments create wildly different loudness levels. Normalization rescales the waveform basically so that the encoder focuses on the shape of the sound rather than the absolute volume, and then typically, you want to bring a 16-bit integer sampling into a floating range from negative one to one, and it's a practical way to reduce variance and speed of learning. Pre-processing isn't about making audio perfect, it's about removing avoidable differences so that the encoder can learn robust speech patterns basically and slice the standard standardized signals into short frames and compute the features the encoders can learn from. So, framing is how we turn a continuous waveform into manageable pieces. Instead of feeling, feeding the whole one to two second clip to a model as one blob, we slice it into short windows that overlap slightly. In speech, important cues such as consonants or quick transitions happen fast. 25 milliseconds isn't short enough to capture those fast changes while the 10 millisecond step provides overlaps so we can basically understand what happens in the frame boundaries. You can basically take a 1.5 second recording of someone saying pizza was basal. After re-sampling and normalization, we split it into 25 millisecond frames every 10 milliseconds. So, this yields a sequence of frames, each ready for per-frame feature extraction basically, and then the per-frame feature extraction basically creates a time feature matrix that summarizes how the spectrum evolves from PI to top to bay to the audio encoder will consume this matrix and create one audio embedding vector for the entire clip. Everything is basically projected into a common space doing similarity and training. And as soon as we basically have a stable vector for the clip, the encoder is the piece that turns many short detail frames into a fixed length representation. P-cell with basal clip flows through the encoder and comes out as a vector, say 512 dimensional. This vector should be similar whenever the same phrase is spoken clearly and meaningfully different for unrelated phrases like turn off the lights. So, linear projection is basically a lightweight layer that maps each embedding into the target size basically, so we cover this basically already, but linear projection is basically an important concept, basically. So, this is basically kind of a summarize. We basically do linear projection, then we basically calculate the similarity, we apply contrastive loss, and then we converge the text and kind of the audio. So, this summary basically, so you can basically understand, basically the audio recipe, we pair audio and text examples, we encode audio and text into embedding, we project and normalize it into the same space, apply for the same similarity, contrastive loss, and then provide a search example, basically. So, from here, you can add practical examples, like search the text to speech, depending on your product, kind of by needs. You can also fine tune or adapt the audio encoder depending on your domain or specialized vocabulary or acoustic, then you can basically connect this for evaluation and then to understand, basically, whether you're going in the right direction. You can basically build a lot of different things like voice assistants, smart devices, customer service bots, basically. So, in practice, basically, what people do is they basically use either deep-seach whisper, basically, to be able to fine tune the existing models rather than basically completely do it themselves, basically. So, deep-seach was from the earliest kind of open source kind of efforts, basically, and then whisper uses transformer architecture, making it more accurate and more robust, basically.

[41:25 - 41:50] So, once large language models basically provide the response, the text is basically about half the story to complete the voice loop. We needed to speak it back to the user. So, that's basically about text to speech comes in. Early text to speech sound robotics because they sit together pre-work for the sounds. Then, newer text to speech, basically, like WaveNet or Tron, basically, are much more smooth, have much more natural language, correct pacing, stress, and intonation.

[41:51 - 43:21] Then, if you basically combine all of this is where we basically kind of have the agents. You can basically have voice response, basically, you can basically build IoT devices, customer service bots, accessibility, hands-free applications, like, for example, dictation, meeting assistants, basically, or car-based assistants as well. So, this locks in kind of the big idea that we use about throughout the course, basically, it's a general kind of a recipe, and then we basically change the different components and the encoder and the pre-processing. You create a text embedding, the modality-specific embedding, then you project it normalized into a shared space, then you apply contrastive learning, and the result is a piece of was based a little in text that's sitting close to an audio text clip, and both sit close to a photo of a piece of was based inside the same space. So, at this point, we've handled the basics of text plus image plus audio, but video is the next step. Basically, video is a little more complex because it has the time dimension. A video is a sequence of images, but it's not enough to look at each frame alone. You need to be able to capture how it happens over time. So, you can take a look at Will Smith with Eating Spaghetti. Basically, you can just Google it as a YouTube video. Basically, there's Will Smith's Eating Spaghetti over different generative AI kind of models, and you can see how the generative AI models have progressed. But one of the common things is not able to capture the sequence of how frames, basically, how objects transition to different frames. And so, the state of the basically is where you have to represent time within the tensor.

[43:22 - 43:30] So, it's a four-dimensional kind of a tensor, and you have to represent time with end channel. Basically, time represents the sequence of frames. Each frame is a snapshot.

[43:31 - 45:30] High represents the number of pixels vertically in each frame. With represents the number of pixels in each frame, channel represents the RGB. And so, a typical video tensor is shaped like time high with end channel. This is basically, this is the high level about architecture from Wang . Basically, it's a Alibaba vision diffusion kind of a transformer architecture. As I mentioned, basically, before all of most of the open source state of the art architectures , at least of right now, are typically Chinese. Basically, for a variety of competitive reasons, Google, basically, Google, OpenAI, and anthropic, basically, kind of much goes into the entirely closed source base case. And so, the high level about embedding, this is basically, this is probably the state of the art for video, open source video about training. So, what you basically do is, you split everything into frames, then you encode specifically every frame. And this is basically, you have to make sure to encode the spatial dimensions and time dimensions. And then, then you basically do time aggregation. Basically, frame level embeddings are aggregated using a diffusion transformer, then you create a unified full video embedding with a full clip, you align it with text embedding, and then the model basically aligns it video and text using cross attention, and then you train it with contrastive loss base case so that the video pairs are close to space. So, you have similar areas, but basically, but the key thing is basically the data structures. So, the video is encoded into a something called a latent, basically. So, a latent is just, it's another compression data structure that people basically use. And so, the first thing that people basically do is encode it into a latent, then the latent video is split into patches, basically, and process in sequence, then this is basically passed into transformer blocks and applied for space and time, then diffusion is basically applied where the transformer predicts how to reduce the noise in the latent kind of representation. And then, it's merged with the text basically as well.

[45:31 - 48:10] Basically, so, if you want to basically see the detailed report, there's only two that you want to take a look at. One is wing I2B, and the other one is cloud video X, basically, if you want to basically kind of go into it on a more detailed basis. So, this is basically the clip embedding kind of exercise. Clip is opening eyes, image to text, text to image model, and what you're going to be kind of learning here is how to load and use a pre-trained clip model, how to turn images and text into embeddings, how to compute a similarity score between them, and then understanding the vector space. So, you load a sample image, embed the image, and text options using clip, calculate their simpler, and then take a look at which pairs can about work. This is similar basically to the previous kind of exercises for text embedding. So, you load the actual clip model, you process the image basically, and then you have some URL processing libraries, and then you have the actual kind of text, and then you're able to run the actual clip models, basically, and calculate the similarity and the cosine similarity. So, then this is basically experimenting with the prompts, basically. So, just like how we basically experimented with king and queen in the embedding kind of outside, this is basically the idea is basically you'll feed in one image of a cat, 10 poor different text prompts, just describing it in different ways. Compute how well each prompt matches the image, rank them by similar. You use the same image of a cat as before, test poor different prompts, compute cosine similarity, but in a ranked list from best or worst. Basically, you will start to basically see like the question about the P cell with the family, you'll start to basically see, oh okay, it's able to natively understand it, how well or how good or how badly it's able to understand it, basically, depending on how natively it understands the different concepts. So, you load that image of a cat, and then you put the prompt pre- process everything, and then feed everything but there. Then you basically, then what we want to basically learn is how to embed multiple images in one text prompt, combine compute similarity, sort and retrieve the top and best matches in a base case. The idea is to basically do a search for the most relevant images based on your text prompt, display the top three matches. So, why does this matter? Some of you guys basically are having e-commerce use cases. So, for example, if you want to basically kind of do e-commerce, you can buy use cases, you may want to basically consider this. So, it's not just basically captioning, you can also use search basically as well, and then you basically kind of, you're able to see basically the clip score, and so you can basically use this to get an intuition of, you can even try the pizza example, basically with a pizza, with a family, a pizza inside, with a family, basically just to see how well it basically behaves under certain conditions.

[48:11 - 50:26] Then, then basically, what we see here is using image to image retrieval. So, without any text. So, you embed one image as the query, embed a folder of other images using cosine similarity to rank them by closeness. So, you may or may not have used a Google image search, basically, this is almost like Google image search, basically, you learn how to embed multiple images and compare them, loop through the image data, and then how to compute similarity and rank scores by score. So, you're able to see this is the query, and then out of your database, you basically, if your database includes a cat and a tiger, how well it basically matches. So, then you can basically, you can also basically do math with images, just like how you can do math with embeddings, basically be able to subtract and add image embedding, basically, like for example, you can start with one image like zebra, and then you can just do traction of stripes, so that basically, and then like for example, this is an example, basically, that's a simpler example, basically, you can do zebra minus stripes at spots, and then you can basically see what the results are, basically. So, sometimes it works, and then it doesn't. So, that's why it is. So, you basically can see these examples, basically, and kind of suggestions, and you can see what is a basically kind of a return. And then, then you can basically see vision and language kind of about transformers, and what you basically can about do here is you're learning a vision, a visual kind of a question and answer, basically, it can basically be something like what type of dog is this, what object is a student pointing at, if you create a DIY kind of thing, basically, so accessibility tools for blind users, basically, load an image, write your questions about it using a VUQA answer, basically, to get the answer. So, then you basically have an image captioning, which, basically, it uses a vision transformer and a language model. Basically, you can basically use this to do Instagram, TikTok, auto captioning, visual storytelling, AI assistance, as you can describe the surroundings, basically, the new meta glasses, basically, it can describe everything, basically, like, as you can about see it, you can about it, it's using some form of this, and then, you'll basically kind of feed images to a caption model, compare your own descriptions, basically, and, yeah, so that's basically it, and then, as always, with exercises, basically, we have the answers there.