Advanced RAG

- Intro to RAG and Why LLMs Need External Knowledge - LLM Limitations and How Retrieval Fixes Hallucinations - How RAG Combines Search + Generation Into One System - Fresh Data Retrieval to Overcome Frozen Training Cutoffs - Context Engineering for Giving LLMs the Right Evidence - Multi-Agent RAG and Routing Queries to the Right Tools - Retrieval Indexes: Vector DBs, APIs, SQL, and Web Search - Query Routing With Prompts and Model-Driven Decision Logic - API Calls vs RAG: When You Need Data vs Full Answers - Tool Calling for Weather, Stocks, Databases, and More - Chunking Long Documents Into Searchable Units - Chunk Size Trade-offs for Precision vs Broad Context - Metadata Extraction to Link Related Chunks Together - Semantic Search Using Embeddings for Nearest-Neighbor Retrieval - Image and Multimodal Handling for RAG Pipelines - Text-Based Image Descriptions vs True Image Embeddings - Query Rewriting for Broad, Vague, or Ambiguous Questions - Hybrid Retrieval Using Metadata + Embeddings Together - Rerankers to Push the Correct Chunk to the Top - Vector Databases and How They Index Embeddings at Scale - Term-Based vs Embedding-Based vs Hybrid Search - Multi-Vector RAG and When to Use Multiple Embedding Models - Retrieval Indexes Beyond Vector DBs: APIs, SQL, Search Engines - Generation Stage: Stitching Evidence Into Final Answers - Tool Calling With Multiple Retrieval Sources for Complex Tasks - Synthetic Data for Stress-Testing Retrieval Quality Early - RAG vs Fine-Tuning: When to Retrieve and When to Update the Model - Prompt Patterns for Retrieval-Driven Generation - Evaluating Retrieval: Recall, Relevance, and Chunk Quality - Building End-to-End RAG Systems for Real Applications

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 00:10] retrieval augmented generation is primarily about databases. So our AI databases and how we use it with private data. Yeah, the idea is you're doing two things.

[00:11 - 00:15] One is you're doing retrieving. And then the second thing you're doing is, you're doing generation.

[00:16 - 00:26] So retrieving is a, where you're supplementing the model with your private data. And so this lecture, we're basically gonna be covering the basic form of retrieving.

[00:27 - 00:46] I could have not the advanced form of, I could have adjust yet. And the idea is the LM has certain data, but there's multiple reasons why you would want to retrieve, is to give it norm memory, is to, like for example, when we talk about quantum engineering and context engineering, we wanna give the LM examples of what to do.

[00:47 - 00:53] So there's a lot of reasons basically why we want to give it examples from our private data. The other thing is real time data.

[00:54 - 01:06] Anything that's semi-real time or real time, you need to use a retriever, basically not only like traditional databases. The typical misconception with retrieval augmented generation is that it's just one retriever.

[01:07 - 01:13] So a retriever is a broader category. You can basically have multiple retrievers, add it to generation.

[01:14 - 01:35] And some of the advanced concepts are basically where we have multiple retrievers combined together generation. But the basic retriever is basically what, right now people basically refer to a 2023 retrieval augmented generation, which is you have an AI database, you do similarity matching, and then you basically pair it with a large language problem.

[01:36 - 01:54] And then the other thing is basically, before I can go a little bit further, if you haven't already, create your ideas inside the project, use that you burn a notebook to brainstorm basically, and please book a time with us on the projects basically as well. So just as a reminder, basically, today we had a conversation with Julia and she came with 14 ideas.

[01:55 - 02:03] Maybe it's not 14, but seemed like a lot. And but I think whatever many of ideas you basically have, it's good to basically kind of walk through, basically different ideas.

[02:04 - 02:28] And so we're gonna go over retrieval augmented generation, why retrieval augmented generation is needed, what is chunking, what is embedding, what is vector databases, retrieval and generation, explore retrieval techniques and improve accuracy, gain a high level view of the rack pipeline, and then have different or close and advanced architecture. And then more importantly, basically, see how evaluation basically ties their reading together.

[02:29 - 02:42] Retrieval augmented generation combines reasoning capabilities with the power of large language models. So traditional large language models can only learn from the patterns that they've learned during doing training.

[02:43 - 02:52] This often leads to issues such as hallucination, outdated fast, and missing domains visit knowledge. Retrieval augmented generation addresses this by combining two steps.

[02:53 - 03:05] First, retrieving relevant information from external sources, then generating an answer that uses the both retrieved knowledge and the models reasoning capabilities. The result is an output that is both fluent and naturally grounded, basically.

[03:06 - 03:20] So a lot of times when you basically see citation engines, for example, perplexity, open evidence, these are retrieval augmented generation where it's citing particular types of information inside the database. And then it's using that service of the citation.

[03:21 - 03:32] The LM is capable of summarizing reasoning and explaining, but it may invent or misremember details. By enabling retrieval, the model can check information before presenting it.

[03:33 - 03:47] And then a lot of systems, search engines, enterprise assistance now rely on retrieval augmented generation. The main reason why retrieval augmented generation is necessary lies into the weakness of LM when used by its own.

[03:48 - 03:51] LM's are prone to hallucination. It produces incorrect or fabricated information.

[03:52 - 04:02] This is acceptable in casual use, but becomes more dangerous when decisions and an accuracy such as medicine, law, financial services. Another issue is that training data is frozen at a certain point.

[04:03 - 04:18] A model trained in 2024 doesn't know about events for updates about 2025 unless it's weak. So that makes it ineffective for tasks where freshness matters, such as asking about stock prices, current weather, or new regulation.

[04:19 - 04:28] And then large language models are not automatically aware of domain specific data. So for example, your company's specific internal policies, medical research database, product documentations.

[04:29 - 04:41] There's a lot of different challenges together where explains why large language models alone are not sufficient for equitable workloads. One of the best ways to understand a retrieval augmented generation is to notice how similar approaches are already in use.

[04:42 - 04:52] A search engine like Google retrieves documents from the web. It doesn't just list results, but it also generates some rates that explains the answer in clean language.

[04:53 - 04:57] And this is retrieving combined with generation. Customer support bots are another example.

[04:58 - 05:05] They do not invent answers. Instead, they retrieve information from documentation or knowledge basis, and they present it in a conversational way.

[05:06 - 05:13] The retrieving ensures accuracy while the generation component makes the response natural and easy to follow. A simpler example is weather inquiries.

[05:14 - 05:26] When asking what is the weather in New York today, a system fetches live, meetable data from my API, and then formats it into a human-friendly statement. This is essentially a retrieval augmented generation where you retrieve first, then generate.

[05:27 - 05:40] If you notice here, we're not just talking about retrieving as purely a vector database or a AI database. We're defining, we're treating much more broadly basically because retrieving what you basically see in later lectures can use relational data.

[05:41 - 05:44] It can use vector data. It can use different forms of search indices.

[05:45 - 05:51] And then you combine this with generation. And then, and so at a broader kind of basis, what you're basically doing is you're creating a search index.

[05:52 - 06:06] And so this is why we basically kind of define retrieving much more broadly. Because it classically, if you go on the website, sorry, if you go to Bovole and pre-GPT, retrieval augmented generation is more classically defined as vector databases plus LLMs.

[06:07 - 06:13] So these scenarios, what about demonstrate Julia? - Yeah, I'm wondering here, talking about whether retrieval area.

[06:14 - 06:30] What is the difference? Because I was thinking that whether retrieval, it's not a RAC, it's a tool or MCP and RAC, it's going to database and find similarity between my request and what it's in vector database.

[06:31 - 06:51] - Yeah, basically the classic definition is vector databases, finding similarity combined with generation. But what you basically see in production is that a lot of times, you're actually combining it with relational databases for filtering, for metadata filtering, you're combining it with a lot of other type of search indices, basically in practice.

[06:52 - 07:04] So we're really like, the classic definition is vector database, similar and advanced generation. But for the point of the lecture and also for the course, we actually define it a little bit more broadly.

[07:05 - 07:18] Basically, this is why we basically mentioned, basically a weather API, which is retrieving something, then process it and then just combining it with a large language model. You'll basically see this, why we define this generally, a little bit later, basically in a advanced RAG.

[07:19 - 07:26] It's because a lot of the advanced RAG techniques and production level techniques, they all combine multiple types of retrievers. That's why we basically define it broadly.

[07:27 - 07:34] But yes, to answer it very shortly, yes, you are correct. And that similarity matching with vector database is the classic definition.

[07:35 - 07:46] - Okay, thank you. - These scenarios basically demonstrate that retrieval operating generation are already embedded about in daily life. And we want to basically define retrieval advantage generation, a little bit more formal.

[07:47 - 07:56] So by default, a large language model can only generate patterns that are learned during training. So once this knowledge is frozen, it can't automatically update with automation.

[07:57 - 08:11] So the retrieval step is like performing a database query and software engineering. It's like a backend service, my query a SQL database, retrieval augmented generation queries of vector database and/or a search index to find the most relevant document for the user's input.

[08:12 - 08:19] Then an augmentation step adds a retrieval context to the original query. And then the generation step is where the large language model takes over.

[08:20 - 08:32] So with the enriched query, the model produces a response that blends is built in knowledge with the newly retrieved context. This helps the model provide answers that are accurate up to date and specific to the problem at hand.

[08:33 - 08:49] So this workflow basically demonstrates the mechanics of RAG at the very simplest nervous step. The process basically begins with a user question and its own a large language model may provide outdated or information and the system, the RAG system retrieves a relevant passage from a database or a trusted source.

[08:50 - 09:02] The retrieval provides a necessary context. After particular chunks, the document is retrieved, then they must be combined with the user query and embedded directly into the prompt before passing to the LLM.

[09:03 - 09:11] This ensures that the model can see the original question and supporting evidence at the same time. Without augmentation, the model would not know how to incorporate the retrieved material.

[09:12 - 09:20] Then the large language model then generates a published response incorporating the retrieval fact into a natural sounding sentence. This combination has several advantages.

[09:21 - 09:29] It ensures that answers are not limited to the model's data. Basically, it integrates private or domain specific knowledge that the model never saw during training.

[09:30 - 09:38] So you might basically kind of consider the example question, "Who is the old deal?" A retrieval step might fetch a line from my internal documents from Fermi.

[09:39 - 09:47] Sam Olman is the CEO of OpenAI. The documentation stage takes this retrieved line and inserts it into the LLM input prompt alongside the question.

[09:48 - 09:52] The model generates a published response. The current CEO of Olman is Sam Olman.

[09:53 - 10:07] So we just saw that female augmented generation isn't just a black box and it adds to the L. So we'll basically first go into where the data actually comes from and how raw information needs to be prepared before retrieval actually can succeed.

[10:08 - 10:15] So this begins the breakdown of components starting with the data sources and trunking. So retrieval begins with simple bad crucial kind of question.

[10:16 - 10:24] Where does a knowledge come from? Unlike a standalone OLAM, the retrieval augmented generation has can connect to a lot of external sources of data.

[10:25 - 10:33] So this is why we basically define it much more broadly. You can basically pull in PDFs, research papers, transcripts, structure data to just SQL tables, enterprise systems.

[10:34 - 10:48] And then you can basically have a vector databases dealing with a compare meaning rather than exact words. And then you can also combine live APIs, live feeds, access to real data like stock prices or weather conditions or breaking news.

[10:49 - 10:56] The informed message is that RAG is not restricted to one source. It's a flexible design pattern in which external knowledge can be integrated into the model reasonably.

[10:57 - 11:11] This sets up the next slide which clears up a common misconception that RAG is only about vector databases plus LLMs. So a frequent misunderstanding is that retrieval augmented generation is LLM connected just for a vector database.

[11:12 - 11:17] This is the older simplification. Vector databases are informed but it's not the only way that retrieval can be performed.

[11:18 - 11:24] Retrieval is a broader context. So for example, a system may pull results from Google's church, feed them to LLM for summarization.

[11:25 - 11:33] That is a form of RAG. If an enterprise system queries a SQL database for internal records, the results can be sent to LLM to generate a natural language explanation.

[11:34 - 11:43] This clarification helps prevents anchoring on the raw mental model. Thinking of RAG only as vector DB plus LLM it isn't necessarily the right approach.

[11:44 - 11:57] So RAG can basically be implemented in many different ways using APIs to SQL queries to full-scale search engines. When we're introducing RAG, basically people often start with a basic vector database with a large language model.

[11:58 - 12:10] So this provides the clearest demonstration of the underlying mechanics. So once you basically understand the vector database plus LLM, then you can extend it further to API calls, SQL queries, and the general about a retriever message.

[12:11 - 12:18] The classic RAG workflow requires documents to be broken down into manageable pieces. So this is like the classic workflow.

[12:19 - 12:34] So if you take a document or a multiple set of documents, and then you break it down into chunks before they can be searched effectively. Entire documents such as books, low manuals or transcripts, provided into a vector database, search would be inefficient and retrieval results would be imprecise.

[12:35 - 12:51] So chunking is the first thing introduced because it's the first step in data preparation for the classic vector database setup. So even though vector databases have not been fully explained, it's important to understand that we're basically processing things at the level of smaller units, not into our documents.

[12:52 - 13:02] Basically chunking ensures that when we later store adding the system can find relevant information at the right granularity. So this doesn't mean that all forms of retrieval needs chunking.

[13:03 - 13:24] For example, if a system pulls data directly from an API or SQL query, chunking may not be needed, but in that classic RAG pipeline, the one we were using to build foundational understanding, chunking is unavoidable. So one of the most critical processing steps is basically is chunking documents such as books, PDFs, transcripts can't be directly handled by LLM because of input length limits.

[13:25 - 13:32] Even if they could, searching for entire document is inefficient and imprecise. Chunking solves this by dividing the documents into smaller meaningful pieces.

[13:33 - 13:43] Each chunk is small enough to be processed individually but large enough to carry useful context. For example, instead of indexing a 200 page manual, the system might break it into a few hundred word each.

[13:44 - 13:54] Chunking is basically a foundational step basically. So a lot of people basically think that, oh, chunking is just breaking it into different things, but it actually depends on the document.

[13:55 - 14:05] For example, if you're doing a clone, you're doing something called abstract syntax tree chunking, basically. And so it's not just basically like naive chunking, you break it by paragraph or you break it by different things.

[14:06 - 14:12] Choosing the right chunk size, choosing the right chunk size is a balancing act. Larger chunks capture more surrounding context.

[14:13 - 14:21] This is helpful when berries require understanding of relationships behind between multiple sentences. Large chunks and also dilute the relevance of retrieved results.

[14:22 - 14:29] If a query is specific, the chunk may contain too much unrelated materials. Smaller chunk by contrast often offer more precision.

[14:30 - 14:36] A query about a single fact is more likely to retrieve exactly the right passage when chunks are similar. The downside is fragmentation.

[14:37 - 14:47] Context may be lost when if sentences are separated too aggressively. So designing a RAG system often requires experimentation with different chunks sizes to find the right balance between the domain.

[14:48 - 14:54] One of the techniques basically it kind of go after the balance is the sliding window. The sliding window technique is an enhancement to basic chunking.

[14:55 - 15:05] Instead of dividing documents into rigid, non-overlapping section, sliding window allows chunks to overlap slightly. This means that the end of one chunk and the beginning of the next share some context.

[15:06 - 15:14] The benefit of this means is that meaning is preserved across the boundary. For example, a sentence at the end of one chunk may be essential to understanding the first sentence of the next.

[15:15 - 15:22] If chunked for strictly separate, this connection would be lost. Overlapping ensures continuity and reduces the risk of fragmented retreat.

[15:23 - 15:34] So this technique is particularly useful in transcripts or continuous text where ideas flow naturally from one section to another. Without overlap, various may retrieve incomplete passages that lack critical context.

[15:35 - 15:43] With overlap, the model has better chances of retrieving coherent information. At this stage, basically the retrieval and the generation pipeline has data sources and chunked documents ready.

[15:44 - 15:52] However, computers cannot directly compare raw text for media. They need a numerical representation that captures semantic relationships.

[15:53 - 15:55] This is where embeddings come in. Embeddings are back to sequence of number.

[15:56 - 16:05] And what you do is you convert chunks into embedding. The system is able to compare the embeddings and the documents based on semantic similarity rather than exact word matching.

[16:06 - 16:10] This chunking makes documents manageable. The next embeddings makes them comparable.

[16:11 - 16:18] Basically, without embeddings, retrieving would be limited to a keyword search, which often misses the deeper meaning. Embeddings have already been covered.

[16:19 - 16:29] But to recap kind of about their role, embeddings convert word sentences and documents into sequences of numbers. These vectors exist in a semantic space where meaning is captured through approximately.

[16:30 - 16:38] Herms with similarity matching are closer together and things with unrelated terms are further apart. Embeddings play two complementary role in retrieval augmented generation system.

[16:39 - 16:45] First, they're used to explore knowledge. Every document or chunk is converted into an embedding and saved into a database.

[16:46 - 16:51] This makes the information reusable without recalculating the embedding each time. Then embeddings are used for searching.

[16:52 - 16:58] When a user submits a query, the system generates its embedding. The query embedding is then compared with the stored embedding in the database.

[16:59 - 17:07] This results with the highest similarity are returned as the most relevant chunks. Destroying and searching process must be aligned with retrieval to work.

[17:08 - 17:14] Without storage, embeddings are just numbers with no persistent use. With search stored embeddings cannot be connected to queries.

[17:15 - 17:26] Together, they create the core retrieval mechanism in retrieval augmented generation where meaning is matched instead of keywords. Consistency is one of the most overlooked yet critical principles in retrieval augmented generation.

[17:27 - 17:35] When dodgements are chunked or stored, they are converted into embeddings using a specific model. When queries are submitted, these queries must also be embedded using the same model.

[17:36 - 17:48] If different embedding models are used, the vector spaces will not align. For example, basically imagine storing all the documents in French, but submitting the queries in Japanese, there would be no shared reference to compare meaning.

[17:49 - 17:58] In practice, this results in retrieval failures where even highly relevant documents are never matched. And beginners often look overlooked this detail assuming any embedding could be swapped freely.

[17:59 - 18:14] Embeddings are very highly model-specific models, create their own unique representation space, and only within that space can comparison be meaningful. Embeddings are powerful because basically they capture meaning in a way machines could compare, but on their own without a way to store and organize them, embeddings have no practical use.

[18:15 - 18:20] They are nothing but a list of numbers. So the natural question after introducing embeddings is where do we put them?

[18:21 - 18:24] And how do we find the right ones when needed? The answer is a vector database.

[18:25 - 18:30] A database provides persistence. Embeddings can be stored and retrieved later rather than recalculated every time.

[18:31 - 18:38] It also provides scalability. When dealing with thousands or millions of embeddings, searching one by one be computationally impossible.

[18:39 - 18:45] Databases introduce indexing strategies that make these searches efficient. Databases allow structure alongside with embeddings.

[18:46 - 18:57] You can store the original text and metadata, such as the source or timestamp, just vital for filtering and context. A vector database provides a way to store, organize and search embedding at scale.

[18:58 - 19:07] So vector databases solve this problem and by indexing embedding. And instead of scanning every embedding one by one, it uses mathematical structure to find query embedding.

[19:08 - 19:14] When a question is asked, a database can return the most semantically relevant chunks in milliseconds. Each record pulls essentially three parts.

[19:15 - 19:24] The original text, the embedding and the metadata. Metadata adds filtering power, making it possible to restrict retrieval to certain domains, time-serious or types of documents.

[19:25 - 19:29] Not all vector databases are the same. They vary depending on scale, performance and usability.

[19:30 - 19:35] Some are optimized for small projects where simplicity matters most. They're often lightweight and open-source.

[19:36 - 19:42] Others are built for enterprise use, capable of handling billions of vectors across distributed systems. The difference can basically be in four categories.

[19:43 - 19:47] Scalability determines whether the system can grow with the dataset. Speed often matters.

[19:48 - 19:56] Measures how fast results are returned during parties. Closement matters because enterprise-based systems offerable part subscriptions or infrastructure expenses.

[19:57 - 20:05] Small teams may prefer lightweight local tools, while large enterprises may want managed cloud services as well. So there are many vector databases with its own strengths.

[20:06 - 20:16] FEI-SS, developed by Facebook AI, Research is an open-source tool. Chroma has become popular for developer-friendly interface, for it's good for quick prototypes and education projects.

[20:17 - 20:25] Land CDB is a newer option that focuses on performance. It has a lot more newer research basically put into it, like a multi-vector embedding.

[20:26 - 20:34] And while PyC represents the enterprise category, it's more expensive and removes much of the operational burden. It really depends on where your needs are as well.

[20:35 - 20:46] The next stage of the pipeline is basically, which helps the knowledge of raw force for the largest language model we use. Once documents are chunked and slurred into a vector database, the system needs a way to bring back the pieces most relevant to a new question.

[20:47 - 20:51] So the workload is straightforward. The incoming query converges into our embedding.

[20:52 - 21:02] The embedding gets compared to everything in the database. And then the large language model works with the most accurate and contextual relevant content rather than what it basically kind of all memorized.

[21:03 - 21:06] So retrievers a common different forms. It's helpful to understand the main categories.

[21:07 - 21:12] The simplest ways are term-based retrievers. These has been around in search engines.

[21:13 - 21:19] So this is TLV idea and VM25. Basically in practice, a lot of people basically use VM25.

[21:20 - 21:38] Basically, they look for overlap between words in a query and words in a document. So this is an example of where you have lexical search, which is like VM25, which is combined with vector similarity, which is combined with relational metadata field, which can be combined with Google indexing.

[21:39 - 21:47] So you can basically have four different search indexes combined and then mapped to generation. So this is why basically we defined it much more broadly.

[21:48 - 21:56] And in practice, people basically use filtering, metadata filtering plus VM to 25 plus vector similarity. So this is why we basically defined it very broadly.

[21:57 - 21:59] And it's a very important concept. So...

[22:00 - 22:16] - What do VM25, so can you repeat again? - VM25 is, have you ever used Lucene, which is like a keyword search basically with... So if you're doing keyword search, it uses different algorithms than relational databases, which is filtering and creating indexes.

[22:17 - 22:26] So VM25 is a, it's a lexical search method. Basically, it just, I don't wanna basically oversimplify it, but basically it uses the word...

[22:27 - 22:30] - Why is this search? - Was that? - It's like, why is this search?

[22:31 - 22:40] - Yeah, basically it's, yeah, it's basically like fuzzy search. Basically, it's not exact, basically a keyword matching code of a search, basically. And people basically use it for effectively keyword search.

[22:41 - 22:48] Like you use keyword search, like you think you're using keyword search, but you're actually using VM25. It's VM25 is implemented in basically.

[22:49 - 22:58] So a lot of the traditional document searching interfaces that you use, basically like for your internal knowledge base and other things. A lot of it basically, like some, sometimes not a lot of it.

[22:59 - 23:02] Like sometimes they basically VM, VM25. It's like a classic algorithm.

[23:03 - 23:22] It's a, yeah, like you guys understand that using Lucene as a keyword search is very different than using relational database. But when we're basically doing large language model or resource management, we're not only doing vector data based on learning and managing, but you're also using, let's just search VM25, but you're also doing metadata filtering and basically as well.

[23:23 - 23:31] So this is why we basically say you can't just define retrieval like a bunch of generation, surely with a vector similarity matching. So hybrid retrieval combines both strategies.

[23:32 - 23:46] And then you can basically combine things that are key or precise, but semantically relevant. So the important thing is not to understand all the details, well, TF, IDF, basically we're VM25, but you understand that different retrieval methods differ in how they find matches.

[23:47 - 24:01] And being able to combine this basically are, you find relevant information through the retriever and then you match it to the generation system. So embeddings and vector databases are the most best teaching examples basically, but they're not the way to basically implement retrieval.

[24:02 - 24:07] Basically retrieval is basically about a broader topic. Basically it can be an API call, a SQL query.

[24:08 - 24:15] And this is basically just a summary of what we typically talk. The classic RAG approach basically has embeddings, basically as the backbone basically.

[24:16 - 24:27] And the retrieval path is basically where you rely on the embedding for adding and querying basically. So for example, if you ask what is the weather in your system, it goes into a weather API.

[24:28 - 24:33] A customer support assistant varies a SQL database for past tickets. A news assistant uses a web search.

[24:34 - 24:42] None of these require it because the data is retrieved directly in its native form. It's important to basically mention basically that the contrast is the speaker of a clear.

[24:43 - 24:48] The vector database method is embedding dependent. And so it can reuse large sets of unstructured documents.

[24:49 - 24:59] The alternative is where it's already unstructured. Combining embedding-based retrieval basically with direct retrieval from live or structured sources is basically a retrieval-augmented generation.

[25:00 - 25:05] So generally speaking, it's a pattern basically. And we want to be able to make it very clear.

[25:06 - 25:09] You get a question. You go into the search database, retrieve the terms.

[25:10 - 25:14] But you can also basically combine it with different sources as well. The generation basically stages.

[25:15 - 25:21] So we primarily talked about retrieval. The generation stage begins when the retrieval system has delivered the relevant information.

[25:22 - 25:32] The large language model resumed its natural role as the reasoning and writing engine. And the model needs to retrieve chunks, basically drawing connections and filling in gaps with its reasoning capabilities.

[25:33 - 25:42] The LLM can summarize, rephrase, or even cross-check information across the chunks. So a crucial limitation is the context size, the maximum input that the model process.

[25:43 - 25:47] Large language models have two components. One is retrieval and the other one is generation.

[25:48 - 25:57] This may seem obvious, but when you're actually doing evaluation, you evaluate this two separate parts separately. So this is why we literally delineate retrieving, basically from generation.

[25:58 - 26:18] Because most people basically think, oh, when I test the retrieval-augmented generation, I just trigger basically the generation is actually more costly to test than the retrieval card. So when you test the retrieval-augmented generation, if you're basically doing like a unit test or anything, you're actually basically often testing the retrieval part, basically rather than fully triggering the generation part.

[26:19 - 26:29] So it's very important to understand these as two separate components, both as a general system, but also provide evaluation standpoint and testing standpoint. One of the strengths of RAG is not linked to any single provider.

[26:30 - 26:39] You have different options for close sources and open sources, basically. And then this allows you to basically put in different models at different kind of level points in the pipeline as well.

[26:40 - 26:45] So a common format basically starts with an instruction, answer the question using the following date. I'll go ahead, you have a question.

[26:46 - 26:50] Oh, no, okay. A common format next answers the question using the following data.

[26:51 - 26:59] Next, the retrieve passage is inserted, often with markers or separators so that the model can recognize them as external context. Finally, the original user query is included.

[27:00 - 27:10] The structure makes it explicit that the model that the retrieve chunks are relevant and should guide the answer. Basically, otherwise the model might ignore the retrieve data and rely on its internal urban knowledge.

[27:11 - 27:18] Problem instruction is both an art and a science. So what we basically kind of went over on Tuesday is we went over like a term for meal planning.

[27:19 - 27:30] I think Kevin asked a question. And what this basically kind of shows is sometimes you wanna basically just very explicitly structure anything that's external, basically either in XML or kind of some other kind of thing, basically.

[27:31 - 27:39] So clearing problems around everything and reduce problem summation, basically is promptly jerry is a critical part of the RAG pipeline. What's the weather in New York today?

[27:40 - 27:47] Retrieval catches the live weather data from the API. The raw data might be technical, such as temperature readings, humidity levels and timestamps.

[27:48 - 27:57] Then the prompt is constructed to include the data as context and then the instruction with the LLM to provide a user friendly answer. So the result is a coherent natural sampling response.

[27:58 - 28:04] It's 72 degrees sunny in New York today. This illustrates the value of generation LLM as reasoning.

[28:05 - 28:13] Basically, without the generation step, it may not be easy for end users to interpret. Another key point is basically, is that the LLM raw power isn't enough.

[28:14 - 28:28] Basically, even though a most advanced model cannot compensate for weak retrieval, basically the model can only integrate what is given. So that makes retrieval design, chunking strategy, embedding models, database and indexing and re-ranking just as far as the choice of the system.

[28:29 - 28:34] Basically, the system retrieves some passage, drops it into a prompt asset LLM to generate an answer. It's a very shallow processing.

[28:35 - 28:45] It doesn't basically have the combination of different indexes. Basically, the M25, basically a vector database, metadata and filtering, and it doesn't have the post-process.

[28:46 - 28:56] Basically, the post-processing is basically where you get the search retrieval results, you filter it, you re-rank it, and then you do progressive kind of a refinement on it. The traditional kind of a naive workflow is limited.

[28:57 - 29:18] Basically, and then critics assume that this is what retrieval advantage generation looks like basically and in reality, retrieval advantage generation is a multi-step pipeline that recommends that looks like a recommendation system. Basically, part of the reason why we're basically like repeating different concepts over and over again is we wanted to basically make it not so confusing when we actually introduce this slide.

[29:19 - 29:28] This is probably the most important slide basically. And what people really think of retrieval advantage generation is they think of it as vector database retrieval, then prompting, then generation.

[29:29 - 29:50] Basically, what actually it is basically you can fetch API data, you can do BM25, you can basically do metadata filtering, you can basically, you can get things from a lot of different places, and then you can then basically filter it, score it, and then reorder it basically. So the reality of modern RAG is basically very different than retrieval and generate kind of a loop.

[29:51 - 29:58] It starts with basically the retrieval index, the stored documents or external sources. And formally, this is not limited to a single database.

[29:59 - 30:02] A query may reach different retrieval indexes. It can be even multimodal.

[30:03 - 30:14] It can be text, tables, images. So instead of basically sending all results to the model, the system applies filtering to remove irrelevant or noisy content, usually using top case selection.

[30:15 - 30:31] Next, scoring assigns numerical ways to measure similarity or relevance. And then it's often in here basically with re-rankers that basically evaluate concurrence and contextual fit, then ordering then prioritizes the best managers, ensuring that the most useful passages are prepared as concretely.

[30:32 - 30:44] The curated content is passed with the prompt for the LLM to generate the answer. This process is much more closer to a recommendation system where movie songs and products are filtered, scored and re-ranked basically before shown to a user.

[30:45 - 30:54] Reachable Occmented Generation carefully selects and organizes knowledge before handing it off to the LLM. Quality of the generation is directly linked to the quality of the retrieval.

[30:55 - 31:05] A common mistake is engineers over focus on generation, assuming that we can prompt or switching models will fix weak results. The real gains basically come from improving retrieval.

[31:06 - 31:10] They're stronger and more precise than retrieval, tight-mindedness. The more grounded and reliable LLM outputs will be.

[31:11 - 31:21] I think some people basically skip the retrieval on graduate registration today because they thought they knew what it was. And but this is basically how this is like production level of retrieval I've mentioned in generation.

[31:22 - 31:26] Breaking down the pipeline shows retrieval is more than one step. Filtering narrows down the candidates.

[31:27 - 31:47] Then you select the top K most relevant chunks, scoring at another layer, ranking results with metrics or re-rankers that account for concepts and co-curin, or growing them rearranged the results into the highest quality items and maximizes their inputs on that large language model. Then form, augmentation takes these selective results and formats them into the models input.

[31:48 - 31:58] So building a strong retrieval Occmented Generation is less about tweaking generation and more about engineering retrieval. Basically, a generation quality improves as retrieval quality improves.

[31:59 - 32:10] That's why a Vansbrag systems are best understood as multi-indexed, multi-step recommendation engines wrapped around the language model. So this pipeline shows that retrieval Occmented Generation is more than retrieve and generate.

[32:11 - 32:26] And rather than basically a single vector database, it's sent to multiple indices at once. Text documents, C4 tables, APIs, multimodal sources like image instructor, regarding re-ranking and post-processing as refinement gets behind the initial retrieval.

[32:27 - 32:43] A re-ranker might re-score the passages using a smaller large language model or a coherence model ensuring that the best results rise to the top. Post-processing can filter duplicates, informed, enforced domain rules or adapt the content for context.

[32:44 - 32:55] Agentic RAG goes further by giving the large language model an active role in retrieval. The model made plan which indices to query, decide when to expand or reformulate a query.

[32:56 - 33:02] So agents are not just using tools. Agents basically can help define or plan the execution.

[33:03 - 33:21] So agent rack basically is where the agent can plan basically which indices to query, decide when to expand it or reformulate a query and orchestrate the retrieval steps in certain sequences. So the agent approach is basically particularly useful for complex or multi-hop kind of queries.

[33:22 - 33:36] So the more advanced basically current multi-hop agents, basically you're basically using the agent to basically make decisions at any given time and you can also basically make decisions for you on the retrieving side. Then multi-modal RAG goes beyond text alone.

[33:37 - 33:45] Retriable now can basically span into images, audio, video which are then aligned with text for richer cross-model model reasoning. This opens new possibilities.

[33:46 - 33:50] For example, healthcare and legal documents search with tables and multi-modal assistance. Julia?

[33:51 - 34:06] - Yeah, when do you say in Jane RAG? So it's your saying to another LALAM to make a decision to plan retrieval and to protect the removal and setting to again allow for generating response? - Yeah, yeah, basically, yeah, that's right.

[34:07 - 34:17] Because basically we're in the middle of basically uploading a lot of the cloud videos basically as resources. So you guys, once we upload the agent lecture and you guys should basically take a look at agents.

[34:18 - 34:21] But agents is at a very high level. Agents can basically make decisions.

[34:22 - 34:31] So it can plan and then recurse and make decisions and then execute tools or execute different things. So a tool that is basically can execute is the retriever.

[34:32 - 34:44] Basically, so it can basically look like traditional for loops are very dummy, right? It's you basically have I equals zero, basically you go to N, you know, basically whereas agent loops make decisions at every kind about this point.

[34:45 - 34:54] And it basically says, hey, maybe I basically vary the weather API. Maybe I basically go up to the SQL database and I basically query the relational data.

[34:55 - 35:04] Maybe I use VM 25 and it basically combines these sources, both plans and varies it. And then once it basically gets the information then it's going to execute it against that.

[35:05 - 35:09] And then multimodal rag is now a state of the art. Basically, it expands beyond text alone.

[35:10 - 35:17] Retrieval can now span images, audio or video. And this opens up possibilities and healthcare imaging, legal documents search.

[35:18 - 35:31] - Where shareable awkward generation is, instead of being dead is actively growing into capable flexible and intelligent systems. You often basically see if you follow like AI stuff, basically like I do, you'll see like on YouTube, like people are like, rag is dead.

[35:32 - 35:38] And then they then basically redefine rag in terms of rag or they redefine rag as them. Oh, you have to use VM 25.

[35:39 - 35:50] Basically, so whenever you basically see rag is dead, it's they're like, people are basically like doing these clickbaity kind of things. And whereas rag is more like a design pattern, it's a broader design pattern and it actually looks something like this.

[35:51 - 35:59] Basically, it doesn't look like the top thing. Evaluation is what transforms rag into from an interesting prototype into a reliable production system.

[36:00 - 36:18] So without evaluation, it's difficult to know whether the pipeline is truly helping the large language model or simply adding noise without poorly evaluated system risk producing hallucination irrelevant answers or outdated by information. So first, it reduces hallucination by ensuring the model's output is grounded by retrieval and about evidence.

[36:19 - 36:26] Second, it builds trust. Users are far more likely to rely on an assistant if its responses is consistent and verified.

[36:27 - 36:32] And it has the system under real world conditions. Basically, it's one thing to design a pipeline that's working in theory.

[36:33 - 36:41] Another to confirm that it works on varies on live data. Evaluation basically provides a feedback with basically experiment with fixes, measuring those changes and improve results.

[36:42 - 36:53] And otherwise the development is guesswork. Evaluation of a rag starts with retrieval, but before worrying about generation, you need to know whether the system is even pulling back the right information.

[36:54 - 37:01] If retrieval fails, generation will fail too. There are many different types of retrieval metrics, but the tool that stand out most is recall and precision.

[37:02 - 37:06] Recall basically measures coverage. So imagine asking what the weather is in New York today.

[37:07 - 37:18] The database containing the correct weather API results recall basically checks whether the weather record appeared among the top three retrieved entries. Precision, on the other hand, measures quality.

[37:19 - 37:31] Basically, if three chunks are retrieved, but only one is the actual weather record and the other two are unrelated documents, precision is low. We introduce only these two metrics here because they're widely available and widely recognized.

[37:32 - 37:42] It provides a solid foundation for evaluating retrieval. More advanced methods will come in later in the course, but recall and precision is starting are enough to start reasoning and retrieving effective.

[37:43 - 37:52] Retrieval is not just about the correct information is present, but also where it appears. Large language models and users heavily rely most on the top ranked results.

[37:53 - 38:00] If the correct chunk is buried in 10th place, it may not be used even though recall technically says it was retrieved. This is where ranking metrics come in.

[38:01 - 38:14] The most likely used is mean reciprocal rank, where MRR, MRR looks at the position of the first relevant results. For example, if the query is what's the weather in New York today, the correct forecast appears first.

[38:15 - 38:20] MRR is 1.0. If it appears in fifth place, MRR goes to 0.2.

[38:21 - 38:31] This metric is particularly important when re-rankers or hybrid retrievers are added to a pipeline. Their purpose is to push the relevant results higher and the MRR tells us whether this is happening or not.

[38:32 - 38:43] Together, recall, precision, and MRR covers the basics. Recall ensures nothing important is a mis-precision interest results are relevant and MRR ensures that the best ones are serviced first.

[38:44 - 38:50] Generation, basically, quality is different. Basically, at this stage, basically, it's important to focus on three widely adopted metrics.

[38:51 - 38:53] Number one is facefulness. Number two is relevant.

[38:54 - 38:58] And the number three is brown. Facefulness ensures that the answer does not contradict the retrieved column test.

[38:59 - 39:06] If the retreat column passage says 72 degrees in sunny New York, but the model answers, it is rainy. The response is unfaithful.

[39:07 - 39:16] Relevance checks whether the generated answer directly answers addresses the query. If the question was about today's weather in New York, but the answer drifts into last week's cast, relevant is for.

[39:17 - 39:23] Groundness requires that the answers especially tie back to the retrieved data. This is what distinguishes RAG from plain LLM generation.

[39:24 - 39:37] The three metrics are chosen because they're easy to understand, aligns with the core purpose of RAG, which is producing reliable, evidence-grounded answers. Evaluation is basically most effective when treated as a continuous loop rather than a one-off task.

[39:38 - 39:49] The process basically starts with analysis, studying where the system fails and due to missed retrievals, irrelevant answers were hallucination. Then you basically have measurement, which is recall, precision, error, face and round.

[39:50 - 39:57] Based on these measurements, engineers may target improvements. For example, they may refine chunking using a stronger embedding model at a re-ranker.

[39:58 - 40:08] These changes are tested again and again with the same metrics to confirm that they have intended effect. This loop of analyzed measure improved keeps the system on a path of steady improvement.

[40:09 - 40:21] In mirrors that iterative nature of software development and ensures that rank pipelines are adapting to the data and use cases. So that's the biggest kind of a takeaway is retrieval of venture generation is very much alive.

[40:22 - 40:27] Retrieval is very flexible. It's not only limited to vector databases, can involve APIs, SQL queries or web searches.

[40:28 - 40:35] The quality of the generation depends on the quality of the retrieval. LLM can generate reliable answers, basically only if the context is strong.

[40:36 - 40:51] Basically, evaluation is the backbone of a rag, basically it's a cycle, analyzed measure and ensures that the system, the system remains reliable as data and use cases emerge. So before we go into combat everything, did you guys have any questions?

[40:52 - 40:57] - Hi, so actually talking about the recall and the precision, right? So safe rag is already in production, right?

[40:58 - 41:09] Or if you're a productionizing a rag, I will frequently do you recommend, I know for apps we have the metrics, right? We continuously collect the metrics on every calls, but with a rag system, how frequently do you think we need to run there?

[41:10 - 41:19] Or is it something like a continuous metrics, like we collect for the apps? - Yeah, for retriever kind of metrics, you typically run precision and recall and MRI, like a unit test.

[41:20 - 41:29] So you can basically have some unit tests, basically, like whether it's roughly working or not. So you're effectively checking for regression, basically.

[41:30 - 41:52] And then just like how you basically do traditional kind of testing in production systems, you make sure basically like you have a set of lightweight that are less costly than basically that ensures that you don't basically go after regression. And then you have more higher level, basically, kind of metrics, basically where it involves generation, which basically ensures that you don't, it's more of a system-wide, basically testing, basically.

[41:53 - 42:19] And that's basically like where, kind of like you're doing more systematic testing in advanced rag, basically, but basically, you have the L basically make decisions, basically, basically you effectively basically say, hey, the query basically instead of basically you executing the weather API, you have the LLM decide whether to execute the API or not. And then like, for example, you can basically say, basically we talked about the system from, right?

[42:20 - 42:32] The role basically, you are a advanced query planner. Basically you are able to be able to optimize the retrieval, basically it reduces hallucination.

[42:33 - 42:58] Basically, you have these tools that you basically have educator scores, or you have the weather API, you have the weather SQL database, you have the weather, you give them basically the right of your tools, go after these, choose the right one, and then basically continuously move until you're able to be convinced that this is basically the right. So you can basically do a thing where it basically goes and retrieves the thing, then it does LLM as a judge on itself, and then basically it determines the quality, and then it keeps going.

[42:59 - 43:08] And then once it basically determines like a useful quality, then it basically returns the results for filtering and reranking, scoring and reranking. - Looks like it's still a long process.

[43:09 - 43:44] What if it's a voice conversational type of system where user acquires and you expect to answer to get answered pretty quickly? - Yeah, so the voice latency, basically systems, they tend to use techniques that are not so slow, basically, simply because I forgot what the exact milliseconds is, basically, but there's a certain amount of milliseconds that you expect when answering different things, and so latency is an important part of, unfortunately, basically, like multi-agent systems are, it has more latency, basically, like this is why we introduce you to different techniques.

[43:45 - 43:54] So depending on the overall context, latency requirements and all the things you provide, you use different things in the double box, basically. So this is basically talking about the exercise.

[43:55 - 44:05] We want to introduce you to basically combine retrieval augmented generation, and then you can upload a PDF, basically. So the very first things that people basically do are retrieval augmented generation.

[44:06 - 44:14] It's always like doing a PDF. So you basically take a university and go went file, basically upload it, and then you basically start processing it, basically.

[44:15 - 44:45] So the first thing you want to basically do is process it, basically, and process it, basically using PDF to image, as well as Python image library, basically, and then take a look at it, basically, process it, and you can also basically try some of your PDFs, as well, basically, you want to use that. Then a lot of times, basically, APIs don't set raw image files directly via the API, and expect images as a base64 kind of string, base base commercial, Python image into base64, preview the incoming string, basically, before you basically send it to AI.

[44:46 - 44:57] Then define your, basically, talk about a data schema, basically. So now that you've heard your document into images, basically, you want to understand pedantic, basically, we had pedantic in the brainstorming section.

[44:58 - 45:27] Basically, you want to basically understand the model, basically, basically, different fields, basically, like student names, student ID, data type string, and load, basically, so be able to have LMS understand, debug and evaluate and trust extractive data. So then we want to basically add field level citation, basically, so we want to understand, basically, where each field for luck came from, make it easy to debug, debug, or flag the hallucination, and then lay the foundation for evaluation and feedback.

[45:28 - 46:11] So you create a new-sided field, replace all fields in the schema, and then create an enrollment form with the citations from page one. Then, basically, you can basically have extract-borne data with Gemini, basically, you'll send a base64 image, both the model extract-structured-borne data, receive the output in the schema that you deny, and then evaluate completeness and citation, talking about trust, accept an enrollment object, check if this is the top-level fields are there, fields are missing, all citation IDs are unique, compute completeness for this case, and then you basically kind of loop through multi-page PDFs, sending each page to a vision model separately at the page number and unique ID per field, merge the result and draw one complete kind of structure.

[46:12 - 46:45] Then, basically, so then you basically loop through, basically, understand the trust log, basically, which is a field-by-field summary that flags every output as trusted, incomplete or missing, basically, so in real world, kind of AI systems, you want answers that you've been trust, the clear signal, whether it's incomplete or hallucinated, or missing in general. Then, the visual kind of query and anchoring, basically, being able to use OCR to read text directly from the images, detect whether a text is on a core-name basis, the terms or list of word locations in Congress score, basically.

[46:46 - 47:03] Then, you combine Gemini with OCR for visual traceability, basically, answer a question, basically, and get a structure of adjacent response, loop through your OCR output and draw a bounded box over about these citations. A library is basically processing, kind of about the PDF and ensuring, basically, kind of about the PDF works.

[47:04 - 47:30] This is basically retrieval augmented generation, which is combining, basically, retrieval augmented generation, retrieval augmented generation, basically, you guys understand, is the retriever, kind of a top-kin, top-k, and final answer. You build a toy retriever with a FAI-SS, find the most relevant chunks, feed those chunks into the AI model, then you measure retrieval quality, recall, precision, and MRI, basically, and are you retrieving good enough chunks?

[47:31 - 47:33] Are you retrieving only good chunks? Are you getting them, kind of about early?

[47:34 - 47:42] And then, and then, kind of about, then you will understand, basically, generation with or without retrieval context. You'll see a marginal answer without any help.

[47:43 - 47:51] How it improves with relevant information, health hallucinations decrease when RAG is applied. Then, you basically, it kind of charts the PDF with multiple, kind of about techniques.

[47:52 - 48:03] Basically, you use the PDF by new PDF, which one is better for which type of documents, how layout and chunkings are affected. Then, you chunk the same PDF with basically multiple strategies, paragraph-based, sentence-based, fix-window-based.

[48:04 - 48:14] Then, you embed the chunk with different types of embedding models. You use a B-small, basically, GTU-small, basically understand, basically, how the vector shapes and similarities score, compare, basically.

[48:15 - 48:25] Then, understand, basically chunking and overlap, basically, the sliding window, basically, and specifically, the fix-sliding window. Then, generate synthetic questions for each chunk, basically.

[48:26 - 48:32] So, prompt LM to generate three to five questions per chunk. Use different question styles for the chunk questions with their chunk IDs.

[48:33 - 48:37] Then, retrieve the chunks with synthetic questions. Basically, retrieval metrics.

[48:38 - 48:41] Basically, we call M or R, basically. Then, you compare retrieval algorithms.

[48:42 - 48:45] Basically, dense, sparse, hybrid. Basically, sparse, VM-25.

[48:46 - 48:53] Basically, dense, FAI-SS, hybrid, basically, which is VM-25, plus FAI-SS. Then, you analyze and debug failures.

[48:54 - 49:02] Looking at chunking or semantic, yeah, embedding weakness, distract or match. Then, you basically evaluate, basically, on larger, synthetic data sets, basically.

[49:03 - 49:15] And then, you basically want to apply a re-ranker, basically. So, a re-ranker will basically allow you to understand what about which chunk is basically being applied, and then you can basically track your own re-ranker, basically, as well.

[49:16 - 49:21] Then, evaluate, generate answers, face and relevance. Basically, this is generation evaluation metrics.

[49:22 - 49:27] And then, build a rag pipeline from raw text to answer. Chunking with overlap, embedding chunks for search.

[49:28 - 49:33] Dense retrieval from FAI-SS. Be ranking LLM-based evaluation-based recall and metrics.