Synthetic Data

- Intro to Synthetic Data and Why It Matters in Modern AI - What Synthetic Data Really Is vs Common Misconceptions - How Synthetic Data Fills Gaps When Real Data Is Limited or Unsafe - The Synthetic Data Flywheel: Generate → Evaluate → Iterate - Using Synthetic Data Across Pretraining, Finetuning, and Evaluation - Synthetic Data for RAG: How It Stress-Tests Retrieval Systems - Fine-Tuning with Synthetic Examples to Update Model Behavior - When to Use RAG vs Fine-Tuning for Changing Information - Building RAG Systems Like Lego: LLM + Vector DB + Retrieval - How Vector Databases Reduce Hallucinations and Improve Accuracy - Generating Edge Cases, Adversarial Queries, and Hard Negatives - Control Knobs for Diversity: Intent, Persona, Difficulty, Style - Guardrails and Bias Control Using Prompt Engineering and DPO - Privacy Engineering with Synthetic Data for Safe Testing - Debugging AI Apps Using Synthetic Data Like a Developer Debugs Code - LLM-as-Judge for Fast, Cheap, Scalable Data Quality Checks - Axial Coding: Turning Model Failures Into Actionable Error Clusters - Evaluation-First Loops: The Only Way to Improve Synthetic Data Quality - Components of High-Quality Prompts for Synthetic Data Generation - User Query Generators for Realistic Customer Support Scenarios - Chatbot Response Generators for Complete and Partial Solutions - Error Analysis to Catch Hallucinations, Bias, and Structure Failures - Human + LLM Evaluation: Combining Experts With Automated Judges - Model Cards and Benchmarks for Understanding Model Capabilities

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

Thumbnail for the \newline course Power AI course
  • [00:00 - 00:11] We are going to be covering all the things relating to a evaluation synthetic data and how to debug and AI applications. So that's like the common misperception of people have that the synthetic data.

    [00:12 - 00:24] Like it used to be that back in the days when I was doing my PhD. But as soon as the model started evaluating, it all depends on the prompts that you are using and how you are creating synthetic data.

    [00:25 - 00:43] So basically right now, most of the internet is filled with AI generated data and we don't even know what those data are. So basically right now, one of my skill products on AI is SEO bot and I'm currently working on that and it generates like SEO optimized article for any given website.

    [00:44 - 01:06] And similarly, that might be another product that does the same and people read these articles, but they don't have idea of what synthetic data or what are the difference between the synthetic and real data. And specifically speaking of synthetic data into this session is going to be useful for a rag and fine tuning as you mentioned, but it's also going to be useful after all of those techniques.

    [01:07 - 01:29] So basically, my goal for session is to cover what synthetic data is and how it matters. This line of metrics and this line of metrics and public and custom benchmarks and then basically we are going to have rag flywheel or end to end workflow of how rag works and really people have misconception of what rag is, but we are going to unclear that misconception in the rag lecture.

    [01:30 - 01:48] And then we are going to cover in depth of how synthetic data is generated and how it can be precise in order to make it real to the end user. Or any application ask looking, what hypothesis am I testing about my model or what am I missing?

    [01:49 - 02:01] If signal or noise pass or incomplete, that's where synthetic data comes in. Is the Jew is what the squeeze if collecting real example is expensive, unsafe or impossible, like a fraud or medical edge cases.

    [02:02 - 02:24] Synthetic data lets you generate those reps at lower cost. So once that flywheel is in the place and progress is about the obvious thing repeatedly, which is basically generating training, evaluating, just like lifting weights, where think of it as building muscle, where you are tracking calories or workout, you're rating your real data set, which is equal to creating a real data set.

    [02:25 - 02:38] And then, but sometimes when you are doing those diet or the environment does not give you enough reps. So synthetic data is like ting powder or resistant bands. It fills that gap and accelerates the growth and keep the progress consistent.

    [02:39 - 02:52] Synthetic data is not a skip effort or not a shortcut. It's a structured supplement that makes the consistent effort you already put in the face of it can be training evaluation tuning and it compounds it actually faster.

    [02:53 - 03:13] So in real projects like production, data is often limited and basically there are privacy rules, compliance or just being early mean, you don't have enough example to test with synthetic data solves this gap. Imagine try to stress test a brand new website without any user, no key extreme , no bug reports, you are stuck unless you generate some fake traffic.

    [03:14 - 03:30] Synthetic data is similar to generating those fake traffic. It lets you taste debug and improve. Think of synthetic data as training wheels . It's not meant to replace like real world data forever, but it helps you to move safely without needing a lot of real data.

    [03:31 - 03:46] Just like developers, car folding app with moi or seed user in staging environment, yeah, engineer uses synthetic data to run experiments. It's safer than testing on live customer or like real end user and it acceler ates iteration speed.

    [03:47 - 04:02] For like we established the synthetic data help fills the gap of real world real data where it's where we are in the like where we are in the scenario where real world data is limited. But the neutral question is how do we actually put this synthetic data to work?

    [04:03 - 04:15] Like in today's year stream, the most powerful leaders are rag and fine tuning. Think of them as two engineering tools like one that plugs in external knowledge and the other adapts the model.

    [04:16 - 04:28] This is the best, like this is the best example I like to put about fine tuning . So basically, let's say you are running a llama model and you can also taste on the notebook and you can ask who is the president of United States and you are running a older llama model.

    [04:29 - 04:50] It will basically answer it from the training data that it had during the training stage. Now, when you fine tune that existing model with newer data or the synthetic data, which is basically you have these multiple variation, like the president of United States is Donald Trump in 2025 and then basically you paraphrase it many times.

    [04:51 - 05:02] Then you have these huge text, which consists of around, let's say more than 1000 indexes and then you train upon this data, like the llama model. And then when you taste the llama model again, it's going to have a different answer.

    [05:03 - 05:27] Similarly for rag, like people have like this construction about drag and as I mentioned earlier, like I will cover in depth about what rag is, but that basically is the combination of two, which is basically search and liberation. Instead of relying on what LLM members from training, we all, we ingest all of the data in the vector DB and then we allow it to fetch the relevant documents when the LLM is being asked a question.

    [05:28 - 05:46] Basically, what happens on the back end is the model generates answer, but it generates the answer from the grounded documents that is retrieved from the TV and we will also uncover more about vector DB, but think of vector DB as a database and why is it important? It reduces the hallucination and makes your system much easier to update.

    [05:47 - 06:01] Instead of retraining the model with new knowledge, you just have to update the documents in your database and then basically attach their retrieval to the LLM and then we'll find the appropriate answer from the document or from the database. And then fine tuning is like an employee.

    [06:02 - 06:18] So think of rag as a complete architecture. Basically, think of rag as two legal piece, we have LLM as legal piece and the database as legal piece and then basically we are attaching these database in LLM and we are forcing LLM to retrieve all the documents from the data.

    [06:19 - 06:34] So as long as it has the data from the database, it will retrieve those relevant documents and then it will answer upon those relevant documents. Now what you, so if I'm understanding it correct, what you ask was basically, if the LLM remembers the data from the database, that's what you asked, right?

    [06:35 - 06:37] No, so basically that's fine tuning. So what you mentioned is basically fine tuning.

    [06:38 - 06:57] Okay, other for you, for LLM to remember, that's fine tuning, which is basically you have this data and then you keep iterating the LLM model. For example, that's why I've always mentioned this precedent example, like when the LLM model is being trained on old data, which is, let's say, LAMA model and it is being trained on old data.

    [06:58 - 07:13] And then you ask, was the price of Apple stock is going to answer the price from the old data, but when you are training it with the newer data, it's going to answer from the newer data. But in this specific application, the stock data goes up and down and it is not stated.

    [07:14 - 07:18] So in this specific application, we need drag. The reason for that is it is continuously changing.

    [07:19 - 07:37] Synthetic data is powerful, but let's be clear, it's not automatically perfect, it can create like what Kevin mentioned, it can create like misleading evaluation or biased data. The solution is to build an evaluation center to look where you start by generating data, run the automated checks with the help of another LLM, which is called LLM as George.

    [07:38 - 07:59] Next, you validate the subset or the set of data set with the human reviewers or just you, and then you compare both of the reasons, which is basically comparing the evaluation from human reviewers. And then comparing the evaluation from the LLM as George, and then find the gap , control the notes, refine the prompt, and then do the cycle repeats again until you have near to real synthetic data.

    [08:00 - 08:17] Think of it as debugging a code where you first create a code, which is not, which is not simplistic or which is near to real from your perspective. And then basically, you find the bugs and then you pass the application and then again, you rerun the application and then you again taste it, find the bugs as the application.

    [08:18 - 08:24] Until you have the proper reason. So our LLM as George always accurate for evaluating synthetic data.

    [08:25 - 08:34] No, basically they are not perfect. The LLM as George are pretty much cheap, fast and scalable solution for making sure that synthetic data is not being biased.

    [08:35 - 08:49] But when we are using another LLM, that LLM can be biased too. So in order to make that LLM not biased, the key is to fill the gap, which is called axial coding, or comparing LLM, George, which deviates from human evaluation.

    [08:50 - 09:06] And those difference can tell us how to refine prompt of the LLM is George, and how to refine prompt of synthetic data to make the filtering roles as precise as possible, near to the human. And then over the time you align the synthetic data quality closer to the real world or real world standards.

    [09:07 - 09:16] Now, when I mentioned axial coding, like what is the question went for? LLM as are you using an LLM model that specializes in doing those type of tasks ?

    [09:17 - 09:28] Or are you just using the model that you're training to somewhat evaluate the data? So like with the LLM as judge, would it be a different model or would be a model that's already, you know, a previous version of something like that?

    [09:29 - 09:32] It slowly depends on you. You can choose another model or you can use the same model.

    [09:33 - 09:40] So basically what I usually do is I create the synthetic data. And then I use the same model for LLM as judge, which is another file under the Python file.

    [09:41 - 09:46] And then that LLM as judge gives course based on my prompts. And then basically it gives me the result.

    [09:47 - 09:57] And then I don't see the result because if I see the result, I'm going to be more biased to the LLM. What I usually do is I go to the synthetic data, I evaluate the synthetic data by myself, which can be like 20 data set.

    [09:58 - 10:10] And then basically I find the gap between the LLM as judge result and the synthetic data. And then I improve the prompt of judge and I also improve the prompt of synthetic data in order to feel that in order to feel that gap.

    [10:11 - 10:21] And then basically it's not about making it perfect, but it's about making it near to perfect. So basically the answer to your question is you can use like another model as LLM as judge.

    [10:22 - 10:24] You can use the same model as LLM as judge. It does not matter.

    [10:25 - 10:44] And when you are generating synthetic data, like there has to be a structure like a structure of generating those synthetic data where basically it's a decent structure and we are going to see it's a part of prompt engineering. So we are going to go over to the structured output when we are going when we are going in the section of from theory to application.

    [10:45 - 10:46] Thank you. So what is actual coding?

    [10:47 - 10:55] Like when I mentioned actual coding on the previous slide, like what is actual coding? So when you start analyzing evaluation data, you will notice it can get messy.

    [10:56 - 11:06] Open coding procedure or lots of notes are useful, but they are on structure. Actual coding swans these by clustering failures into many fully groups.

    [11:07 - 11:26] For example, like as Kevin mentioned about the bias, multiple bias and proof in errors can be grouped into language, failure category format and schema issues can be grouped to output structure, failure category. So basically we are not giving specific note or specific error where the data is not correct.

    [11:27 - 11:47] We are basically giving or we are giving a cluster or we are giving a category of failure instead of giving like very precise failure category. So this structure view makes debugging easier because you, you may want to find like what field that and then basically when you check the category, you will be able to refine the prompt by having the context of failure from the category.

    [11:48 - 12:06] And then basically transition from error analysis to accuracy where so far we have looked at the error analysis of how to find the gap and then how to resolve it. The next step is accuracy where basically you will transform the raw mistake into structure, actionable inside that gives you the prompt improvements.

    [12:07 - 12:12] Let's see in form of diagram. So here like Alice is not just about spotting mistake.

    [12:13 - 12:18] It's about creating a repetitive process to understand why the model feels. First, you collect a trace data set.

    [12:19 - 12:26] This can be synthetic real or a mix. And then next you open these code traces where basically you read through these traces and capture observation.

    [12:27 - 12:34] These are the unstructured notes about what went wrong. And then you code where basically you revisit these notes with more precision and consistent labeling.

    [12:35 - 12:47] And then after applying like the prompts, you create the data set again until you find a near to real data set where you are convenient of using it for your own application. So now jumping from theory to application.

    [12:48 - 13:16] So this is connected. So I connected this with the homework, which is attached on our community website where basically the first step is to analyze, which consists of creating examples and categorizing the failure modes or creating the trace data set and then happening, like reviewing the data set and then coding the trace of five and so first, like how to generate synthetic queries. Now, the bad way to generate synthetic queries is just you ask LLM to give random hundred equation.

    [13:17 - 13:30] This is the most unrealistic repeated thing and it won't have any near to real data set. So the better way to do the prompt engineering is basically focus on the features intent or the purpose, like what user wants and then the persona, which is basically who is asking.

    [13:31 - 13:45] And then the scenario, which is unclear or clear. For example, speaking about the same slide, same previous slide with respect to the coding exercise, which is attached in this week, where basically we are trying to create a chat bot and we are creating like synthetic queries.

    [13:46 - 14:08] And in here, instead of asking LLM to give me 100 random question relating to customer support, that's like the most unrealistic prompt and unrealistic data you get from the level. So what you do is basically you define the feature or intent or the purpose, which is basically the errors, logout, logging out, updates, crashes, search, performance, data sync, billing, account settings.

    [14:09 - 14:19] It can end the persona, which is basically who is asking. It can be new user, it can be power user, it can be enterprise admin, it can be non-native English, or it can be someone who needs accessibility.

    [14:20 - 14:43] And then the scenario, where basically you define clear versus ambiguous or multi-issue, which is basically a user can have the login issue and the update issue at the same time or the crash issue at the same time. Cross-device, which is basically creating the synthetic data set over multiple platforms or multiple access, which is imagine a user wanting to ask, hey, I'm an admin, but I don't have the admin access.

    [14:44 - 14:57] And when I do have the admin access, there is a problem in my computer with relating to data sync. So this is like a cross-device, and then time critical, which is basically a user asking in a way that it's a critical question, like he needs answer.

    [14:58 - 15:02] And then there is compliance is concerned. So these are basically the control knobs, which I was talking earlier.

    [15:03 - 15:40] And then the component of good prompt, specifically speaking to chatbot, which is basically defining the chat persona and task, which is you are a professional and empathetic customer support assistant that resolves technical issue clearly and unlikely. And then instruction or rules, which is basically always confirm the customer problems before giving them the solution, never make up the product detail, if unscented, instead escalate or ask, verify, then basically providing the context, which is basically customer product type issue, category, support history, relevant policy, or it can be multiple context where a customer might have contacted the company multiple times, and then you add that context.

    [15:41 - 16:00] Another continuing the previous slide, another thing, so these are the components of prompt engineering, which will be disclosed in more detail in the prompt engineering lecture, but it is connected with the synthetic data. So I had to introduce this earlier, where basically you give a few short example in the prompt and you give like an example of how it should be.

    [16:01 - 16:25] And then basically it will produce the output similar to the exam. Second is like reasoning step, you have to mention in the prompt that first identify the issue, then verify the account detail, then suggest a solution, instead of like just directly jumping into the conclusion. And then the out format, which is basically always respond with JSON key units, and then the key units are basically acknowledgement solution steps, escalation path, roads, multiple things.

    [16:26 - 16:42] And then there's a structure, which is basically, we have to separate the instruction from the response with clear markers. So it can be XML tags, which is basically problem solution, or it can be a mark down, which is basically three has stayed, and then clearly keeps the section separate.

    [16:43 - 16:55] And then you can code, either you find this markdown or XML tag to extract the data. So for user query generator, like right now, imagine we want to create a synthetic data and we want to create synthetic data.

    [16:56 - 17:15] For the issues, like the issues are the questions that is released by the users . This is a prompt that you can use, where basically you pass this prompt as you are synthesizing realistic customer queries, where the task is to generate and samples, where you basically pass the number of samples you want to generate and distinct user queries about the issue types.

    [17:16 - 17:31] And then you change the issue types with each of the API. Now issue type can be log in problems, software creases, feature, not working, performance issues, installation problem, details issues, billing questions, accounts, and settings . So there can be multiple issue types in there.

    [17:32 - 17:53] And then the constraint here is basically each query should sound like real person, and then you have to make sure it varies in paraphrasing, and it also has statements, questions, or the errors. It includes like occasional concrete details, which is basically a version, screenshot, mention, all the timestamp, and then you keep the query between one to two sentence, it depends on your application.

    [17:54 - 18:02] And then always returning G, which is basically the JSON that is defined. And now these will generate the synthetic data relating to the issue type with user query format.

    [18:03 - 18:17] And then another synthetic data generation is basically from the chat bot side, which is basically, so there are multiple variations. So this is like complete solution. So this is complete solution where generate the customer support board response that provides a complete solution.

    [18:18 - 18:39] And then you pass that user query, which is basically in here, generated from here. And then you, and then you generate a synthetic database on that user query from the chat bot perspective, which is basically requirements is provide a year step by step instruction that fully resolve the issue address the specific problem mentioned in the user collectionable and complete use professional tool, and the land is basically two to four sentences.

    [18:40 - 18:47] And the issue type is basically the same that is being passed from the customer synthetic data generation. Basically, this is the, that generates the partial solution.

    [18:48 - 19:00] The only difference here is you are replacing the complete solution, partial solution, and then you are editing the requirements with the partial solutions. And then you are passing the same synthetic data, which is basically the user query and the issue.

    [19:01 - 19:10] And when you see the homework coding, these are all the synthetic data that is being generated from MLM. So here for each user query, it creates our bot response styles.

    [19:11 - 19:20] It has complete solution, partial solution, work around escalation out of the scope, redirect, year 11, lot of things. And then the outcome is basically clean or complete solution.

    [19:21 - 19:29] The one with the partial is basically near miss negative. And then the other are basically all anchored towards user queries.

    [19:30 - 19:35] And then now we are in the stage of error analysis. So that was the stage of strictly to creating synthetic data.

    [19:36 - 20:00] And then we are in the stage of analysis in order to check if both of these synthetic data, not just from the user perspective, but from the chat bot perspective, are making sense. So in order to do that, basically, what you are doing over here is this is a, this is a repeatable blueprint, like the attached code on our community website is a repeatable blueprint for using MLM as automated judge, whether you are evaluating chatbot responses, somebody's answer or any kind of generated contents.

    [20:01 - 20:10] It starts with defining what you are trying to measure and then building or sourcing the label example that reflects to real world distribution. So basically in here, your prompt is your judge.

    [20:11 - 20:23] So treat it like evaluation or testing code where basically you have to be as precise as possible when you are driving up prompt to the LLM for judging the synthetic data. For example, it can have a few short examples.

    [20:24 - 20:58] So here, let's say the prompt can be evaluate the user query, which is basically the value bracket user queries, and then evaluate the chatbot response, which is basically the response and label it as result or unresolved based on requirements. And then you specify the requirements and then at the end, you make it very precise on only answer, resolved or not resolved to meet the label and then basically, it will label the data with the result or not resolved by seeing the user query and the bot response with the requirements that you can give.

    [20:59 - 21:15] Now, in here, in order to make it more efficient, you can add few short reasoning examples. So for example, you can give an example of a data real world data where a user is asking a query and then the response of the chatbot is these and then you give the label as resolved similarly for other data.

    [21:16 - 21:42] So these are some examples that you pass inside the prompt in order for LMS church to evaluate based on the examples, based on the prompt requirements and the data generated and then always make sure to make the temperature lower for deterministic results. The reason for this is like when you increase the temperature of LLM model, which is basically when you if you have if you have played around with the notebook LLM notebook, you will see the temperature.

    [21:43 - 21:58] And then if you play around with the temperature, basically, when you move the temperature of LLM model towards the one or more than one, it becomes creative but deterministic. And then when you move the temperature to lower, it becomes deterministic, but it also be lower on the creativity.

    [21:59 - 22:25] And then when you do and then again, you have to do the human evaluation, which is basically, you look at the same data, synthetic data, and then we evaluate those synthetic data and then we mark it as result or not resolved by looking at each of those synthetic data. And then basically we use our evaluation and the LMS evaluation to calculate the metrics such as accuracy, precision, recall and F1 through positive rate through negative rate confusion metrics and then bias analysis.

    [22:26 - 22:36] And then again, we have to go back on refactoring the prompt, not just on the LLM as judge, but also in the synthetic data to make it as responsible. Now we are in the stage of measure.

    [22:37 - 22:50] It is basically actual coding where we have reasons from both LLM as judge and human. Now, when we see, for example, the false positive cluster are basically LLM saying not resolved and human saying it as a result.

    [22:51 - 22:58] So basically, this gives us an idea that LLM as judge is being too, but still itself. Now, it can be the synthetic data or LLM as judge.

    [22:59 - 23:11] So we have to go back in the prompt and refine the prompt accordingly, considering the false positive cluster. And then false negative cluster, which is basically LLM seed result or mark the data as result and then human seed as not resolved.

    [23:12 - 23:32] It is basically false positive false negative cluster. And it is saying that LL M is basically marking most of the things as result while human is being more precise on result and not. So if you see the type, like, if you see, like, it captures, if you see, like, this is basically the JSON file in the coding exercise itself.

    [23:33 - 23:42] And if you see the human versus LLM comparison, we have total number of sample as 15, 2 positive as 4, 2 negative 6. And while false positive 4, false negative as 1.

    [23:43 - 23:52] So this gives us an idea of how LMS judge is being ineffective in those areas. So our next step is to basically provide, like, better examples in the prompt itself.

    [23:53 - 23:57] Either it can be in LLM as judge or the synthetic data. And then basically, we iterating that and improving that.

    [23:58 - 24:07] So this is what I just mentioned in the earlier slide, which is basically improving from the results. Now, what usually happens is, like, if you are using a, there is a data drift, which goes on.

    [24:08 - 24:15] So when, so this happens mostly with the API. So before diving into the data drift, let's first understand what is a data drift.

    [24:16 - 24:25] So a data drift is basically the LLM model answering from the pre-training data . But over the time, it does not answer from the pre-training data.

    [24:26 - 24:28] The data drifts from the pre-training data. And then it gives you the new outputs.

    [24:29 - 24:48] So now, in these cases, what happens is both in the synthetic data and also in the LLM as judge, it gives us new failure modes. So, in order to make these as precise as possible, what we usually do is, we make the entire processes like after identifying the new failure modes.

    [24:49 - 24:56] We go back and quickly read past traces. This ensures that the data reflects the updated definition and categories.

    [24:57 - 25:04] As more, as more example comes in, basically, you will have to refine the categories further. Sometimes you will must this category turning out to be overlapping.

    [25:05 - 25:13] And then other times, you will need to split into overly broad categories into more precise ones. The key idea here is the theoretical situation.

    [25:14 - 25:24] That's the point where reviewing additional traces stops surfacing fundamental new errors. Instead, finding everything at the same time, you make an active process to find new errors over the time.

    [25:25 - 25:30] And then basically, at some point, you will have complete details of what you want. Like, there is a new error right now.

    [25:31 - 25:38] When you ask an LLM model that, "Hey, find me best framework or find me the best restaurant to be simple. Find me the best restaurant."

    [25:39 - 25:41] It used to answer something else before. Now, it will answer something else.

    [25:42 - 25:52] The reason for that is the advertising firms are trying to connect these models with the advertisements. So, when it answers something at the top, end user will think that it's a good restaurant.

    [25:53 - 25:59] Let's go there. While it's a data drive. It's done intentionally because of advertising and loss of other reasons.

    [26:00 - 26:03] So, the model which are continuously changing. That's their foundational model.

    [26:04 - 26:12] But if you have an on-premise model, it will have lower data drives. But it will still have data drives due to some reasons in the hallucination part.

    [26:13 - 26:23] Yes, there is a new error right now where they are trying to connect the advertisement firms with the LLM models. So, it used to be fair answering before, like when you ask a chat GPT model.

    [26:24 - 26:36] But now, what they usually do is they fine-tune the existing model with those advertisements vendors. So, when you ask, what is the best airline, it used to be, it used to answer fairly from the internet courts.

    [26:37 - 26:50] Which is basically, let's say it can be American Airlines. And then, let's say, if I'm the CEO of Spirit Airlines and I give some huge funding to the advertising firms that are connected with the LLM or chat GPT or any kind of large language model organization.

    [26:51 - 26:57] I will ask them to always give my Spirit Airlines as well. So, when you ask, what is the best airline is going to answer as Spirit Airlines?

    [26:58 - 27:09] It's not from the training data, it's more on the updating the model weights internally. And also, there are multiple reasons of data drives because the data of internet cops always changes.

    [27:10 - 27:20] So, right now, we are in the stage where most of the data generated by LLM model on the internet. While it used to be, we were the one who were generating all the text on internet.

    [27:21 - 27:28] So, after you have, let's say after you have a collective example of synthetic data. This is more on the data science part, but I wanted to include this.

    [27:29 - 27:43] The reason for this is it gives you an overall example of how synthetic data is . So, the overall analyzer here is you create a Python code which has your own dimension of calculating the quality of synthetic data.

    [27:44 - 27:55] For our use case, it's data quality, which is basically, which is basically, are these data duplicate queries or responses? Are there any field mission? How complete is the data overall?

    [27:56 - 28:10] So, that is basically the data quality, one of the dimension. Now, the dimension is diversity, which is basically calculating the diversity from all the issue types from the user queries, and then the bias detection, which is basically checking the response that dominates more than 50%.

    [28:11 - 28:17] And then, if certain words are represented, for example, if you reset the password, it will solve the problem. If you reset the problem, it will solve the problem.

    [28:18 - 28:29] If you reset the password, that is basically biased detection. Linguistic quality is basically how repetitive is the text, where you check, even if it's the same question, how repetitive is the same question in terms of paraphrasing.

    [28:30 - 28:45] Basically, what, when I say paraphrasing is, it has the same conclusion, but it is a complete different text structure. While many of the paraphrasing, usually LLM tends to have this tendency of taking the shortcut, and then it gives you a paraphrasing, a very simple paraphrasing of a simple text.

    [28:46 - 28:58] So, that's overall what I can say. And then the LLM realism check, which is basically we sample a few records and let an LLM score them, based on the realism, diversity, authenticity, response quality, and the mutualness.

    [28:59 - 29:06] And then, this is like asking a human judge, but automated. And then, you get like an overall score of how your data is.

    [29:07 - 29:14] And basically, if you see you were here, we have the bias score of 62 out of 100. And then, you can target these bias.

    [29:15 - 29:32] And then, try to include more samples, more additional refinement of prompts, additional few short examples in the prompt, and then it will resolve the bias ness. So, when we start evaluating LLM output, our first list of error types is usually, but as we review more traces, we discovered new patterns of mistake.

    [29:33 - 29:46] That's why, like, the data drift happens, as I mentioned. Basically, data drift occurs because the real world data is generating data over the period of time, and then it diverges from its pattern, from the original machine learning model.

    [29:47 - 29:51] This is like one of the data drift that used to occur. But now, there are many data drift in the data drift category.

    [29:52 - 30:00] So, when we were talking about evaluation, which is basically accuracy to positive to negative. But before going into that, like, what is evaluation?

    [30:01 - 30:16] Evaluation has four aspects, and it works all together, evaluation, which is specifically for retrieving documents, relevant documents, which is recalled, and precision is basically asking for those documents, actually, element. Then we have generation evaluation.

    [30:17 - 30:33] Once documents are retrieved, how well the system uses those documents to answer it correctly, factually, or use the source, precisely. And this is basically called as leg lagging matrix, which basically is free-it-fullness, relevance for, and a lot of other metrics.

    [30:34 - 30:40] And then we have benchmarking and tasting, which is basically, think of benchmarking as simulation. You have a complete set of simulation, where you have this.

    [30:41 - 30:54] So, let's say, if you have a rag, then you have these PDFs, you have the synthetic data, and then you have its goal label. And then, basically, you simulate the entire rag system on that benchmark, and then taste if your system is working correctly.

    [30:55 - 31:00] People usually ask me, like, are we targeting evaluation? We are targeting evaluation to make it better.

    [31:01 - 31:06] But we are never, slowly target evaluation or benchmark. The reason for this is it varies on the end user.

    [31:07 - 31:15] These are not precise way of telling. End-time model is, let's say, if an accuracy is 98%, then we cannot say an EI application is 98% accurate.

    [31:16 - 31:24] The reason for that is the end user are different, and the way of measuring accuracy is not defined right now. And that's my personal viewpoint.

    [31:25 - 31:33] And basically, but we have to target the evaluation and benchmark in order to improve the existing EI application. So, when I see it like true positive in previous slide, what is true positive?

    [31:34 - 31:39] This is when the model correctly flags a real problem. For example, if chatbot sees an issue, is it resolved?

    [31:40 - 31:47] And if human marks the issue as unresolved, this is true positive. And then, true negative is basically the model correctly recognize that no problem exists.

    [31:48 - 32:00] For example, the chatbot sees resolve, and the human also resolve is true negative. And then, false positive is basically, sorry, for the true negative, both of them sees unresolved, and for the true positive, both of them sees resolved.

    [32:01 - 32:07] It is basically both of the parties agrees on the same issue. And then, for the false positive, it's basically a false alarm.

    [32:08 - 32:21] The model flags a problem, but in reality, everything was fine. For example, in our use case, if a chatbot LLMS judge sees the ticket is unresolved, while the ticket is actually resolved, where I marked that as resolved, a human marked as resolved.

    [32:22 - 32:34] Then it's false positive. False negative is basically the opposite of false positive, where a chatbot sees or LLMS judge, the user problem was unresolved, and the LLMS judge sees resolved.

    [32:35 - 32:51] And then, in order to calculate the accuracy, basically, what we do is we find the percentage of all of these predictions, whether positive or negative. And then, we get like, and then we divide it by the total number of metrics, which is TPT and FP and FN, and then it gives us the accuracy.

    [32:52 - 32:58] And then, for recall, basically, we do the same with the presented formula. Similarly, for the precision, we use only true positive.

    [32:59 - 33:23] And then, for the F1 score, basically, for the F1 score, we basically take an harmonic mean of precision and recon, which is basically useful when we want to find out not missing problems and not over flagging resolved. These are baseline information retrieval metrics or RAG, where like information retrieval is basically a part of RAG, while RAG is like our own architecture with LLMS.

    [33:24 - 33:32] So, when we talk about baseline metrics for information retrieval, it consists of recall and precision. And then recall is basically asking, did we find all the relevant document?

    [33:33 - 33:44] Precision is asking, are the documents you retrieved actually relevant? High recall means you are likely, high recall means you are less likely to miss an important source, but you might have faged noise along the way.

    [33:45 - 33:55] And then, the precision is basically only returning the best documents, but you risk the useful information that can be missed. So, that's a neutral trade-off.

    [33:56 - 33:59] You have to find the right balance. And then people usually ask me, like, what's the right balance?

    [34:00 - 34:12] That's basically after you test it with your own self or with the other user, and then you find the right balance of recall and precision. Here, there are two types of metrics, which is basically leading metrics and lagging metrics, as I mentioned earlier.

    [34:13 - 34:29] Leading metrics are retrieval focus, which is recall precision, and lagging are basically what I mentioned earlier, which is basically relevance and fitfulness, where it gives us an evidence of LLM generating the data from the source. Same recall is like fetching all the CSS file you need for a page.

    [34:30 - 34:41] If you miss one, the page is broken. Precision is like making sure you don't load a giant 50MB of CSS file when the user is coming on web page.

    [34:42 - 34:52] The key takeaway here is to find the right balance between all of these metrics . While these metrics are reliable on, but overlaying on it might lose the value of full application.

    [34:53 - 34:59] So, I discussed these earlier. So, these are some additional retriever metrics, which is basically recall precision, MRR and DCG.

    [35:00 - 35:10] This all has its own nuances, and we are going to uncover most of it in the rag. And on the generation metrics, it's relevance, correctness, full, and then there are hallucination.

    [35:11 - 35:29] And for the benchmarks, like how benchmarks different than the metrics. So, as I mentioned earlier, benchmarks basically a complete set of simulation on performing a set of actionable items and then getting the results to see how better it performs on that benchmark.

    [35:30 - 35:40] Basically, there are multiple rag benchmarks, be EIR, flash stack, rag bench, and the three ingredients of all of these benchmarks are basically the same. PDF queries, and then the round or the label.

    [35:41 - 35:46] So, these are all the benchmarks that I mentioned on the earlier slides. It has its own specificity.

    [35:47 - 35:59] Now, you might perform more on this benchmark, but you might perform better on the metrics and when you test your own system, it might perform better. So, in that case, what you usually do is you create your own custom benchmarks.

    [36:00 - 36:17] Basically, in custom benchmarks, what you do is you use your own document, you create your own synthetic regeneration, and then you have its goal label attached with the regeneration. Now, here, when I mentioned goal label is basically the position of data where it is stored in the vector DB or the database.

    [36:18 - 36:48] So, this is the rag, like flywheel or an over picture, where first you have a retrieval documents and then you generate answer based on those retrieval documents, and then you evaluate the retrieval plus the generation, not just the generation, and then you use synthetic data set to refine the whole process, and then basically you choose the best embedding chunking and re-ranking, and then you have an iter loop. Basically, think of it as CI/CD pipeline, where you fetch, build, test, deploy, monitor, and then iterate.

    [36:49 - 37:04] Now, in the rag, the use of synthetic data is basically to unit test them, whether it's checking the recall, whether it's checking the MRR, whether it's checking the fullness score or the relevance. We are going to cover more of it in the rag lecture, but this is to give you an overall picture of how things work in there.

    [37:05 - 37:16] Some of the pitfalls to avoid is basically rushing through the evaluation process. If you are not considering the dimension, then your evaluation will always fail, like it will always fall off.

    [37:17 - 37:25] Synthetic data needs to cover enough variety, otherwise you will miss key insights. If you skip open coding, you will miss error that usually does a data.

    [37:26 - 37:38] Do not blindly apply categories or lead to the blind spots. If you use fine grid scoring like one to five scales, before you even know your category, you will end up in meaningless numbers.

    [37:39 - 37:44] So, these two have binary flags, zero or one, or pass or fail. If you don't iterate, nothing improves.

    [37:45 - 37:57] Evaluation must be in cycle, not our one and done exercise. If you do not involve domain expertise where needed, you are risk of generating confident, non-size data that looks polished, but is misleading or unsafe.

    [37:58 - 38:09] The fix is simple, slow down, cover your dimensions, make sure your prompts are good when you are generating synthetic data. Make sure your prompts are good when you are evaluating with elements charged, carefully evaluation and the human evaluation and the night.

    [38:10 - 38:25] So far, we were looking into the synthetic data and metrics, but now to improve, but now to improve your knowledge on model cards and other stuff, we are going to go some of the hands-on model cards from different models. So, before going into what is model card?

    [38:26 - 38:34] A model card is like an official report card of an AI model. It shows how it was built, where it performs well, where it struggles, and its benchmark scores.

    [38:35 - 38:46] Before diving into evaluation, it's important to know that this is where you actually find them and this is how it performs in the specific benchmarks. Some might perform well in maths, while some might perform well in tax generation.

    [38:47 - 38:54] Some might perform well in generating the codes. So, that's why we need model cards to check if we are using the right model.

    [38:55 - 39:14] So, when we are thinking of evaluation on, in perspective of foundational models, think of it as testing it with specific units. So, the code category are basically coding maths and STEM, tool use and reasoning, general knowledge and logic, or it can be another categories where multi-language support.

    [39:15 - 39:27] When you are seeing this category and let's say a specific model is perform well in one of these categories, that's how you find out that this is the model that you should be using for your own AI application. So, without evaluation, like a model release would be just marketing.

    [39:28 - 39:42] Benchmark keeps company honest and they show if model is better at coding or other stuff and where it feels. For enterprise, these evaluation actors prove that the model is safe and reliable and enough to add up with your own application.

    [39:43 - 39:57] So, when Google launched Jaminoy or Meta released llama, they don't just say this is better, they back it up with the reason. This result gets published in the model cards and then you can see the model card to evaluate which of the model is performing well in which of these benchmarks.

    [39:58 - 40:10] There are different benchmarks in each of these categories. So, some of the different benchmarks are basically, let's say, for category and software engineering, these are some benchmarks, which is basically human L, MBPP, EPP as all of these are like benchmarks.

    [40:11 - 40:24] And you can use these benchmarks to evaluate these foundational models, evaluating in terms like with the context of foundational model. And then, let me show you, so when you go to hugging face, this is the model card and then this is like deep seat V3.1.

    [40:25 - 40:29] And then, basically, you can see there are two variants. And this helps us to check which model we should use.

    [40:30 - 40:40] And then, there are additional information of how to use deep-seq model and then how to use tool use, which is basically giving a tool. And then, this is the evaluation where it performs better.

    [40:41 - 40:54] And it also has the category where maths performs for these, it performs like the non-checking version performs 66.3. And then, while the R1 0.5 to 8 performs 91.4.

    [40:55 - 41:02] So, this gives you an idea of what to use for your own use cases. One of the guys from our previous cohort was building a code agent.

    [41:03 - 41:10] And, basically, he was confused on how or what to use. And then, basically, this is the benchmark that gives you an idea of what to use.

    [41:11 - 41:18] And then, this is the open-source model of GPT as well as 20 billion parameters . In here, all of these are additional information that is available.

    [41:19 - 41:26] But the reason I want to show these, like, when you, this is the model card, but it does not see one job information. It does not see information about the model.

    [41:27 - 41:34] So, what you do is, basically, you find a research paper. And then, basically, you can come up to your own conclusion on if you should be using that model.

    [41:35 - 41:41] And these research paper will involve all the benchmarks. From my personal opinion, I do not, I did not prefer, like, GPT devices.

    [41:42 - 41:51] I would go with DeepSick. These are additional benchmarks for all of the other, these are additional benchmarks for all of the other scores, considering coding and software engineering.

    [41:52 - 42:00] And then, for the math and stem. So, in here, I'm using, in one of the coding exercises, I'm using GSM8 key for evaluating the LLM model.

    [42:01 - 42:07] So, this is, like, one of the benchmarks for math and stem. Additional, these are all the additional benchmarks that is available to you.

    [42:08 - 42:16] And then, basically, common sense logic listening, which are all of these benchmarks. And then, for trustfulness, factual and safety, you can use these benchmarks.

    [42:17 - 42:28] These benchmarks consist of some of the, like, some of the benchmarks will consist of proof and language, where, basically, you can use this benchmark to test. If your system is performing better in this category.

    [42:29 - 42:41] And then, for conversion and disruption following, these are the leaderboards, where people continuously, so let's explore them, check it, you know. So, here, people continuously evaluate the model.

    [42:42 - 42:51] So, if you see here, like, in terms of tech generation, Gemini 2.5 Pro is performed better, while in terms of web development, GPT file is performed better. Similarly, reasons.

    [42:52 - 42:59] So, these leaderboards are, like, continuous testing of models by the users. And then, this is a tool use and Asian benchmarks.

    [43:00 - 43:13] So, if you are building a multi-Asian architecture, then, basically, you will need some tool use. And, in order to make sure that I let them model in the right tool, you can use one of these benchmarks to check if ALM is not hallucinating on picking the right tool.

    [43:14 - 43:30] This is for long-term, long-context benchmark, which is basically passing out a plethora of text, and then asking some question from the plethora of text to check if the model is capturing the context from the plethora of text. And, as I mentioned earlier, like, I already showed you the model card example on Hagin face.

    [43:31 - 43:52] So, these are the simple example of model cards for all of the models and how it performs on different benchmarks. And also, the leaderboards, as we saw the leaderboards on the slide, like on the channel, there are different types of leaderboards that you can basically rely on, and where people are actively testing these different models, see which one performs better in which category.