Technical Orientation (Python, Numpy, Probability, Statistics, Tensors)

- How AI Thinks in Numbers: Dot Products and Matrix Logic - NumPy Power-Tools: The Math Engine Behind Modern AI - Introduction To Machine Learning Libraries - Two and Three Dimensional Arrays - Data as Fuel: Cleaning, Structuring, and Transforming with Pandas - Normalization in Data Processing: Teaching Models to Compare Apples to Apples - Probability Foundations: How Models Reason About the Unknown - The Bell Curve in AI: Detecting Outliers and Anomalies - Evaluating Models Like a Scientist: Bootstrapping, T-Tests, Confidence Intervals - Transformers: The Architecture That Gave AI Its Brain - Diffusion Models: How AI Creates Images, Video, and Sound - Activation Functions: Teaching Models to Make Decisions - Vectors and Tensors: The Language of Deep Learning - GPUs, Cloud, and APIs: How AI Runs in the Real World

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 00:08] Okay, so basically, what is NumPy? NumPy is basically, you can think of it as a pre-made library or pre-made program available inside Python.

[00:09 - 00:27] In order to install NumPy, what you can do is you can just type a command, keep install NumPy, and then it will basically install the NumPy. It is used to perform fast and efficient array operation, and it is being used in AI and machine learning programs in order to make arithmetic operations on data.

[00:28 - 00:40] There are different types of arrays in NumPy, one-dimensional array, two-dimensional array, and three-dimensional array. One-dimensional array are something that represents simple list of numbers, commonly used for storing single feature data set.

[00:41 - 01:01] Two-dimensional, on the other hand, is basically rows and columns, and three-dimensional is basically combination of two-dimensional arrays. Now, if you go in typical coding languages, you would have to perform a manual loop in order to store an array, while in NumPy, you just have to define NP array, and then you can define whatever size of array you would like to define.

[01:02 - 01:27] So you don't have to manually go over and store the list and then have a 2D or 3D array when comparing to other programming languages. So to use Py, what you basically do is you can, you basically import NumPy, and then as NP is basically a variable that is defined by throughout the code, and then you can use a function called as array from NumPy, and then you can use that function.

[01:28 - 01:40] These functions varies from add to arithmetic operation, like dot products, multiplication, subtraction, and many more. So the reason for using NumPy is primarily, you don't have to code each and everything.

[01:41 - 02:09] You can just basically use the existing library and the function of those existing library in order to achieve your specific task. Moving on to different libraries, so these are the typical libraries that are usually used in the field of machine learning but usually NumPy is basically fast number crunching and it is necessarily used for matrix operation products and then then all sorts of stuff that requires us to store high dimensional data in variable and then perform some arithmetic operation on it.

[02:10 - 02:30] While on the other hand, map is another library that is used to create inside full plots just by importing the function from the matplotlib will be going over in that library in the pool, like from the pool while we go to the future. And then similarly, pandas is tabular form of data, manipulating tabular form of data.

[02:31 - 02:42] Basically, pandas is essential for handling structured data, allowing us to easily manipulate the data frames. It helps us to do the filtering things and transformation stuff like normalization and other things.

[02:43 - 02:57] All of those things will be disclosed in the upcoming slides. So think of all of these as pre-made programs for us to use and then we can just install it by using pip install and then NumPy can be matplot or it can be pandas.

[02:58 - 03:13] And then you use the function from that library in order to use a specific function from the respected libraries. Leveraging these libraries helps us to build and train and then for the deploy AI model efficiently, like making Python preferred choice for AI developer.

[03:14 - 03:18] Let's go two-dimensional array and three-dimensional library. So what are two-dimensional array?

[03:19 - 03:26] Basically, two-dimensional arrays are something that consists of rows and columns. It can be, you can think of it as a spreadsheet or table.

[03:27 - 03:37] And then basically, three-dimensional array is stack of two-dimensional library. You think it like a small notebook with multiple pages where each page is a two-dimensional grid.

[03:38 - 03:49] For example, if you would like to store image data, it would require us to store it in three-dimensional. The reason for that is it has height, it has width, and then it has color channels like red, green, and blue.

[03:50 - 04:01] So we are gonna go over, I'm gonna simultaneously try to go over to the notebook in order to help you understand all the concepts we go over to the slides. So basically, this is two-dimensional, two-dimensional array.

[04:02 - 04:14] And then these are some pre-made function that you can use to perform any operations. For example, p.full, the full here represents filling up the entire array with seven with the respected size.

[04:15 - 04:22] The size over here is two and four. Similarly, p.0 is filling up entire with zero with the respected size, which is basically two and two.

[04:23 - 04:32] Similarly for one. So there are all these pre-made functions that you have to explore by your own when you are performing the notebook from unit one, and then exercise one task.

[04:33 - 04:39] You don't have to worry about completing each and everything. This is just to get you familiar and brush up your skills with Python programming.

[04:40 - 04:44] And then this is like 3D array. It's basically consists of a stack of 2D array.

[04:45 - 04:52] If you see over here. And then when we want to print the shape of that 3D array, which is basically here, it's a queue.

[04:53 - 05:06] So if we want to print the size of 3D array dot shape, that prints us the shape of that 3D array, which is two, two, and two. Similarly, for any of the other array, it's gonna do the same if we try to add dot shape with the way which is legal block.

[05:07 - 05:18] So now diving into the matrix multiplication and dot product. So multiplication is a basic concept where you multiply a number with a matrix and you have, and then you have an answer, which is basically over here.

[05:19 - 05:28] If you see, if you multiply a number like two by an element inside the matrix, which is four, then we have zero. Similarly, two multiply zero is two.

[05:29 - 05:33] Two multiply one is two. And two multiply minus 18 is minus nine is minus eight.

[05:34 - 05:55] So think of it as you have some input. For example, number that describes age, income, education level, and you have set of preference of filter and you would like to make model more emphasis is on specific feature of the dataset, then you would use dot product and then you would use emphasis is, which is basically over here is two.

[05:56 - 06:10] It is also used in different, it is also used in different direction, where basically if you try to multiply, so we are gonna go in that exercise in the further units. Where if you try to multiply some words, or you try to do like arithmetic operation on the words, you will have different words.

[06:11 - 06:32] So that's how it relates with the EI. Similarly, like dot product over here, the operation of dot product over here is basically, if we have two matrices, A, B, and E and B, and then if we want to find like dot product of E and B, then basically what we do over here is, we first multiply E1 and B1, and then do the addition of E2 and B4.

[06:33 - 06:49] Similarly, plus multiplication of E3 and B7, and then we will have C1. And then to make it more simple over here, if we do a dot product of both of these matrix, one multiplied by seven, plus two multiplied by nine, plus three multiplied by 11 is 58.

[06:50 - 07:05] Similarly, for other element, it's gonna be one multiplied by eight, plus two multiplied by 10, plus three multiplied by 12 is 64. Similarly, four multiplied by seven, plus five multiplied by nine, plus six multiplied by 11 is 139.

[07:06 - 07:15] And four multiplied by eight, plus five multiplied by 10, plus six multiplied by 12 is 154. So let's, so this is just a basic concept.

[07:16 - 07:21] You don't have to remember it, but you have to know how it works. Basically, we don't have to manually do all those things.

[07:22 - 07:31] We don't have to even define any loops. That's the advantage of using NumPy, where basically it does it, it does the thing for you, manually defining the loop and multiplication.

[07:32 - 07:52] So if we go to topic 2.1, and if we want to multiply two of this array, which is E and B, then basically right now it's addition, it's basically E plus B as simple as that. And then if we want to do dot product, we have a pre-defined function as dot dot, and then use both of the variable E and B in order to get the dot product.

[07:53 - 08:00] So try to explore exercise from task. And then if you can't figure out what to do, then go ahead and look into the answer.

[08:01 - 08:04] Again, you just have to plug in play. You don't have to worry about the whole code.

[08:05 - 08:10] You just have to worry about what to put in each of these exercises. And then we have introduction to data with pandas.

[08:11 - 08:24] Let's switch gears to from mathematics operation to something more practical, which is basically working with real data. In here, we are often dealing with mass real world data set, like CSV file that consists of user info, sales or sensor readings.

[08:25 - 08:30] So before we can build any model, we need to clean filter and understand that data. This is where pandas comes in.

[08:31 - 08:41] Pandas is a powerful Python library, specifically designed for data analysis and manipulation. You can think of pandas and supercharged Microsoft Excel for Python.

[08:42 - 08:51] But unlike Excel where you have to manually click and drag and then find answers, pandas does it for you. You just have to import functions from pandas.

[08:52 - 09:00] And then you will be able to do all sorts of things with the help on this. When I mentioned what are some things that can be done with pandas, it can be data cleaning and transformation.

[09:01 - 09:06] It can be normalization. It can be multiple other things when you are dealing with data.

[09:07 - 09:15] So what is normalization? Normalization is like conno, is like converting different currency into single standard currency before making a purchase.

[09:16 - 09:28] In data processing, it scales all your values within the data, within a specific range, which is zero to one. And it ensures that all the features contribute equally when you are feeding the data to any machine learning or AI model.

[09:29 - 09:37] Now, why do we need normalization? So for example, imagine you are comparing prices from different countries without converting them into comms.

[09:38 - 09:45] It would be difficult for us to compare values accurately. So similarly in machine learning, any eye, it features a different range.

[09:46 - 09:59] For example, you have a specific data and then you have income in thousands and you have age in years. Then what happens over here is if you do not normalize the data, the income can dominate while predicting the pattern.

[10:00 - 10:10] The reason for that is it holds more value because income can be in thousands while age can be in years. Another way to put this example is basically think of cooking recipe example.

[10:11 - 10:21] Think of a recipe that uses measurement like cups, tablespoon and grams. Now to adjust recipe correctly, you need to convert all the measurements to a single unit, for example, cups.

[10:22 - 10:30] Now, normalization does the same with the data, bringing everything to common scale. So different features, which is basically ingredients, can be mixed fairly.

[10:31 - 10:45] It also makes it easy for us to do this scaling by using existing functions. And the pandas can be input can be used in the similar fashion as NumPy, where you can just type input pandas as PD.

[10:46 - 10:58] And you can use PD.any function in pandas in order to do any operations on the data. Now that we have looked into how to load and manipulate data set using pandas, we are ready to take a closer look at patterns inside the data.

[10:59 - 11:07] And for that, we have to turn the things towards statistics. So in that we don't just collect data, we need to understand what it tells us.

[11:08 - 11:14] That's why statistic comes in. Statistics gives us tool to describe, compare and evaluate what's going on inside our data set.

[11:15 - 11:27] At the core of nearly every machine learning model is the assumption that there is a pattern to be learned from the, with the help of statistics. You will see statistic concepts being used in many AI articles.

[11:28 - 11:34] And this is the reason for that we are covering it over here. When we talk about like loss function in machine learning, we are imaging.

[11:35 - 11:43] So basically before speaking about loss function, what is loss function? Loss function is basically a pre-made function that helps to reduce loss gradually.

[11:44 - 11:46] Now, how does it do? Like, how does it do that?

[11:47 - 12:02] It has statistical function within its program that helps us to decrease the loss and makes model predict much better than the average expected results. Don't worry, like we don't need to, we don't need to dive deep into this concept.

[12:03 - 12:17] It is, the reason it is over here is to help you understand how things work out at the foundational level. So what is basically what's most of the classification or most of the things in the direction of classification are done via probability.

[12:18 - 12:23] Even the tax generation from AI model is done with the help of probability. Now, what is probability?

[12:24 - 12:32] Probability is a measure of how likely an event is to occur. Represented as number between zero, which is basically impossible and one as certain.

[12:33 - 12:41] Or it can be represented as zero to 100. Zero percentage to 100% is to predict the outcome of any specific data.

[12:42 - 12:58] The simplest example is coin toss example. Imagine if you have a fair coin and there are two possible outcomes, heads or tails, you will basically have 50% of probability for heads, 50% probability of tails, which is basically 0.5 for heads and 0.5 for tails.

[12:59 - 13:14] Similarly, ruling a dice, another example, which has six side and each side has equal probability of appearing when you throw the dice. So the probability of one side coming out of six is basically 0.167 or 16.7 percentage.

[13:15 - 13:18] Again, why probability is important in that? It is used in predictive modeling.

[13:19 - 13:28] Basically, it can be used as the likelihood of event that can be occurred. For example, spam detection, fraud detection, also the tax generation.

[13:29 - 13:35] Those are some things that use this probability at the very last year of any AI model. Then there comes the mean.

[13:36 - 13:38] Like, what is mean? Mean is basically a common concept and everybody knows that.

[13:39 - 13:45] It's an addition of total number of data set. It's an sum of all numbers divided by number of items.

[13:46 - 13:59] So over here, basically, if we have five apples with the price of one, two, three, and four, then basically we just add those numbers, one plus two plus three plus four, and then divide it by total number of item, which is basically five. And then we will have the mean of three.

[14:00 - 14:03] That is nothing complicated. Why is mean important?

[14:04 - 14:13] It's summarized data into single representation. It has in comparison, for example, understanding at its average temperature or average test scores.

[14:14 - 14:26] In AI and machine learning, the mean is often used to normalize data and then measure model performance. Now that we have talked about mean, let's talk about how far away values are from the mean.

[14:27 - 14:37] That is what we called standard deviation. The standard deviation is often represent as sigma, and it helps us how the data set is spread out with respect to the mean.

[14:38 - 14:50] If all your data points are close to mean, you will have lower standard deviation. But if the values are scattered, some far above or some far below, the standard deviation will be larger.

[14:51 - 15:00] For example, you can think of it as imagine a group of friends is lining up for a foot. If you are all similar heights, the group looks tight and uniform.

[15:01 - 15:08] This is like low standard deviation. But if one friend is really tall and another one is really short, the group looks more spread out.

[15:09 - 15:18] That's a higher standard deviation. Another example is let's say an average height in your group is 5.8, and most people are close to the height of 5.7 and 5.9.

[15:19 - 15:24] Then basically it has a low standard deviation. But if one person has 6.4, another is 5.2.

[15:25 - 15:34] The height varies a lot and it will have higher standard deviation. While it matters in AI, it helps us to understand how consistent or variable our data is.

[15:35 - 15:50] In AI, it is used to normalize data before training a model. It spots the outliers or anomalies, and then basically you can perform some function in order to remove that outliers and then feed the data to the model and then evaluate the performance of the model.

[15:51 - 16:02] In the field like finance, standard deviation is used to measure risk, like how much stock price typically fluctuate. In next slide, we will go through a concrete example to see how standard deviation is calculated.

[16:03 - 16:16] So for example, imagine you have a dataset of two, four, six and 10, and then you would like to find a standard deviation of that. What you do is you first find the mean, which is basically summation of total number divided by number of items, that is six.

[16:17 - 16:28] And then you subtract the mean, you subtract the mean and see how far each of the number is from the mean, which is basically minus four, minus two, zero, two and four. This shows how number deviates from the mean.

[16:29 - 16:40] Now you square those numbers and then you will basically have 16, four, zero, four and 16. And then when you find the average of those squared numbers, we have a term variance.

[16:41 - 16:53] Then further, when we do the square root of variance, we will have standard deviation, which is 2.83. These number tells us that on average, the data points are about 2.83 unit away from the mean.

[16:54 - 17:06] We can use this to understand how typical or unusual new data points, especially when working with normal distribution. You can basically use standard deviation by defining a NumPy function, p.std.

[17:07 - 17:10] And then you will have standard deviation for that specific dataset. So let's go, okay.

[17:11 - 17:20] So this is how you find mean, which is basically np.mean. And then if you would like to find the standard deviation, you can just use the dataset, which is group one.

[17:21 - 17:27] And you can do an operation np.std, which is basically a standard deviation of group one. And then you will have a standard deviation.

[17:28 - 17:32] That's like an advantage of using NumPy. You don't have to define the coding for each instead.

[17:33 - 17:47] You just have to define the function of those libraries. So now diving deeper into the distribution when we were talking about standard deviation, which is basically not distribution or so-called distribution, which is basically a bell cow.

[17:48 - 18:00] It describes how well you tend to spread out or spread around the central point to the mean. In many real world dataset, most values are close to the average and only a few are very low or very high.

[18:01 - 18:07] That's why the cow has the peak in the middle and then it tickles off at the end. A good real example is human height.

[18:08 - 18:12] Most people are somewhere around average height. A few are much taller and much shorter.

[18:13 - 18:23] If we graph that, we get a smooth bell shift that is represented on a slide right now. The cows becomes powerful when we used to infer predict and classify.

[18:24 - 18:28] That's what we will talk about in the upcoming slides. Why is knowing bell cow useful?

[18:29 - 18:37] Like by understanding the bell shift distribution, we can estimate how likely your new data point is in the dataset. It allows us to detect outliers make prediction and data more efficient.

[18:38 - 18:46] How to interpret like new data? If a new data sample comes and falls near the center of bell cow, it is typically a common thing.

[18:47 - 18:56] But if it falls near the age, it is likely unusual and may need further investigation because it can be an outlier. From when considering the whole dataset.

[18:57 - 19:04] Why does this matter? This pattern helps EI model understand normal versus abnormal data without requiring complex calculation.

[19:05 - 19:16] In machine learning, it is commonly used within the frame of detection, which is basically fraud detection or medical diagnosis. It allows EI to predict probabilities and classify data efficiently.

[19:17 - 19:38] Examples in EI when we are using normal distribution is basically if we are building an application for raising a flag for potential fraud. Imagine you have a bank transaction where most of the users are spending around $10 to $500 and then certain a transaction of $10,000, which is basically far away from the mean, then it flagged it as potential fraud.

[19:39 - 19:48] The EI system will flag it as potential fraud and the EI system further will alert the security team to check that the transaction is valid or fraud. Now, this is like a real world.

[19:49 - 19:52] This is like near to real world example. And I love this example.

[19:53 - 20:16] But before diving deep into the example and basically imagine you have two sets of training data, which is basically best and the best weights of training data is basically 3, 4.9, 5, 2.5, 1.2, 6.1, and 2.2, and similarly it goes on. Similarly, some weight of training data is basically 13, 14.9, 15, 12.5, 11.2, 16.1, and 12.3.

[20:17 - 20:31] Now we have new data for testing, which is basically 4.3, 12.3, 7.2, and 8.4, which class does the new data for testing belongs to? The hint over here is shown on the slide.

[20:32 - 20:42] Determine the mean of both training data and find the distance between both the mean. And then we have to assume each face, weight follows a normal distribution.

[20:43 - 20:54] And for simplicity, we are assuming like similar variability for both the classes. Anyone, if anyone is able to calculate which class does the 4.3 belongs to, which class does the 12.3 belongs to?

[20:55 - 20:59] Similarly for 7.2 and 8.4. - When you say class, what does it mean?

[21:00 - 21:03] What does class mean? - The class over here is base and salomon.

[21:04 - 21:17] So base has training data of 3, 4.9, 5, 2.5, 12, 6.1, and 2.3. Similarly, another class is salomon, which is 13, 14.9, 15, similarly all the other numbers.

[21:18 - 21:29] - So you have to find which class does the 4.3 falls in? Does it fall in on base or salomon? - So 12.3 is salmon, and then the rest would be best with innocent outliers.

[21:30 - 21:35] But yeah, that's how they like, yeah, the curves would. - Yes, this is a calculation for each of it.

[21:36 - 21:52] Basically, when we are doing for 4.3, first we have to calculate the mean, which is basically zero point, with the distance from the best mean, which is basically 4.3 minus 3.57. And we have 0.73, similarly for salomon, which is 9.27.

[21:53 - 22:04] And then we can see that the 4.3 is closer towards, the distance to the mean, which is classification as base. Similarly, 12.3 is a 7.2 is best, and 8.4 is best.

[22:05 - 22:14] So we have most of the data points in best. So the classification will be best, and all of the three numbers will be best, and 12.3 will be salomon.

[22:15 - 22:22] - So this is getting ahead probably, but what we were looking at earlier, when like dot matrix and like matrix multiplication. - Yes.

[22:23 - 22:33] - Is this kind of how they're classified, how some image classification works, where if you can convert big pegs, like three dimensional pixels and color values? - Yes, so we are going to go over, yes.

[22:34 - 22:51] So basically, all these concepts are tied towards the preexisting EI layers, which is softmax and sigmoid, that we are going to go over in the upcoming slides. And it helps us to classify the classes with the help of this foundational knowledge.

[22:52 - 23:06] And then let's take a moment to introduce like two other mathematical concepts, which is logarithm and exponents. So basically, you don't have to, so before going on to all of the slides, you don't have to memorize like each of these single weights, or each of these single things.

[23:07 - 23:25] You just have to understand the concept, how things were behind machine learning or EI model, such that you have a foundational knowledge. When you are reading any state-of-the-art articles, you will be able to understand easily while you are reading any research articles or any state-of-the-art EI articles.

[23:26 - 23:32] The, like, we all know what exponent is. It basically, two raised to three is eight, which is basically two into two.

[23:33 - 23:37] But what is logarithm? A logarithm is, think of it as inverse of exponents.

[23:38 - 23:49] Instead of asking what is two to the power of three, a log asks, what power do I need to raise two in order to get eight? In other words, we have two raised to three as eight.

[23:50 - 24:01] If we do logarithm, log base of two with eight, we will have three, which is basically a reverse process, similar to subtraction and addition. So why does this matter for EI?

[24:02 - 24:12] Logarithm and exponents are deeply connected to how EI model learn, optimize and handle uncertainty. Specifically, exponents helps us to calculate like probabilities using softmax outputs.

[24:13 - 24:22] Logs helps inside loss functions, like it can be cross-entropy classification loss function. It can be minor cross-entropy loss function.

[24:23 - 24:42] So this loss function has logarithm program defined in it in order to make machine learning model learn from the data point easily and find unit towards the pattern of the data. Let's keep the previous slide, but this is my favorite analogy, which is basically, imagine you have a big piece with eight, eight slides.

[24:43 - 24:48] If someone asks, how many times do you need to double one slides to get all eight slices? The answer is three times, how?

[24:49 - 25:02] Start with one slice, double, which is basically two slices, double again, four slices, double again, eight slices. So in terms of maths, log base of two is basically three, which is basically how many times do you multiply to reach eight, bootstrapping.

[25:03 - 25:10] This is being used in all of the most of the AI techniques. Bootstrapping is basically your data set to simulate new data.

[25:11 - 25:26] Think of it as, think of it as a link, plus creating new data set from the existing data. For example, let's say you have a data set occur if you have a LLM model, which is basically GPT-4, and then you have a data set as correct, wrong, correct, wrong, correct, wrong.

[25:27 - 25:38] And then when you pass that in any AI model, you get 60% of accuracy. Now, if you just reshuffle that existing data, you always get the same five results, which is basically still 60%, nothing new.

[25:39 - 25:58] Now, if you try to re-sample with replacement, basically, if you have combination of newer data from the existing data, you will be able to get different types of accuracy or distribution of possible accuracies from the LLM model. Where, now you might be thinking like, where is it being used?

[25:59 - 26:18] Basically, the bootstrapping is basically used in things to quantify uncertainty and basically, the metrics are evaluation like blue score or rogue that is being used to evaluate LLM model. We are gonna go more deeper onto the blue, what is blue score, what is rogue, and what all the other types of evaluation metrics.

[26:19 - 26:30] But bootstrapping is used in all of these metrics to evaluate the LLM model. It answers the question, if I had drawn a slightly different set of prompts, would my model still look this good?

[26:31 - 26:47] And then, we have T-Taste, which is basically one of the simplest tool we have to check if one model is truly better than another. At its core, a T-Taste compares the average performance of two groups and ask, is the difference larger enough to matter, or is it just random noise?

[26:48 - 26:55] Let's ground these into an EI example. Suppose you do fine-tune models, model A and model B on the same synthetic dataset.

[26:56 - 27:11] When you run them on same batch size and same prompt, model A scores 78 percentage, and model B scores 82 percentage. On paper, it may look that model B scores better because it is 82% and it is better than model A.

[27:12 - 27:20] But here's the catch. If you zoom into the individual prompts, sometimes model A wins in some scenario, while many of the times model B wins in some scenario.

[27:21 - 27:30] So the question is, is model B consistently better across all prompts, or is it just a look? That's exactly what a T-Taste checks.

[27:31 - 27:40] It looks at differences in the scores compared to natural variation across prompts. If the difference is like B compared to the noise, we see the improvement is statistical significant.

[27:41 - 27:55] If the difference is tiny compared to the noise, we can't really trust that model B is better. An analogy can be something where, let's say if you have two students who both claim to be fast runner, and if they each run once, whoever wins is better.

[27:56 - 28:06] But if you run, if you raise them multiple times, and one student wins consistently, that's when you know that student has the real strength of running. And that's similar to VA, why?

[28:07 - 28:12] This is why it's amazing. Like a T-Taste allows us to avoid fooling ourselves with leaderboard numbers.

[28:13 - 28:20] It tells us whether model B is actually smarter, or is it just ruling high, higher dies on that run? - Yes, go ahead. - Yes.

[28:21 - 28:29] - I have a question. So in this case, doesn't it depend on the domain or the set of all the prompts that artists this data consists of? - So go on.

[28:30 - 28:42] - Yeah, so my question is, how do we ensure that those input to these T-tests? How do we know the quality of those tests is good enough to induce errors in the bad model and test the correctness of the good model?

[28:43 - 28:48] - Yes. So that's where the prompt engineering comes in. And we are going to go more deeper into that in the upcoming units.

[28:49 - 29:06] The best way to make it consistent is to define as precise prompts as possible when we are talking about AI in order to evaluate. And then further, for example, let's say if you are creating a synthetic data and you want to evaluate that synthetic data, that how real is the data?

[29:07 - 29:20] Now you have defined an abstract prompt, then it would have higher number of variation and higher noise, and that won't be a good data. But if you have a precise prompt, let's say you define that the data has to be within this range.

[29:21 - 29:28] The data has to have these, data has to have these, and you have multiple sets of templates. That's how you be consistent with the data.

[29:29 - 29:45] So basically, when you are generating a dataset, you start with smaller size of dataset, which is basically, let's say, you have 10 rows and 10 columns, and then we can use the LLM model to be like an LLM as just. That is also, so we are going to go over all of these things in the upcoming unit.

[29:46 - 29:54] And then basically, you do human evaluation on top of it. So you don't care about LLM as judged right now, but you do human evaluation by yourself.

[29:55 - 30:06] And then you measure the gap between the LLM as judged, human evaluation, and that's where the confidence interval comes in. So when we evaluate a model, we often report a single number.

[30:07 - 30:17] Maybe it's 80% of answer is correct, but that single number can be misleading, because if we test with a slightly different prompt or different dataset, the result could deviate a lot. So that's why it's simple.

[30:18 - 30:22] That's why the confidence interval comes in. A confidence interval gives you a range, not just a point.

[30:23 - 30:31] So basically, it's imagine it has a window of where the true performance likely lies. Think about the weather forecast.

[30:32 - 30:42] We don't usually hear tomorrow will be exactly 72 degrees. Instead, we hear something like, tomorrow will be 70 and 75 with 95% of confidence.

[30:43 - 30:49] So that communicates the uncertainty in a way people can trust. Now, let's put these in terms of LLM contacts.

[30:50 - 30:53] Say we are creating synthetic dataset. For example, thousands of examples.

[30:54 - 31:04] And we, so synthetic dataset consists right now, imagine synthetic dataset consists of question and answers. So we have both an LLM as judged and human reviewers to create the answer.

[31:05 - 31:14] And the model scores 80% as good answer. Now, if we apply bootstrap basically, which is, there is our dataset many times, then we will find different range of accuracies.

[31:15 - 31:40] And then once we have these different range of accuracies, we can say that our model is between 70 to 83% with the confidence interval of 95%. So what this means is, if we regenerated or re-asked our LLM as judged over and over again, 95% of the time, the scores will fall within that bandwidth or within that window, which is 70 to 83%.

[31:41 - 31:59] So that is where the press boundary comes in and confidence interval comes in. Basically, imagine it as saying, instead of saying like the model is 80% correct or accurate, we can say the model accuracy is between 77 and 83% most of the time, but what is most of the time?

[32:00 - 32:02] The most of the time is 95%. That is confidence in terms.

[32:03 - 32:19] Basically, this is how like a typical pipeline views, when we are, so we are gonna go more deeper into each of these things as we go in additional units. So this is just to make you grasp the concepts of all foundational things that will be used in the upcoming units.

[32:20 - 32:28] So as we discussed, bush shaping is generating multiple completion from the same prompt distribution. The test asking if one model is consistently better than other model.

[32:29 - 32:41] And then confidence interval is showing the range where the performance likely falls in if you read on evaluation. So from data to patterns, which is basically we just explored how to behave in terms of mean distribution and probability.

[32:42 - 32:49] But how does this connect here? Like now we enter first before we entered in the field of AI, we are gonna enter with machine learning because it started all with machine learning.

[32:50 - 33:02] So there was an era where you used to have a data, you do some statistical operation, you find you emphasize the patterns in the data with those statistical operation. And then you feed it to the machine learning model.

[33:03 - 33:21] And the machine learning model is able to predict from the statistical pattern it learned during the training stage. So think of it as normally like before machine learning, programmer used to write explicit rules step by step for the computer to follow, but that approach quickly fall apart when problem becomes too big and too complex.

[33:22 - 33:30] That's where the machine learning comes in. And then further there was a massive shift towards AI where this machine learning model further became transformers.

[33:31 - 33:39] And then a people came out which is basically attention all you need. And then that's where the LLM revaluation and revaluation started all over the world.

[33:40 - 33:51] So types of machine learning model which are basically supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, recommender system, deep learning, transformer, and LLM. These are all the categories or the types.

[33:52 - 34:03] And the learning process of each of these category differs from each other. So supervised learning is basically what I just said earlier, which is basically learning from liberal data or statistical pattern.

[34:04 - 34:16] Unsupervised learning is learning from the unlabeled data solely relying on the statistical pattern. Then there is semi-supervised learning where it lies on small liberal data and largely unlabeled data.

[34:17 - 34:32] And then reinforcement learning which is basically a function of feedback and reward function. So the practical example of reinforcement learning is basically room the version one where it had the reward and feedback system with the LIDAR sensor in order to vacuum your entire place.

[34:33 - 34:48] So as soon as it has lower battery than specific percentage, then it will start rewarding itself with higher and higher negative rewards. And as soon as it has like accumulation of huge negative reward, it will go back to its station and then it will recharge itself.

[34:49 - 35:04] The deep learning is basically which is with consists of neural networks and it is basically used for speech vision and audio. And in our bootcamp, we are gonna primarily talk about all the techniques and things relating with transformer and LLM.

[35:05 - 35:14] Basically transformer and LLM are sequence modeling a deep learning model which can generate text. So we've already talked about all of, if you would like to read more about it, you can read it later.

[35:15 - 35:24] We already talked all of the application that can be used using machine learning model. I would like to talk more about this right now, which is basically transformer and diffusion model.

[35:25 - 35:31] So first, what is transformer? It is basically imagine it as version two of machine learning.

[35:32 - 35:36] Basically, it generates text from the pre-learned data text. How does it do that?

[35:37 - 35:40] What happens and diffusion model? Like what is diffusion model?

[35:41 - 35:45] Basically, diffusion model is basically something where it can generate images. It can generate videos.

[35:46 - 35:50] It can generate audio. So all the generation things falls in the diffusion model.

[35:51 - 35:55] You don't need to understand all of these. You don't need to understand the internals of all of these right now.

[35:56 - 36:00] As we are going to go over slowly in each of the units. So first, transformers and language models.

[36:01 - 36:10] Like imagine transformer and language model as librarian, but not just any librarian. This one just knew where every book is.

[36:11 - 36:12] They heard all of them. They understand the theme.

[36:13 - 36:25] They understand the character and the relationship between the books. And if you walk in and vaguely describe what you are looking for, something like that new I read last summer with more action, they can instantly recommend the perfect book for you.

[36:26 - 36:37] So that's what a transfer base language modeling does. It reads like huge volume of text, learns the structure and meaning of language and use that knowledge to generate answer and continuous conversation and more.

[36:38 - 36:47] Models like JAD GPT are built using transfer architecture. That they are trained to predict next word and over again with the semantic meaning, which is basically using attention layer.

[36:48 - 36:54] And we are going to go over on the attention layer and auto regression decoding in the later slides. Or in the upcoming movies.

[36:55 - 37:04] Now, on the other hand, diffusion model, like diffusion model works differently than a transfer model. Now, let's say if you, let's say you have defined a program, which is diffusion model.

[37:05 - 37:16] And then you fade and then basically what you do over here is you start feeding a dog photo and then you slowly add noise to the dog photo, sitting noise. And then eventually you will have pure noise.

[37:17 - 37:26] And then when and once you train the model, you reverse the process where you feed the noise and then you will get the dog photo. So that's what a diffusion model do.

[37:27 - 37:29] That's how it learns. There's not generate image in one sort.

[37:30 - 37:38] It creates them gradually by denoising the random noise until something minimum full appears. So that's how basically all the image work model works.

[37:39 - 37:46] So basically just to add on diffusion model came from GAN and you can read more about GAN on Google. But it's not important.

[37:47 - 37:57] It's just for conditional concept. You have trained the image and text in a specific manner where this text and image fall into the same vector dimension.

[37:58 - 38:00] Now, what is the vector? We'll go more in the coming slide.

[38:01 - 38:04] That's where the vector comes in. When we are speaking about any direction within the data.

[38:05 - 38:06] What is sigmoid? What is softness?

[38:07 - 38:16] The sigmoid function is basically when you want to detect when you want to detect the probability between 0 and 1. This basically drives the decision of AI models.

[38:17 - 38:28] Whether it's choosing a label predicting a number or generating text from the model. For example, let's say if you want to have an spam filtering technique and you want to use like AI model.

[38:29 - 38:33] Basically, it will consist of sigmoid function. It will say this is 85% likely to be spam.

[38:34 - 38:43] And then for softmax, it is more of a class classification. So basically the difference between makes and sigmoid is sigmoid ranges from 0 to 1.

[38:44 - 38:51] While softmax is having multiple categories or multiple options during training stage. So think of it as softmoid.

[38:52 - 38:59] Think of sigmoid as soft switch where you take any input value large, small or in between. And then squashing it into the range of 0 and 1.

[39:00 - 39:16] This makes it ideal for representing the probability, especially when you are making binary decision like yes or no, two or false or it can be any decision between the likelihood of yes and no, which is basically the value between 0 and 1. So this is like more foundational knowledge on the sigmoid function.

[39:17 - 39:26] If you like to read more about it, you can read later as we are running late on time. For basically, as I defined earlier, softmax is more of multiple possible classifications.

[39:27 - 39:59] So this is the best example that I like to show to the audience because it is very easy for everyone to understand how softmax works over here. So imagine you have a cat horse and another image with some quotes and then you feed it to a feed forward network or a neural network and basically when I speak feed forward, like when I speak like passing the data to the feed forward network, that specific thing or that specific process is called a forward propagation.

[40:00 - 40:17] And then basically a feed forward network will have its own values and its own machine level understanding of what are some things. And then if we apply a softmax function after it, we'll have a bunch of values that can be, that is represented from the human or from any user's perspective.

[40:18 - 40:31] So basically over here, we have a cat dog and hook horse. So after applying softmax function, we have the probability of 0.7 in cat, 0.26 in dog and 0.04 in horse.

[40:32 - 40:34] So basically the image is cat. The image is cat.

[40:35 - 40:40] And then similarly for horse, which is basically 0.02. And then for dog, it's 0.

[40:41 - 40:46] And then for horse is 0.98. And then if we have like different image, basically.

[40:47 - 40:55] So the reason we don't have like the classification of last image, this is this is intentional. The reason for that is some of the image may not be classified.

[40:56 - 41:03] So it might fall into. So that is the reason it might fall into lower accuracy because it is not able to detect all the images.

[41:04 - 41:12] Now diving into vectors, what are vectors? A vector is a simple list of numbers in specific order that represents something meaningful.

[41:13 - 41:16] Now what something meaningful? What is that something meaningful?

[41:17 - 41:26] That something meaningful over here is magnitude, which is basically how big it is and the direction where it is pointing in the space. In EI, we often treat vectors as container of information.

[41:27 - 41:46] So here's an example where basically if you have 0.21.8 and minus 0.2, negative 0.5, this could represent a color using RGB values or word using word embeddings and image using feature vectors. So even a user profile in recommend, even a user profile in recommender system.

[41:47 - 41:55] So a vector is just a way to turn real world objects into the number that a model can understand and work with. We are going to go more deep into the embeddings.

[41:56 - 42:00] So basically, what are embeddings? Meddings are vectors that captures relationship between the concepts.

[42:01 - 42:15] So for example, if you have a king that has some vector, if you have a word man that has some vector, and then if you have a woman that has some vector. And we do like arithmetic operation on this vector or the words we have the answer as queen.

[42:16 - 42:27] The reason for that is it to give us a specific direction in the whole vector space. And then we can use that vectors to identify the closest words, closest words from the dictionary.

[42:28 - 42:35] So we'll go more deep into the concept of embeddings and things in the upcoming units. This is like a perfect example of how vector is visualized.

[42:36 - 42:52] For example, let's say we have two vectors over here, V1 and V2. Right now we are just speaking in terms of vector, but as we move on, we are gonna be converting the words into vector and then we'll be playing with the arithmetic operation with those vectors in order to see how things work in terms of element.

[42:53 - 43:11] And then when you plot these V1 and V2 using matplotlib, which is basically a library used to visualize things, we have two different direction, which is V1 and V2, where V2 is represented as green and V1 are represented as blue. Similarly, you will be required to complete exercise.

[43:12 - 43:32] So if you go to your task mode, if you go to your task mode and topic for point one here, you can play around by having some vectors with X and Y and then you can press shift enter to check how the direction is being produced. Similarly, let's say if we have two vectors, V1 and V, which is basically these two vectors.

[43:33 - 43:47] And we do an arithmetic operation, which is basically V1 plus V2 and store it in V sum. And then if we try to visualize all of these, we will have something like this, where V1 is in specific direction, V2 is in specific direction.

[43:48 - 43:55] And if we concur both of these, we'll have something which is represented as red line. Similarly, there is an operation that is done.

[43:56 - 44:06] So these operations right now are totally on the foundational side. But once we go into the tokens and embeddings and stuff, we'll be playing with the vectors in terms of words.

[44:07 - 44:16] So we'll be playing around something similar, that I disclosed earlier, which is basically king minus man plus woman, which is equal to queen. And it does not magically comes as queen.

[44:17 - 44:22] As on the, behind the curtains, it's basically playing with the vectors. And then we have tensors.

[44:23 - 44:25] So basically what is tensor? A tensor is generalized data structure.

[44:26 - 44:27] It can be scalar. It can be vector.

[44:28 - 44:33] It can be matrix. It can be a 3D tensor, or it can be any high dimensional steak-off metric.

[44:34 - 44:39] For imaging the tensor as stack of pages. And imagine like that stack to be a 3D tensor.

[44:40 - 44:51] If you have a cube of cubes or on notebooks of notebooks, that's basically the high dimensional data that we are looking at. In AI tensor are usually used when we are playing around with the image.

[44:52 - 44:56] Basically image is high dimensional tensor. And sorry, image is three dimensional array.

[44:57 - 45:07] And then if we have a batch of images, that is basically four dimensional tensor, which is basically high dimensional tensor. So the tensor is the universal format of storing and manipulating data inside animal.

[45:08 - 45:16] Like when we are dealing with animals. So why we use tensor is basically to manipulate the input, to manipulate the data as a result, we fine-tune that specific task towards its pool.

[45:17 - 45:27] So this is basically tensor where we are creating a 3D tensor with the shape of three, three and three. And it is just filled with random numbers between zero and one by using the function p.random brand.

[45:28 - 45:36] And then this is the high dimensional data that is being visualized with the help of Matplotlib. You can play around with it with unit 1.3 task notebook.

[45:37 - 45:47] Similarly, there are other things that can be done with the help of 10. So the hardware and compute that you might need going forward for the AI application and exercises.

[45:48 - 46:11] So most of our exercises are built on Google call. So you won't be needing any additional hardware, but once you play around with the mini exercises that I build end-to-end for specific applications, you will need cloud platforms like together AI Lamblabs, which is basically, Lambda Labs is equally, you can get a GPU compute resource from Lamblabs.

[46:12 - 46:23] And then you can create your AI application in here. If you are trying to have on-premises LLM model, but if you are trying to use like API, then basically, I don't think you need any high-end GPU.

[46:24 - 46:39] Also, for Google call, you might need the pro version of Google call because the pro version gives you an ability to unlock all these GPUs, which is A100PU L40. Basically, most of the notebooks can be completed with T4 GPU and L4 GPU.

[46:40 - 46:56] While if you are trying to build an AI application, and you need like on-premise LLM model or on-premise AI model, you will need a GPU that can be, that can be used via using the cloud services like bara.ai, together.ai or any other cloud services.