AI in Production

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

Thumbnail for the \newline course Power AI course
  • [00:00 - 00:12] All right, guys. So this is a presentation on how to deploy AI in a cloud production level.

    [00:13 - 00:26] So I know this is basically something that you guys have wanted. So this is not going to necessarily have every single possible thing.

    [00:27 - 00:46] But hopefully it can basically cover different topics that are related to production systems. So in production systems, we covered data engineering, kind of outproduction systems in the past.

    [00:47 - 01:06] And that data engineering about production system was an overview of what it basically takes in practice. But this is basically about giving a little bit more a high level about overview.

    [01:07 - 01:27] And when you're basically building AI in a cloud production, there's a couple of different phases, basically. A couple of phases that you guys are in is you're basically building things around your rag, around your fine-tuning right now.

    [01:28 - 01:31] You're dealing with notebooks. You're dealing with data issues.

    [01:32 - 01:43] And but in production, you basically have to deal with a lot of things related to your AI application. So it's dealing with latency.

    [01:44 - 01:50] It's dealing with inference. It's dealing with things related to your core application.

    [01:51 - 02:29] So for example, if you're dealing with a voice application, you're basically dealing with the latency within the voice call and the chat, kind of a back and forth with the voice call, and then depending on your modality and how you're basically interacting with the interface, you basically want to consider about the interface as well. You know, all of these things, basically, are things to basically consider.

    [02:30 - 02:42] You can obviously use AWS or GCP, about for what you basically want. But the reality is that people basically use multi-cloud talking about systems for inference and fine-tuning.

    [02:43 - 02:58] So what this basically means in practice is, for example, you do data discovery and other things on your collab notebook. Then you basically use hugging face to store your data.

    [02:59 - 03:13] You use to store your data and your models. Then you basically use a model for deploying about inference or growth for deploying about inference.

    [03:14 - 03:40] So if you actually use AWS and GCP, basically, for the entire lifecycle, it's actually quite expensive to basically do. And if you guys do anything with production systems basically in real life, you'll basically immediately be able to see, basically, that AI in production is quite expensive already using AWS.

    [03:41 - 03:48] They hook you with the variable pricing. But it's even more expensive with these systems.

    [03:49 - 04:00] So one of the things that you have to basically consider is it's not just GPUs underneath. It's-- there are different types of underlying infrastructures that you can basically take advantage of.

    [04:01 - 04:15] So if you're basically using Google collab, you're basically using Google's infrastructure, Google's TPU, Google's GPU. If you're basically using Core Web, or you're basically using growth, you're using a specialized semiconductor.

    [04:16 - 04:31] That's not a-- it's similar to Google Tensor Processing unit, but it's specifically designed around inference. So Google's TPU is actually designed around fine tuning and inference, so it can basically be done either way.

    [04:32 - 04:48] But Rolke is only specifically designed for inference. So when you're basically kind of about building these systems, you just have to basically build it with multi-cloud in-- from-- as an understanding from the get-go.

    [04:49 - 05:13] You also want to be able to understand these different information, basically, for processing kind of by user questions and understanding the internal kind of systems. And when you're basically doing LLM kind of by inferencing, the LLM inferencing really depends on the application you're basically doing.

    [05:14 - 05:30] For example, if you're basically doing like a chat bot, it's expected to stream toward you. But if you're basically gearing toward voice applications, you have to take an advantage of the turns.

    [05:31 - 05:41] Basically, how many milliseconds does it take for a human to basically expect something from a conversation? It's generally around 500 milliseconds.

    [05:42 - 06:03] And so your model actually has to be designed in a way, basically, where you accommodate the systems and about properly. So your user interaction and your user interface and your application really matters on what streaming is exactly, basically.

    [06:04 - 06:24] And what is expected, kind of about response rate. So a lot of these inference kind of about tasks, like if you look at chat JPTs, image generation, or they take quite a while to basically do versus traditional things, like deep research takes 10, sometimes 15 minutes.

    [06:25 - 06:42] So you really want to basically provide an interface for that. And a lot of these, they basically use, let's see.

    [06:43 - 06:58] Yeah, they basically use a chain of thought, basically, and reasoning process to basically make you understand that it's like processing a lot of things as well. The other thing to basically consider is privacy versus security.

    [06:59 - 07:11] Basically, a lot of generative AI really hasn't been really implemented in the enterprise just yet. Because there's privacy concerns, people don't want to give their all their data to OpenAI.

    [07:12 - 07:31] Basically, they're afraid that people are training on the data, basically. And there's different levels of LLM applications where you basically can slice it based on privacy, based on security, and based on all these other systems.

    [07:32 - 07:46] Not unlike traditional SAS, right? Like traditional SAS, basically, you have the original cloud vendors, and then you have multi-tenancy, basically.

    [07:47 - 07:55] And then you have hybrid models. Like Microsoft CRM is a hybrid on-premise cloud model.

    [07:56 - 08:16] College here is also somewhat on-premise, basically, as well, to address some of our privacy concerns. So you want to understand what user satisfaction means in your application, what efficiency means, and also what consistency means.

    [08:17 - 08:31] Basically, this is mean, basically, retrieval, augmented generation, what about runs. And you put in your database and update it once a day, once a month, basically once a week.

    [08:32 - 08:51] You're not-- a lot of these applications are not always, basically, a real-time kind of applications, basically. You want to just basically be able to understand, basically, where-- what is it kind of about doing?

    [08:52 - 09:07] All LLM about development follows the following kind of past. You first do model development, then you basically do fine-tuning, and you do RAG, you're doing kind of experiments.

    [09:08 - 09:09] You're basically slicing the data. You're chunking the data.

    [09:10 - 09:18] You're basically kind of looking at, basically, what it's doing. And then you're basically deploying it.

    [09:19 - 09:30] And you're putting in the doctor or virtual environment kind of a system, and you are deploying it. Then you're basically setting up the inference kind of an infrastructure.

    [09:31 - 09:50] So right now, basically, you can basically-- there's a couple kind of different infrastructure for inference. So there's a couple packages for inference that basically optimize inference.

    [09:51 - 10:05] And then there's a couple of semi-conductors that optimize it at the semi-conductor about layer as well. Of course, you can basically just set up the infrastructure as it is.

    [10:06 - 10:26] You can just basically deploy it to one of these systems and not be concerned about it for the time being. And as you basically optimize for applause, as you optimize for speed, a lot of people basically end up optimizing the inference side as well.

    [10:27 - 10:45] So the other thing is inference and serving. A lot of times, when you're basically talking about building these applications, you're not just basically building us a rack application, which is a vector database.

    [10:46 - 11:00] You also have a database for metadata and pre-filtering. So I was speaking to one of you basically, an hour ago, about pre-shirt itinerary and using metadata.

    [11:01 - 11:12] And you can basically use metadata in table rows-- sorry, table columns, basically. And these table columns provide a pre-filter or a post-filter.

    [11:13 - 11:20] Like, you can basically filter on user IDs. You can filter on about different forms of metadata.

    [11:21 - 11:38] And if you look online, they call this a advanced rag, because basically it's a combination of traditional database techniques and a vector similarity matching. But the reality is, when you're actually kind of by serving, you're basically dealing with databases.

    [11:39 - 11:55] You're dealing with inference, a model serving, and you're dealing with better data as well. And also you're dealing with that data pipeline.

    [11:56 - 12:12] You know, basically, so as your example, you have a data pipeline where you have these great person about running and you're maintaining these scrapers and you're monitoring these scrapers. And then, then you're basically monitoring and logging about as well.

    [12:13 - 12:25] You're tracking kind of performance. I can't overemphasize, basically, the importance of being able to set up your queries and your evaluations.

    [12:26 - 12:42] So a number of you guys, basically, during the AI kind of explosion calls, basically. So you guys have your basic kind of queries, but you really need to basically set up a set of queries based on your archetypes, basically.

    [12:43 - 13:03] So your archetypes, basically, can be-- like, for example, you're doing a data analyst kind of by agent, basically. And this data analyst kind of by agent is you have a data analyst, you have the manager, you have a business owner stakeholder.

    [13:04 - 13:16] And so you want to generate kind of queries using synthetic kind of data. And then you're able to generate for all these different archetypes and then build the evaluation around it.

    [13:17 - 13:26] And that way, you're basically, no, whether your AI application is evolving in the right point. You can't just basically start with two queries that you basically think is OK.

    [13:27 - 13:37] And then when you're getting into-- right before production, you start testing with 1,000 to 1,000 queries. It's a bit too late.

    [13:38 - 13:49] The model development process, basically, with queries and evaluation starts very early, you know, basically. And then the monitoring and logging is simply a manifestation of that.

    [13:50 - 14:10] Basically, what the monitoring and logging is where you basically track live kind of off queries and how well it's basically doing. Like, traditional kind of systems, basically, you have a query, you have error logs, you have warning kind of on notice logs and information logs.

    [14:11 - 14:29] And you look at it once in a while, it's basically-- whereas this is like how well your application is basically doing. As you basically query the application, basically, you want to see all the fails that the system is basically providing.

    [14:30 - 14:43] So there was a recent presentation I was watching, basically, where it talked about all the fails that AI agents have in real life. And it's a-- it's a surprising amount of fails.

    [14:44 - 15:17] Not because the technology doesn't work, it's because they fail to account for building all the APIs and tools to be able to accommodate the user's needs, basically, and-- OK, yeah. If you search a ramp, R-A-M-P, AI lead, basically, he talks about the fails, basically.

    [15:18 - 15:22] I can basically close that. Taylor basically asks, do you have a link to that?

    [15:23 - 15:34] But you can basically search for a ramp, an AI researcher. And then he basically talks about it in one of the talks.

    [15:35 - 15:48] I'll find it after the class. And then-- and so his solution, basically, what he suggested is that you actually, basically, default everything back to the user interface.

    [15:49 - 16:23] And then you have a browser that basically controls and executes the inputs within the browser, basically, and then you're able to execute everything in the browser of an interface as a default fallback, basically. And then what you want to basically do is you want to create an update, and then you want to update the fresh model or fine tuning.

    [16:24 - 16:36] There's different ways to basically build your model. Basically, AWS has a pipeline that you can basically think about, build, basically, to test your L-M container.

    [16:37 - 16:50] And there's a waste and biases. There's also DSP-- well, DSPY is not like a fully end-to-end about system. It's primarily a prompt optimizer with reinforcement learning.

    [16:51 - 17:05] But DSPY is also-- it's like a subset of this, basically. So you can also basically use a cloud build or you can use Vertex AI for doing a lot of this as well.

    [17:06 - 17:16] This is similar to production level systems in that the more managed it is, the less work you do. But the more expensive it is over the bottom run.

    [17:17 - 17:33] Basically, it's like a managed database kind of a system, basically, which manages the MySQL in everything for you. That's a picture of all the redundancy, low balancing backups.

    [17:34 - 17:47] It's obviously going to be more than something where you have to manage the MySQL. Similar here is you have managed services in AI where it manages certain things for you.

    [17:48 - 17:57] The more it manages, the more it costs. That being said, basically, for managed services, it has a lot of things out of the box.

    [17:58 - 18:15] So what you end up having seen is basically for there's companies that specialize in all of these once for six. Basically, there's companies that specialize just solely on the evaluation, they solely on RAG.

    [18:16 - 18:29] Basically, they optimize RAG for you. Basically, there's companies that solely focus on inference, solely focus on different things.

    [18:30 - 18:33] It could be you basically say, hey, I'm good at RAG. I only want inference.

    [18:34 - 18:41] Or I want someone to basically do inference, but I'm good at finding me. Basically, you can basically mix and choose.

    [18:42 - 18:57] Basically, your ideal about systems. So, currently, what we're basically trying to do, basically, with the notion database, is to basically provide you the overall environment.

    [18:58 - 19:08] More examples of infrastructure that you can basically use for different things. And then there's more companies basically servicing these flows all the time.

    [19:09 - 19:13] So, back to our keynote bot. Oh, hit the tailor question.

    [19:14 - 19:16] Yeah, sorry. I was just wondering, I guess it depends on the company.

    [19:17 - 19:30] But do these companies, most of them, have free hobbyist tiers for people as experimenting with stuff? Or are they mostly geared towards enterprises where you're going to have to expect you to pay a lot at a company level?

    [19:31 - 19:38] It really depends, basically. So it depends on their go-to-market strategies.

    [19:39 - 19:44] You have things like fierocrawl.dev. Basically, they were formerly an application company.

    [19:45 - 19:56] And they basically found how hard it is to maintain a crawler. And a lot of the legacy scripting kind of companies were all based on hard-coded scrapers.

    [19:57 - 20:03] Basically, so they designed their own system. And then they have three tried here.

    [20:04 - 20:13] And then they have a tier where they basically manage everything for you. So you have guys like that.

    [20:14 - 20:19] And then you have things where you go to the signup button and you can't find a signup button. It's book a demo, basically.

    [20:20 - 20:57] As soon as you have to talk to a sales version, it's an enterprise thing. I think in practice, a lot of these things are designed, basically, for people that can get up and running, and for enterprise, because they're designed for AI researchers, and a lot of AI researchers are experimenting with things where sometimes they're in grad school, or they're talking about professors that their department doesn't always have a ton of funding, basically.

    [20:58 - 21:12] So I think in practice, basically, you tend to see a lot of lower price years for people that keep the tires, basically. So it's not $5.

    [21:13 - 21:28] It might be $50 a month or $40 a month. But if it solves a material about a workflow problem of yours, basically, then it could be worth it, basically.

    [21:29 - 21:32] And this-- yeah. Yeah, cool.

    [21:33 - 21:57] Thanks. So Vector does a certain question, and then you query, and you're able to get the results.

    [21:58 - 22:15] Vector databases are able to-- you do have solutions that you can provide use by yourself. So look, vector databases have the same problems that traditional databases have.

    [22:16 - 22:27] Traditional databases have a problem with a redundancy problems. You have someone possibly deleting it, basically.

    [22:28 - 22:43] You have a low-balancing problems. You have a lot of the same problems, basically, where there's a certain amount of insights or reads per second, it can basically do.

    [22:44 - 22:56] And then you have to accommodate it for your specific kind of needs. Traditional kind of SQL databases are very dependent on that application.

    [22:57 - 23:20] For example, if you have a social media system, there's a big difference even between Twitter versus Facebook versus enterprise kind of a system. Or-- so it depends on how much insert you have, how much updates you have, how much reads you have, and what is the insert frequency, basically.

    [23:21 - 23:48] So if you're basically saying, OK, I'm going to insert certain things at a frequency of one hour, then you basically know that inserts per second, the reads per second, and in other information, basically. Traditional, convinced, you know, vector databases are not unlike that.

    [23:49 - 24:10] But one thing to basically keep in mind that vector databases have is they have search algorithms built in to some of the vector databases. So what that basically means is that sometimes the query layer can basically be slower, and it could get overwhelmed faster, basically.

    [24:11 - 24:18] So there are basically other things that can basically help scale. Basically, you have Amazon Open Search.

    [24:19 - 24:32] You have Pine Clone, and you have Halloween kind of a-- Halloween DB is basically a vector search, traditional vector search. But basically, there's a lot of things in this use.

    [24:33 - 24:53] But it's really kind of dependent on the actual bus system. So highly kind of a scalable vector database about design is partitions that embedding space.

    [24:54 - 25:08] So you use a clustering algorithm called key means so that similar vectors reside on the same chart. And so queries embedding is compared against every single shard, and the results are merged.

    [25:09 - 25:22] And so the sharding enables horizontal scaling, but introduces a hard chart to combat issues. This is not unlike basically traditional database design.

    [25:23 - 25:42] Basically, like traditional database design, you have hot partitions that are frequently accessed, and then you have a logical kind of partitions basically where it's logically separated based on your structure of the data. Facebook used a structure based on network.

    [25:43 - 26:01] So for example, every college would basically be on a different virtual instance. And so that's logical partitioning, and whereas this is basically partitioning based on the embedding space, if you think about it, it makes sense.

    [26:02 - 26:17] And then the replication is replicated via a leader forward to ensure fault tolerance and reads for both. So for example, writes go to a leader and gets propagated to replicas.

    [26:18 - 26:31] And if a node fails, the replica can take over. The vector-specific designs, some engines only replicate only raw data while rebut building indexes locally on each node.

    [26:32 - 26:49] So this is similar to traditional [INAUDIBLE] database design as well, the master slave architecture. So the master is able to basically take in inserts and updates, whereas the reads all go to the slave, basically in terms of the master slave kind of architecture.

    [26:50 - 26:55] So this is called leader follower. It's a similar thing, basically.

    [26:56 - 27:12] And the architecture follows using a simple queuing service, SUS. It's a queue service that has a worker test that inserts and queries about a printable index.

    [27:13 - 27:32] The printable reference in the architecture uses both the SQS and EC2 to basically is to be able to charge ingestion and autoscale. And they also use Kubernetes deployment for the vector store as well.

    [27:33 - 27:56] So if you basically do this, where you have different namespaces per tenant and you have SQS that low balances everything, it can scale up to millions of records by scaling out workers and kind of pause. Milvis is an open source system.

    [27:57 - 28:16] It uses a sharded storage system that has a proxy, basically for low balancing and coordination. And then it has various workers to be able to execute the different task.

    [28:17 - 28:34] You also basically have open search, which has a clustering about nearest neighbor algorithm for a vector search. And you can basically use that as well, basically.

    [28:35 - 28:43] I mean, it's basically the same. It's very similar to traditional database scaling.

    [28:44 - 28:51] This is just for-- it's just different names, basically. And the only thing that's different is the sharding, like how you shard.

    [28:52 - 29:05] This is by similarity rather than logical kind of a-- for example, Netflix, they shard based on hotspots, basically. So they know that certain shows will basically be very popular.

    [29:06 - 29:17] And those are prioritized in their reads. But other ones are basically not as prioritized in their popstones, basically.

    [29:18 - 29:28] It's a similar concept. Vector databases, as your knowledge base kind of grows, you have to shard and partition the process about mobile nodes for faster parallel reads.

    [29:29 - 29:41] Be able to replicate copies of your data for high availability and built orange. And then you can use a nearest neighbor algorithm to speed up a large scale about searches.

    [29:42 - 29:53] Amazon Open Search can scale horizontally, distributing and combining your index. And third-party solutions like Pinecone can also auto-scale with the seams.

    [29:54 - 30:12] You'll have to research the individual vector database to see whether this is out of the box or not, or if there's other tools that are basically adjacent to it. And then you can also use it to distribute the vector databases, like Mailbase, for advanced indexing and storage.

    [30:13 - 30:20] And then-- yeah. And then this way, basically, it can respond to a spike without less slowing your queries.

    [30:21 - 30:29] So some of you guys have enterprise kind of queries, basically. So I don't think you'll need to worry about this very much.

    [30:30 - 30:41] Basically, for enterprise kind of a query, you can scale as you go. Basically, but if you're using something consumer, you'll have to consider this earlier than later.

    [30:42 - 30:50] The low balancing for AI can about workload. The large language model can be deployed on multiple servers or containers, taking a little high user traffic.

    [30:51 - 31:12] You can basically do round robin, basically, so that no single LLM instance is overwhelmed. You can auto-scale depending on the GPU usage or latency threshold, and then you can do automatic checking and rerouting if instances fail.

    [31:13 - 31:24] So this is, again, very similar to traditional kind of a high traffic kind of systems. And you can use low balancers.

    [31:25 - 31:44] Low balancers in practice, if you basically use cloud manager services like AWS low balancer, they have more things out of the box. The reality is you can just use an Ingenex kind of a system and high availability kind of a proxy system to be able to do it yourself.

    [31:45 - 31:52] It's not very difficult to do. You're basically just pinging kind of a different metrics, basically.

    [31:53 - 32:16] And then splitting more, putting things up, basically, as things basically come about, increase. And this is an example of using the configuration to create a low balancing service, basically.

    [32:17 - 32:21] So there's application low balancing. There's network low balancing.

    [32:22 - 32:42] And then you can configure the low balancing to auto-scale, depending on what memory-- what metrics is basically kind of a targeting. And then you can also do route and failover so that it basically fails over on a certain kind of a criteria.

    [32:43 - 33:03] It's also latency aware, basically, and then it basically kind of configures-- it can be configured on GPU utilization, basically queue length or custom metrics as well. Again, this is very similar to traditional low balancing bus system.

    [33:04 - 33:24] And the one thing to basically consider for low balancing is traditional low balancing is you're primarily on that application kind of a-- this is very similar. I think-- I was going to say it's slightly different, but it's actually very on the inference side.

    [33:25 - 33:36] So if you're doing kind of multi-tenancies, there was a recent article-- I forgot from where they basically mentioned that, oh, I think it's from the economists. They came out yesterday.

    [33:37 - 34:00] And they were like, oh, we're going through an AI winter, basically, most enterprises are not getting the value from generative AI and goes into a kind of a detail. But some of the issues that they were basically detailing were all issues that were overcame, overcome in traditional SAS systems.

    [34:01 - 34:18] Privacy, security, basically, adapt adaptation to their own Java data. And so a lot of the techniques we're basically about talking about regime models were auto optimization using DSPY or using reinforcement learning.

    [34:19 - 34:31] They have largely not propagated to the wider audience. Basically, if you say where we are, we're like 1997, 1998 in the .com kind of bubble.

    [34:32 - 35:06] And the .com basically, there was a boom at bus and it required 20, 30 years, basically, for everyone to visit feel cloud native, basically. One of the key things for people to feel comfortable, basically, especially for certain verticals that help, is a multi-tenant system ensuring each tenant's documents and embedding state private and having resource allocation.

    [35:07 - 35:17] So one tenant's usage doesn't degrade performance, basically. And then distinct API keys or a lot of toolkits for each one.

    [35:18 - 35:30] So some of you guys are dealing with enterprise kind of things. These are going to be important to basically build out to differentiate it from the very beginning.

    [35:31 - 35:51] And then you can basically use-- yeah, you would have to build a lot of these. You can't really basically use-- you can use SageMaker, kind of, about employees and VPC, kind of, a step-notch to isolate data flows.

    [35:52 - 36:02] Basically, AWS, AI, and roles and policies, you generally have to implement this yourself, basically. But you can also basically use this as well.

    [36:03 - 36:11] But generally speaking, I think you have to implement it yourself. You can use multi-tenancy with both Docker and Kubernetes.

    [36:12 - 36:21] You don't have to use Kubernetes. And the data asset, the isolation, is really kind of important.

    [36:22 - 36:37] And so there's a couple kind of different models, basically, for it. One is silo-based, which is you have a vector search. You create a separate index or a separate database instance for per tenant.

    [36:38 - 36:42] And that creates a maximum kind of isolation. But increase kind of our overhead.

    [36:43 - 36:51] You have many small indices and different clusters. AWS OpenSearch serverless allows a collection per tenant.

    [36:52 - 36:59] And each tenant's data is in its own kind of our collection. Millvias supports the database level of tenancy.

    [37:00 - 37:11] And each tenant has its own database. You can also have a pool model where tenants share the same index and database, but tenant data is tagged in filter.

    [37:12 - 37:21] For example, OpenSearch index may require tenant ID. Basically, it requires low-level security.

    [37:22 - 37:27] And then restricts access per tenant. Timecone has namespaces as long patterns.

    [37:28 - 37:34] And then so one index can call multiple namespaces. And each namespace isolates the tenants and above vector.

    [37:35 - 37:42] Pulling is simpler to manage and more cost efficient. And if not, carefully secured.

    [37:43 - 37:51] Or you can basically use a bridge between the two, like one index per tenant, post them in a shared cluster. AWS closes the bridge model.

    [37:52 - 37:59] And reduces onboarding kind of our delay. And it balances isolation and manageability.

    [38:00 - 38:17] You can all-- you will want to also consider authentication so that each tenant accesses its own data or a namespace. And so OpenSearch serverless allows you to attach authentication systems to every tenant.

    [38:18 - 38:29] And then in Kubernetes, you might use a namespace system. So Kubernetes is used to manage it.

    [38:30 - 38:38] We typically use Docker. Basically, like for enterprise solutions is you may want to use it.

    [38:39 - 38:51] And then network isolation as well. You can basically have different subnets and VPC systems for extra security between different services.

    [38:52 - 39:04] For LMM, chatbot, what about the query latency? You want to basically be able to understand the latency and query times.

    [39:05 - 39:23] You want to check basically how long it takes to retrieve the document, the model inference at about time, and the response generation time. Basically, so your RAG application will basically be the sum of all of these.

    [39:24 - 39:30] And if you're basically building agent-based systems, you're basically executing tools in the back end. That's additional latency.

    [39:31 - 39:40] And so it's highly dependent on what you're basically build. So if you're building a voice agent, a latency is much more important.

    [39:41 - 39:53] Basically, if you're building something where it's a LML with a chat-based interface, it's not like the traditional web. The traditional web, basically, latency was super important.

    [39:54 - 40:09] Whereas for LMM-based applications, people understand that you're breaking down something that would ordinarily take three weeks, basically. And/or two weeks, basically, you're able to break it down into individual components.

    [40:10 - 40:22] So people are much more tolerant, especially if you're basically able to see show progress, basically. Techniques to reduce latency, you have model optimization.

    [40:23 - 40:33] You can use quantization, distillation, basically, for different versions of the LM. You can use batching, basically.

    [40:34 - 40:40] You can combine kind of a smaller request. You can do GPU, kind of a GPU optimization as well.

    [40:41 - 40:53] And then measuring latency in query times. With every single workflow, there's a managed service.

    [40:54 - 41:07] But if you use all these managed service, the cost is going to add up, basically. So you can, obviously, use a GCP kind of a cloud monitoring to do this.

    [41:08 - 41:22] Basically, back in the day, we basically kind of built up a bunch of these things, kind of a custom in-house. You just have to understand whether it's critical or whether it's not critical, basically, for you.

    [41:23 - 41:35] You want to, basically, be able to understand what's the cost, what's the quality, what's the speed. And so you have a multi-stage latency.

    [41:36 - 41:48] You have embedding generation, vector search, LM inference, and then response streaming. So if you optimize each stage, it reduces end-to-end delay.

    [41:49 - 42:05] But you also want to basically make sure, basically, your accuracy or your hallucination is basically about-- is right, basically. So query embedding kind of a latency, like a small transformer, is basically fast.

    [42:06 - 42:17] And it can batch concurrent kind of queries. It can also catch identical kind of inputs in Redis and by hashing the input model kind of a version.

    [42:18 - 42:23] You can also do batch inference. You can also do vector search latency.

    [42:24 - 42:32] Use approximate nearest neighbor. So some of these algorithms we already cover, like FAI-SS or HNSW.

    [42:33 - 42:45] And it can retrieve on your neighbor from millions of vectors in a few milliseconds. You can also basically optimize LM inference.

    [42:46 - 42:52] But oftentimes, model generation is the slowest step. And there's techniques underneath the hood.

    [42:53 - 43:03] They use speculative decoding or model decoding. You can also adjust max tokens or lower the temperature just for an output.

    [43:04 - 43:14] So I'm not sure if you guys notice that a lot of LM's, it defaults to a certain token amount. Basically, you actually have to force it repeatedly to get more tokens.

    [43:15 - 43:39] And this is important if you're generating a report or summarization. And so it's even the large-- even the more popular, like OpenAI and other things, they configure these things so that you're not basically like generating 10,000 or a million tokens, basically, like every single query, basically.

    [43:40 - 43:53] So yeah. And then if you're basically streaming about tokens to the user as they're generating, time to first token is one of the most important kind of benchmarks.

    [43:54 - 44:08] And so you can basically use quantization. You can improve your low balancer and API timeout, basically, to avoid premature about issues.

    [44:09 - 44:21] These are optimization kind of about things. But what you really want to get is really about your quality, basically.

    [44:22 - 44:39] And at least for right now, what you basically see is that quality goes to the reasoning models. And if you basically are able to train on intermediate reasoning about outputs, it tends to be slower at the expense of speed.

    [44:40 - 44:55] So you have to basically understand the proper evaluation metrics to be able to understand the trade-offs that you're basically making. Then, basically, you want to be able to understand whether you're vertically scaling or horizontally scaling.

    [44:56 - 45:05] Vertically scaling is you basically just add more GPUs, add multiple GPUs per node. For example, scaling is where you're sharding, basically.

    [45:06 - 45:24] Again, you can shard at multiple levels. You can shard at the low balancer level, add the API level at the vector database level at the database level, and also at the LLM, kind of a level.

    [45:25 - 45:39] There's multiple ways of doing sharding. We talked about this, basically, before a vertical, kind of, but keep you scaling really depends on your application.

    [45:40 - 45:56] So depending on your application, basically, like maybe A100, H100, could basically work. Or a Google TPU version, or a single A100 can do tens of tokens per second for moderate models.

    [45:57 - 46:11] Large models, which is tens of billions of parameters, can only fit in H100 and may need a model parallelism. And vertical scaling simplifies architecture but has a hard limit per node.

    [46:12 - 46:28] You may basically be wondering why we basically cover deep-seak and some other combined systems. One is we wanted to basically show you that the optimizations and innovations weren't just at the intelligence layer, basically, and reasoning model.

    [46:29 - 46:41] It's basically at the optimization layer. So there's more and more optimizations where people basically have parallel, kind of, by execution.

    [46:42 - 46:56] There's optimization at the inference side. And then if you remember, basically, in the code lecture, we basically talked about one of the wind source you need, kind of, about value proposition, is that we're able to optimize at the optimization layer, basically.

    [46:57 - 47:17] Sorry, at the inference layer, I'm getting my words mixed up. And horizontal, kind of, scaling is basically where you'll start, kind of, and you can deploy many smaller GPUs and nodes and serve, kind of, a traffic command currently, basically.

    [47:18 - 47:33] And then you can basically do a container, kind of, about organization. For example, kind of, you know, you can use a Kubernetes, for example, kind of, about Poly scalar and other kind of systems.

    [47:34 - 48:03] And then you can also use a cost-effective kind of other scalar, which is based on GPU utilization, or Q-length, basically, or latency, or whatever metrics that you basically can think of, basically. Google's best practices notice that optimization on GPU utilization and request queue size is key to match traffic to the GPUs.

    [48:04 - 48:14] So when traffic and bus fights, a single LM instance can become a bottleneck. So notable instances can handle concurrent and bad requests.

    [48:15 - 48:24] Each replica handles some subsets of queries. If one replica is busy, incoming kind of bad requests goes to one another, it improves tolerance.

    [48:25 - 48:39] If one node fails, the others are serving. Basically, AWS example is AWS EC3, basically, to run multiple containers, which the LM loaded.

    [48:40 - 49:08] And so you have to run multiple LM replicas with real-world, kind of, like, use cases, with unpredictable, kind of, like, the traffic. So if you look at, basically, the chat UBT, it looks like a unified interface, but it's actually-- there's many, kind of, our shards in the back, basically, that are serving everything, basically.

    [49:09 - 49:13] Yeah, model instance-- oh, Taylor? Yeah, sorry.

    [49:14 - 49:28] Go back to the slide before this. Yeah, so this diagram, I don't know if it's, like, a real-world example, but it shows client A might-- it's request might get targeted to either GPT-4 or 2T-3-5s.

    [49:29 - 49:36] What would determine that decision? Or what would it use to try to decide which model to send requests to?

    [49:37 - 49:46] Yeah. OK, so there's what GPT chat UBT does on that user interface.

    [49:47 - 49:51] You select the model, basically. And so if you select the model-- Oh, yeah.

    [49:52 - 50:07] The routes, but there's what's known as dynamic routing. Basically, there's a set of companies that are focused on router query optimization.

    [50:08 - 50:20] And they basically have evaluation metrics based on different queries. There was a recent company I just saw, basically, that entirely focused on router optimization.

    [50:21 - 50:31] And it would basically understand your query and understand the evaluation metrics where it's best for it. And then you basically determine, is it latency?

    [50:32 - 50:36] Is it query accuracy? Is it basically, like, whatever it is?

    [50:37 - 50:45] And then it autorows it, basically. So some of the people asked this during the AI coaching clause, basically.

    [50:46 - 51:08] And I said, you have to basically do it manually, but there's now companies that are in projects that are basically focused on query routing, basically. If you search for a deep resource, you can probably find a query routing in LM, something like that.

    [51:09 - 51:10] Yeah. Interesting.

    [51:11 - 51:12] Thanks. Yeah.

    [51:13 - 51:28] Yeah. So yeah, basically, what Taylor basically asked is also a-- the query kind of a routing doesn't always necessarily have to be based on availability and queuing about delays.

    [51:29 - 51:55] It can also basically be based on the actual application, basically. And-- Hey, his out, is there any literature on the health checks that these AI applications tend to use for load bouncers and traffic?

    [51:56 - 52:07] Yeah. Google had a best practice document, and let me-- let me come back to get up to see.

    [52:08 - 52:29] But basically, in general, people basically-- the metrics that we've listed cited are things centered around where you're detecting whether the cluster is being overloaded. So it basically, like, lazy is getting higher and higher.

    [52:30 - 52:37] It's one proxy. And the GPU utilization is also another proxy as well.

    [52:38 - 52:52] And so this is slightly different than traditional kind of systems, basically. So traditional systems, you can get queries that basically can be very long, basically, and can slow down the system.

    [52:53 - 53:05] But you basically have-- if you don't configure your model correctly, you can get, like, weekly models that actually you prefer a while, basically. And you can solve up all the GPU utilization.

    [53:06 - 53:27] So it's a little bit-- what we basically saw in our research is lazy and query, kind of, a GPU utilization. And then you can also basically use things around what you perceive as errors as well, basically.

    [53:28 - 53:35] Yeah. Occasionally, if you use chat GPT, sometimes it's just, like, errors out, basically.

    [53:36 - 53:38] Who knows what happens in the background? Yeah.

    [53:39 - 53:59] I haven't used the latest generation of load balancers. They probably-- I suspect they probably already have a configuration for thresholds relating to resource utilization.

    [54:00 - 54:20] I was more concerned with the actual application-specific health checks, such as running an inference that-- and making sure that it's responsive as well as correct. Yeah.

    [54:21 - 54:36] Those are-- a lot of those are basically, like, it's hard to do inference time. Unfortunately, basically, it's-- basically, unless you basically force-- so there are ways to basically do it.

    [54:37 - 54:44] But it's just very costly during inference time. You can basically have another model that checks the other model, basically.

    [54:45 - 55:01] And so it's a middle-emission-critical kind of a system where the accuracy is super important. You know, basically, you do-- you can have another reason model that basically checks and uses some form of a citation engine, basically.

    [55:02 - 55:10] And it basically just, like, double checks it every time. Obviously, it's going to be super slow for the user to get something bad.

    [55:11 - 55:21] But this is basically where we talk about the accuracy thing. If you want super accurate, where it double checks everything, you can add additional things like model verifiers.

    [55:22 - 55:47] And it basically double-checking with vector databases and chunks, basically, whereas if you're basically saying, OK, it might fail basically at inference level. But then the data scientists or the AI engineers will basically go afterwards and basically check everything afterwards.

    [55:48 - 55:54] So that's also kind of a thing. So availability of the model servers are crucial.

    [55:55 - 56:09] Basically, how do you create-- this is basically creating a Docker image that contains the inference code and also the dependency. If you want to reduce the image size, you use lighter weight systems.

    [56:10 - 56:29] And so this is like a sample of Docker configuration. You can also basically do-- a lot of people basically do a serverless, basically, for this, as well, basically.

    [56:30 - 56:38] You can also, with serverless, do co-start a large LLM. Co-starting a large LLM container basically can about take minutes.

    [56:39 - 56:58] And there's a company called VentoML. They improve co-stars 25x by layering and both on-demand, as well as kind of a code existing kind of instances.

    [56:59 - 57:20] To mitigate co-stars, they kept one warmed pod per region and then used an initialized container to prefetch models or load smaller quantized weights. And in performance dummy instance inference and to warm up the model, you can also do rolling kind of updates.

    [57:21 - 57:39] You then replace all the version numbers without coming back downtime. And then for replication, you basically want to maintain multiple instances of every kind of model servers just for looking about low balancing and failure.

    [57:40 - 58:12] There's also an issue with model versioning, basically, where if you basically took the right model version is when you deployment and switch traffic, basically, and you're able to store the model artifacts in a version about bucket. And to find out how many users are LLM-based service and support at once, you want to understand about the rate of queries per second.

    [58:13 - 58:25] So generally speaking, this is not unlike traditional database scaling. So traditional database scaling, you want to know the reads per second, the answers per second, and updates per second.

    [58:26 - 58:31] So for example, you're basically doing something that is just read-only. I don't know.

    [58:32 - 58:59] You're doing a directory website where you're just caching bot, I don't know, real estate data. Primarily, you're concerned about reads per second, whereas if you're doing something like a perplexity where you have a rag database and you want to keep it up to date, you have to worry about inserts and updates as well.

    [59:00 - 59:19] But those are more maintainable, basically, because there are EJOR infrastructure versus things that are on the user's side. Then you want to understand, latencies can buy per request, average, as well as can buy abnormal, basically, latencies.

    [59:20 - 59:37] You can understand GPU CPU usage. Again, your model parameters and how you configure temperature, match token, and understandings, will affect your GPU usage.

    [59:38 - 59:52] So there's tools like locus and Jmeter that can stress test your endpoints to find the breakpoint of where latency becomes unacceptable. And then you cannot have automated low tests on EC2.

    [59:53 - 01:00:07] And then you can also use AWS X-ray to visualize the request trace and latency across microservices. GCP, you can use cloud profiler and cloud monitoring kind of dashboards.

    [01:00:08 - 01:00:27] You can estimate throughput by local testing and simple math. Key metrics include time to the first token, tokens per second, and requests kind of a per second, or benchmarking use locus and artillery to simulate many users.

    [01:00:28 - 01:00:47] And you can use true salaries, LLM focused, to outlines about how requests per second is measured. If your inference spot has an average L per second, then in steady state, it can basically do 1 divided by L requests per second.

    [01:00:48 - 01:00:59] So if a GPU answers 200 milliseconds, one instance is five requests per second. With 20 replicas, you get 100 requests per second.

    [01:01:00 - 01:01:08] And then the tokens per second is the token rate per GPU. It obviously depends on your level of GPU.

    [01:01:09 - 01:01:14] It's 50 tokens per second per 8, 100. And your average risk balance is 20 tokens.

    [01:01:15 - 01:01:30] And you have 2.5 languages per second per GPU. And then concurrency, if you have very short latency to allow more accuracy, often an LLM server can handle a batch of requests firmly until a GPU saturates.

    [01:01:31 - 01:01:48] If you use a framework to send an increasing request per second until latency degrades, you monitor that point as a massimable, sustainable request per second. And this is basically an essential kind of a measure of throughput for LLM servers.

    [01:01:49 - 01:01:56] You have a simple low test here. If you don't want to use, you have a simple low test here.

    [01:01:57 - 01:02:17] But you generally want to see the request per second basically of your vector database, your request per second of your RAG system in general, or the request per second of your LLM. So those are your vector database and your LLM are going to basically be your bottlenecks, typically.

    [01:02:18 - 01:02:38] And I've basically mentioned that for anti-elucination, a lot of people basically use a chain of thought reasoning, which is in reasoning models and reasoning models in a data set. So it has a better explanation, higher accuracy.

    [01:02:39 - 01:02:58] But it has basically much more compute time because basically it's like generating tokens, not to just have the output, but it's actually thinking through things. As a higher memory usage, higher inference cost, longer processing per request.

    [01:02:59 - 01:03:13] And this is just another example of class versus quality versus latency. If you low balance a chain of thought models, you can basically have longer running request queues.

    [01:03:14 - 01:03:39] So a standard 60-second timeout might be too short. Basically, if the LLM is generating the lengthy reading set, so just understanding basically what period of time is just for a timeout is actually it's highly dependent on your application and whether you're using tools or not, basically in your agent.

    [01:03:40 - 01:03:53] And also basically, adaptive routing, new requests to the lowest load. And then you want to basically be able to auto scale as well to monitor your GPU about memory, which are very robust scale outs.

    [01:03:54 - 01:04:07] AWS low balancing can basically extend timeout and robust based on target memory. This is the advantage of modern low balancing systems.

    [01:04:08 - 01:04:21] It's quite configurable based on what you need. Again, using a managed service like this is a cost more relative to kind of a for you.

    [01:04:22 - 01:04:46] And then, yeah, so you need basically more generous about timeout and you need more different policies basically to compare a simple timeout models. The chain of thought consideration is it affects system design in a few ways.

    [01:04:47 - 01:04:57] Basically, one is inference length and latency. Chain of thought tokens can basically be very long.

    [01:04:58 - 01:05:08] And so, you could default timeout to 300 milliseconds. Sorry, 300 seconds, not 300 milliseconds.

    [01:05:09 - 01:05:22] And the locking basically can lock a pod basically. You can also basically ensure that a user's multi step reading stays into the same pod.

    [01:05:23 - 01:05:41] If partial open about state is needed. In traditional care about systems, you actually have a session affinity where you basically have a user session that's basically per low balance or per API per database.

    [01:05:42 - 01:05:51] You can also do that for LLM based applications as well. Each additional kind of token requires more GPU memory.

    [01:05:52 - 01:06:01] And chain of thought can exhaust memory on smaller GPUs. Use larger memories or offload key value cache as needed.

    [01:06:02 - 01:06:23] Then you can also stream the tokens to the user basically for so that the user can see progress. And then-- or you can basically configure it only to return partial tasks so that it's not streaming every single thing.

    [01:06:24 - 01:06:45] And then research has shown that allowing multiple branches of reasoning can vary about latency with worst case amount of branches is dominating response time. So, Hectees like early stop me, when reasoning is busy, it has enough confidence can help limit runway latency.

    [01:06:46 - 01:06:56] So, you can also basically do caching. So, if you look at basically the deep-sea kind of internals, they basically do a lot of caching internally.

    [01:06:57 - 01:07:14] They do caching about multiple levels simply because they're GPU kind of constrained. And some LLM kind of queries can repeat or have it over rhyme about context.

    [01:07:15 - 01:07:19] You can do prompt caching. You can do document chunk caching.

    [01:07:20 - 01:07:26] Pre-compute embeddings, and store it in Redis or memcache. You can do inference result caching.

    [01:07:27 - 01:07:34] User cases where identical prompts, yield stable answers. A cache in skip regeneration.

    [01:07:35 - 01:07:45] And then-- yeah. This is not unlike traditional scaling as well.

    [01:07:46 - 01:07:52] Basically, a traditional kerbok. Re-scaling, you use in-memory cache quite a bit.

    [01:07:53 - 01:08:07] Basically, the different kind of caching is you can basically do embedding caching. That's based on a key of which is the input text and the model version.

    [01:08:08 - 01:08:15] And then the key basically maps to a given vector. And it skips re-embedding, basically.

    [01:08:16 - 01:08:28] You can also do a result kind of about caching. And for identical queries, you can basically kind of outwit the final kind of JSON.

    [01:08:29 - 01:08:41] And of course, this is an issue with a temperature, right? When cache responses are much more deterministic, basically, you can do this.

    [01:08:42 - 01:09:06] You can also do tiered hashing, store most commonly accessed, kind of by caching in furlough latency and warm caching in databases. You can also do re-computation, pre-embed, different kind of common queries, build embeddings for the documents, and then pre-generate for known prompts and load it into the cache start.

    [01:09:07 - 01:09:25] Then you basically have a common kind about pitfalls and mistakes, which is ignoring kind of monitoring and logging. There's actually quite a bit of user fails, basically, because this area is still new.

    [01:09:26 - 01:09:44] As I mentioned, basically, with the RAM AI researcher, he shows in his video Apple failing-- AI agent failing in production. Very large companies with large engineering bases that have failed whales, wow.

    [01:09:45 - 01:09:51] You have to basically have understanding of your monitoring and logging. And it has to be not an afterthought.

    [01:09:52 - 01:09:56] It's not like some random logging. You basically have somewhere where you check it once in a while.

    [01:09:57 - 01:10:03] It's a core part of your LL room application. The other issue is basically a TPUs for LL room inference.

    [01:10:04 - 01:10:12] It's basically like very high cost. Basically, this is often why you don't see as many free trials in AI applications.

    [01:10:13 - 01:10:26] It's because the base level cost is much higher than spinning up by EC2, about instance, and making some reads and requests. And then you have security about oversight.

    [01:10:27 - 01:10:48] Almodate tends to see what the L9 might leak data between different customers if some are not segmented or sanitized. And then there's-- you also have fine tuning or overfitting about issues if you basically use a generic LLM without domain specific adaptations.

    [01:10:49 - 01:11:02] Basically, someone basically asked me earlier today about LLM for a specific industry. And there probably is for specialized industries.

    [01:11:03 - 01:11:21] And then if you have two complex of pipeline, you can also lead to a agile and robust system where it's hard to just manage. And so all of these are basically balancing monitors, security, privacy, cost control, and development.

    [01:11:22 - 01:11:37] And it sounds like a lot, but these are things like in practice, people don't do all of this in the very beginning. Basically, so you don't have to be stressed out that you have to deploy all of this.

    [01:11:38 - 01:11:52] Like, when a lot of the production of our systems, we've done-- you do it kind of by incrementally. And then you do it kind of by-- you build it with the usage.

    [01:11:53 - 01:12:24] When you basically deploy a chatbot, basically, you want to be able to deploy a chatbot into AWS. And so if you package the chatbot, including the ML code model in the inverse library into a dopper, then you deploy the dopper on AWS Cloud9 and then store the final image in the registry for easy retrieval.

    [01:12:25 - 01:12:32] Basically, it's a similar thing in GCP. Basically, it's called the Artifacts registry.

    [01:12:33 - 01:12:40] Then you set up your infrastructure for inference. You use Amazon EC2.

    [01:12:41 - 01:13:01] And you can use Elastic Container Online Service for deploying remote containers for a high availability. Or you can even specify a serverless kind of a solution using an AWS Fargate, where you define kind of a CPU, GPU requirements.

    [01:13:02 - 01:13:05] That's the same thing for GCP. If this is the GCP, it could work.

    [01:13:06 - 01:13:25] And then you want to basically configure network and low balancing. And then you must expose the chatbot over a secure endpoint and distribute the traffic across about multiple instances.

    [01:13:26 - 01:13:44] AWS can buy application, can buy a low balancer, can basically can buy a low balancer to everything. And a low balance error is ensured so that no one basically can buy is overloaded.

    [01:13:45 - 01:13:52] Then you want to basically use model logs. Basically, Amazon S3 CloudWatch logs.

    [01:13:53 - 01:14:05] You can also use AWS Secrets Manager for storing API keys or credentials securely. Look, we talk about AWS and GCP a lot, but in practice in versus save money.

    [01:14:06 - 01:14:20] And you may just implement this yourself, basically, rather than them about fully implemented. But we wanted to basically show you, you know, what to basically do.

    [01:14:21 - 01:14:36] And then, just because you have a vector database, doesn't mean that you don't have a relational database, basically, or a key value database. So for embedding vector search, you can use Amazon Open Search or Milvis.

    [01:14:37 - 01:14:53] And for management of our relational database, you can use Amazon RBS or Hang on DynamoDB, basically, as well. You have traffic kind of up roles, infrastructure could scale automatically.

    [01:14:54 - 01:15:05] And then you can have AWS kind of auto-scaling. And then you can set auto-scaling kind of a metric and model for the AWS kind of out watch.

    [01:15:06 - 01:15:17] And then you want to secure your inputs. You know, obviously, there's like traditional security, which is like cyber security and other thing.

    [01:15:18 - 01:15:26] And there's LLM security. We talked about LLM security in the first three weeks, which is people trying to get your system problems.

    [01:15:27 - 01:15:42] And that's more more related to your fine-tuning and your query layer, basically, and detecting whether people are trying to check you or not. Like LLM hacking is a bit different than like traditional hacking.

    [01:15:43 - 01:15:55] So a lot of people like, why are there the system problems for everything? It's because for cyber security guys that specialize in trying to get the system problems for everything.

    [01:15:56 - 01:16:18] So you have traditional security, which is SSL, basically firewall, basically API, kind of blocking, kind of security, kind of systems. But you also basically want to do LLM based security, which is what we mentioned about earlier in the lectures.

    [01:16:19 - 01:16:44] And then continuous integration, continuous deployment, basically to be able to deploy your Docker, kind of like instance else, and then be able to do testing and all the assurance. And yeah, basically it's, yeah, that's basically it.

    [01:16:45 - 01:16:54] And let's see, yeah, that's basically it. Let's see, yeah.

    [01:16:55 - 01:16:59] Did you guys have any questions? Yes, I'm curious.

    [01:17:00 - 01:17:19] Do you know how frequently, how do I phrase this, how frequently the models, the production models, are updated with fresh data? I know RAG is a technique for that.

    [01:17:20 - 01:17:33] Yeah, yeah, basically it depends on the application. Basically, when a fine-tuning basis, it doesn't get updated very often.

    [01:17:34 - 01:17:55] Basically, whereas on a, I know you mentioned on Circle, basically asking about semi real-time fine-tuning. There's probably research on it, but right now, at least, fine-tuning is not updated very often.

    [01:17:56 - 01:18:05] But RAG, you can basically update more often. And then, updating RAG is really depending on the system.

    [01:18:06 - 01:18:26] In practice, it's people basically, they might ingest the data in once a day, once an hour, basically it. And then, for search engines, like plasticity and maybe hexa, basically it's more often, maybe 15, kind of about 30 minutes.

    [01:18:27 - 01:18:48] Maintaining the index is really depending on the application. Basically, if you're using something like, I don't know, Gleen, basically, which is enterprise by an element in search across your entire user base, maybe you have to update it every minute, basically, so that if you add something, you're able to search it immediately.

    [01:18:49 - 01:18:57] Whereas, there are other applications where that's not expected or required. So, it really depends on the application.

    [01:18:58 - 01:19:17] These memories that are being added to the models, those memories are being treated as RAG? Yeah, there are vector databases in the back end.

    [01:19:18 - 01:19:32] So the feature, they call it memory as a feature, but they're actually a vector database just in the back end. Or a vector slash databases, as you read a comment.

    [01:19:33 - 01:19:41] There seems to be pretty big privacy issues with that. You speak of multi-tenancy in the lecture.

    [01:19:42 - 01:19:52] That is really hard, really expensive to implement your multi-tenancy. And once you start adding stuff like memory.

    [01:19:53 - 01:20:04] Yeah, it's a problem, basically. We haven't gone to the case, basically, where the FBI is requesting people's chat logs.

    [01:20:05 - 01:20:21] But, basically, it's going to happen, basically, not only for your private kind of metadata, but also company based on data. Yeah.

    [01:20:22 - 01:20:31] But this is also why it's not a winner to all mobile market. You optimize specific privacy security features, multi-tenancy features.

    [01:20:32 - 01:20:37] And there will be certain customers that will want to pay for it. So do you want to have any questions?

    [01:20:38 - 01:21:00] I'm thinking that ideally memory should be treated as a zero trust in the zero trust methodology. For example, each user could have their memory on their own device.

    [01:21:01 - 01:21:15] But I'm sure it's not happening. Yeah, that architecture, a tan word, basically.

    [01:21:16 - 01:21:27] So recently, people have implemented some terminal LOM-based things. And there's actually some applications where we have a whole number and other things.

    [01:21:28 - 01:21:46] So people are building these things. So there are desktop systems for very similar concerns, as you where people, basically, say, OK, open AI is great, but I really don't want all my memory to basically be there.

    [01:21:47 - 01:21:53] And so I would prefer to basically have my memory remotely. So it's definitely a concern.

    [01:21:54 - 01:22:02] And yeah, there's going to be a population. It's going to be a market, so yeah, for that.

    [01:22:03 - 01:22:07] So where are we in the course at this point? How many more weeks do we have?

    [01:22:08 - 01:22:18] We're getting to the end, basically. I'm trying to calibrate it for people's AI projects.

    [01:22:19 - 01:22:44] Basically, so we have-- I'll give an update, basically, where we're pretty close to that, and we just have a couple more concepts, like anti-collucination. Basically, and enterprise architectures.

    [01:22:45 - 01:23:09] But I think we're close to that, and we want to do it that day. Basically, we're people basically present, and then we'll do another set of-- as a tolerance-fisted for our Cinderella recruiting, basically, people-- for a subset of people who are interested in recruiting.

    [01:23:10 - 01:23:32] Yeah, so we're basically getting to the end of a bus and an update, but I want to basically-- we didn't-- some of the AI projects are taking longer for a certain people, so I want to make sure people are able to present something on demo day. Basically, obviously, it's not going to be a finish finish, but yeah.

    [01:23:33 - 01:24:07] One of the startups that I am close to-- the CEO wanted to know what the cost of the boot campus-- yeah, I can-- we can talk offline, basically, about it. Yeah, so you've got the early price, Ken, and the price-- the price did change for the majority of the people armed power.

    [01:24:08 - 01:24:18] You did the pre-order [INAUDIBLE] you took the risk off. OK, and the other thing is, are you still considering having a second one?

    [01:24:19 - 01:24:34] We are, basically, but the amount of work this boot camp basically took, we may have to debate the second one. We'll bet, basically, I'll get back to you on the second one.

    [01:24:35 - 01:24:45] It's-- I have to-- I have to basically hire more staff to be able to service our second one. OK, it is.

    [01:24:46 - 01:24:58] OK, yeah. The crypto world, there are few crypto projects that are trying to either do-- Yeah.

    [01:24:59 - 01:25:07] --be like a core weave type of offering-- Yeah. --with decentralized GPUs-- Yeah.

    [01:25:08 - 01:25:42] And then there are also ones that are trying to do more secure-- more security-oriented so that the models are verified to be, in fact, that you're querying or you're prompting-- I don't know what the right term is-- but you're interacting with the model that is, indeed, the model you think it is and that sort of thing. Are you following that?

    [01:25:43 - 01:25:54] It seems to be a very active area. Yeah, I think the best one is called a prime intellect.

    [01:25:55 - 01:26:08] And so a lot of the people that have gone into the space, it's a lot of it focused on GPU marketplace. A bit about things.

    [01:26:09 - 01:26:29] But they are basically trying to be decentralized infrastructure kind of bus systems, like town. But what ended up happening is the infrastructure wasn't quite there, and there wasn't a sufficient set of ML researchers that were building for infrastructure.

    [01:26:30 - 01:26:51] So this is one of the first ML systems, basically, is where they have an asynchronous distributed kind of a training system. And the recent technical reward on intellectual is-- it's a technical feed.

    [01:26:52 - 01:27:04] Let's just say that. Basically, it's the first, will lead to a distributed system that basically built a reinforcement learning model issued itself as a system.

    [01:27:05 - 01:27:17] I haven't had a deep chance to go into the specifics, but on the surface, it seems like it's the predominant one. Yeah.

    [01:27:18 - 01:27:28] Are you still active on your project? It's decentralized, but if you go to the Casper Knobck community, it's an active surrounding community.

    [01:27:29 - 01:27:35] We model it off of Bitcoin, so there's no centralized leaders. OK.

    [01:27:36 - 01:27:42] I'll shut up. I could sit here and ask questions all day.

    [01:27:43 - 01:27:46] Yeah. Anyone have any other questions?

    [01:27:47 - 01:27:54] All right. All right, I guess I'll touch you guys later. Thank you guys.