Building a Semantic Search Engine with Chroma

Project Source Code

Get the project source code below, and follow along with the lesson material.

Download Project Source Code

To set up the project on your local machine, please follow the directions provided in the README.md file. If you run into any issues with running the project source code, then feel free to reach out to the author in the course's Discord channel.

This lesson preview is part of the Responsive LLM Applications with Server-Sent Events course and can be unlocked immediately with a \newline Pro subscription or a single-time purchase. Already have access to this course? Log in here.

This video is available to students only
Unlock This Course

Get unlimited access to Responsive LLM Applications with Server-Sent Events, plus 70+ \newline books, guides and courses with the \newline Pro subscription.

Thumbnail for the \newline course Responsive LLM Applications with Server-Sent Events
  • [00:00 - 00:43] Welcome back. In this module, we'll learn how to use the In-Memory Vector Database, Chroma, to build an vector index and use it to perform semantic search. Here we are on the documentation page of Chroma. We can see a simple diagram explaining the retrieval of Monte-Generation, but Chroma is a new type of database for the edge of EI, which serves documents alongside vectors. In our use case, the data will be IPCC reports under the PDF format. Now, let's dive into the code. Here is a quick script whose job is to take peters as input and transform all the content into vectors, then insert those vectors in the database.

    [00:44 - 02:46] Let's take a look at the code. The first job and pairs on the most difficult is to pass the data, because we need to read all our documents. To parse them, here we are going to use some long channel. Here we are loading the PDF, we get some text, and then we split. We split because we recall that there is a maximum size for the window of ALM. So we cannot fit the entire document, we need to split it. And here we will use a strategy from Long-Gen, which is very aggressive character text feature. We will change size of 1000 to 10, and we add some metadata. For instance, we add the page, where the chunk was taken, the URL, the name of the document, and so on. So now we have extracted some chunks of text from the PDF. Now we need to transform the chunk of text into vectors. This is made possible through embeddings. Here we are using a main extra paneer embedding through a Long-Gen class. So you will need to have an open API key set in your environment variable for this class to work. Our last job will be to insert those vectors into a vector database. Here we will rely on Long-Gen class. So vectors, the long- gen class, and here we are using a common instance. Here we simply insert document into a database, making sure to open a vector using the embeddings. And now we have done our job, we manage to build our database and to delete without data. Now there is an important step which is very trivial. Very trivial is that we need to pick the relevant document. How is it done? We have a query, a real-scale query, bring it. Now we will need to convert this query to a vector. This is done for non-beading. We will once again create an instance of our vector. We will do a similarity search.

    [02:47 - 03:35] Well, we search the k closest chunks from a semantic point of view. Notice that we make sure to make an assign implementation to preserve news behavior for our rule workflow. And we make sure also that the similarity score and all the kind of meta-data are done by. And last but not least, the filters. You might want to do a semantic search on a linear subset of the data. Good news is that it's already managed for you in many methods database. So there we just need to pass these filters and you get the five closest among the accepted. So now we've built our database which built our trivia. So in our next course we will be able to implement our rule workflow. See you soon.