 Okay, so we'll try to keep it fairly straightforward so that it's understandable for everyone. And hopefully you'll find something useful. We're going to give it a couple of minutes and start at 3.55. Okay, so my name is Surya Anand and this is my colleague, Surya Pathak. So we are a part of emerging technologies at Red Hat and we look at the latest data science trends and create POCs that can help internal Red Hat products and also upstream open source communities. So today we're going to go over how we can deploy our own open source language models and also go into an application and its architecture to explain what are the components that go into such an application. So all of our experiments are done on OpenShift, which is a Red Hat product. However, all of this can be replicated on a Kubernetes cluster. So all of this is cloud native and you can use this for your own experiments or deploying in a production environment. So we're going to start today by going into the overview of large language models, then describe an example problem, give a solution demo, and then explain the architecture of that application. Then we'll go into the open source components that we used in that application to explain some of the components such as vector embeddings, database base models and fine tuning, and finally some of the deployment frameworks that we used. So to give an overview, foundation models or large language models are a type of neural network and the job of the neural network is to approximate a function that fits the input data. Now, the entire explanation of this topic is out of the scope for this talk, but I'll give a brief introduction. So if you look at the graph on the left, it's a very simple data set with two dimensions. You can think of it as a housing price data set where the x-axis is location from the center of the city and the y-axis is the house price. And so if you plot the house prices on this data set, you can basically fit it into a line, the blue line, and if you have a new point and you know the location of that point, you can go on this line and on the y-axis, it will give you a predicted housing price. Now, this is a simple data set and you can use a simple line to fit this data set. However, if your input data is multidimensional like an image or natural language, you need more advanced architectures to find multidimensional version of this line. So as these architectures are advanced, we have come to this transformer architecture and this image basically shows the transformer network given by Google in the paper attention is all you need. So what this architecture has allowed us to do is to have a generation mode, or in other words, outputs or words can be created repeatedly by guessing likely options for each next word. So you can start with a human prompt and then the model can start generating new words. So we've seen that this has applications in code generation, information retrieval, question answering, sentiment and classification tasks. So today we're going to take one of these applications of question answering and describe documentation search application. So the primary objective of this exercise is to create a QA system for product documentation and we're taking a cloud service called Rosa, which is that had open shift on AWS and take all of the publicly available documents for that particular service and create an application that looks something like the image on the right side. So there is a question and so it says what are the differences between Rosa and Kubernetes and then there is an answer text to that. So we want an application that can generate that answer. So for the data set, like I said, we take all the publicly available Rosa documents and for validation we use a FAQ document. So some users asked question about the service. We recorded those and we have engineers generated answers to those questions. So we take them and see if our application is able to match the quality of those answers. So next I'm going to give a live demo of the application. So this is the environment that like that open data hub, which is an open source project and roads provide. So I'm going to show the back end of this application. We are currently updating the front end. So it's as simple as calling this application and asking your query. So here I'm going to ask what is Rosa or what is the service? And it's going to execute all the steps in the background. So it's going to load the documents, like the data set documents, it's going to load the vector store, then it's going to generate top candidates. So these are potential answers to this query. And then the final step is to have constructed answer, which is basically definition of the service. So it's fully managed on the application platform and so on. So we can do this with any query. I have another example. So we can ask a question that a user actually asked. What exactly am I responsible for? And what is red hat slash AWS wants to move forward? This should do the exact same thing that happened in the previous query, load the documents, the vector store, generate the top candidates, and finally the answer. So yeah, that's basically the application demo. Next we are going to go into how we created this and some of the open source components that we used. So the major concept behind this is embeddings and it's really exciting to see how far things have come in the embeddings world. So given overview, so we have as input words or images. In this case it's like a cat or a dog and a car. And then we enter that into this embedding model or neural network that generates this embedding space or a vector of numbers that capture information in the input objects. So at the end, we have this space where cat and dog are closer to each other. So that means there is more similarities between them than compared to with the car. So the car is further away in the space. So that's the basic concept. And we can see that in an example with the query that we just ran. So the question is, what is Rosa service? And then we can run this to encode that question into a vector. So this question vector is 384 dimensional, so it's a collection of 384 numbers. And we can just look at them here. Similarly, we can get the answer to that question. So the definition of that service and encode that into the same vector space. And then we can calculate the distance between them. So how close are they to each other? And when we do that, we get a number between zero to one. One meaning that they're highly similar and zero meaning that they're further away in the space. And here we get something like 0.6, which is a high number. So if we do a sanity check, if we take a completely unrelated piece of text, so something about Docker, then the similarity between our Rosa question and the unrelated vector is quite small. And similarly, the similarity between the answer vector and the unrelated vector is also very small, so around 0.2. So that kind of signifies that the embeddings can capture some of the semantics inside these texts. So we can extend this exercise into a collection of documents around this Rosa service and a collection of documents around Docker. And when we take those 384 dimensions, squeeze them into two dimensions. And look at the plots, we see two separate clusters indicating that these documents came from different sources. So again, it signifies that in this space, the embeddings were able to meaningfully separate these two. So how does it become useful in a document retrieval or a document search application? So if you look at this graph, what we do is that we take all the documents and encode them into vector space. So all the blue documents show those. And whenever a query comes, like the one that we just showed, we find the vector for that query. So in this space, we look at the nearest neighbor of that query and find those relevant documents. So this would be the most relevant. And if you want top five, we can take these five and use them as our candidate documents. So that's the first step in this application. Going to go back to the slides. Right, so this is the architecture of the application. So there's a user, comes up with a query, it goes into the embedding search. The concept that I just showed, so that generates the top candidates. Then the second step of this application is the answer generation. So you have the query, and you have the top candidates. And they go through a prompt manager or something that generates a prompt that can go to an NLM interface that talks to your large language model. So the large language model is then responsible for creating the final response. And so this architecture is like a bare bone architecture. There are many ways in improving this. So you can add lexical search to embedding search. You can add queuing, caching. You can add a task planner to this. But the basic concept of rag pattern applications follow this architecture. So next we're going to go into some of the open source components. So the first one is vector database. The purpose of vector database is to index vectors generated by the embedding model. So in the demo that I showed, we converted them to vectors. We converted the documents into vectors. So we take all the publicly available documents of a service. We encode them into vectors and then index them in the vector database. And the performance of these vector databases are judged on how efficiently they can find the nearest neighbor of the input vector. And they use something called approximate nearest neighbor algorithms to do that. So there are more than 50 vector database that have emerged in the foundation model area. And some of the major open source players are Quadrant, Milvus, Elastic, VVA, PG vector, and so on. So you could look at some of the benchmarks on Quadrant and Zillis and see which ones perform best based on your specific use case. In our demo, we are experimenting with Quadrant because it allows us to create a Kubernetes deployment and then also run it in memory locally. So this is an example of answers that we got from our application. So the first question is what I showed in the demo. So what exactly am I responsible for and what is Red Hat and AWS responsible for? And the actual answer goes into the responsibility separation. The second generated answer is without any prompt engineering. And it gets it partially correct. However, if we do prompt engineering, which is a collection of guidelines to create a prompt template that interacts best with your LLM. Then the final generated answer is much better. It's comprehensive and gives all the points that are there in the actual answer. However, this is not like a full proof method yet. So in the next two examples, I show where this fails. So in the question, where can I get more information and details? The actual answer gives three links to the Rosa service. However, the final answer here gives three links out of which two are incorrect. So it points to the hallucination problems that these large language models have. So we are currently working on getting feedback and improving this application. However, the base LLM that we used for this exercise was OpenAI GPT 3.5, which is not open sourced. And the next set of experiments that we did were to see if we can adapt some open sourced models and then deploy them and use in this application. So at this point, I'm going to hand it over to Surya. He's going to go over the next set of experiments and talk more about feedback and evaluation. Thank you, Surya. All right, so talking about adapting open LLMs, here we explored the process of adapting open models to suit our use case scenario. Now the first phase is pre-training. A language model is trained on a massive data set of text and code. Some examples include Falcon, Flan, Pithya, et cetera. The second phase is supervised fine tuning or SFD with high quality data, prompt and response pairs. For our case, we use the data set of prompt response pairs about Rosa documentation to fine tune a Flan model. This process refines the model, making it more in tune with domain specific areas. This can be replicated with LLama or any other family of models. The final step is the reinforcement learning with human feedback, or famously called RLHF. This step reduces the chances of the model being biased, toxic, or not so helpful for its users. We are currently collecting user feedback data to experiment with RLHF training. So more about open LLMs. Let's see what are some of the things that we can consider before choosing an open LLMs. The decision is based on your resource budget, license requirements, and the model performances. Talking about performances, there are some evaluation benchmark criteria, like MMLU or Truthful QA, et cetera, that you can use to find the model and that fits your use cases. As of licenses, some commercially available models and the respective licenses are listed here. Some have Apache 2.0, MIT, LLama, which has some constraints. Among them, we have experimented with Bloom, GPT-2, Flan T5. Now, we have taken this pre-trained model, Flan T5, and finetuned on our QA data set related to Rosa. We could do that in one T4, 16 GB GPU. One of the sources of this list of open LLMs is also Hugging Face Open LLM Leaderboard. So you can also check them out. Selecting an appropriately sized model according to your resource budget is also an important consideration. So once you have selected your preferred open model, and the next step is to transition from running it to your local machine to creating a fast, secure, and scalable API endpoint that can handle queries from multiple users. For model deployment, some things to consider are infrastructure and resource management, like GPU availability, model sizes, et cetera. You need to also optimize model deployments like cost optimizations, performance, check model latency, system monitoring, and optimized usability, meaning consider how many users are using this model, and lastly, also consider security and privacy. There are a number of different deployment tools available for LLMs. The best tool for you will depend on your specific needs, such as the type of LLM you are using, the tasks you want to perform, and of course your resource budget. Here's a quick comparison of some open source tools for LLM inference based on the article cited over here. Tools like VLLMs, CTranslate, DeepSpeed, shows low latency, means they are fast, but they have limited integration with other technologies. Currently, we are looking at text generation inference and RayServe. RayServe has higher latency, but also offers better scalability, multimodal compositions, et cetera. It also have a native Lantern integration, which is useful for a project developed with Lantern. In the next demo, I'll demonstrate how to deploy a pre-trained, flanned T5 text to text generation model with RayServe on a Kubernetes cluster. So here we have a notebook, which shows the different steps that goes into deploying a pre-trained, flanned T5 text generation model on RayServe. But before jumping into the detail, I would first like to give you a brief intro about the Ray and the RayServe. So Ray basically provides a backbone and primitives for distributed computing. And RayServe is a native library built on top of Ray, which provides a scalable and programmable framework for machine learning applications. RayServe provides user-friendly solutions for deploying and serving ML models, ensuring ease of use while maintaining products and readiness. This bridges the gap between the simple tools like Python Wave framework and complex and custom tooling or specialized system, providing the reliable and scalable platform for ML deployment. Moving on. So next thing is we first start by, in order to start the process of deployment, we first start by creating a cluster. So basically we begin with creating and configuring Ray cluster using the codeflare SDK. Now here, a brief definition of codeflare. It is a framework to simplify the integration, scaling, and acceleration of complex, multi-steps analytics, and machine learning pipelines on the cloud. Before we move forward, I would also like to give a special shout out to our team members, Michael Clifford and Alec Alanson, who gave a talk at OSS North America about codeflare. So I would encourage everyone, every one of you to have a look at it for deep understanding of this framework. So let's jump into the main point. So once we configure our cluster, our Ray cluster over here using codeflare SDK, let's look at different parameters. We have the name of the cluster, we have the name space, we have the number of worker nodes as two, minimum number of CPUs we designed as two, which means like each worker node has two CPUs, memory, each memory is eight, and number of GPUs is one for plan T5 model. Now we have instance case two, which means the worker nodes will be adjusted according to the workloads. Now this process will create a YAML file which has some Kubernetes resources that are deployed on the cluster. So once the cluster is up and running, which takes some time to run, we will be installing additional libraries that are required for model servings of those transformers data sets. So here, once the cluster is up, you can check the cluster address to Ray cluster URI cluster from here. Now we install additional libraries in the runtime environment like transformers, data sets, which will be used in the LLM preprocessing. So we define the runtime environment, we then initialize the Ray with respect to the address of our cluster and the runtime environment. And it will take some time to initialize. So once the Ray cluster is up and running, once the Ray is initialized, then we move on to designing our deployment class. Now to deploy a simple service like text generation, we will need a serve.deployment decorator, which takes in two arguments, a number of replicas. Here we have kept it as two, which means that there will be two service at one point of time, which means that there will be, which can improve performance and scalability, and reactor options which specifies that the service should be deployed with one GPUs. This is necessary to deploy the Flan T5 model to run efficiently. Now we then define the text in class with model pipeline and model query. Now in order to start the Ray serve application, we run the serve.run command, which takes in the deployment name, deployment class name, and the host. So this serve.run command will give you the, if you go here, if you will give you the Ray serve handle, which has the deployment name. Now using this deployment name, you will be getting the endpoint for this particular deployment. So once you provide a deployment handle name over here, you'll be getting the name of the deployment and the root to the endpoint. As you can see, the endpoint is at the root of the directory. So getting that endpoint, we provide that endpoint over here, we request it and provide a query over here. Like in this case, as an example, we provided the query open source submit is D. So it is the text generation, text completion. So as the model, what it does is it completes the text like this. So our query is open source submit is D and it completes like open source submit, the annual conference for open source community. It is the largest gathering off. So that is the output from a Flanty five model. So here is what I'm trying to say over here. So for our demo, we implemented a model with 783 million parameter on a single GPU and this is the outcome we achieved. To give you some context, the number of parameters for open AI GPT 3.5 is around 170 billion. The size of the base model improves the response, but it requires proportional compute resources. If you are wanting to run higher parameter model, you need equally higher resources. With open source models, we are currently deploying models like Lama 2 with seven billions and 13 billion parameters. The performance is better compared to one billion models, but whether they can perform at par with 100 billion scale models is an open area of research. So having said that, I'll move back to my slide where I described a bit about deployment. So now you have your model, you have deployed your model and you've got your output. Now what? When your language model is up and running, it is time to evaluate this performance within the context of the specific use case. To gauge the effectiveness of your document retrieval process, we talked about this in the earlier part of our presentations. A metric known as normalized discounted cumulative gain or NDCG is helpful. We saw it in the demo of the document retrieval process that the list of the sources were ranked one to five. This particular matrix excesses the quality of this ranked list. For generations, if you have, for text generations, if you have a reference answer, then you can evaluate your model responses based on that reference answer using automatic evaluation metrics like blue, root score, et cetera. For time efficiency, metrics like throughput and inference time are used. Now in the case when we do not have the reference response, human evaluations comes in the picture, which I'll talk into detail in the next slide. So why human feedback? Here we discuss the case as to why we require human feedback. Some of the reasons include for a domain-specific model. A domain expert needs to take the relevance of the response. When we rack original or truth response, we need human feedback. Although it might be correct, but sometimes the response of the LLM might be correct, but it might not be helpful to the users. And in order to evaluate those things as well, we need human feedback. So here is the result cited, which shows the increase in the performance for different model sizes after fine-tuning with human feedback. However, keep in mind that it does comes with its own resource cost for fine-tuning. Next thing is how do you collect the human feedback once you have your model generating multiple responses used by human evaluators? Then we can ask questions like, rate the response on the scale of one to five, or re-rank different response or edit your responses. There are two ways to include feedback in your applications. Number one, connect your applications with the data annihilation open tool, such as Argila. The snapshot is shown in the slide. We chose the prompt and response along with the UI for ranking the response. We deployed an instance of Argila on our Kubernetes cluster, and whenever we have an interaction where prompts are answered, a version of that response is recorded in Argila for feedback collection. The subject matter expert can then go to the UI and give the feedback through the tool. Now, another way of collecting the feedback is to build in feedback questions into the applications, and that would involve change the application UI itself, connecting and also connecting to the databases, et cetera. Our next step is to use a feedback we are collecting and use it to automatically refine the applications. Now to conclude this talk, I'll hand it to my colleague Srey for conclusions. Thank you. So just to summarize all the key points that would be required if we want to build such an application. So the first most important thing is that you have to clear up the problem statement or you need to know what is the desired output. So in our case, it was answers to Rosa product documentation. If you're building some other application like summarization or text classification task, then you need to fix that output before you start. The second important thing is you need to nail down the input data sets. So it could mean that for your application, the data sets are located in different sources like text files or databases, and you need to connect all of them with your application. The third thing is to choose a base LLM. So like Surya showed, couldn't go with different families of model and the decision to select one depends on the cost. So as you increase the size of your base models, you need bigger GPUs, and that means you have to increase the budget of the application. That does come with performance improvements, but then there's a trade-off. Then you have to consider the licenses. So you can only use, if you're deploying a commercial application, you can only use commercially allowed application. Then you also need to look at the performance of the LLM based on performance benchmarks and if it's giving ethical responses. So once you've selected the base model, you have to look at adapting it to your domain-specific task. So in our case, we did prompt engineering, we did fine-tuning with our data set of prompts and the responses, and now we are working on the feedback approaches. So there could be different combinations that work in different cases. So yeah, and once you know how you're adapting it, you have to fix the components. So example for that would be vector databases. And so you can experiment with some of the open source ones out there. There's also the choice of frameworks. So there's LAN chain, there's Lama index. These frameworks allow you to connect with base LLMs and then also index your input documents. So based on the features that suit your application, these components can be selected. Finally, you have to look at the deployment parameters. So for runtime, like Surya showed, there are many frameworks available. We are experimenting with hugging phases, TGI, text generation interface, and also Racer that we showed in the demo. For evaluation, you need to have the feedback mechanism in place and some of the metrics that we showed. So on our journey of creating this application, we found these points address some of the challenges that we faced and hopefully that was useful. Thank you for your attention. I can answer any questions you have. Yes? I have a question. If you have, oh, thank you. So if you're in a situation where you have a member of staff that's harangued by questions and they have a set of policy documents that they're answering questions on, but it's not a super large dataset between a small set of policy documents and maybe a restricted set of the DM conversations that they have, would that be a reasonable application of this? So you're saying if there's sensitive data, can we use such an application? Yeah, just one person. So like it's not a massive dataset, but they really need to be botified, turned into a robot that would automatically answer the questions that a sizable community talks to them about. Would this be a reasonable? Yeah, so like the vector database that I showed in the beginning, so you take whatever the size of your documents are and then you use the model to find the relevant context to a query. And it doesn't matter if your dataset is huge or small. If your data is sensitive, this becomes even more relevant because you are deploying your own large language model. So you're not worried about you're sending that data to any other server. And when I'm saying you're deploying it on a Kubernetes cluster, that cluster can just be on your laptop. So yeah, the vector databases cover the small part and the local deployment covers the sensitive nature of it. Thank you. Okay then, so we're gonna hang around. If you have any more questions, we can have a chat. Thank you all.