 Welcome, everyone. And thank you for coming to my talk. I'm super excited that all of you all are here. And I cannot wait to take you on a journey with large language models. You've all noticed that in recent times, there's been an undeniable buzz surrounding large language models, or as you call them, LLMs. Perhaps you've encountered discussions, articles, or even firsthand experience the remarkable capabilities these models have to offer. Yet despite all this excitement, many of us may find ourselves hesitant to dive in the world of LLMs, held back by the perceived complexity of deploying and utilizing these powerful tools. So let's get into the realm of large language models and hosting your own models using Kubernetes to harness the full potential of LLMs within your own projects and workflows. So before we get into the world of LLMs, I'll take a quick moment to introduce myself and my co-speaker. My name is Akanksha Dughal. I am a senior data scientist in the Office of the CTO at Red Hat. I work in the emerging technologies data science team. My co-speaker, Hemaviradhi, is also a senior data scientist in my team, but unfortunately couldn't make it here today. So in this talk, we are going to talk about what is LLMs. Where does LLM come from? We'll also go over what are some open source LLMs, what are the commercial LLMs that you can get started with, what are some closed source LLMs that you should be careful when you're using them. Then we'll go over what are the various steps when you're trying to build a large language model application. Then we'll quickly go over the concept of self-hosting your large language models. And then followed by a demo will also help you set up your own LLM using Kubernetes. Before diving into LLMs, let's take a step back. Let's address a language model first. A language model is a type of machine learning model that is trained to conduct a probability distribution over words. Simply put, it tries to predict the next most appropriate word to fill in a blank space in a sentence or a phrase based on the context of the previously given text. Just like when you're trying to write an email or a text message, the language model tries to predict the next set of words. So the job of a language model is to approximate a function that would fit your input data. So if the input data is one dimensional, like on the left, you can have a linear function that basically fits the input data. But if your input data is natural language or an image, we need advanced architectures to approximate the function. There are various kind of language models that can be put to use. Statistical language models are type of models that use statistical patterns in the data to make predictions about the likelihood of specific sequences of words. Just like the probabilistic language models use n-gram probabilities, then comes the neural language models. They are similar to neural networks in the way that they predict the likelihood of sequence of words. But the only difference is that these models are trained on a larger corpus of words and images. Then comes the large language models. LLM is just a larger version of a language model. A language model is more generic than a large language model, just like all squares are rectangles, but not all rectangles are squares. All LLMs are language models, but not all language models are LLMs. So what makes LLMs special as compared to the other language models? So there are two key differentiators when it comes to large language models. The first quantitatively, what distinguishes LLM from a language model, is the amount of data that has been put in the model. Current large language models have the order of 10 to 100 billion parameters. Qualitatively, something remarkable happens when the model gets to this size. It exhibits properties like zero-shot learning, and these kind of properties only happen when the model has reached a significantly large size. Back in the day, a model needed to be explicitly trained on a task, it had to aim for a good performance, and it required approximately around 1,000 to 1 million pre-labeled training examples. So for example, if you were to do translation or sentiment analysis, each of these tasks would require separate label examples. But however, LLMs today can do all of this without any explicit training. Technically, a large language model is a type of AI that is trained on a massive data set of text and code. This allows the model to learn the statistical relationships between words and phrases, which in turn allows it to generate text, translate languages, write creative content, answer your questions, and so much more to that. So let's talk a little about the open-source and the closed-source models that are available for usage. Some of the commonly used open-source models are Lama2 by Meta, Falcon, Mistral. Then there are famous closed-source models as well, like GPT-4, Gemini, and Claude. But in a recent evaluation conducted by Radbond University, several instruction-tuned text generators, including Lama2 underwent scrutiny regarding their open-source claims. So the study comprehensively assessed availability, the documentation quality, the access methods. And they also tried to rank some of these models on the basis of their openness. So Meta quite did not fit into the category of openness. So I've also put a link here, from which you can take a look at all the open-source models that are available for commercial use in order to make sure that you're not using somebody else's proprietary model and their proprietary data associated with it. And as these large-language models are becoming increasingly sophisticated, there's a growing emphasis on democratizing the access to these models as well. Open-source models in particular are playing a pivotal role in this democratization and offering researchers, developers, and enthusiasts the opportunity to delve into the intricacies, fine-tune these amazing models for specific tasks, and even build upon the foundation of this model. So on the basis of the technical knowledge and the computational resources that are available, we have three levels of utilizing these large-language models. So the first one is prompt engineering. Using a large-language model out of the box without changing any model parameters is the most accessible way to get started with LLMs, both technically and economically. It requires very little knowledge to get started with prompt engineering. And even when you're trying to do this, there are two ways to approach prompt engineering. The first one, the easy way, the most common, and I'm pretty sure everybody in this room has used it, chat GPT or any LLM UI is the most convenient tool to get started with large-language models. Tools like chat GPT have no cost, no code requirement, and it's pretty intuitive to get started with that. And it doesn't get any easier than that. But it does come with a lack of functionality, and the customers cannot customize the model by any model parameters or having a separate data set. You cannot change the temperature or the maximum token length of these models, which are the values that are required to modulate the LLM outputs. The second way is to interact with the large-language model directly in a programmatic fashion. So since we've seen that there are drawbacks that we cannot address using the UI, so we can also access these models using their APIs in a Jupyter Notebook or a Python script as well. We will also go over an example where you can self-host this model and also do prompt tuning later. But for now, I'm gonna just talk about how you can access the public APIs. So again, I will continue with the example of OpenAI, which also has an API key associated with it. However, that may have some cost associated with it. It's still a good beginner-friendly way to get started with that model. While all of this is easy and less convenient, also at the same time, because of some cost associated, it does provide you with a customizable and a flexible way to get started with LLMs. So have you all noticed that when you tune your prompts a little bit on chat GPT or when you're trying to interact with an LLM using programs, have you thought that when you give prompts like let's think step-by-step or prompts like let's take a deep breath and work on this problem slowly, then the third one has provided some amazing results which says this is very important to my career. This has had five to 20% increase in the efficiency of the model. So this is the out-of-from tuning where you specify how important a particular task is to your LLM and the output gets significantly better. The second level of using a large language model is fine tuning. Fine tuning is a process of continuing the training on a pre-trained large language model on a specific data set, which I'll define as taking an existing pre-trained large language model and then tweaking at least one model parameter. In the context of large language models and neural networks in general, the model parameters refer to the trainable variables that the model learns during the training process like weights and biases. These parameters are also adjusted iteratively through optimization algorithms like stochastic gradient descent and many more. In order to minimize the difference between the model's predictions and the ground truth of this data set, this is a powerful approach to the model development because it requires a relatively smaller numbers of examples and the computational resources that are required here are also much smaller than building your own large language model. And this can also produce exceptional model performance for a small company or a smaller project as well. And then third and the final way is to build your own large language model. In terms of model parameters, this is where you come up with all the model parameters from scratch. So as I mentioned earlier, the LLM is primarily a product of its training data. Thus, for some applications, it may be necessary to curate a custom, high quality text corpora for model training. For example, if you were to build a medical research corpus for the development of a clinical application, you would need specific medical data for that. The biggest upside to this approach is that you can fully customize this LLM for your particular use case. This provides you the ultimate flexibility. And however, as often the case, the flexibility does come at the cost of convenience. Since LLM's performance is mostly dependent on scale and building the LLM from scratch would require not only a lot of computational resources, but it does require a certain amount of technical expertise. And it's not a project that can be put together over a weekend. So it requires months and months of effort to build your own large language model from scratch. So now, having said all the levels of large language models, I think for today's talk, I will focus mainly on the first two parts where we deal with prompt tuning and fine tuning. And in order to build your own large language model, these are the few steps that we will have to follow. The initial step is very crucial as this sets the foundation for your LLM application. So we have to take the time clearly to define the kind of problem you want to address. We have to answer questions like, what are the tasks that we would want our LLM to do? What is the kind of target audience? What are the needs and the pain points? And then we also have to conduct a thorough research and gather insights to ensure that the LLM application also aligns with the real world challenges and opportunities. Once we've identified the problem and outlined the goals, it's time to start building the application. First begin by selecting the appropriate LLM for the task. There are tons of platforms where you can get models or large language models, particularly for the task. I will go over that shortly, but for example, Hugging Face has tons of open-source models that you can get started with. And after you have an appropriate LLM that you have for your task, consider factors like model size, the performance, compatibility with your problem domain and so much more. Next, customize this large language model to suit your needs. This may involve prompt tuning and fine tuning of the model on specific data set or adjusting the parameters to optimizing the performance. Additionally, you have to also set up an architecture for your application, which ensures that it has appropriate integration with the other systems as well. And once the model is deployed, it's time to put it to test, implement evaluation mechanisms to assess the performance and effectiveness of your application. This may involve metrics such as accuracy, speed, and user satisfaction. You can also use land chain and so many others scoring metrics for testing your large language model. You can also gather feedback from users and stakeholders to identify the areas for improvement and iteration. Continuous iteration based on feedback is the key to refining and enhancing the capabilities of your large language model. So, answering the question to self host or to not to. In this rapidly evolving landscape of large language models, the question to whether to self host or opt for a proprietary solution that has gained considerable significance over time, this decision requires a comprehensive evaluation of factors such as customization needs, scalability requirements, regulatory compliance, and most importantly, technical expertise. Self hosting offers unparalleled customization options and greater control over the data. Providing organizations the flexibility to tailor their solutions to the specific requirements. On the other hand, the proprietary of the shelf options offer convenience but may present limitations in customization and long-term flexibility. Moreover, self hosting ensures that optimal performance. Okay, I'm gonna start again. So, moreover, self hosting ensures optimal performance and reduced latency crucial for applications that require real time responses. In contrast, the open API introduces dependencies on external servers, potentially leading to service disruptions and operational challenges. Long-term cost considerations also favor self hosting as it eliminates the recurring subscription costs associated with the third party providers. However, it's also essential to recognize that self hosting requires upfront investment in infrastructure and expertise unless you are doing a small project, or in a small team, then you would not want that many resources. Additionally, self hosting promotes flexibility and adaptability, enabling seamless integration of updates and advancements. On the other hand, changes to open APIs may disrupt workflows and may require adjustments, creating uncertainty and inefficiencies. So, in short, if you were to build a quick prototype and test a hypothesis of large language model, sure, the best solution is using OpenAI's API. But if privacy is one of the major concerns, then you must offer self-hosted LLMs. Even OpenAI's terms of use mention, and I quote, may use content from services other than our API to help develop and improve our services. This means that anything that you send to chat GPT or send to their API key may or may not be included in their training data. So despite the anonymizing of the data efforts, it still contributes to the knowledge of model. In conclusion, if you deal with sensitive data or your privacy is the most important thing for you, I would suggest that you definitely use self-hosted models. You are gonna be in 100% control of your model and you are the only party which is responsible for the system uptime. And there would be no failures on external services or any API quota limits. So you have 100% control of your model. So now let's talk about self-hosting containerized large language models. So first of all, given the kind of problem you're trying to solve, you will have to select a large language model from Huggingface. Huggingface is a platform with over 350K models, 75,000 data sets and 150,000 demo applications, all of which is open source and publicly available to use. In an online platform where machine learning folks can collaborate easily. And once we have chosen the model for our application, we will containerize this large language model locally on our laptop using container engines like Podman or Docker. And once this model is containerized, we can use this container image along with fast API to serve the model. Additionally, this container image can be pulled into self-hosted clusters or in any cube environment. And if you want to run this locally, you can continue using container engines like Podman or Docker. Now let's take a look at a demo that we've put together that helps you with the use case which is going to convert speech to text by self-hosting your large language model. So this is a demo from my colleague Hema who couldn't be here today. So I'm gonna go ahead and play that. My name is Hema Viradee and I'm also working as a senior data scientist along with Akanksha in the emerging technology data science team at Red Hat. Unfortunately, I couldn't be here today in person at KubeCon. However, I do have a small demo for you all and I hope you enjoyed and also give it a try once you're free after the conference. So as Akanksha already kind of walked us through what an LLM is, how do you kind of set up an LLM and how do we actually have all this running on our local machines and laptops? As most of us are aware, the large language models that we have all seen or interacted with are quite large and compute intensive. So it's nearly impossible to kind of spin it up and try it out for ourselves on our laptop because of the resource constraints that we have. However, if you are completely new to this entire generative AI space and if you're a developer or if you're just a new user who's trying to get familiar with large language models and you just want to test out some of these models to kind of see how they run, how they work, how you can interact with it, how do we see how well it's doing and just kind of doing some development locally. Then an easy solution is to kind of set it up yourself through a containerized fashion or through kind of having it run based on the resources that you have on your laptop. So that's precisely what this demo is going to do. We are going to use the concept of containerization and Kubernetes. And as most of y'all are here today at KubeCon, I'm taking it that all of you have interacted or have used containerization techniques at some point or are at least familiar with the concept. And that's what this demo is all about, is how can we use containerization techniques to kind of wrap up all of these LLM models and have it running within our local machines. So the example LLM use case that we're looking at today is that of speech recognition. So what we're doing is we want to basically translate a non-English speech and give the translation in a text format generated in English. So in order to do this task, luckily for us, there are already a bunch of LLMs who do a pretty good job at speech recognition and language translation. So for today's demo, we're going to be using the OpenAI Whisper model. This Whisper model has been developed by OpenAI and it has also been licensed under Apache, which is a good sign that you can kind of use it for your open source projects and contributions that you would like to do on top of it. So the Whisper model itself has a bunch of different flavors to it in the sense that it has different sizes or variations of models that you can use. And the nice thing about this is that since we are testing it locally on our laptop, we can kind of choose a small model or a more compressed model rather than using the entire base model which requires like tons of compute resources in order to run them. So these models which have been compressed and optimized are just model binaries which are about 500 MB to about two to four gigs in size and that's easily able to be installed on your local machines and laptops. So with that, let's get started. And to begin, like I said, we need to wrap this model and have it served through a container. So in order to do that, we need to first come up with a container file and the container file is nothing but specifying all the requirements that we need to safely download and all the dependencies and all the packages that's needed to kind of render this Whisper model that we are working with. So here the base image that I'm going to be using is the UI image and then we kind of have this particular GitHub repository where all of the Whisper code has been developed and that's the repo that we're going to clone inside our container. And since we're dealing with audio files over here, we do need to have this particular FFMPEG package installed inside the container. FFMPEG is just an audio formatting and audio processing package that's available and do note that Whisper requires your audio files to be in a 16-bit wave format. So it does not accept MP3 files and other such formats. It requires you to kind of convert it to that specified wave format that you need to process. So FFMPEG is a tool that you can use to actually convert your audio files into that suitable format. And lastly, this is how we want to run the container itself. So we have a bash script over here where we specify that we want to serve this particular Whisper model and here we're using the small Whisper model that I mentioned and I'm basically downloading the binary of this inside this model directory and the binaries of these Whisper models can in fact be found on HuggingFace. So in the repository that Akaksha will share with you all later, there is a link to where you can go and fetch those binaries. So using that particular URL, I've just gone and downloaded that binary and copied it over into my models directory inside of my repository. And then finally, this is the host and the port that I'm specifying where I want my model to be, serving and the inference endpoint with which I want to interact once we have kind of uploaded all our audio files. So that's what this entry point is doing. So now that we know what the container file looks like, let's go ahead and run this. So before that, I also want to see what are the other images that I've already built in the past. So here are some similar container images that I've built just a couple of hours ago to kind of test out and see if this is working. So we're just gonna build a completely new one here. So I'm gonna see portman build and I'm gonna give it a name. Let's see this per KubeCon demo and I'm gonna ask you to build it from this particular directory. Now I'm using portman here, but you're free to use any other container tool like Docker. It's just a preference that you can kind of play around with. So now that we have the image and we have it under this particular name, this per KubeCon demo, we actually need to go ahead and run this image now. So to run it, we have this particular command over here. So in this command, I'm basically exposing the port which I need for the model server to be up and running. And then here, I'm going to pass the image name. So we call it this per KubeCon demo and along with that, the other argument that we need to pass is where is the location of your model binary file? So that's the part that I've specified over here. And once all of that is correct, you can see that the whisper model server is running and it's listening at this particular address and this particular port. Awesome. So now that we have it up and running, how do we kind of pass the audio files? Now there are two ways you can do this. One, you can send a simple call request which does a post request to this endpoint and it gives you back the output as a long text in a JSON format in your console and within the terminal itself. But if you want a more easier way and you wanted to have a more user interactive format, you can actually spin up a very simple UI application. So to do this, there are a couple of tools that are popularly being used for many of these LLM use cases. One of them is Streamlint, which is what we're using today. So Streamlint just requires you to kind of come up with the code of how you want the UI to look like using simple Python and it has a Streamlint package that's supported by Python. So it's kind of easy for you to come up with that Python script. So this is what my Streamlint Python script would look like. I'm importing the required Streamlint packages and here I'm just kind of giving the structure as a very bare-bones simple UI application and then I'm uploading the audio file over here so I've given it an option for users to kind of drag and drop whatever audio files they would like to have translated and then finally I'm sending that audio file as a post request to the model serving endpoint that is up and running over here. So now let's go ahead and actually run this particular Python script. So to do that we're gonna say Streamlint run and I'm going to give the Python file here and now we can see that Streamlint is up and running and you can see your UI at this particular local host URL. So let's go ahead and see what that looks like. This is what the Streamlint UI looks like, very simple and this is the URL at which it's getting rendered at the moment. So here you can see there's a drag and drop option here to upload your audio file and one thing to note that Whisper does require your audio file to be in a 16-bit wave format. So we do need to convert some of the MP3 and other formats into this required wave format for Whisper to process it. So once we have it in that particular format we can go ahead and upload. So here I'm gonna upload a French audio file since we are here at QCon in Paris. So let's go ahead and give it a very simple French audio which is just about seven seconds. So if I play this audio. Monsieur Gerbois, professeur de mathématique au lycée de Versailles, dénichat dans le fouille d'un marchand de Brekabarak. So that was the audio that we had. I do not know French, but I'm relying on the Whisper model to kind of give me the accurate translation. So this is the translation that it gave and there you go. That's pretty much how this application works. You can kind of play around with it by uploading other languages, like which are supported by Whisper, like Japanese, Italian, Spanish. There are a couple of languages that it supports at the moment. And yeah, I hope you enjoyed this demo and I hope you all enjoy the rest of the conference as well. Thank you. I do have a couple more slides left. Well, thank you. And I'm glad that you all liked the demo. So I will quickly go ahead and tell you what are our future steps in the field of large language models. So the first and the most important is to have an enhanced developer experience for also enabling the non-data scientists to follow a simple workflow for setting up and also interacting with their own large language models via microservices. Also implementing a seamless workflow from transitioning from a local dev to a production-grade environment. And also finally providing end-to-end tooling and framework for setting up large language models locally for various applications such as text generation, code generation, document search, rag applications and so much more. And these are the resources that we went over in the presentation. The first one is our GitHub repository. All the code is updated and running on the repository. You just have to go follow along the readme and you should be able to create the same demo as Hema just demonstrated. We also have our slides updated on SCED and also on our GitHub repository. We've also mentioned the link for the Hugging Face model that was required for the demo. And some of our colleagues have also worked on an interesting project with Whisperat Edge using MicroShift. So if you are more expert in these kind of things, you can definitely explore their demo as well. And thank you so much, everyone, for sitting through, coming for my talk. And I do appreciate feedback and questions. Do reach out to us, make issues on the repository. If you're facing any troubles, we would be happy to help you out with that. Thank you so much. No questions. Please feel free to meet me at the Red Hat booth if you have any questions or would want to talk more about this.