 Hi, my name is Ken, and in this webinar, we're going to talk about one of the most popular topics right now is working with AI, AI models, working with GPTs. And I'm going to focus on the second part of working with these, which is how do we get them up and running part of our applications and rolled out into production. And there's a lot of challenges in getting through this process, and I'm going to focus on primarily the end part of that process. I've got some models. I want to get them running in production. How do I do that and how can I figure out if things are working well? So you may have noticed in March, 2024, a great paper came out from CNCF about cloud native AI and all of the capabilities that are part of the CNCF ecosystem. I highly recommend that you take a look at this paper. There's a ton of great resources in there. And I kind of want to riff off of some of the things that are in there, and especially like I said at the end part. Now I've got some ideas and I want to get them up and running in production. And I'll share an open source example of how I do this as part of the process. So for a little bit about me, I'm one of the co-founders and CEO of SpeedScale. We're a CNCF member and have presented at KubeCon and we've attended KubeCon for a number of years. And we primarily work with customers who have workloads that are in Kubernetes environments and trying to figure out how to improve the performance of their own code. And you might say, hey, Ken, what are your own personal qualifications? Well, I consider myself a prompt engineer kind of like everyone nowadays. I've familiarized myself with working with a number of the different AI tools. You can see here from Perplexity where they actually have a lab, by the way, you can go and use and test out a bunch of the different models. Chat GPT where it's pretty well known from OpenAI as one of the first really easy to use models you can go and start to ask it questions and that kind of thing, as well as tools like Anthropic and Google's models. There's a ton of different things and a lot of these offer a free tier. You can sign up and easily start using and working with them. But in all seriousness, I've spent most of my career in the application performance area helping companies figure out, why is this running slow? Where are the slow parts? How can we improve it? How can we load test and validate that things are running better? So one of the things I definitely recommend you drill into and do some research on is hugging face. They have free accounts you can go and sign up and you'll find really quickly there's hundreds of thousands of models that are already out there. This is a screenshot from the leaderboard where they're saying, hey, these are some of the best models for a variety of different tasks. And you can see how fast this space is moving. So just in the past six to eight months, things have improved from 50% to over 80% and getting close to human levels of intelligence. Obviously, you want to take advantage of this and say, what gen AI capabilities can I take and add to my own application? And I recommend take a look at some of these models. These were the trending models at this time in April 2024 when I'm doing this webinar but it's always changing over time. So you want to have flexibility in how you work with these. So when you start working with AI models and you get some feedback and you go, okay, I got some ideas, I want to add a recommendation system. I want to add some generative capabilities to my product. You're going to find there's actually a pretty long process for doing this kind of thing from making sure the data you give the models is really clean and doesn't have the wrong information in it. Training the models is where almost all of the information you'll find about AI is about training. And I'm going to cover that real briefly. There's not as much about model serving or you might hear it called inference, which just means running it. Making sure that it stays up and running and that kind of thing. I'm going to focus a lot in this webinar about model serving. And of course, once you've got something, it's up and running. Is it running properly? Is it crashing? Is it erroring for people? And there's new ways to implement monitoring observability for these kinds of AI models. So drilling into each one of these, data preparation is super critical. If you feed garbage data into your models, they are going to give you garbage results. This is in a couple different areas. So one is the data you use to train the model. And you obviously don't want to train it on bad data, wrong responses, that kind of thing. But also the way you prompt the model. So take time to test out a bunch of different prompts. Try different variations on how you send data to it so that you can get the best kind of results. I highly recommend researching a capability called RAG where you taken off the shelf model that already is good at a variety of different tasks and you augment it with your own proprietary data. So this is kind of a sweet spot where you don't have to take a huge expense of training the model yourself. You can take something that already works really well and add your own responses to it. This can help you cut down on hallucination and weird responses that come out of the middle of nowhere. So model training, this is obviously really well known as a big challenge nowadays. Everyone in the world is fighting over the GPUs. Everyone's trying to get access to these kinds of things. Personally, I skipped some of this. If there's 500,000 models on Huggingface, maybe I don't need to train my own model from scratch. Maybe I can take someone else's model and just rag my responses into it. So this lets you get something up and running faster. I'm all about moving fast, getting an MVP going. And so I'll be honest, I skipped this whole stage. And for most people, you don't need to train your own model from scratch unless that's your business. Model serving and getting this thing up and running. You've got to figure out how do I package it? How do I build a container? What kind of infrastructure do I need to use? So it turns out running the models might require a GPU, but way less than training it. And you'll quickly find one of the challenges is you have a lot of microservices that are calling your model. And everyone has this new dependency. So you can use a technique called service mocking, where you record the responses that come across this API and you create a mock, which will repeatedly send those same kind of responses back. This is a way you can provide these mocks to the development teams so that they get what looks like realistic responses without necessarily having to have everyone run a giant model on their own machine that's not always feasible. So I highly recommend that you take a look at how to do service mocking. I'll try to show some examples as part of this as well. And then monitoring and observability. You spent all this time to get the model into production. Is it crashing? Is it running really slow? So you may be familiar with the SRE golden signals that came out of Google's SRE handbook. They are what is the latency of this service? What is the throughput that it can handle? The saturation, how much infrastructure is required to run it? And the fourth one being the errors. Is it even responding? By the way, errors sometimes have really good performance. The response really fast with an error message is not that helpful. In addition for AI models, there's a couple of different things that you're going to want to include are the answers accurate. So you may have it returning really inaccurate answers. This is actually the most common thing you see coming from the AI folks when they talk about performance. They're actually talking about prediction accuracy. That's great. And another one you want to include is how many tokens are used. There's a good correlation between the more tokens, the higher the latency. So you're going to want to work on that and tinker with it. What's the smaller number of tokens that you can use for your query so that you can get good latency without breaking the bank? So I'm all about applied engineering, actually building and running these things. So what I did for myself is I designed an experiment. Let me go and get a Kubernetes cluster, stand it up and start to deploy some of these things. I selected from Huggingface a container called TGI, which I'll show you. It is a way that you can run these models and it's just wrapped in a Docker container. So I put that in a Kubernetes manifest. You need a cluster that you can run these things in that has a GPU. Setting up your node groups takes a lot of time. For starters, I used an autopilot cluster. That way it says, hey, this workload needs a GPU. It spins up a GPU node. The other workloads may not and it spins up a regular ARM or X86, whatever kind of node that's required. Again, this helps you get things up and running, experiment, testing out the model so you can get feedback. So I have an open source project that I'll share here and it's got a couple of different containers. There is a React user interface that's being served off of Ingenix. There's an ingress into the cluster so you can see it from the browser. The backend API is written in NodeJS. A lot of the examples are either in Python or NodeJS. I like working with Node. Now I've got kind of the same node code for my front end and for my API. And then, like I said, the TGI is coming from HuggingFace. So there's an existing container that we can take advantage of and I'll show you how you flag the manifest to say, hey, this needs an NVIDIA GPU. So let's jump in and take a look at this and I'll show you how you can run these in your own cluster. Okay, so this is the documentation page for the TGI project from HuggingFace. It's a pretty active project. You can see here thousands of GitHub stars from HuggingFace and it's a great way that you can come and run these open source LLMs. I highly recommend it as a way to get started and get something running. There's in addition to this, by the way, it already has capabilities like we talked about monitoring and observability and it also is set up for server-send events. That's where you see the responses get streamed back to the user kind of one token at a time and it can enable you to have instant data. The first part of the response comes back fast and then the entire response is sent over time. So you can see, by the way, the details on the different models that are available from this. So for my own testing, I've been using Mixtral. There are so many models available here that there's plenty that you're gonna be able to work with and figure out and get working for your application. The API on how you call it is in the current versions, which is the one I'm using, it looks like the V1 chat completions endpoint that has the same shape as you see from OpenAI. So this makes it easy so that you can switch the different backends out and I highly recommend that you test different backends. What does it happen if I use chat GPT? What happens if I use Anthropic? What if I wanna call TGI? All of these can be done without having to rewrite your code. So speaking of the code, let me show you my GitHub repo here. So this is a really simple project just to show you how to get things up and running. Here are the components that are part of it. So the UI code in the UI directory is a little React GUI, okay? And it'll get built into a container. The API tier is Express Node.js app. When it gets an API request, it goes and hits the backend TGI system. And because I wanna keep this information around, I hooked a little database up to it. Right now the database is sort of self-contained inside the container in a future version. I might break this out, but I can hold all of the responses and details like the latency of the response and how many tokens were used. And then to deploy this, I've got a couple of Kubernetes manifests. You can see this is pretty simple. I'm using customization and so there's just three. This is very vanilla stuff, a service and a deployment, but this is the one that's probably the more interesting. How do you run the TGI? Well, this is pretty simple here. You put in the image details. Because I'm using an autopilot cluster, I need to define everything, the CPU memory storage required, and I just say give me one GPU. And then you can pass in the model that you wanna use. So here is the specific model that I'm running. You might need to put in your Hugging Face token. So you've gotta figure that out for yourself, set up a token. By the way, you can create a free account and then add this node selector down here. If you're running it in GKE, which I am, so that you make sure that you get a GPU that's created as part of this and that's it. And when you go and deploy this, it will spin up in your cluster. And one thing to note is it does take a little bit of time for it to download the model. I recommend when you productize this, you're gonna wanna put storage associated with it because you can store the model so it doesn't have to download it, that kind of thing. Again, this is the quick and dirty fastest way to get something up and running for a proof of concept. And for the other components, you'll see that I'm building containers for the AI, sorry, for the API and for the UI, then these are, I'm also building them with GitHub actions. So again, you can do this on your free account and then put a service in front of it and then I'm hosting this right now on one of the domains I have, trafficreplay.com. And so this part, setting up the ingress, that's on you, how you want to get the interface deployed. So here's what it looks like, my little demo app. If you hit my URL, by the way, you need to log in as John Doe, the password is in my repo. If you can't find it, just drop me a note or ask me a question on the GitHub repo. So you can come in and say, hey, generate a brief poem about Kubernetes, okay? And give it a number of tokens, like a hint on how many tokens to use, a small number of tokens, like I was saying earlier, should respond a little faster. So, okay, it went and created my poem for me. And you can actually see here, the poem's getting cut off. So 50 tokens is not enough. I told it how many tokens to use, my prompts required 16 and this took over three seconds. So let's try another poem and we'll do this again, write a poem about Kubernetes, but we'll give it more tokens and see how this one works. So obviously it's gonna take a little bit longer because I've given it more tokens. And by the way, three seconds is a pretty long response time here, this one is not, you know, it's synchronous. So it is gonna wait for the entire response to come back. That took 12 seconds and it actually still, it still ran out the quality of the poem. I will leave up to you to decide, but you can see a really direct correlation here between the number of tokens that it uses and, you know, the response time. This has not been tuned for performance or anything. That is for a later stage, we gotta start by at least getting visibility into it. So let me try, you know, one more poem about Kubernetes. And we're gonna say, like, go crazy, give this a ton of tokens. So 25. So now we can see we just got some kind of error. And this is the most common thing in the world, right? What something happened in my environment here. And I don't know what it is, right? So this is where API observability and understanding what's going on comes in. So I did go ahead and hook this up to speed scale. So I can see the different calls that are happening in the environment here. So I can see, you know, getting the list of poems. I got a 304 response means it hasn't changed. Here is that specific poem that was sent. This is exactly how my, the TGI model responded and how long it took. And again, the tokens and that kind of thing. So this is the inbound call that went to the API. Here I can see the outbound call that was where, how TGI responded in the specific details. And I can actually see also the prompt that was used, very basic, you know, for this example. And then, you know, moving up, we can see our one that took 12 seconds and the specific details about that. And then, you know, interestingly this other one where it was trying to post that other one, we don't know what the error was. So we can see here, actually, the backend replied with a 422. So this is coming from that hooking face TGI server, input validation error. You're not allowed to send 2,500 tokens. That's too many. So this is the kind of thing that is not always obvious to a developer, to someone who's working with this system, what's going on in the environment. So tracking these kinds of details is really important. And if as a developer, if I want to try to work with these APIs, again, it's kind of hard to get this up and running in my environment. So using a tool like SpeedScale or another capability for service mocking, you can come and save this data and simply easily build a mock that will allow you to run this locally on your own machine and you can see the mock responses include the successful responses of when it was properly returned as well as the error conditions. And this lets developers easily work locally without having to deploy giant GPUs and spend a ton of money on infrastructure because these can be run just with a single command line. Let me show you, as simple as this can be set up, it's just running it on your desktop like this with a single command. Go and run this locally on my own machine, let me hook my Node.js code up to that and develop and iterate so that I don't get that server communication error or whatever kind of thing, but I handle it in a cleaner way and show that response back to the user. So this was just a quick overview of some of the capabilities in the cloud native space. And please take a look at my GitHub repo, open comments, send me some issues and this kind of thing so we can work together. I would love a chance to help out, I'm always interested in what folks are doing around cloud native AI models nowadays and getting them up and running in Kubernetes as easy and fast. And this is definitely something you can check out. So if you have any questions, feel free to reach out. Thank you very much.