 So I'll be talking about, like, how do we think about productizing generative AI applications, especially from the point of view of, like, managing load, whether as a consumer of generative APIs and whether as a provider of APIs, like, how do we think about things like prioritization, rate limiting, and caching, right, and scaling these workloads? So a little bit of background about myself. I am co-founder and CEO of a startup called Flux Ninja, which is actually based off an open-source project called Aperture, and this is what we will be covering in the presentation today. So I have dedicated more than a decade of my life building tools for DevOps and SREs. My last startup was actually acquired by Nutanix back in 2018, and that startup focused on observability in cloud-native applications, like essentially going from packets and reconstructing entire APIs and understanding their performance and building dependency graphs and maps. So let's talk about, like, generative AI, right, and try to understand, really, how fundamentally these workloads are different from anything we have seen in the past, and what are the real challenges in taking them to production, right? So I'd like to, like, throw an analogy here. Like, we are really talking about, like, mice versus elephant APIs, right? If you really look at the latency profile of different models that OpenAI has, for instance, one of the things you will notice that these APIs are unlike anything we have seen in the past. Just for comparison, like, we have been dealing with a lot of mice APIs, like web-scale APIs, think of transactional search APIs, shopping card checkouts, just browsing through the catalog or something, and you're talking about response times of, like, maybe 100 milliseconds to 500 milliseconds, right? That's kind of what's considered normal in the world of web transactions. And now you go a little bit to the OLAP world where you're, like, loading dashboards. Let's say you have Apache Druid or Cassandra. Like, you're talking about latencies of maybe 500 milliseconds to 1.5 seconds in Cassandra, if I'm not wrong. Like, you're talking about, like, maybe, like, a cow API or buffalo API, like, a little bit larger. Now, LLMs are unlike anything we have seen in the past. So you're talking about response times, which are several seconds to several minutes, even, right? So for instance, GPT-4, which is the top of the line model, and this is the real world production graph from one of our users, you're talking about average is, like, anywhere from, like, five seconds to several minutes, even in some cases, depending on the prompt length and the completion sizes. Even, like, one minute is pretty, pretty common in this world. Now, that is kind of the root cause for a lot of the problems we are seeing on the provider side. Now, let's understand why these providers, like OpenAI, are rate limiting very, very aggressively. And this, interestingly, is a tweet from this morning, just an hour back. As you guys know, like, chat GPT had been, like, put on a waitlist for last several weeks because of the shortage of GPUs. And it's understandable, right? I mean, these APIs are very hard to scale. One thing is, yes, the demand has outstripped the supply. The GPUs are not there. And even then, like, these workloads are very, very heavy. They, you need a lot more compute than we have seen in the past. And these state-of-the-art models are really slow and very, very expensive, right? And one of the things that happens is, like, once you kick off a call to these models, internally these, as you know, these are, like, a mixture of experts. Internally, they're fanning out to other smaller models or sometimes even if you have a rag workflow, like retrieval augmented generation, you are also putting a lot more stress on traditional APIs where you're getting, like, a lot more data from. So a simple query from a user, let's say you are searching for flight tickets, is fanning out to a lot more APIs, not just generative AI, but even traditional APIs are being called a lot more than we have seen in the past. So the reason rate limits are being added and they're very aggressive is, yes, you want to reduce the load on the infrastructure, you want to prevent abuse by someone who's done accidentally, like, makes a lot of API calls. And the third point becomes fairness, like, I mean, given this, everyone wants to build on AI and providers generally want to provide fair access to all organizations, regardless of the size, so that it remains accessible. And this is kind of like where the world is right now. Let's take a look at open AI rate limits. So they have different tiers. So I took one of the screenshots from kind of a tier two rate limits, depending on the business accounts. And as you can see, rate limits are by the model, so each model is different. And on that, they have a couple of three or four types of rate limiters. You have, like, requests per minute, so which, for example, like, on GPT4, I can only make 5,000 requests every minute. And beyond that, I'm rate limited. But the most constraining factor more than RPM is actually tokens per minute. So as you guys know, like, GPT4 context window is, like, 8,000 tokens. And every minute, they only allow, like, just 40,000 tokens to be sent in a request to them. And beyond that, they're rate limit. And the same thing is with other APIs, like all the other models. Like, even the preview model isn't, like, that generous in terms of API rate limits, right? So it's pretty crazy how aggressively these models are rate limited right now. And the thing is that, given the nature of these workloads, even if, let's say, these rate limits, like, multiply, like, 10x or 20x, a lot more API calls will be made to them. Like, people are just going to, like, add AI workloads into more and more existing workload workflows, right? So the fact is, like, we have to live with these rate limits. And how do we work with them and think about them, right? So to that effect, let's talk about, like, what it takes to productizing these applications. So for the rest of the presentation, I'll be talking more from the consumer standpoint, like, if I'm consuming these APIs, like, how do I guard myself or, like, work with these rate limits and manage them properly with prioritization and so on? So let's first talk about case study. So this is a company called CodeRabbit, which is a code review tool on GitHub and GitLab. So each time you open up a request, they automatically review code in that PR. And on each commit, they also review the latest changes, right? And they use multiple LLM models in order to accomplish code reviews. So one of the things they do is summarization with GPT 3.5 Turbo. But some of the complex tasks, like code reviews or code verification or chat, are powered by GPT4. And on top of that, they have, like, different product tiers. So it's not like one type of users. They have, like, free tier. They have open source. So their product is actually free for all open source customers. So they actually offer a pretty generous open source kind of an offering. Then there's a trial. So anyone who's signing up gets, like, I think, a one-week trial with them. And then there are paid users. So they're different tiers. And for them, they want to do different quality of service. That's where I'm getting at. And then they have different kind of workloads. Sometimes, yes, they have a lot of code review workloads, which are running in background each time a commit happens. But a lot of times, they have interactive chat, as well, where customers or their users are asking AI questions around in the context of code. And over there, they expect, like, real-time responses. So there's a lot more interactive workload running alongside batch processing and background workloads. And interestingly, given the demand for AI and how much value it's adding to everyone's lives, their demand had outstripped supply. Pretty much thousands of repositories have signed up. They're probably, like, reviewing 10,000 pull requests a day or something like that. So crazy growth, right? And as soon as it started hitting growth, they found themselves running into a lot of issues during peak load hours. So every morning, around 6 a.m., 4 a.m., when Europe and everyone's working in Japan, especially, like, they have a lot more, a lot of users in Japan. And what happens during that time is they exceed limits. And a lot of failures used to happen, like, a lot of reviews would get lost. And due to various reasons, like, one of the issues they were completely on serverless, so they even had some serverless timeouts, right? So they started, like, figuring out how to solve this problem. So, like, other companies in the space, like, they started opening a lot more accounts. So they opened, like, four or five OpenAI accounts just to multiply the rate limits. And I've seen, like, companies actually run OpenAI in Azure on different regions just to circumvent the rate limits. So they'll spin up an account in each region just to get more limits. But the thing is, like, it's just a temporary relief, right? And I've seen people use wait lists, like, we just saw. Like, even Chad GPT+, which is an offering by OpenAI, had been wait listed. And there are other companies, like, which I've heard, like, who are so much reliant on, like, 32K context. Now, there's a new GPT4 turbo, but before that, there was 32K model, which was so heavily rate limited and so slow that in order to get dedicated capacity, you have to, like, invest, I don't know, a million dollars with OpenAI to get some dedicated capacity, right? So a lot of companies were forced to put a wait list on their product, despite the demand, which is not very ideal. Then one of the other things they tried, where it helped to some extent, but still a lot of guesswork was involved, was to limit the concurrent calls to OpenAI. So they had, like, multiple containers running these code reviews. So on each container, they'll put a concurrency limit on how many calls they have, they're running at a time with OpenAI. That helped to some extent, but given the distributed nature of their system, there was no coordination. And a lot of guesswork was required, like, for example, like, is 10 a good number? Is 20 a good number? How many concurrent calls do I make to get maximum utilization on the rate limits, right? Then, of course, they tried back off and retry. So each time you hit a rate limit, you get a retry after header, but the problem with this API is given that how, like, the 30 seconds or 40 seconds delayed, the headers are also delayed. So when you say retry after 10 seconds, you're getting the information 40 seconds late. The headers are very, very stale, right? So it's very inaccurate. And also another problem is the prioritization. Like, once you start this back off retry loops, like, so not all your workloads are equally important. Sometimes you have workloads which are interactive, just like the real-time chat. So ideally, you want to be able to prioritize those workloads more than others. So there's no way to do prioritization in this model, right? So they struggled for a while, and we started working with them to solve this problem with the project we have in open source called Aperture. And the cool idea we brought to them was, like, how do you think about client-side rate limiting with request prioritization to solve this problem? Essentially, like, how do you track external rate limits on the client side and work backwards from that to schedule your workloads? Just like in operating systems, you have a scheduler. Think of it as a scheduler for open AI APIs. So let's just a brief overview of Aperture. It's an open source platform which provides a lot of these observability-driven load management capabilities. The primary focus of Aperture project is, like, rate limiting, caching, and request prioritization with quotas. And think of it as a service. So it's kind of a load management as a service as a sidecar which can run alongside your existing stack. So you could use SDKs like Python, TypeScript, and so on. And you could also use service meshes like Istio and Envoy to insert the solution along with some proxy. So there are multiple ways to add load management capabilities into your existing stack. And this is kind of the high-level diagram. So the solution Aperture project has been designed by keeping like cloud-native architecture in mind. So what you are essentially doing is deploying an agent on each of your Kubernetes worker nodes, kind of as a demon-set as a sidecar. And your application services are talking to the sidecar for load management decisions. So in real time, it's making rate limiting decisions, queuing decisions, prioritization decisions on each API call, entering or leaving your service. And then there's a control plane like completely designed on HCD and has some of the Prometheus elements for visibility and so on to scale these agents and configure them dynamically in production. Aperture has a lot more moving parts, but for this presentation, we'll talk about the ones which are more relevant to the quota management. So one of the cool things about Aperture is that it's kind of like a distributed cache, right? So the token buckets in Aperture are like sharded and distributed in the cluster and all these agents are participating in that common pool. And each time you have a token bucket lookup on let's say an API key, in case of OpenAI, you want to do rate limits by or track rate limits by API key. So it's going to ask for tokens, look where the owner of that API key is and run the token bucket algorithm. So think of it as a distributed token bucket, which is available inside your cluster. It's the end cluster solution. So unlike Redis, which is like external dependency, this is like much more higher performance, much more designed for these counters and tracking a lot more writes as compared to Redis, which actually overloads pretty quickly with these kind of workloads. And of course, like the algorithms over here are like much more advanced than a fixed window. It's properly like a leaky bucket algorithm, much more smoother traffic flows. And one of the cool things is like, once you have this token bucket constraint applied, then it also provides like a weighted fair queuing scheduler under the hood where the requests are actually prioritized. Now let's look at the bigger picture. Like this is how we worked with CodeRabbit to schedule requests to OpenAI. So this is one of the CodeRabbit instances. So they have different workloads coming in from different tier of users. So they've paid trial free, some interactive workloads like chat. Before they make a call to OpenAI, they are actually checking the call against Aperture, which is a sidecar agent, where it's looking up, it's doing the client-side rate limiting. So it's kind of tracking the OpenAI limits, kind of mimicking that on the client side. And if we are above the limit, then the requests are actually queued in a weighted fair queuing scheduler where what we're trying to do is like giving paid users or chat users higher priority as compared to the free users, right? So during the peak load, the idea is that your interactive sessions, like your interactive chat goes through faster, gets access to OpenAI APIs faster as compared to code reviews, which can wait several minutes to get served, right? That's the high-level idea on how we worked with this customer. And the whole idea about the Aperture project is like how easy it was to integrate. Like the entire system is based on three steps, like one is simply you define labels in your code, and these labels are your business attributes, simple key values. For example, like over here, the labels just as an example, we want to provide is like API key, put your OpenAI API key here, maybe do a sharp 2256 hash or something so that it's not exposed. The idea to put an API key over here is because you want the token bucket to track capacity by OpenAI API keys. Each key is different rate limit or each org is different rate limit, right? So you want to track them separately and Aperture does it for you. The second part is like providing Aperture, the estimated tokens, the cost of the request, and these tokens are actually the prompt tokens. So you have a prompt, which could be 1,000 tokens, 2,000 tokens, because on the OpenAI side, if you guys recall, there are two kinds of limits. There's a request per minute, then there's a tokens per minute. So Aperture needs to know how many tokens a certain prompt is going to take. So you're going to provide estimated tokens to Aperture as a label. Then you're going to provide a model variant because each model is different rate limits. Your GPT-4 is aggressively rate limited versus GPT-3.5. The third label I'll provide Aperture is GPT-4. Then I want to like further give different priorities to different users. So I'm going to say product tier free. I mean, this is not a great example. Like the free over here should have been low priority. So it's just a typo. But the whole idea is that on each product tier, you can give different priority, right? Then after you've defined the labels, all you're doing is like wrapping your workload or in this case, the OpenAI call in start and end flow. That's a step two. So you have a chat completion, which we have wrapped with start flow call and the end flow call over here. That's it, right? Then Aperture is taking care of prioritization, scheduling and even caching and so on. And this is kind of what the policy looks like. So the nice thing about the Aperture architecture is for the developers, all you're doing is like providing these labels and wrapping the workload in the code, whereas the policy itself is decoupled from the code. The policy is defined in Aperture using the UI or YAML, right? So you provide all the parameters that you want to now limit these labels by these endpoints or these control points. For example, this is the GPD for tokens per minute policy. So as if you guys recall like GPD 440 or two users has 40,000 tokens per minute capacity, right? So what we're telling Aperture is that start tracking 40,000 tokens, that's my bucket fill capacity, fill amount is 40,000 tokens and it fills in a minute. So you're basically trying to replicate the token bucket. OpenAI is running on their side, but on the client side. So that you are completely aligned with the rate of requests you send out to them. And the reason to do that is once you do that, then you can do prioritization. So rather than wait for them to reject the request and you start retrying after without prioritization by doing this, you are able to actually regulate the traffic actually going out from your end and work with prioritization, right? And the second part as I said is prioritization which takes two factors because it's a weighted fair queuing scheduler. So it needs priority of the request and the second thing it needs is the cost of the request which is the tokens. So based on these two attributes, the scheduler makes a decision on who gets to make an API call to OpenAI first and who gets to wait, which workload gets to wait, right? The third point is the selectors. Like these are just the labels kind of filter on where this policy gets applied. For instance, I want to apply this policy on a control point called OpenAI. If you remember in the example, I wrapped the workload in this OpenAI control point. That's what I'm telling the policy layer that look for this and start applying the policy on this workload. And the second thing is like only apply this policy for GPT-4 as a model variant. So you can have a separate policy for GPT-3.5. You can have a separate policy for some other model, right? And each of them will have a different limit. And then you can have a separate policy for requests per minute where you don't have to provide estimated tokens because in that case, Aperture just assumes a token equal to one which is requests per minute. So let's look at the results. So this is kind of a screenshot from one of the peak load hours for this customer. Like as you can see during the morning hours, like very spiky behavior, they're like several times exceeding like the rate limit. Otherwise this is the incoming token rate. And during that time, this is what happens when Aperture is running. So it's actually smoothing out the spikes. So you have peak load, a lot of spiky behavior and what Aperture has done on the outbound side, it has really smoothed out the flow, like really flattened the peaks so that you're accessing open AI at 6666, precisely tokens per minute translates to 40,000 tokens per minute on their side, or 666 tokens per second, something like that, right? And this is kind of like, kind of a picture showing like different users and how they got prioritized. So on the top left side, you will see the delay of, on an average for each request for different product tiers, like the blue line interestingly shows the paid users, which are delayed, but they're still like, for a few minutes, like there is some delay, but as compared to the green lines, which are open source users, which are the lowest priority in their system, it's still much better experience than the red lines are the trial users. So basically they're prioritizing paid users over trial users, and then open source users get the kind of the lowest priority in their system. And if you will notice those red, the orange lines in this chart, very small orange lines, those blips, those are chat. So if someone asks AI in question, that gets the highest priority in the system. So there's hardly any weight for those kind of workloads. So that way, they were able to like take AI to production without putting a wait list. And while still maintaining like a great user experience with this queuing technology, right? And the graph on the lower side kind of shows the preemption, like how much on an average, how many tokens was a request bumped up or down the queue because it's a weighted fare queuing scheduler. So we also track on an average, like how much a request was promoted and put on head of a queue. As you can see, like the green line is open source users, which always bumped down in the queue made to wait much, much longer compared to the blue line, which is a paid user, which is always like put in front of the queue, right? So this kind of gives an idea, like if anyone taking AI to production today, even in the future, like given just the nature of these workloads, we have to start thinking about how do we build schedulers for these new kind of APIs, given that how heavy they are going to be and how much demand there is, how much value these APIs are generating, more and more workflows will be AI augmented. So we have to start thinking about all this. So another thing that we did for them, like this is kind of like the last slide where we basically started with the quota management side on the outbound side, but we have been helping our customers also to do rate limiting and caching, given that these APIs are also very expensive. So there's a price differential of like almost 30X between 3.5 model and GPT-4. So GPT-4 is 30X more expensive than GPT-3.5. So the thing is that if they are taking more traffic, they're paying more, right, it's pay per use. So what they have done is like, even put some rate limits on users. For example, like they have got couple of limits themselves where they limit reviews by number of files reviewed per hour. So they can rate limit each user if they exceed like certain number of files reviewed per hour, they will rate limit them. And they've also put some limits on number of commits per hour. So they themselves have started rate limiting while also managing the load on the open AI side, right? And also introduced some caching along with the same, in the same API call, so that they don't have to re-review a file if it hasn't changed a lot since the last review. And that's all like been possible with this simple API that we have provided them based on labels and wrapping the workloads. So I guess that's it. So you could go on GitHub, look at the Aperture project, it's all in the open source. And there's also cloud service that we offer. So you could deploy it on Kubernetes. Yes, you could on your own infrastructure, but a lot of times when you have developers on serverless, they could run Aperture in the region, region closes to them and we will host it for them in our cloud. As I said, it's all like, think of it as a rate limiting as a service. So you could, all you need is to like, add a few lines of code in the SDK and either you are accessing the service as a local sidecar in a very high performance manner or you could access it over cloud by talking to our cloud. All right, I guess that's it. I would wrap up. Any questions I would be happy to take? Any questions, any thoughts, anything about AI so far? What do you think was the biggest challenge in actually applying rate limits, but also making sure that you were able to not limit LLM performance? Right, so it's a cost right now. Like, I mean, if you're talking about the open AI side, we clearly understand it's a capacity issue. Like they're not enough GPUs available, right? And they're trying to optimize with these turbo models, but these workloads are really, really like heavy, right? It's just a scaling problem with these workloads. But on the consumer side, like it can co-rabbit. Like I think it was cost. Like when they started opening up the product for free for open source users, lot of people thought that, hey, I don't need chat, GPT paid subscription. I get GPT for free. I can ask questions in the product itself, in the PR. So a lot of people started opening up PR and making hundreds of commits in a single PR, using it as a coding assistant, which was not the intended use case. So to prevent that kind of abuse, they put very aggressive limits on open source, something like maybe two commits an hour. So they kept the tool for free, but with limits. So the thing is that paid users can support the open source users. And once we installed rate limiter, they actually saved up to 30%. So there was a lot of abuse. So think of it, they are paying like right now, I don't know, $1,500, $2,000 a day to open AI. Like it's a bootstrap startup. Like it's not even like a big startup, but they're paying crazy money to them. So for them, like putting this rate limiting really bringing them 30%, a mix of breaks their business model. Same with caching. I think caching is also going to be very, very important in this world, even more than in the past. Where sometimes you want to use smaller models like 3.5 to make a decision whether you want to send something to GPT-4 or not. That's what they're doing. So what they're doing is each time they're doing a review, they keep the summary of the last review in a cache. And on the incremental commit, once they detect that the file has not changed a lot and the last review is probably still valid, they skip the review. So they've been doing these kind of optimizations where they're saving the money while still satisfying customers. So far they haven't had any complaints with these limits. Like it's a right balance, like on how much you want to aggressive limit versus allow, still be okay. Thank you for the talk. I have one question about the effectiveness or efficacy of the load balancing, I mean the rate limiting. Because you need to actually deploy the rating, the application in front of the AI model. So that need to be deployed on the motor side. Because otherwise if you actually do the rate limiting with your own customers, but then other customers don't go through the rate limiter, then the efficacy of your application will be low. So what do you think about that? Do you need any further clarification on my question? Yeah, I think, yeah, I can explain. Like I think the rate limiters are being done at many levels. Open AI has their own set of limiters, which they are putting on their customers. Then on the consumer side, as we discussed, like if you're not regulating the, or doing the client side rate limiting on the outbound side, then you cannot do prioritization effectively. The whole idea was to do prioritization and user experience, right? So I think on each level, like on both input and output, you need some kind of limits, whether as a, so on the incoming side, I need load protection, right? So I can look at my saturation. Like for example, if I'm open AI, what I should be doing is looking at the real time saturation of my infrastructure and using that as a feedback loop to dynamically limit. And still do like per user rate limit, but that doesn't translate to overall load. It could be anything, right? And on the consumer side, I have to do the same. I have to rate limit API calls or high level workload calls coming into me. And then I have, once I translate those to open AI calls, I have to then do regulation on the outbound side and deploy it as a service, right? I mean, if you look at the overheads, right? So these APIs are much more heavy. So if you are looking at the lookup overheads of rate limiting, it's hardly 20 milliseconds. If you're not queuing, I mean, if it's queuing, it's a different thing. Then it will queue for one or two minutes depending on how much capacity open AI has right now. But if there is no queue, then on the local cluster, there's like five milliseconds lookup and across clouds, 20 milliseconds, as compared to several seconds of the workload that you're predicting. Yeah, all right, I guess that's it. Thank you, everyone, for being here.