 All right All right. Thank you. Thank you guys for coming. Let's Try to see whether there are additional seats and in the interest of time. Let's Let's get started and it's my great pleasure to To introduce young Stoica young is a professor at UC Berkeley and Really for for many of us young doesn't need any introduction, but as the host I'm obligated to say some things anyway Ian is currently the director of the sky computing lab at UC Berkeley where his work focuses on cloud computing and AI systems and Ian has done a lot of influential work For instance, many of us have probably used his systems like Ray or a party spark And many others so young in addition to having done a lot of great research young's work has Significant real-world impact as well young is a very successful Entrepreneur having co-founded three companies the most recent being any scale with the ray and Earlier with a party spark, which is behind data data bricks and even earlier from his working peer-to-peer networks Conviva on video streaming and young has met one many awards He's a ACM fellow and a national Academy of Engineering member and an honor member of Of the Romanian Academy So thank you young for coming to Michigan and let's give young a warm welcome Thank you, and great to be here and Hopefully for the people who stand out there I can make you this worthwhile so the title of the talk it's an AI stack but another way to frame it is like an AI system journey and Actually during this talk, I am going to do something which I advise my students not to do That is I'm going to talk about a few systems what I tell them You have really to talk about one thing the best work you have done and go, you know deeper and But anyway, so with that being said, let me start This is prelude right. It's like Apache spark. This is the first time we done at least at Berkeley work on the system at the intersection of system and AI and this work happened in Amblav and Amblav it's you know Mosher off and Barzan Some of your faculties, you know, they know about the lab they are You know spending time and being part of the lab and that was about big data I am post-sending for algorithm machine and people and it was about making sense of the big data Which at that time it was pretty hot area. It has around eight faculties around 50 graduate students and and postdocs and The one thing was very cool it as we are all sitting in the same open area Including the faculty this is you know the faculty some of the faculty gave away their offices and start this cubicles Right in the middle of the lab and also as a lab included people from different areas It was systems machine learning in particular Michael Jordan database is Mike Mike Franklin and so forth So so that's and it was a very collaborative environment. That was idea. Okay so and Now the story about you know AI in there it was Lester McKee which was the you know one of the graduate PhD students of Michael Jordan and some of his fellows Wanted to compete at this Netflix Netflix when it was a Netflix competition and it was surprised It was one million dollar for whoever can Develop an algorithms recommendation algorithms based on the anonymized data Provided by Netflix which proves that in time. It wasn't as Anonymous as they thought The best algorithm So these people it was a lot of you know data for that that stage came to us and Basically ask what we should do we should use and well, you know, you should use how dope right? It's what we told them now the problem was after a few days They came back and say well how do piece extremely slow and why was that is because This algorithms basically there is or not deep learning algorithms. Okay. He's like collaborative filtering this kind of algorithms You have a classic machine learning But still and almost am I all machine learning algorithms are iterative, right? You are going to have a loss and you are going to iterate until you know the loss decreases and so forth and Each iteration was a map reduced job. Okay, so therefore and With how do you have to read and write actually not as a beginning at the end of the job But even between maps and reducer to to read and do write and write the data from this from the disc And this make it extremely slow. So each iteration drives enter data on the disc is enter data from the disc So Maté, you know here comes Maté and basically he put together some kind of you know a small System in Scala and that was kind of early spark and the main idea of spark is that why you know for these data sets Actually, they are not as large. You could actually store them in memory So now if you store this data in memory, it's going to be much faster And spark also has a more powerful API in addition to map reduces providing other operators And of course, there are things about ensuring resilience and things like that So what was the outcome? So this is a final leaderboard the Berkeley team was the ensemble and as you can look at the score, it's identical to the first Team for the first which eventually winner so we didn't win the reason didn't win it was Submitted 20 minutes later. So you see the speed. It's extremely important We should have they should have gone directly to spark right now had up there So so that's kind of the story so now for a fast forward today and I'm going, you know I've been involved in many of the systems in many systems and a different layers I will put here in front orchestration layer like Mezzo skypiler skyplane spark and ray like distributed frameworks some optimized engines like Alpa Alpa Servant VLLM and application level chatbot are in Avicuna and a few others From this again, what happened with spark, you know, it becomes it was the fact of standard for big data analytics The data breaks was a company behind it was the other around the spark is behind data breaks And the one why once also to mention is Ray. This is for heterogeneous building building applications on heterogeneous distributed clusters and It's quite popular. It was used by open AI to train GPT models and many companies are currently building the AI Infra and platforms on top of it and it's be it's raised behind this other company any scale during this talk I am going to Present these three projects skypilot VLLM and chatbot arena a different layers of the stack And actually they build the only the connection here is it build on each other, but otherwise is very different So now let me start with skypilot skypilot and and it's it's within this Lab which is called sky computing is the same place the same or less configuration Roughly the same kind of a number of people We call it AI now not machine learning. That's kind of a difference and This actually is a Lab between them was a rice lab one we developed Ray So at very places are these five-year slabs each of them It is own mission and so forth and if you want to say what is a mission of of Sky computing is basically to provide like the internet for the clouds So let me tell you a little bit about sky and then I'm going to talk about skypilot So obviously the cloud has been a revolution is that the compute infrastructure of choice today, right? no question about that and However, despite all of this Huge value it provides today the clouds are quite fragmented, right? For instance AWS it is kind of 300 services, right? Many of them are proprietary services and more and more they also have proprietary infrastructures Right think about right now every cloud major cloud. They develop their own Accelerators right inferential and train you may WS Maya 100 is from Azure and obviously GCP's tensor flow processor units right, so what are the what are the Impact of this is because it's our silos, right? If you are a customer you are going to go to a cloud and pretty typically stick to that cloud, right? So this means you have limited choices and doing only with respect to that cloud and The clouds kind of lock you in so it's also they can raise the prices And so therefore a user cannot use a best of breed hardware and software, right? They cannot satisfy actually another thing is data and operational sovereignty so more and more Regulations are around data sovereignty, which means that the data should be Stored within the boundaries of the country what is what is created and Operational sovereignty is even more stringent because it says that the data has to be processed by a data center Which is managed by the nationals of that country, okay? You and also you can maybe not achieve resilience against cloud-wide failure So now these are the kind of all problem is we are not the first to think about Wouldn't be nice to have to abstract away the clouds and Build application and they run on any cloud, right? And there it's a lot of work for the past decade and so forth and typically the War is a kind of pattern is that you are going to have a portability layer Which abstracts away the infrastructure and on that you are going to build the services, right? And then on top of that applications And the example here is actually also a sky computing project ten years ago as the same name and then you save as your arc and And and sing like that even from cloud providers, okay? And this is very natural design because this is how we are thinking, you know in one-on-one networking That's IP right the abstractor is a harder the same is that in the operating system like this call like interface, right? So so it's very natural. However, and unfortunately these efforts have failed so far And there are several reason maybe too complex right the amount of functionality. You need to abstract away is huge, right? And then you need to implement all of these cloud services on top of the portability layer very hard And then clouds have no incentive to support it because it will commodize them So our approach is different in the following thing What we say forget about what we are going to try to do. It's use the services which are existing in the clouds today And see what we can do starting from there. So we don't ask anything from the clouds Right and we try to figure out whether there are some applications for which we can make a difference make it much easier to use services from different clouds and Fortunately, it's actually there are also quite a few services which are running on multiple clouds. They are pretty much the same almost the same and There are a lot of them are powered by the open source projects like for instance Spark Kafka Kubernetes Every cloud has a Kubernetes offering Hosted cloud offering Kafka offering anything like that. There are third party services no flakes on multiple clouds and I think like that So so basically here. There is no need to for cloud to participate. We just use our services We are providing today and we can build it today So the main Component of this guy is this inter cloud broker and the inter cloud broker basically collects informations and from users like jobs Job is a code and the specification In terms of what services they want to use right like host is Spark or Kubernetes or whatever and Also, some desired of criteria optimization criteria like I want this This job to be optimized for minimize the cost or for performance or a thing like that Right, so you get all these requirements from the users and then also collect Information about what services different clouds provide their availability their cost and sing like that Okay, and then is the broker places the jobs on appropriate clouds based on requirements And then overseas execution if the job fails is restarting if resources are preempted It restarts it may be in different region different cloud. That's really go basically what it is Okay, so this is the cloud broker is kind of you know, it's like on the control plane is not data plane Right for now and there can be multiple inter cloud brokers for different applications So now once you get the high level picture of what sky is Sky Pilot one way to can think about is inter cloud broker for AI workloads Okay, so what problems you try to solve here one is GPU scarcity, right? You heard that for the entire year, you know, it seems that some people believe is going to continue like this AWS exec So that's one GPU scarcity is very hard to find GPU in the cloud. It's actually very hard even for People inside Google or people inside Azure, right? So it's real The other one if you get the GPUs, they are very expensive Right as you know, right one of the reason universities tries to get and you you know Servers GPU servers is not only about scarcity. It's also because of cost We are trying to do that at Berkeley before this scarcity, right for the same reason, right? And and then it's like these models Training this model is very expensive GPT for depend what numbers use was between 60 and 100 millions to train, right? so Cost and scale and then okay, so now you have sky, you know, you're talking about sky So, you know, you can use multiple clouds and you can use multiple instances and see like that But it's expensive, right? It's expensive. You have a huge number of choices You need to provision then you need to do the data transfer if you compute in a region or in a cloud or the data And your data is stored in another To set up to manage the job and so forth, right? And by the way availability is a big problem Right is like many times you can go and you try to get, you know How many of you want, you know to whatever the cloud and try to get some GPUs and you get something like that? Yeah Okay, so it's this this kind of is complex, right? Yeah, I can tell you all use all these resources You use all these credits for all these clouds from which for which you have credits is not going to be easy So this is what it is a sky pilot is handle this, you know all this cloud Structure make it easier reduces friction and at the same time I am to save the cost and Maximize availability, okay so and initially when we revealed sky pilot is on IWS Azure and GCP so basically in this case is very simple, you know the users basically submit the jobs They say what kind of resources they need like a 100 and then sky pilot picks the best location provision and Run the job and then clean up the resource So now this is a typical of very trivial ML project, you know, you have a Peep install the requirements what kind of modules a library is to use and then you Run you use Python to the train by Okay, so how this looks in a sky pilot you have a YAML file You have you put all this, you know the peep install and running the Python Running the train or pie, but in addition you also specify the accelerators and words Where the data is right sink like that okay, and You give this to sky pilot and the sky pilot is going to do the rest you do sky launch task YAML and Sky pilot, you know, it's going to find some available instances in some regions and so forth and doing that sky pilot when to do that sky pilot keep a list for instance of For various instances and for their, you know, which what are their prices in different regions? even in the same in a single cloud as you probably know, there are the same instances they are priced differently in different regions, right and Of course that also happens across clouds the difference in prices can be even larger and availability Okay So under the hood what happens when you submit a task a job you know the sky pilots on one hand has this component which is service catalog and Like I mentioned to you which collects the information about services and the instance is available in different clouds and Prices and availability. This is refreshed periodically then there is an optimizer We take in the requirements from the users and based on the information in the catalog is going to and come with some decision where to run the job given the information again in the catalog the requirements from the users and Considering data and compute location data data egress cost and things like that Now when you decide to say I want to run this in this region It tries to allocate to provision is that region? if it doesn't if he cannot is going to try another region or another cloud and once is going to You know reserve the instances and maybe start a cluster a cluster or whatever is going to execute the job Once the job is run. It's it's Done is the allocator resources at the same time. It's again if you have failures if you have preemptions sky pilot take care of that right So here is one example So so you know that in these clouds you have this kind of two type of resources on demand and spot instances spot instances are much cheaper but You can get they can get preempted Right, you have a two minutes warning, but whatever, but they get preempted right so you may lose them But they can be three times cheaper So ideally what you want you want to have boss, right? And here and sky pilot gives you that and this is what an illustration This is about training the bird model for three days. This is a loss rate so you want the loss rate to go down and This is around one day Training so you start to allocate the sky pilot allocates the resources on AWS US West 2a But after a few hours this pot because our spot is so pot instances are preempted So sky pilots and makes a decision that and looks around and finds resources now in US West to see and Basically restarts a job from the previous checkpoint. We assume here that the job is going to take these checkpoints It runs for a while. It's again is preempted and again is going moving back in West 2a because now are available resources and Finally after the last preemption decides to go to Google Right and it runs until the end. Okay, so that's kind of what it is All right, and you get 70 cost percent cost saving versus on demand with really no loss in performance and in You know the time it takes right Okay, so that's one and again The second example I want to give you is like sky surf so sky Pilot and sky service part of sky surf the naming is not the most fortunate But this is about that sky pilot was mostly initially for job for batch jobs training or whatever But what about serving right, especially if you are going to serve this large language model You know, it's like they are a big workload bigger and bigger over now. It's the same thing you have here, right? You are going to Try to serve and you can use one of the serving systems like TGI try it on a thing like that But what happens if you want to scale up, right? And you don't have you no longer have resources Right what you can do again with SkyServe, right? You you know, it's like because you no longer have Resources there you can go to another region or to another cloud Okay, that's what it allows you to do But it's pretty clean is doing DNS redirect for that. It's it's it's looking as availability in real time of all of this Regions in the same cloud or different clouds right here. It's an example one Set of replicas are running on lambdas and once other set is running on our running on GCP, right? And now when things are coming together actually we use SkyServe and on top of that we use BLLM for model serving and this is the infrastructure we used To serve this chatbot arena. I'm going to talk in a little bit. So that's for us first Actually before ending I want to say two other things about skypilot when we started remember we have only these three clouds Right the major ones right since for the last year actually we got a lot of contribution other clouds adding support to skypilot right, I think total is 12 and This is very nice, right and some of them they They and actually quite a few of them they add support because some of their users ask say well I we are using skypilot, you know, can you Support it right because people like skypilot if they use it right because they are not no longer Link to a particular cloud and they can you know easily move their workload from one cloud to another and Of course, you know like these guys They have the entire incident which are not the major three clouds. Although some of them are big like Oracle and Obviously IBM cloud Because they want workload, right? That's why they are incentivized. This is a very nice thing to say This we are thinking this will happen, but it's very nice to see always when it happens Yeah, so, you know, it's it's it's a healthy project. What can I say? This is a numbers You know mistral has used it, you know, Shopify is using it sing like that now under sky There are many many projects, right? It's like and I'm just going to list here It's cloudcast. This is to minimize the core You need to move that data if you need to move that data like for instance You have a model and you are going to serve the model in different region in different clouds You want to multicast to broadcast that model to each of region. How you do it? Do you do that? Efficiently low cost and low latency Sky sports, but you have a job and you may want to meet a deadline How can you use a combination between spot instances on demand instances in order to meet the deadline and reduce the cost? Sky hedge is a similar thing but meeting us allows for services Starburst integrate on prem clusters to sky and sky identity. Obviously. It's adding a layer of security on top Okay, and sky storage Multi-cloud storage on top of the blob stores you have in every cloud. Okay, so now I'm down with sky a Second project I'm going to talk about is VLLM right so sky is really orchestration Resource orchestration What is a story here? Everyone knows LLM's are very popular since release of charge APT and you have more and more application using these large language models and the problem is Is that serving LLM's is very expensive, right? And at least it's are some data from you know, since we last year, right? It's one year old project You know LLM's run on high-end GPUs as you know and you know for NVIDIA 100 Even if you have a certain billion parameter model, which is not that very large, right? You can get only after a few requests per second is order of magnitude Larger than what you get from a web server, right? It's like in terms of cost So this is a problem right now Why is this why is this and I know that almost everyone knows but let me give you an one-on-one why this happens You know what these LLM's are about, right? Auto-regressive models. So basically you get you know, you Get a prompt and these large language models Statistically are going to decide what is the next word or token Okay Now the key that when they do that they use this attention mechanism. They in this attention mechanisms looks at all that the tokens Which are before the new token, right? So looks at the entire prefix to decide the next token. That's what you need to remember right, so for instance for You know deciding on the they look at artificial intelligence is and for deciding the next word Future they look at artificial intelligence if the future and they go until they The last token is like end of sentence Right, and that's the result out Okay So what is the problem here The problem here is that is low throughput That's why it is right. It's not CPU bounded. And why? Because tokens are generated one at a time Right initially when you have the prompt the prompt is in parallel right for the first token, but after that it's one at a time right, so you have this kind of Expensive GPU Generating one token at a time and not being very utilized right because GPUs are about parallelism. You cannot take advantage of parallelism. So what is the idea? It's an obvious idea. Well, if you cannot have parallelism with zero and request just serve more request in parallel a batch of request That's what it is The problem now now you run out of memory and I'll explain why so that's a problem So here is how the memory lie out or what for a 140 gigabytes of GPU RAM Certain billion parameters Parameters these are the weights 26 gigabytes. Why is certain billions each parameter is two bytes 16 16 bits So 26 gigabytes and now you have this thing which is called KV cash Key value cash. What is that again? I'm not going to go into details, but a basically it's a stores the embeddings These are representations of each token Again to get the next token you need all the previous tokens You need to restore the embeddings for all the previous tokens just to get the next token and These embeddings are large like think about one megabyte. So one request can take several gigabytes Okay, that's a problem. Okay now It's getting a little bit worth Because a memory management for this kind of you know, at least at that time It's very simplistic. It's basically what this What the system at that time they are doing They store the six sequence in a contiguous memory Right contiguous memory allocation and because it's contiguous you need to pre-allocate Right, you need to kind of the maximum Sick, you know our precise, right? That's kind of what it is, right? So it's very inefficient and everyone who took operating systems. They know this and they know also the solution to this But let me go in more details. So this is a request, right? This is a prompt Artificial intelligence is and then you need to pre-allocate all these slots for the maximum length because you need to be contiguous There is another subtle things here even if you know The length of the output is still not efficient. Why? because You see You you occupy this You know whatever the you reserve embedding the reservation one by one, right? So you have one thousand First you occupy one the second one the third one So the rest, you know like even though you are at the end you are going to occupy all this one thousand The one at the end they are not going to be used for a long time And that's an opportunity lost because as a request might could have use those Right before you need them. Okay, so that's we called the reservation So first is internal fragmentation when the internal fragmentation is because they don't know the output the Reservation is that even if you know the output You because you are not using all the all the memory reserve right away then it's a waste And then it's external fragmentation because different Requests are all okay different sizes and you may end up that you're going to have gaps or value of sizes These are the gaps. So even on the aggregate you have enough memory for a new request Because it needs to be stored contiguous and you don't have a gap which is large enough You may not be able to serve to to serve that request That's what it is And this is serious problem. These are different Solution up before this one is even assume oracle, you know the future and And So this means that you know the output size and you know, even if you know the output size The importance things to look here is a green one green block. You lose you use only 38% of the entire Memory Okay So what do you do is very similar is operating system? that The operating system from the process perspective, you have a contiguous address space You cannot have that in the physical memory. What you do is paging One level of indirection right the same you lose here You have blocks. This is a level the granularity at which you are going to allocate memory Blocks contains multiple embeddings and You have these block tables. You see this is a logical view. So it's from application view is contiguous block zero one two three four from physical air physical physical view different blocks are stored at different Locations, right? So they are not contiguous and then you have an indirection point right saying block zero here. It's in This kind of logical view it's block seven in Physical memory and then you have this kind of counter because now, you know as as you add Tokens and you are going to fill a block you need to know when you fill the block and then you start allocating the next right Okay, that's pretty much what it is right and by the way Because in many workloads, you know for many techniques you do parallel sampling for the same problem to generate multiple outputs to get a better output and so forth You can again like in operating systems. You can share the prompt, right? Not only that is efficient, but you know the common part of different Queries it can be shared So what is the end result the end result is that you can actually get 96.3, you know Efficiency allocation right usage of the memory Which is you know almost three times more Than the best if even if you are Oracle, but this is not realistic. This is more realistic So it's more than three times more More than three times a better utilization Okay, and that means what I mean that you are going to It's you can process more request in parallel If you process more request in parallel, you increase the throughput if you increase the throughput you reduce the cost Right because you do more unit of work on the same time and you pay the same because you pay for the hardware right, this is comparing which the solution back then and This is about throughput, right? and You can see the throughput is in its blue right higher the better This is for two models llama 7b and certain B and You know you have up to 3.5 higher support Okay, which means 3.5 higher lower cost now let me Also, I talk about this analogy, but you know like everything, you know like Every problem at the first sight seems up so identical But it's a little bit in a different domain Then once you start working with you on it. You see a few differences So what is the same word very similar like OS pages like KV blocks, right? And to reuse a memory fragmentation you can share pages across processes the same way you share KV blocks across Samples about multiple samples of the same query or multiple queries. What are the differences eviction? You see the eviction in the operating system is page level Now the equivalent of paging now our case is block But when we evict it doesn't make sense to evict a block only a block in the middle of the sequence Because the other blocks will not be enough to compute the next token. So when you evict evict the entire sequence The other thing is that loading blocks pages back to memory here when you do the paging you store on the disk in the swap Right and then when you need them back you read them from the disk You could do the same thing for instance moving the data from the GPU RAM To the CPU RAM, right? You can do that and get them back. However, it turns out that in this case or in some cases It's easier to forget don't store that just throw them away and you need them you recompute the entire sequence And why is that? Because again GPUs are highly parallel Right, so it just takes in some cases. It's faster to recompute then go to the GPU and get This blocks back So really adoption, you know, it's first growing fast. This is one-year-old project Many projects using it Many companies and the most what I'm most happy about it's actually we have right now more and more hardware Accelerator providers adding support for VRLM to their chips like MD Intel AWS inferential and Soon Google Okay, and there are two things about VRLM. VRLM is the artifact But as a technique I mentioned to you, which I forgot to tell you it was in the slide. It's called page attention Now page attention as a technique. It's also now used in all major other LLM engine service engines, right not be LLM Fireworks hugging face DJI Nvidia TRT LLM. These are competing. This is proprietary This is semi proprietary semi open whatever you want to call it, you know, they're kind of competing and But they use page attention like flash attention, you know, everyone is using it Okay, so that's my second now the last chapter Yes Yes Yeah If I'm already generated, you know, three paragraphs Large prompt would they need all of it? Yes, most of the cases. Yes, there are optimization and there are tricks like you have a window or Things like that, but but that's kind of had an impact on the quality But it would affect your paging mechanism Yeah, you might your eviction policy might be dependent on whether such a window is appropriate. Yeah Yeah, you can do that but you'll need semantics from the algorithms and things like that You can do it. You can do it. We don't we don't do it because not all not a lot of models do that now Right, but you you can do it. Yeah, and by the way, the one thing again is one one difference is that, you know, the That's one advantage if you look about The one single probably which is very important which I missed all these differences is this is because application level Paging so you see we do we can do this Request level versus page level eviction because you have semantics about the application Right. We know that the block in isolation is not very useful, right? Like if we you know without a block that he sorry without a block the entire sequence is not very useful And this is the same thing because we know how to recompute Right, so that's kind of a very important aspect Okay, so let me go to the next one Sharebot arena. This is very different LLM evaluation. It's running on top of what I discussed so far So what? Yeah, this is his benchmarks, right? And this is Dave Patterson and he famously said For better or worse benchmark shape a field And this is true for AI right is like image and I attend all of these tests before for release for image Recognition and so forth Okay, very important So problem is that LLMs are extremely hard to evaluate why? Evaluation is expensive and it's unreliable. So let me give you Show you why they are expensive This is a question develop a python program that reads all the text files under a directory and returns top five Words with the most number of occurrences Okay, okay, and you get this two answers from A and B Which one is better? Everyone knows he knows Python. That's my point. It's hard. This takes time right Now you've been so this is program is hard to you know, but let me give you another one photosynthesis It's a vital vital process for life on earth. Could you outline the two main stages of Photosynthesis including where they take the place and what are their inputs and outputs? And you get this Which one is good? It's correct Why that It's got 50% chance of Okay So that's why it's their heart They're also unreliable because data contamination a lot of these benchmarks are static And as you know, the LLMs are trained on All the data they can find in fact the data is now the bottleneck in the same that they don't find enough data They don't have enough data So this side is one in some of examples, right? This was when GPT-4 was launched and You have this kind of code for this problem from code force and 2021 was if you remember the first charge EPT was trained on all the data before 2021 So the problems which are before then solve them 10 out of 10 Okay after that zero out of 10 Here is another example is this paper basically claiming that GPT-4 can score 100% on MIT ECS curriculum with the right prompting and here it's few days later Well, it turns out that GPT was already trained on this and actually if you take that into account. It's actually only 58% is correct Okay, still may be impressive, but Okay So that's what it is. Okay so Now our story and when we encounter ourself this problem is that while we developed this kind of the Quna after Facebook released Islam in February 2023 we released this VQ now which was Find you on on a shared GPT data 70,000 Conversation 70,000 conversation and this charge of it is pretty high quality data will be people who are sharing their conversation with GPT to this share GPT site and this site was Providing the ability to download this conversation for the community. Okay So now we developed this now How we figure out how good it is right That was a problem and humans take a long time and you know, you're asking your students You know, you are it's hard right who wants to go through, you know to say which one is better So there what we had this idea then back then and this is now pretty popular is to you GPT-4 Which was released just weeks before To evaluate instead of humans Right, that's kind of what we tried Okay and you know We look at that. This is the you know, you remember the first question right so actually this was the answer from GPT GPT-4 Assistant a is similar, but that doesn't handle case sensitive or point patients so actually Assistant B gives a better answer between this one You're right Assistant A They both say a lot of things correct But assistant A mixes the input and output between stages. That's what it is Right, although it's a more verbose answer Okay So pretty happy, but of course this is not satisfying because What is a baseline is still this is going to be consumed by humans So you need to figure out how you compare these three results with what humans results provide So then we went to design a mechanism and a framework for getting the humans Basically inputs of humans to evaluate these models and this is Charbot arena. What it's about now so ideally You have the question you get an answer from each LLM and you rank them Right which one is better That's one thing. There are few issues with that one of them. It turns out that Ranking and choices is hard right It's actually easier to pick if I give you any choices to pick the best one, but even for that it's Not that easy, you know this book the paradox of choice basically saying the more choices you have the harder is to make the decision Which is better the easiest one to do it if I have your two choices pick which one is the best That's the easiest for humans Okay So this is what we do. So now you have you'd give two choices pick the best Okay, but now there are many ways to do that as well One way is to do kind of tournament for the same question. You play every combination of LLMs, right? every two LLMs this is a similar with tournament right like like In soccer you do it But this is hard and actually there is another problem. The other problem is that this is dynamic. This assumes Static right you start with the same number of teams and for the entire championship You have the same number of teams, but this is not the case, you know, you know because In reality you have these large language models popping up every week or every day. So it's dynamic Now fortunately if you look again what humans though because they have to deal, you know They have to evaluate humans right humans evaluate humans, right? How how they do it and as a way is to confuse kind of rating when you have rating in many things you have Tennis and one is the one thing which is also popular. It's chess, right? Right do rating and chess again is like actually you have a rating which is meaningful even if Two players you can run rank two players even if they never played together, right? So this is what we use the yellow rating and then We develop this kind of chatbot arena. So when you go to chatbot arena you are presented with something like that and You are going to get to ask a questions and you are going to be presented to replies from two randomized anonymized LLM's Right, and then you can pick a is better bees better tie or both are better And based on these answers you compute the yellow rating. This is one of the latest I think it's two days ago. So you can see GPT for Up cloth three Oppos the largest it's actually very close now right What you have here Cloud and This is our own right now. It's the best open source model Alibaba. Then the next one is mixed rule here Again, it's not reasonable, you know reasonably close Okay, but the interesting thing to look here is that this model which are open source are better than GPT 3.4 to 3.5 and another things to look here Which is interesting and that why this is very useful You see you have two GPT force here and these numbers represents when they are released. This is March one year ago Tomorrow is that one year exactly and this was in June so you can see this March on perform better. Do you know why? alignment more alignment and Basically saying more times. I don't know the answer So It's you know, it's very popular right now is like some numbers is from April 23 Until now we have 10 millions over 10 millions of user requests over 400,000 votes Over 70 models we tested we evaluated now. We have 100,000 requests per day and 5 to 10,000 births per day 100 languages more and this is the distribution most English now you imagine why is this Because we bribe people because we basically also provide them some free access to the top models Right, but the point is and this is a things, you know, but the votes This is of these are the votes the votes also increase right even with bribing the votes almost increase proportionally Okay Now the question I told you so what I told you is like we could use we use dpd4 as a judge Now we have this kind of chatbot arena and basically we can scalably, you know, basically crowdsourcing the evaluation Right and how you know, we want to see how they compare so Can you really trust LLM as a judge and we had this systematic study and it's very interesting Because the limitations of using GPT-4 as a judge are not unlike humans here it is They exhibit position bias They prefer the first answer Verbosity bias Prefer the long answers Self enhancement biased prefer answers, you know, like you'll you'll you know, you'll do yourself, right? And limited reasoning not good at grading mass questions. Okay, it's very much like many of us So but the most important figure here and again, is there are different ways you are going to account on for this biases I'm not going to go into details, but the most important is to look at the agreement and If you look the agreement between humans excluding the ties You know, sorry, if you look at humans and GPT-4 No, sorry, if you look at the agreement between GPT between humans, it's 81 percent Okay, so two humans only agree with themselves in 81 percent of the cases If you look at the GPT That's 85 percent. So okay, so it's in ballpark Right, so the kind of agreement you have between human to human and human to GPT-4 is pretty similar All right When I started I said about data contamination, what about data contamination? Child GPT It's at large extent Olivia is a problem because a lot of users come with kind of new questions all the time Questions they care about to get the answers to especially since they get to this answer from some of these top models Right for free. Okay, so that's kind of one, but I think that going You know Beyond that we really need again to you know, like to get use more inspiration of how humans evaluate humans And how do we evaluate humans through exams? And these exams are typically one Time exam, right? It's like we don't give the same exam over and over again, right? Yes Sorry, can you speak louder? Yeah, the question is whether we observe more and more data contamination as people ask questions. So maybe question We look an anecdotically. We don't see it, but that's a study which we are Trying to do Right be people asking the same question and so forth. By the way, there is another kind of contamination Which is important right now because this It's pretty popular. This is leaderboard. We have model providers Common coming to us and saying can you evaluate it before they announce it and from some large companies and That's you know, it's very interesting what I can tell you and I you know, I have some data It's very valid the time the kind of topics is very bad. It's it's it's it's it's pretty broad And not only that but there are quite a few difficult questions one of these top one of the top three model providers that you know, you know Who are the top three? Told us that actually These are harder questions as they see from their users So they're also not trivial question. Okay. Anyway, so going back I think that what we want to go farther is to have this one-time exams Maybe the experts can put together these exams and I don't know what Okay And we are working with Kaggle to set up such exams. Okay, so I Am done. So here is my last slide I think that everything what every star all the stack you've seen it's open source, right? It's a debate between close source open source I am obviously definitely in the camp of the open source, right? No surprise here And we do believe there's a future and I talk about three project sky pilot Which I am used to improve the GPU availability reduce a cost and reduce a friction of users taking advantage of GPUs across multiple regions in the same cloud or across clouds VLM to reduce a cost by improving throughput of LLM serving and Chalbot arena is basically to evaluate to an evaluation platform for large language models That I live is a contamination and again also reduce the cost Again, we are just scratching the surface here. I think huge number of challenges between systems and AI and You know very much looking forward for many of you to Contribute and solve some of these challenges. Thank you