 Welcome everyone. As you can see, Waylin is actually on site at another event as well. It's very busy. But he is here with us to do his presentation on chatbot arena. So, Waylin, if you can just wave to everyone so they know that you are there and you're live. Yep, there we go. Andy, we are all set for you. Thank you so much. Thank you. Yeah. So, today, I'm Waylin. I'm a PhD student from UC Berkeley and the LMSYS team. And today, I'm very happy to share our project, chatbot arena, which is an open crowdsource platform for LMS evaluations. So that's the study. Yeah, first, I want to introduce the awesome team behind this effort, the LMSYS team. So, our main goal is to focus on open LMS research. And this team is founded by PhD student and faculty from UC Berkeley, Stanford, CMUM, UCSD. And our goal is to build open models and open data sets as well as open sourcing all the serving systems and evaluation platform. You can learn more at the LMSYS.org to check out our blog post and learn more about us. So, LMSYS project spans a full stack of LMS scope. First, we have open model, as I mentioned. We developed Mikuna and Long-Trad model. And we also have work on evaluation, which we developed chatbot arena. And Chetbot arena is the one I will go to talk about today. And we also have open data sets that we collected from our demo. So we have LMS is Chet1M, which is a one million conversation user LMS conversations on our demo. And we also have Chetbot arena conversation. And lastly, we have also open source all the systems behind these effort, which is fast chat. So to talk about, before we talk about chatbot arena, it has to, you know, the story actually started with the Mikuna. So last year, in November, ChetGPD released. And it basically shocked the entire world. And it's the most innovative breakthrough of the human history, one of. But it still remained a big mystery on how open I built it. And especially earlier this year, before anything, before any open source model can compete, we had no idea what's going on. And fortunately, we have Meta released the LAMA model in the end of the February. And which is a high quality foundation-based LMS released. And very shortly after it, there are research community in the research in Stanford, basically fine-tuned this base model with some conversation data. And then basically fine-tuned it from base model to chat model. And which surprised the world that we finally have the capability to make some open model that behave like this chat conversation capability. So we at Berkeley, we are still in at Berkeley. And we wonder if we can do better. So we believe that the data quality is the key to get you a better model. So we collected a high-quality data set called shared GPT, which is like 70,000 high-quality user chat GPT conversations that user shared online. So there are a few key components. One is data quality, which because this is like a chat GPT answer. So you know that it's significantly better. And as well as multi-turn conversation. So this is the magic component of these models. And these days, it has the capability to interact with it. You can ask questions, follow-up questions, and so on. So this multi-turn conversation is also a key thing. And another thing is longer context lens. So because we ask a lot of follow-up questions, the context lens can be longer and longer. And we need a high-quality data to have longer context lens. And a model can learn how to do this. So we think that this can make a huge difference. So we decided to launch the Vikuna project very shortly after APACA, which is released in mid-Marchance also. And this kind of is a fun project in our lab that we decided to just stop everything we're doing for one week, two week, just focus on this. Because we think we can make some impact and we can tell something new to the world. So after APACA, like two, three weeks after APACA is really we trend Vikuna, which is one of the first high-quality open-track models released to the world at the end of March. And you can see the capability from Lama, APACA, to Vikuna. We try to measure how much improvement we get from APACA to Vikuna. And we find it significant. And we try to measure it. So this number was kind of like a very preliminary result we published in March. And since then, and then we released the demo to the world. It got viral. It got millions of views on Twitter and so on. So since then, we have made some impact. Since then, Vikuna has been widely adopted. There are over five million downloads. And our demo server served over three, like, singly for Vikuna, we serve over two, three million chat requests. And our blog post got almost 500 citations, which enabled several research in the field of vision language research. People can now take our model and then combine with some vision and language data set to find a vision language model. This was not possible if you don't have an open model. And it also enabled some of the AI safety research. So before, Treasury Video was like a black box API. And now you have full access to the Vikuna's open weight. So you can do more like a white box, kind of like a tag research and so on. And then study whether this can be transferred to Treasury Video, which days to see. So we enabled some of the AI research and more. So but what's next for us in April? So we think as a researcher, we think two major issues still exist. So the first one is how to evaluate LIM properties? Because nowadays LIM's capability is like evolving too fast. And traditional evaluation benchmark is not just not able to catch this speed. So we are very behind in the evaluation. And also, the second point is since we have seen the success for Vikuna is basically using more high quality dialogue data and so on. So the second issue is for the open community, how can we continue to collect high quality data set? So that's two major questions for us. And let's dig into the first one. So first one, LIM are very hard to evaluate because they handle very general open-ended tasks. And it has sometimes unstructured tax input and outputs are also not structured so hard to compare and hard to get ground truth and so on. And also interacts with users with multi-turn conversations. So this was not the case before. Before it was always like a question answer and then in the end. So this multi-turn situation is also very hard to deal with. And there are a few limitations in existing benchmarks. So a few limitations here is one is, for example, for some knowledge-focused benchmark like MMORU, they focus on just multi-choice questions answer and so on. So it's not very open-ended and not really testing the model's capability in conversation skill. And then also, there are other benchmarks as single-turn or involving human annotation efforts. So then already they are too easy. So yeah, and they already are too easy. And finally, they are also the huge risk of contamination, which means your test data can be already in your training data. So nowadays in LIM, basically in LIM pre-training, people try to scrap the internet or data over the internet and you don't really know whether they are already test data inside your training data. So you don't know whether the model is really generalized to these knowledge in benchmark or they are just memorizing the training data. So these are pretty big questions. We try to answer. And so a cool problem here is how do we construct from how do we construct these user questions that can use to evaluate LIM? And how can we get a human feedback to decide which one is better? So shortly after the Kona's launch, we think this is one of the biggest issues in the space now because every day there's a new model coming up still today. And then we have no little idea on what's going on. And these benchmark numbers may not be that accurate in capturing all these new ones we mentioned. So in May, we decided to launch a crowdsourced platform to collect real-world human conversations and feedback. So that means it's basically, yeah, we cannot. So to answer the first two questions, to construct prompt and get human feedback, we think it has to be real-world. It has to be lead to the user to decide how they are going to use the LM4 and then how they think which one is better and so on. So we try to develop a platform to experiment with this idea. So this comes to chatbot arena, which is the new evaluation we propose. And it's basically, this idea is very simple. It's a crowdsourced anonymous battle between our labs. So you go to our website and you can ask any questions. So for example, here, we ask Ramiya Blopholz that about how are you. And then you will get two models answer this prompt, model A, model B, and they are basically anonymous. So after they finish, a user can look at the answer and then decide, is A better or B better, or does tie a similar. So very simple idea. And if a user cannot decide, a user can continue chatting. So you can ask follow-up questions and so on until you identify a winner. So since it's launched in May, we have basically operating this platform for like six months and so. And we have collected over 100,000 user votes, prompt end votes. And we use this. So basically, we have the vote. So we can basically compute the pay-wise win rate as I showed on the left. So these are win rate between different LMS, like GPT-4 Cloud, or these open model Lama models, and so on. So very many models here. And we visualize it here. It's a heat map showing that this is GPT-4 versus other models win rate, and so on. And using this pay-wise comparison data, we can actually develop a scalable model ranking system through ELO rating system. So ELO rating is a very commonly used rating system in many games, sports, and so on, like chess. So basically, the data you have is the pay-wise comparison between players, and you use that to estimate kind of like the rating which represent, which the difference of the rating represent the estimated other predicted win rate of the two player. So as you can see here, yeah, we have basically published a leaderboard based on all these comparison data. And now we can see all these proprietary models still leading the space. But we have seen many new open models are catching up. So it's pretty promising. And you can scan that QR code for the full leaderboard. So Arena has since its launch make up some impact here, we've got shout-out from industry, from Open AI, Greg Brogman, and Andrew Kaposi. They all retreat our effort. And we also have Anthropa, Cogginface, Mishra, and so on. Industry leading L&M companies, they recognize our effort. And we think it has been established some of the trust and openness. And we basically were committed to make an open and transparent, make this effort open and transparent, but by releasing all the code we use and data sets, and so on. So if you are curious, you can now just visit our website for the demo at this chat.lms.org, which you can find like 20 plus basically cutting-edge LMS in the field right now. And you can ask any questions and contribute your votes. So the values of Trouble Arena, I want to say is, so first one obviously is we can use this as a platform to evaluate and rank models, as you can see. But in addition, we can also use this platform to collect valuable data and valuable human feedback. So as you can imagine, right now people come to the site and ask questions. And so on, these are like valuable data to get because right now L&M is a very new field. And we are still in very early stage on understanding how people use L&M. And we are very interested in these use cases and how people interact with L&M and so on. So these data can be valuable resources to study. And as well as the human feedback look, so we get feedback on when this model is better than what. So this can be used. We think this can be used to further improve the model. So first identify the models, weakness, and then improve the model. So Trouble Arena right now we position it as an open, scalable, and interactive data collection platform. And just like we're trying to do something like Wikipedia for L&M. And to demonstrate the use case, we also, in two months ago in September, we released the very first large scale L&M conversation data set, which is a one million real world conversations from our 200K users. Basically how they interact with L&M and so on. And we study a few use cases. First is, as you can see, real world is noisy, contains some things that can be unsaved and so on. So we can actually use this data set to develop a content moderation model that can be on par with OpenIS Moderation API or even GPD4. And second is, you can use this data set to build a hard benchmark functions that can effectively differentiate models. And the last one is, we study how to use this data set to train a better model. So these are potential use cases. And then we publish it right now it's on archive. And you'll hopefully be a peer in an iClear conference next year. And you can download the data set on our argument-based page now. And then we also open source the system behind Arena, which we call it FastShed. And since then it has been widely used by many industry and academia group. And we have got 30,000 GitHub stars just a few days ago. And we also have a lot of Contributor 200 plus helping us to attend this project and so on, which is really a great effort. And of course there are more topics can be further investigated. We are also this ongoing effort now. So first one is, how can we prevent shooting? And how can we identify malicious user and so on that just vote randomly or just vote for some specific models? And a second big question is, how can we continue to incentivize users of organic engagement, encourage them to come back as questions and vote? And the third one is, how do we filter low-quality prompts? Low-quality vote prompts. And the last one is, can we have developed some efficient sampling algorithm for sampling these models? Because as you can see, if you just do random sampling and you sample a very strong model, a very weak model, that doesn't tell you much information. So we are also investigating how to improve this. And because we have very limited human resources, human interactions on this website. So we want to maximize the outcome. So I think that comes to the end. So I want to acknowledge the great team behind all these efforts. So we have a core student team maintaining and developing new features on this site. And we also have a faculty team helping us to advise all the giving great advice. And we also thank our sponsor for generous support on this platform. Without them, we are not going to. It's not possible to operate this service. So lastly is just a few contact. You can visit our homepage to learn more about our projects. And we also have a demo that you can play with. And we have open source code and data models here. So with that, yeah, I conclude my talk here. Thanks. Thank you so much. And we do have a couple of minutes left. Does anyone have a question? And we do have one question. So you mentioned some of the work that you're planning to do to improve bad input or whatever. How big of a problem is that right now for the data set? So surprisingly, it's not too bad. So we found that maybe just three percent. So we have some preliminary way to filter them. And we found that just probably less than 5% of the data is like that. So most of the users are pretty organic, they ask their own task, own questions on this platform. And they come here not to waste their time or our time. They come here to find an answer. Because they find some values on our platform. Because I think the major one is we offer free service, free GPT-4 for people to use. So yeah, I think most of the problem are pretty real world tasks and so on. Great, thanks. Thanks for the presentation. I did take a quick look at the arena where you can type in any question and compare two models. Very clever idea. But of course, those questions can cover any topic. And then you talked about your data set that was live conversations. I'm just kind of curious if you think there's value in trying to set up similar constructs but focused on narrow topics. Rather than just a model that can answer anything, a lot of the custom GPTs and RAG, all are kind of focusing models on specific topics. And I'm wondering is if you looked at the data, whether you think there's value in trying to benchmark within specific topic areas. Yes, yes, yes, that's a very great question. So I actually forgot to put it here in a slide. So we are also working on categorization. And basically, for example, we can start simple. Let's say we only filter all the coding question, coding related questions, and see how it looks like, how is it different from the general leaderboard we've seen or some model are actually better in coding than some are not. So this effort is also ongoing. Thank you for the great talk and really upholding open source AI principles. So I present Genevieve Commons and the AI Alliance. And I wonder if LMCs is basically a loose alliance of students and faculty, or is it planning to be an open source organization that can join, for instance, Linux Foundation or AI Alliance? How are you guys planning to proceed after students graduate and get academic jobs and so forth? Yeah, so currently it's still a loose organization that just consists of students, faculty, and so on. And but we are discussing how to make this effort more sustainable, how to make this project, how to make commenting involvement or more contributor developers and so on. So this is still in progress discussion, but we would love to hear your feedback. Wonderful, thank you so much. That is all the time that we have and we limb thank you so much for joining us. And let's give him another hand. Thank you.