 All right. So that photo is obviously several years old when I had more hair. And I wasn't as familiar with brands like Costco and Skechers. So, hi everyone, my name is Gaurab. I run the customer success organization and weights and biases. And other than, so I do use JNAI and foundation models. So just if you have seen the abstract of this talk that was generated by ChatGPD, I gave the PDF of the presentation to ChatGPD, gave a few prompts, and then it gave me the abstract. I didn't have time. But it's actually pretty good quality. And other than users of that, there should be some reason why I am here. And the topic that I wanna talk about is enterprise value with generative AI. And the reason why I, from weights and biases, I'm coming here to talk about it, is we have been lucky to be in the front seat of how this LLM development world has been shaping up for a while. We have a few not so well known customers that have been developing different forms of LLMs for a while now. And we have been kind of supporting them. And what I would say is we do see this landscape is evolving constantly. We started with mostly open source models and there is, BART was one key kind of like milestones in many ways. And then from that point onward, it kind of kept growing and developing in different formats. And then now we have open source models and non-open source models. And all these different complicated world could be certainly difficult to navigate. Now, in this evolving world, while we do talk about a lot of variety of models and standardization of performance across these different foundation and models, what we rarely talk about is tools. So imagine talking about the architecture and of like different masters and never able to talk about who actually will, what type of tools would they use to actually build them and design them. So there needs to be some sort of discussion and we thought that that's what our contribution could be in today's conversation is what are the type of set of tools that will set this world of this constantly evolving generative AI for success. So what I will do with the rest of the seven minutes I have is lay out a few challenges and then what good looks like in our opinion. Surprise, surprise, it will probably end up with a slide of ways and biases showing that we do have provision of some of those but also to dream a little bit on what that good looks like in a world where we are interacting with generative AI in a constant manner. So first thing that is very unique about weights and biases are three main challenges and the first challenge that is very unique about generative AI is there is a sheer volume of generated metadata and combination of different factors that play into choosing the best solution. Like when I was in data robot we were kind of doing very simple kind of like a standard tableau data model. You have a set of different model types and that's pretty much it. You are basically optimizing and selecting models out of that. You are varying some data sets in based on the context but with generative AI the options are endless. You can actually try with different data sets, different model types, different type of hyperparameters, different type of prompts when you are starting to interact with those applications and the experiments, data sets, models, training paradigms explode in dimensionality. And in that explosive world the other thing that happened that is very important is multiple type of people are starting to interact with these tools and the models directly. And these different type of personas have different languages, different background, different context on why they are using these foundational models. And so what we probably need to solve this particular problem is some, oops, sorry, some common system of record that organizes all these different languages and information in one place. So that you can know that which data set you started your experiment with, which variation of the model gave you the best results, which model in production with which prompts interact with the best accuracy and other things. The second challenge that we oftentimes see is orchestration. So if you think about interacting with chat, GPD as a consumer it's fairly easy. Like you go, you open up a web browser and then you type out some things and that's pretty much it. But the developers who are actually building those models or fine tuning those models are dealing with a much larger complex environment. Either they are continuously optimizing these performance of these models through hyperparameter tuning, changing data sets, changing prompts and all of that. Or they have to train these models or actually run these models in really large complex environments that are probably not poised to be that generalizable. And there is a need for repeatable workflow so that you don't repeat your mistakes. You do the thing that you work in one time, you actually want to do that more and more. Because what you're dealing with in the end is very complex resources, very expensive resources, both people's time as well as compute. And in that world, and this one is a bit longer sentence, you probably want an orchestration system that helps make it easy for an end user or developer to interact with the large infrastructures. And also, much like DevOps, a continuous monitoring of performance and system utilization so that you're not spending compute and other costs in places where you should not. The third challenge, and that has been also alluded earlier in the first keynote, is a need for continuous evaluation and documentation and governance. We don't trust most of the time why the models are doing what they're doing. And so there is a very strong need of all of these foundation models to continuously being able to dissect and understand why they're doing certain things in certain places, how you can get them to do the things that you want. So that evaluation of that is really multifaceted when we were doing very simple tabular model building, you take the data set, you put it in a model, you change a few hyperparameters, you change a few data set columns, you do a lot of feature engineering. I'm not trivializing that. These are still complex things, but here it has to be even more flexible. You need to be able to change the prompts. You need to be able to change the different type of prompts that you wanna give it to them. You might wanna change the context. You might wanna change the model type, the API that is calling and whatnot and that diversity really needs to be some way handled in whatever platform that you're using for its evaluation. The other thing in the evaluation side is there is no right answer very clearly. Like if you are building a model with a tabular data set, you know that you have to minimize AUC, maximize AUC or minimize log loss, very clear. There is no standard metric for capturing hallucination. There is no standard metric for capturing toxicity. And there is a still growing research areas for many of these type of metrics. So how do you evaluate something in a standard way? It's not possible. So you have to be constantly flexible and try out whatever different ways you can to actually guardrail things. And for governance perspective, you need to be able to document and have those in one place so that you can communicate not just with your fellow developers who know how to read code, but also to other stakeholders in your organization or outside organization and maybe eventually a regulatory body. What that means is you would need some platform that is flexible in ways to inspect, in ways to debug, in ways to visualize, and finally in ways to document. And this has never been more important than now in the foundation model. So this is the only promotional slide I will have in this context. I know it's an open source and Bates and Biases do have a free version. But the way we are trying to think of it is it won't be in one single tool. It's impossible to have all of these things. So we want to build multiple different tools that capture the different aspects of it. The experiments and sweeps will capture tracking a lot of these experiments. Artifacts will capture tracking a lot of these data sets, tables and reports, and we will have a lot of evaluation flexibility. Automations and launch will help you orchestrate and then prompts and monitoring will help you debug and visualize your prompts and monitoring in a continuous manner. And that is the holistic idea that we think that can actually deliver a more standard set of toolkit for the next generation of LLM developers. With that, Bates and Biases, beyond experiment tracking, I don't have a cool QR code as the previous one, but if you are interested to talk about more, I will be happy to available. Thanks for your attention and thanks for being here today.