 Hi, everybody. Thank you for joining me for this talk. I am the VP of Engineering at a startup called Together AI. And I'd like to talk today about why we believe open source is the future of LLMs and the future of AI, as well as a couple of things that we found a little surprising as we tried to scale out access to all of these open source large language models. First, a little bit about Together. We are a research-driven AI company trying to bring state-of-the-art research around optimization and building the most efficient infrastructure for large language models and AI in general. We're a young company, but we're actually based off of decades of research done by our founders here, which include professors at Stanford, at ETH, and probably the best example of research is Flash Attention, which was done by Tree Dow over there, and is used by pretty much anybody doing LLMs right now to optimize their training loops. As I mentioned earlier, at Together, we fundamentally believe that open source is the future of AI. We don't think that there's going to be just a single commercial model to rule them all. We think there's going to be a lot of leading models, and a lot of them are going to be open source. And the reason for that is three things, really. I think the first is transparency. Anybody who is investing significantly in these large language models is going to want transparency to understand how the model is behaving. In particular, they're going to want to know what data was the model trained on, what methods were used for training the model, and what is the model quality. And this is important for things like model risk management and model review boards and so forth. And that is partly why we've also partnered really closely with the Stanford Center for Research and Foundation models to develop Helm. Helm is this benchmark that can benchmark 119 different models across lots of different scenarios and just produce metrics across all these models to see how they actually are behaving and performing. So the second reason we think that the future of LLMs is open source is because of control. So again, if you've invested heavily into these large language models, I think as a lot of us know, these large language models may actually hallucinate issues, and you may put one in production, all of a sudden it's like hallucinating in production with all of your customers. And so open source models actually give you a lot more control over the behavior of these models. If it's not doing what you want it to do, then you can do a full fine tune of the open source model. You control when you deploy the model to production and when you want to deploy updates to your customers. And you can actually just download the model and just deploy it anywhere you choose on your own infrastructure or anywhere in the cloud. Like with open source models, like you own the model. And then lastly, like the reason why I believe that open source is kind of the future of large language models is because of privacy. As I mentioned earlier, like you could take the model and you could deploy it on your own infrastructure. You can also use your own infrastructure for training these models and fine tuning these models. So you can keep control over all the fine tuning data that you're using, all the input prompts that your model is seeing in production from your users, all the responses from the models, and any other user data that you're using that you feed into the models. So for all these reasons, we believe that open source is actually going to be the future of large language models. And so what we've done at together is we provided, one of the things that we've done is we provided a service where anybody can go to our website, get an API key and start using any of 100 plus open source large language models. We added just like mixed drill support earlier this week, for example. And what I'd like to talk about now is some of the issues that we ran into trying to scale out this service that's serving all of these open source large language models. And I just want to talk about four simple things, four things that we kind of found surprising to us at least as we tried to scale this out. And those are auto scaling and startup, bin packing and robustness, timeouts and keep alive and continuous batching. So to start with when we're hosting this many models in production, we want to actually auto scale with our demand. So if we don't want to actually run every single model at peak capacity all the time. And this actually is incredibly challenging. And one thing that is really important to solve when we're auto scaling is model startup. So that together, we actually have our own inference stack that runs the inference for all of these models. And we spent a lot of time just trying to optimize that stack, optimize that startup time, get it to load the models as quickly as possible and get ready to serve traffic. And what we found, which was a little bit surprising is that that actually was not the biggest part of the startup problem. Actually, the startup time was dominated much more by just downloading the models because the models actually can be fairly large, a 70 billion parameter model is like 140 gigabytes. And when you have like thousands of GPUs and you have to send all this data to all of the workers and all of the services that actually starts to dominate kind of that startup time. And so one of the things that we had to do there is just to make sure that we were really smart about caching things, caching the models like close to our GPUs, both within the same data center and on each of the instances. And also be really smart about just kind of like predicting where we should pre-cache things. Like if a model is gonna be really popular, then just push that model to as many places as you can. So that was a little bit surprising to us. The second thing that actually caught us a little bit by surprise was bin packing versus robustness. So there was this trade-off between robustness and actually being able to autoscale some of our models. So imagine that you're running a bunch of small 7B models that maybe take one or two GPUs each. If you want a robust service, then you would spread out those models on as many machines as you can and each machine may have like eight GPUs. But then also imagine that you're running a large 7B model that might want all eight GPUs on a machine. When it comes time to try to scale up that 7B model, it may not be able to find a machine to run on because you've actually spread out all the small models across all the machines and you can't find a whole machine to run on anymore. So that ended up being really tricky for us as well. We had to set different priorities so that if we tried to scale up a 7B model that we would get kicked off some of the smaller models and just try to bin pack in order to maintain both robustness and efficiency there. The third thing that kind of caught us by surprise and that we had to actually spend a lot of time on is just time-outs. So one of the things about these large language models is that the requests can take a very long time. And to end, from the first byte until the last byte of the response, it could be tens of seconds, sometimes like hundreds of seconds depending on whether you're sending a huge prompt or not. And this is very different from typical web traffic or typical internet service. And so what we found is that every single part of the path actually needed some specific tuning on time-outs in order for us to both make sure that we didn't throw errors with our customers and make sure that we could take best advantage of the persistent connections, the HTTP Keep Alive connections throughout the entire path. And so the trick there was we had to just map out the entire data path and make sure that the time-outs were getting progressively longer and set progressively longer throughout the whole path. And finally, I wanna talk about continuous batching. A lot has been written with continuous batching about how fast continuous batching is in terms of tokens per second and how much it can reduce latency. But what we found is that continuous batching was really essential for us just so that we could handle multiple requests per server. Like without batching in general, like if every GPU could only take one request at a time, there was no way we would have enough capacity to serve all the traffic that we needed to serve. And so continuous batching was just essential just to even just make a scalable service in the first place. So those were just a small sampling of the surprises that we ran into trying to scale out the service. And hopefully that maybe helps, like you guys avoid some of those if you end up having to scale it yourself. And again, like here together, we're providing this service to bring these open source models to the masses because we believe that the future of AI is open source for better transparency, better control, and better privacy. Thank you.