 Hello, everyone. Welcome to theCUBE's presentation of the AWS Startup Showcase AI and Machine Learning, the top startups building foundational model infrastructure. This is season three, episode one of the ongoing series covering the exciting startups from the AWS ecosystem, talk about machine learning and AI. I'm your host, John Furrier. Today we are excited to be joined by Louis Sez, who's the CEO of OctoML and Anna Connelly, VP of customer success and experience. OctoML, great to have you on again, Louis. Anna, thanks for coming on. Appreciate it. Thank you, John, it's great to be here. I love the company. We had a CUBE conversation about this. You guys are really, really addressing how to run foundational models faster for less. And this is like the key theme. Before we get into it, this is a hot trend, but let's explain what you guys do. Can you set the narrative of what the company is about, why it was founded, what's your North Star, and your mission? Yeah, so John, our mission is to make AI sustainable for everyone. And what we offer customers is a way of taking their models into production the most efficient way possible by automating the process of getting a model and optimizing it for a variety of hardware and making costs effective. So better, faster, cheaper model deployment. You know, the big trend here is AI, everyone's seeing the chat GPT kind of as a shot heard around the world, the Bing AI and this fiasco and the ongoing experimentation. People are into it. And I think the business impact is clear. I haven't seen this in all of my career in the technology industry of this kind of inflection point in every senior leader I talked to is rethinking about how to rebuild their business with AI because now the large language models have come in. These foundational models are here. They can see value in their data. This is a 10 year journey in the big data world. Now it's impacting that and everyone's rebuilding their company around this idea of being AI first because they see ways to eliminate things and make things more efficient. And so now that's telling them go do it and like, what do we do? So what do you guys think? Can you explain what is this wave of AI and why is it happening? Why now? And what should people pay attention to? What does it mean to them? Yeah, it's pretty clear by now that AI can do amazing things for the captures people's imaginations and also now can show things that are really impactful in businesses, right? So what people have the opportunity to do today is to either train their own model that adds value to their business or find open models out there that can do very valuable things to them. So the next step really is how do you take that model and put it into production in a cost effective way so that the business can actually get value out of it, right? And what's your take? Because customers are there. You're there to make them successful. You got the new secret weapon for their business. Yeah, I think we just see a lot of companies struggle to get from a trained model into a model that is deployed in a cost effective way that actually makes sense for the application they're building. I think that's a huge challenge we see today kind of across the board, across all of our customers. Well, I see this. Everyone asked the same question. I have data. I want to get value out of it. I got to get these big models. I got to train it. What's it going to cost? So I think there's a reality of, okay, I got to do it. Then no one has any visibility on what it costs when they get into it. So this is going to break the bank. So I have to ask you guys, the cost of training these models is on everyone's mind. OctoML, your company is focused on the cost side of it as well as the efficiency side of running these models in production. Why are the production costs such a concern and where specifically are people looking at it and why did it get here? Yeah, so training costs get a lot of attention because normally a large number but we shouldn't forget that it's a large, typically one time upfront costs the customer's pay but when the model is put into production the cost grows directly with model usage and you actually want your model to be used because it's adding value, right? So the question that a customer faces, they have a model, have a trained model and now what? So how much would it cost to run in production? So and now without the big wave in generative AI which rightfully is getting a lot of attention because of the amazing things that it can do it's important for us to keep in mind that generative AI models like chat GPT are huge, expensive energy hogs. They cost a lot to run, right? So and given that model usage grows directly, model cost grows directly with usage what you want to do is make sure that once you put the model into production you have the best cost structure possible so that you're not surprised when it gets popular, right? So let me give you an example. So if you have a model that costs say one to $2 million to train but then it costs about one to two cents per session to use it, right? So if you have a million active users even if they use just once a day it's 10 to $20,000 a day to operate that model in production and that very, very quickly gets beyond what you pay to train it. And these aren't small numbers and this cost to train and cost to operate it kind of reminds me of when the cloud came around and the data center versus cloud options like wait a minute, want to cost a ton of cash to deploy and then running it? This is kind of a similar dynamic. What are you seeing? Yeah, absolutely. I think we are going to see increasingly the costs in production outpacing the costs in training by a lot. I mean, people talk about training costs now because that's what they're confronting now because people are so focused on getting models perform enough to even use in an application. And now that we have them and they're that capable we're really going to start to see production costs go up a lot. Yeah, Lucy, if you don't mind I know this might be a little bit of a tangent but training is super important. I get that. That's what people are doing now. But then there's the deployment side of production. Where do people get caught up and miss the boat or misconfigure? What's the gotcha? Where's the tripwire or so to speak? Where do people mess up on the cost side? What do they do? Is it they don't think about it? They tie it to proprietary hardware. What's the issue? Yeah, several things. So without getting too annoyingly technical which am I getting to? You have to understand relationship between performance both in terms of latency and throughput and cost. So reduce latency is important because you improve the responsiveness of the model but it's really important to keep in mind that often leads to diminishing returns below a certain latency making it faster won't make a measurable difference in experience but it's going to cost a lot more. So understanding that is important. Now, if you care more about throughput which is the time it takes for you to units per unit of time you care about time to solution we should think about this throughput per dollar and understand that what you want here is the highest throughput per dollar which may come at the cost of higher latency which you're not gonna care about, right? So in the reality here, John is that humans and especially folks in this space want to have the latest and greatest hardware and often it commits a lot of money to get access to them and have to commit upfront before they understand the needs that their models have, right? So common mistake here, one is not spending time to understand what you really need and then two, over committing and using more hardware than you actually need and not giving yourself enough freedom to get your workload to move around to the more cost effective choice, right? So this is just a matter of harder choice and then another thing that's important here too is making a model run faster on the hardware directly translates to lower cost, right? So Buddy takes a lot of engineers who need to think of ways of producing very efficient versions of your model for the target hardware that you're going to use. And what's the customer angle here because price performance has been around for a long time, people get that but now latency and throughput, that's key because we're starting to see this in apps. I mean, it's an end user piece. I've been seeing it on the infrastructure side where they're taking the heavy lifting away from operational costs. So you got application specific to the user and or top of the stack. And then you got actually being used in operations where they want both. Yeah, absolutely. Maybe I can illustrate this with a quick story with the customer that we had recently even working with. So this customer is planning to run kind of a transformer based model for tech generation at super high scale on NVIDIA T4 GPUs, kind of a commodity GPU. And the scale was so high that they would have been paying hundreds of thousands of dollars in cloud costs per year just to serve this model alone. One of many models in their application stack. So we worked with this team to optimize our model and then benchmark it across several possible targets so that matching the hardware that Louise was just talking about including the newer kind of NVIDIA A10 GPUs. And what they found during this process was pretty interesting. First, the team was able to shave a quarter of their spend just by using better optimization techniques on the T4, the older hardware. But actually moving to a newer GPU would allow them to serve this model in a sub two milliseconds latency. So super fast, which was able to unlock an entirely new kind of user experience. So they were able to kind of change the value they're delivering in their application just because they were able to move to this new hardware easily. So they ultimately decided to plan their deployment on the more extensive A10s because of this. But because of the hardware specific optimizations that we helped them with, they managed to even bring costs down from what they had originally planned. And so if you extend this kind of example to everything that's happening with generative AI, I think the story we just talked about was super relevant but the scale can be even higher. It can be 10 fold that. We were recently conducting kind of this internal study using GPTJ as a proxy to illustrate the experience of just a company trying to use one of these large language models with an example scenario of creating a chat bot to help job seekers prepare for interviews. So if you imagine kind of a conservative usage scenario where the model generates just 3000 words per user per day, which is pretty conservative for how people are interacting with these models. It costs five cents a session. And if you're a company and your app goes viral. So from beginning of the year, there's nobody at the end of the year. There's a million daily active users and that you're alone going from zero to a million. You'll be spending about $6 million a year, which is pretty unmanageable. That's crazy, right? For a company or a product that's just launching. So I think, for us, we see the real way to make these kind of advancements accessible and sustainable as we said, is to bring down costs to serve. That's a great story. And I think that illustrates this idea that deployment costs can vary from situation to situation or model to model and that the efficiency is so strong with this new wave. It eliminates heavy lifting, creates more efficiency, automates intellect. I mean, this is the trend, this is radical. This is going to increase, so the cost could go from nominal to millions, literally potentially. So this is what customers are doing. Yeah, that's a great story. What makes sense on a financial? Is there a cost of ownership? Is there a pattern for best practice for training? What do you guys advise? What's, because this is a lot of time and money involved in the potential, good scenarios of upside, but you can get over your skis as they say and be successful and be out of business if you don't manage it. I mean, that's what people are talking about. Yeah, absolutely. I think we see kind of three main vectors to reduce costs. I think one is make your deployment process easier overall, so that your engineering effort to even get your app running goes down. Two would be get more compute, get more from the compute you're already paying for. You're already paying for your instances in the cloud, but can you do more with that? And then three would be shop around for lower cost hardware to match your use case. So on the first one, I think making the deployment easier overall, there's a lot of manual work that goes into benchmarking, optimizing and packaging models for deployment. And because the performance of machine learning models can be really hardware dependent, you have to go through this process for each target you wanna consider running your model on. And this is hard, we see that every day, but for teams who want to incorporate some of these large language models into their applications, it might be desirable because licensing a model from a large vendor like OpenAI can leave you over provision kind of paying for capabilities you don't need in your application or can lock you into them and you lose flexibility. So we have a customer whose team actually prepares models for deployment in a SaaS application that many of us use every day. And they told us recently that without kind of an automated benchmarking and experimentation platform, they were spending several days each to benchmark a single model on a single hardware type. So this is really manually intensive. And then getting more from the compute you're already paying for, we do see customers who leave money on the table by running models that haven't been optimized specifically for the hardware target they're using like Luis was mentioning. And for some teams, they just don't have the time to go through an optimization process. And for others, they might laugh kind of specialized expertise and this is something we can bring. And then on shopping around for different hardware types, we really see a huge variation in model performance across hardware, not just CPU versus GPU which is what people normally think of but across CPU vendors themselves, high memory instances and across cloud providers even. So the best strategy here is for teams to really be able to, at least they look before you leap by running real world benchmarking and not just simulations or predictions to find the best software hardware combination for their workload. Yeah, you guys sound like you have a very impressive customer base deploying large language models. Where would you categorize your current customer base? And as you look out, as you guys are growing, you have new customers coming in. Take me through the progression. Take me through the profile of some of your customers you have now, size, are they hyperscalers? Are they big app folks? Are they kicking the tires? And then as people are out there scratching this, I got to get in this game. What's their psychology like? Are they coming in with specific problems or do they have a specific orientation point of view about what they want to do? Can you share some data around what you're seeing? Yeah, I think we have customers that kind of range across the spectrum of sophistication from teams that basically don't have ML ops expertise in their company at all. And so they're really looking for us to kind of give a full service. How should I do everything from optimization, find the hardware, prepare for deployment. And then we have teams that maybe already have their serving and hosting infrastructure up and ready and they already have models in production and they're really just looking to take the extra juice out of the hardware and just do really specific on that optimization piece. I think one place where we're doing a lot more work now is kind of in the developer tooling model selection space and that's kind of an area that we're creating more tools for particularly within the PyTorch ecosystem to bring kind of this power earlier in the development cycle so that as people are grabbing a model off the shelf they can see how it might perform and use that to inform their development process. Luis, what's the big, I like this idea of picking the models because isn't that like going to the market and picking the best model for your day? It's like, isn't there certain approaches? What's your view on this? Because this is where everyone, I think it's going to be a land rush for this and I want to get your thoughts. You guys are just- For sure, yeah. So I guess I'll start with saying that one main takeaway that we got from the GPTJ study is that having a different understanding of what your models compute and memory requirements are very quickly early on helps with the much smarter AI model deployment, right? And in fact, Ana just touched on this but I want to make sure that it's clear that OptoML is putting that power into users' hands right now. So in partnership with AWS, we are launching this new PyTorch native profiler that allows you with a single one line code decorator allows you to see how your code runs on a variety of different hardware after acceleration. So it gives you very clear data on how you should think about your model deployments. And this ties back to choices of models. So if you have a set of choices that are equally good of models in terms of functionality and you want to understand after acceleration how are you going to deploy, how much they're going to cost or what are the options? Using an automated process of making a decision is really, really, really useful. And in fact, so I think these of these events can get early access to this by signing up for the Octopods. There's this exclusive group for insiders here. So you can go talk to optoML.ai slash pods to sign up. So that Octopod, is that a program? What is that? Is that access to code? Is that a beta? What is that? Explain, take a minute, explain Octopod. I think the Octopod would be a group of people who is interested in experiencing this functionality. So it is the friends and users of OctoML. That would be the Octopod. And then, yes, after you sign up, we would provide you essentially the tool in code form for you to try out in your own, I mean, part of the benefit of this is that it happens in your own local environment and you're in control of everything, kind of within the workflow that developers are already using to create and begin putting these models into their application. So it would all be within your control. Got it. I think the big question I have for you is when do you, when does one of your customers know they need to call you? What's the, what does their environment look like? What are they struggling with? What are the conversations they might be having on their side of the fence? As if anyone's watching this, they're like, hey, you know what, I've got my team, we have a lot of data. Is it, do we have our own language model? Or do I use someone else's? There's a lot of this, I won't say discovery going on around what to do, what path to take. What does that customer look like if someone's listening? When do they know to call you guys? OctoML? Well, I mean, the most obvious one is that you have a significant spend on AI ML come and talk to us, you know, putting AI ML into production. So that's, that's the clear one. In fact, just this morning I was talking to someone who is in the life sciences space and is having, you know, 15 to $20 million a year cloud-related AI ML deployment. It's a clear, it's a pretty clear match right there. Right? So that's on the cost side, but I also want to emphasize something that Anna said earlier that, you know, the hardware soft, the hardware and software complexity involved in putting a model into production is really high. So we've been able to abstract that away offering a clean automation flow enables one, one to experiment early on, you know, how models would run and get them to production. And then two, once they are into production gives you an automated flow to continuously updating your model and taking advantage of all this acceleration and ability to run the model on the, on the right hardware. So anyway, let's say one then is cost, you know, you have a significant cost and then two, you have an automation needs. And Anna, please compliment that. Yeah. Yeah, I think that's, that's exactly right. Maybe the other time is when you are expecting a big scale up in, in serving your application, right? You're launching a new feature you expect to get a lot of usage or any, and you want to kind of anticipate maybe your, your CTO, your CIO whoever pays your cloud bills is going to come after you, right? And so they want to know, you know, what's the return on putting this model essentially into my application stack? Am I going to, is the usage going to match what I'm paying for it? And then you can understand that. So you guys have a lot of the early adopters, they got, they got big data teams they've pushed in the production they want to get it to a little QA if test the waters, understand, use your technology to figure it out. Is there any cases where people have gone into production and then have to pull it out? It's like the old lemon laws with your car, you buy a car and oh my God, it's not the way I wanted it. I mean, I can imagine that the, the early people through the wall, so to speak, in the wave here are going to be bloody in the sense that they've gone in and tried stuff and get stuck with huge bills. Are you seeing that? Are people pulling stuff out of production and redeploying? Or I can imagine that if I had a bad deployment, I'd want to refactor that or actually replatform that. Yeah. Do you see that too? Definitely after sticker shock, you ask your customers to come and make sure that they know that sticker shock won't happen again. But then there's another more, more subtle aspect here that I think we likely touched on be worth elaborating a bit more is just how are you going to scale in a way that's feasible depending on the allocation that you get, right? So as we mentioned several times here, you know, model deployment is so hard or dependent and so complex that you tend to get a model for a harder choice and then you want to scale that specific type of instance. But what if when you want to scale because suddenly luckily got popular and you know, you want to scale it up and then you don't have that instance anymore. So how do you live whatever you have at that moment is something that we see customers needing as well. You know, so in fact, ideally we want these customers to not think about what kind of specific instances they want. What they want is to know what their models need. Say they know the SLA and on a final set of hybrid targets and instances that hit that SLA and whenever they auto-scaling they're going to scale with more freedom, right? Instead of having to wait for AWS to give them more specific location for a specific instance. So what if you could live with other types of hardware and scale up in a more free way, right? So that's another thing that we see customers you know, coming like this, like now they need more freedom to be able to scale with whatever is available. Anna, you touched on this with the business model impact to that six million cost if that goes out of control. There's a business model aspect and there's a technical operation aspect to the cost side too. You want to be mindful of riding the wave in a good way, but not getting over your skis. So that brings up the point around, you know, confidence, right? And teamwork, because if you're in production there's probably a team behind it. Talk about the team aspect of your customers. I mean, they're dedicated. They go putting stuff into production. They're developers, they're data. What's it for them? Are they on the beach, you know, reading the book? Are they easy street for them? What's the customer benefit to the teams? Yeah, absolutely. With just a few clicks of a button, you're in production, right? That's the dream. So yeah, I mean, I think that, you know, we illustrated it before a little bit. I think the automated kind of benchmarking and optimization process, like when you think about the effort it takes to get that data by hand, which is what people are doing today. They just don't do it. So they're making decisions without the best information because it's, you know, there just isn't the bandwidth to get the information that they need to make the best decision and then know exactly how to deploy it. So I think it's actually bringing kind of a new insight and capability to these teams that they didn't have before. And then maybe another aspect on the team side is that it's making the handoff of the models from the data science teams to the model deployment teams more seamless. So we have, you know, we have seen in the past that this kind of transition point is the place where there are a lot of hiccups, right? Data science team will give a model to the production team and it'll be too slow for the application or it'll be too expensive to run and it has to go back and be changed and kind of this loop. And so, you know, with the PyTorch profiler that Luisa was talking about and then also, you know, the other ways we do optimization that kind of prevents that handoff problem from happening. Luisa, you guys have a great company. Final couple of minutes left. Talk about the company, the people there, what's the culture like? You know, if Intel has Moore's law, which is, you know, doubling the performance in a few years, what's the culture like there? Is it, you know, more throughput, better pricing? What's the, explain what's going on in the company and put a plug in? Luisa, we'll start with you. Yeah, absolutely, no, I mean, I'm extremely proud of the team that we built here. You know, we have a people first culture, you know, very, very collaborative and folks, we all have a shared mission here of making iMore accessible and sustainable. We have a very diverse team in terms of backgrounds and live stories. You know, to do what we do here, we need a team that has expertise in software engineering, in machine learning, in computer architecture, even though we don't build chips, we need to understand how they work, right? So, and then, you know, the fact that we have this, this very, really, really varied set of backgrounds makes the environment, you know, it's a very exciting for you to learn more about, you know, assistance end to end, but also makes it for a very interesting, you know, work environment, right? So people have different backgrounds, different stories, some of them went to grad school, others, you know, were intelligence agencies and now we're doing, are working here, you know, so we have a really interesting set of people and, you know, life is too short not to work with interesting humans, you know, that's something that I like to think about, you know. I'm sure your offsite meetings are a lot of fun, people talking about computer architectures, silicon advances, the next GPU, the big data models coming in. Anna, what's your, what's your take? What's the culture like? What's the, what's the company vibe and what are you guys looking to do? What's the, what's the customer success pattern? What's, what's up? Yeah, absolutely. I mean, I, you know, second all of the great things that we just said about the team, I think one that I, an additional one that I'd really like to underscore is kind of this customer obsession to use the, a term you all know well and focused on the, on the end users and really making the experiences that we're bringing to our users who are developers really, you know, useful and valuable for them. And so I think, you know, all of these tools that we're trying to put in the hands of users that the industry and the market is changing so rapidly that our products across the board, you know, all of the companies that, you know, are part of the showcase today, we're all evolving them so quickly and we can only do that kind of really hand in glove with our users. So that would be another thing I'd like to say. I think the change dynamic, the power dynamics of this industry is just the beginning. I'm very bullish that this is going to be probably one of the biggest inflection points in history of the computer industry because of all the dynamics of the confluence of all the forces, which you mentioned some of them. I mean, PC, you know, interoperability with internet working and you got, you know, the web and then mobile. Now we have this, I mean, I wouldn't even put social media in the close to this. Like this is like changes user experience, changes infrastructure. There's going to be massive accelerations in performance on the hardware side from AWS's of the world and cloud. And you got the edge and more data. This is really what big data was going to look like. This is the beginning. Final question. What do you guys see going forward in the future? Well, it's undeniable that machine learning and AI models are becoming an integral part of an interesting application today, right? So, and the clear trends here are, you know, more and more computational needs for these models because they're getting more, only getting more and more powerful. And then two, you know, seeing the complexity of the infrastructure where they run, you know, just considering the cloud, there's like a wide variety of choices there, right? And being able to live with that and making the most out of it in a way that doesn't require, you know, an impossible to find team is something that's pretty clear. So the need for automation abstracting with the complexity is definitely here. And we're seeing this, you know, trends are that you also see models starting to move to the edge as well. So it's clear that we're seeing, we're going to live in a world where there's no large models living in the clouds. And then, you know, edge models that talk to these models in the cloud to form, you know, an end to end truly intelligent application. Anna. Yeah, I think, you know, our, we said it at the beginning, our vision is to make AI sustainable and accessible. And I think as this technology just expands in every company and every team, that's going to happen kind of on its own. And we're here to help support that. And I think you can't do that without tools like those that are coming up. I think it's going to be an error of massive invention creativity. A lot of the format heavy lifting is going to allow the talented people to automate their intellect. I mean, this is really kind of what is going on. And Louise, thank you so much, Anna. Thanks for coming on this segment. Thanks for coming on theCUBE and being part of the 80 of startup showcase. I'm John Furrier, your host. Thanks for watching.