 Hi, this is the host of New Bhartiya and welcome to here for Let's Talk. And today we have with us Devrat Rishi, CEO and co-founder of Pretty Base. Devrat, it's great to have you on the show. Yeah, really nice to be here. Thanks so much for having me. Yeah, it's my pleasure to host you here today. And this is the first time I'm talking to you. So I would love to know a bit about the history and history of the company. What led you to co-found this company? Yeah, you know, like a lot of, I think, entrepreneurship journeys. It's a long winding one that culminates in some moments that happen quickly. But the long winding journey is, you know, I always had my academic background in computer science, statistics and machine learning. And spent time at a number of years as a product manager at Google. I was working on different teams, Firebase, which is an acquired startup, Google Research on productionizing ML. But I spent most of my time on Google Cloud on the platform that eventually became Vertex AI, which is the way that GCP packages its AI services externally. And while I was there, it was also the first PM for Kaggle, data science and machine learning community Google to acquired in 2017. Kaggle had just under a million users when I joined and has over 14 million users today. And, you know, when I was there, I didn't realize that that many people knew the basics of machine learning or NumPy and Pandas. And so it was really an incredible time to see just the organic lift. At the same time I was seeing that there were a lot of organizations that on Vertex were just really struggling to get any value out of ML. And so I saw like this, you know, two-part dichotomy. And in 2020 I met my co-founders, Piero, Travis and Chris. And actually my co-founder, Piero had the idea for what eventually became Predabase that was just built on top of a tool that he had made for himself while he was at Uber. He was a machine learning researcher that had to work on a number of different projects. And he felt like he was reinventing the wheel each time. So he made a tool called Ludwig that made his life easier, then made the life of other engineers at the company easier, and eventually open sourced it in 2019 and had an idea for how to break it easier for any organization to build models at the end of 2020. So we all got excited about that and we started right at the very beginning of 2021 as soon as the vaccines came out. And we've been working and operating ever since. If I ask you to summarize Predabase, how would you summarize the company? I would say that we're, you know, a platform for developers that want to productionize open source AI models. So if you're an engineer who wants to be able to use one of those open source models that you've been reading a lot about and start to be able to fine-tune it on your data and then serve that model inside of your organization, then Predabase is the right platform for you. And we've actually always been built on a platform and a foundation of deep learning models that support many different types of use cases. But especially in 2023, the primary thing we lead with for our customers is fine-tuning and serving fine-tuned open source large language models. So we think that, you know, some of the large gen AI applications and gen AI APIs are a great place to get started. But increasingly, the future is open source and the future is fine-tuned. And that's where Predabase really plays into. How important is open source of attributes? Predabase is actually a company built on top of open source projects and around the open source ecosystem. So we are both big believers in as well as, you know, the platform to allow people to use open source. And it's been important to us in two ways. One, in the ways that we've contributed to the community and two, the ways that the community has contributed back to us. So we have built two open source projects, one for training machine learning models called Ludwig that has over 10,000 stars on GitHub today. And then the second is a new project that just got open sourced last month called Lorax, which is a really efficient way to be able to serve open source fine-tuned models and especially be able to serve hundreds of models for the cost of one. Lorax is the reason that we have the most cost effective way to be able to deploy open source models in the market today. And so it's been really critical to the way that we've developed technology and the way we've wanted to get back. And then we've, of course, benefited a lot from the fact that some of the best models coming out today are actually open sourced because what we want to do is allow organizations to own their own models and own their own IP. If you use an external commercial API today, that API owner really owns the models, the weights that you're using and you're consuming by an API. And we think as more organizations see AI as core to their strategy, they're actually going to want to own the models, own the weights, own all the direct infrastructure that goes around it and have full agency. And that's only possible with the open source models. And so that's really where we think we want to be able to play in. And beyond models, because when we do talk about AI, especially now, we talk about JTVI a lot. It's not as easy as the old lamp stack, where everything is fully open source because more complicated. Even if you look at open AI, the word open is there, but it's not open source. So talk about JTVI AI in general and open source. What are the things that you feel are should be open source or what are the things that you should, you know, though that cannot be open source, you're so like, no, it can be open source as the lamp stack as well. Yeah, absolutely. I mean, we think that, and this has always been true for AI actually, which is we think that the models and the weights of the models should be open source. And I think Meta and Facebook's AI Research Lab have really led the way here with some of the Llama 2 variants that came out and love to see some other strong open source models come out like the ones that came out recently last week with Mistral. So those essentially concepts open source the architecture and then some of the weights that have been used from the model training job. And we're really grateful to the organizations that released that. The great thing about that is it gives every other company a great starting point. Being able to train some of these models from scratch can cost, you know, tens of millions of dollars in order to be able to do. So if you want to get started, there's a really high barrier to entry, which when a company open sources the model architecture and weights, they reduce and they get easier for any organization and startups and tech companies and any other enterprise to be able to start to adopt it right out of the box without having that large barrier to entry to get started. The piece that we think makes sense to be more commercial is on the infrastructure. So that means, you know, the way that you actually orchestrate the compute so you make it possible to be able to have a customer's model be trained and fine tuned on their own data. The second piece is the data itself. Of course, we don't think, you know, companies should open source the proprietary data. Instead, what we want to be able to do is bring the models to their data and be able to bring the models into a layer that allows them to customize and fine tune that on their data. And then finally, the infrastructure that allows customers to be able to serve these models inside their organization as well. We think that infrastructure, you know, we've open sourced some of the implementations around it. But at the end of the day, you probably want to go use a platform, just like people use AWS today for, you know, managing their compute in order to be able to get the actual underlying access towards it. So the techniques in our mind are, you know, nothing magical and nothing that should be kept away from the broader public. The infrastructure on how to get that, you know, actual, those techniques to work in production, in practice, on your internal data, that's the piece that we see as commercial. When we look at open source, just before you, I had a conversation with someone from Linux Foundation, they came up with the report, you know, how the global communities can get involved because something that is happening is now, which is more or less like techno nationalism, where the companies are trying to keep, you know, there is a lot of different sort of sovereignty is there. And then a war is going on. So the collaboration is becoming more and more hard. And sometimes companies don't want to rely on the product of a particular, from a particular country because, you know, U.S. can have trade embargo and suddenly you get locked out. So open source does become a place where people can go and contribute. But open source, you know, can be a company-owned open source. Open source can be a community-owned open source. Open source can be a neutral foundation-owned open source. When it comes to AI, just a few weeks ago, Biden, especially, they also came up with the executive order around, you know, to make Genetic AI more safe and secure. To just the way Linux kernel being open source, it kind of created the whole adoption. Kubernetes is another great example. So can you talk about when we look at AI, you know, LLMs, how open source and what kind of home should be there for these technologies so that companies don't really have to worry about your competitor may pull the strings because they control it. It creates an even playing field. You also don't have to worry about the whole geopolitics that, hey, this project is coming from this country. Talk about what do you see would be beneficial for, because we have to increase that option. So talk about. Definitely. And so in terms of why open source, you know, I think there are a number of reasons, but one of the customer quotes that actually encapsulated, said the best for me is, look, generalized intelligence is great, but I don't need my point of sale system to recite French poetry. It's this idea that especially when organizations are deploying AI models, they don't need the full spectrum of artificial general intelligence to be able to solve every all tasks all in one place necessarily. What they need is a surgical toolkit where they can use the right scalpel, the right task for the individual thing they're looking to be able to automate. And open source actually provides that option and opportunity, where you can go ahead and say, I don't need the 1 trillion parameter model or even the 100 billion parameter model. What I need is something small and right-sized to be able to automate my customer support tasks or to be able to do help me with fraud prediction. It's really about making sure you have the right tool for the job. And that's where the breadth of the open source community is really helpful. So taking a step back, what are the reasons why open source is really useful and like why we think it's going to dominate the overall industry? There's really three key reasons. The first is all about what you were talking about, which is vendor lock-in and preventing that. So today, if you're using for a service like OpenAI, customers I think consistently notice strange patterns, whether that has to do with latency and availability. So sometimes the models will just fail to give a response at a certain point, if you're querying them at too large of a rate. And if you've been on Twitter or X recently, you've probably noticed a lot of announcements about how OpenAI's performance has degraded over the last month. That's not anything you have control or agency over. And if AI is going to be core to your roadmap, then it seems really weird to be able to defer all of that control and agency towards like a third party. And so the first is really around agency and preventing vendor lock-in. The second core reason people really go into open source is for performance reasons. And so here, you might have a task or a model. We see this all the time. It does really well in building the prototype. It's a really good way to be able to start to show how you can summarize emails or something along those lines. But when you want to use it inside your organization, you need to go from like 80% to 99% overall. And that's where the combination of open source and fine tuning I think have been the most useful. We worked on an experiment on JSON generation where we're trying to extract information from text and basically come up with a structured representation. And GPT-4 got about 66% coverage out of the box on that task. But when we fine-tuned Lama 2.7 billion, a much smaller model and much less capable than GPT-4 in a general purpose sense. But when we fine-tuned it on that data, we got up to 93%. And that was really impactful because it basically shows a larger performance gain for a lot more bang for your buck. And that final bang for your buck point is actually the third large reason why people go with open source, which is today the models that people are deploying are really expensive. They're quite slow. And instead what open source allows you to do is right size the right model task. If you want a 1 billion parameter model, a 3 billion parameter model, a 7, a 13, a 34, a 70, all of those are available in the open source community. They're not all available through like commercial APIs, but they're all open source models that are available and you can choose the one that actually makes the right, the most sense for your task. And so that's where we think users are increasingly starting to see that adoption. You know, to end on this point, the number one thing I hear from customers today that are using GenAI and production is, yeah, we built our prototype on OpenAI or we built our demo using another commercial API. But we want to move off of it in the next six to 12 months. And we want to move off of it for exactly those three reasons that I was just mentioning. We can talk about open source. We can talk about all these projects as much as we want. But open source, these open source projects can solve only day zero or day one problem. Where you can get the code base, get it installed, get it up and running. What about day two, day three, day four? So that's where, I mean, you know, there's already that, that's where commercial players come to a picture because that's the problem they solve. Big players can of course have all the, but not everybody's Google, Microsoft, and Apple. At the same time, the projects, they have their own limitations because they cannot take every use case and start, you know, embedding those codes to cater to a specific, you know, customer, commercial players, you folks can vary. Just talk about the role of commercialization in the open source, you know, LLM space. And of course, let's talk about your company, what role are you folks playing there? Exactly. So I like the day zero, day one analogy, right? And if open source discovery, or even just finding out and building your first, you know, demo with a GenAI application is day zero, then Predabase is really day one and on. You know, it's the point at which you've seen the power of some of these models that exist. And now you have the questions of, how do I deploy this inside of my organization? How do I bring this to my data in a secure way that allows me to be able to get a performance lift and a customization? How do I make this available to be able to iterate across other collaborators in my organization? Those are all the key things that we think about inside of an organization. So when we started off, I described Predabase as, you know, the developer platform to productionize open source AI. And so that's really what we're all oriented around. It's not us inventing anything novel in the underlying models. It's about us making it easy to take those models that people have put out, great research labs, and make it something you can fit into that organization. Now, how do we do that? We have a platform today that anyone can try out for free. It's on predabase.com. We have one of the best free trials actually on the market because you can both fine tune an open source large language model and serve it. And we give you the GPUs to do that out of the box. You can go ahead and try it out today and you'll really do three key things. The first is you'll connect your data. The second is you'll fine tune and train a model. And we make it easy for anyone to fine tune and train a model. And then the third is you'll be able to deploy that model so you can start to prompt it and start to get your answers and requests back from it. We have a bunch of quick starts and guides that help you kind of walk through that. But we really think that's kind of the key thing that people need. They see these models out in the open source and they're curious, how do I get it to work inside my organization? How do I connect it to my data? How do I serve it? And that's what predabase is really oriented on. As the adoption is growing, of course, it really depends from one use case to other use case. We cannot generalize it. But from predabase point of view, from your perspective, what are some of the challenges of pain points that at least your customers or users fees that you're like, hey, these are needed to be addressed because these are still the problem that are cropping up? Yeah, absolutely. So just actually both KubeCon as well as the Linux AI.dev conference, we actually had talks at both of those. So hopefully you'll have a chance to check those out and we'll consent over a recording as well. On the second point about what are the challenges that people are seeing, whether it's trust and safety and hallucinations or others. The key challenge that I'd say customers come to predabase for are two things. Number one, it's actually efficiency. So they've gotten something working with a platform like OpenAI. They've found a use case where they've been able to either get a human in the loop system going or it's a use case where the kind of erratic behavior of the model is still acceptable. So for example, maybe it's writing advertising copy and some of the advertising copy is a little bit strange, but that's okay if 95% of the advertising copy is good most of the time. And there's a review process for it. They found that use case where they come to us is they want to actually get to the point where this use case can be served in production at the most cost effective way in the most tailored way to be able to eke out the last bits of performance and to be able to make sure that's low latency and cost effective. So when you think about who's the normal predabase customer persona, it's somebody that probably started their experimentation with OpenAI as an example but has figured out what use case they want to apply it to and want to go to the next level where they're actually thinking about a production setting for that model and they want to own that model. And the reason they come to us is to be able to get that maximum efficiency. The second reason people come to us is I think that we're one of the forefront companies like in fine tuning LLMs overall. And so they want to understand from us like what are the use cases? We talked to both like very large technology companies on a partnerships angle here. We talked to many mid market other companies that are looking to integrate this directly for their own use cases. But we're really kind of talking in joint collaboration about some of the technology we've put out and how do we actually think about brainstorming the use cases around it all the way if they're deploying it inside their industry. What kind of cultural shift either you're seeing it's too early actually and I don't like just the labels and jargons because the reality is different than what we here see or what kind of cultural changes you think are needed once again from a very practical and realistic manner versus just throwing another jargon out there. Yeah absolutely and you know just very being data driven about it actually when I was at Kaggle by the time I left that community we had about just over five million users I want to say. Now there's a kind of common thought process that people do that do machine learning are all these PhDs and deeply embedded and kind of ML and they've been doing this for a while and I can tell you for sure right now of the five million users the vast majority were not PhDs in machine learning the vast majority had never academically studied machine learning right. What they were were they were folks that we will call and at Predebase we call ML curious. They were engineers that maybe were a data engineer at some point they were analysts that had a little bit of a statistical background you know and other kind of types of folks like that and they were the type of persona this is why we at Predebase we love developers it's because they were the type of persona that were like give me some good documentation and some good abstractions and I will figure this out. That is the type of user we see the most often in our open source communities that picked us up even before Predebase became a company. They were organizations like Coup which is a social networking application based in India that basically you know we're looking to be able to build their first models what they really cared about was not the theory behind machine learning it wasn't like what is the right way to optimize a convex loss function it was hey I have content and I need to be able to go ahead and moderate that content give me the set of tools that I need that will allow me to do that in the simplest way possible and that doesn't lock me in so as I get more advanced I can dig deeper and deeper and that's what we've seen increasingly I think the shift left movement has been most commonly referred like you know associated with security over the time and like with DevSecOps but I actually think that machine learning has been shifting left silently for a while why I say silently it's because three years ago it's not like in organizations you actually saw a lot of engineers starting to do machine learning themselves but you saw it in the communities if you looked at the people that used Huggingface if you looked at the people in Kaggle they weren't the MLX like PhDs and MLPHDs were also there but there's 10 to one ratio in terms of like you know who's actually active in those communities and the really cool thing about moving to an API-centric model in the last year is now we start to see that in industry and organizations and so I think that you know it was a silent shift that now we brought to the forefront really in this year also do you see that there are certain industries which can benefit more from LLMs and Genetimo or do you see that you know just the way we say you know at one point we were trying to every industry every company should be seen as a software company right if you don't have a soft you was right so over time certainly think it's going to be a case where like software rate the world so will AI and I think that that's going to be where it goes in the short term though it's actually useful to be tactical and what are the types of organizations that use cases in particular where you know Gen AI will start to attack first I think number one it's going to be organizations that have and deal with a lot of unstructured and especially text data now there's a lot of generative AI models that are coming out and other modalities like images and you know videos and we even put out some work earlier that had to do with tabular data but text has really led the the kind of charge here and so I think the first type of the area to consider is where is there a lots of text data whether that's customer support automations whether that's code and others where you can actually start to see this tactically unfold and the second key area is areas where you know over the near term and Gen AI is going to be capable of making mistakes or hallucinating you know which is in part really a feature rather than a bug the fact that it can come up with some new creative answers you know that might be unexpected and so what we also need to do is make sure we're concentrated in industries and use cases where that type of you know behavior is acceptable and it's acceptable for either one of two reasons one you know it's a use case where every once in a while if it says the wrong thing it's not going to cause a critical cascading downstream failure or and you know when it's not in a regulated industry where you need to be able to have heavy emphasis on that explainability or number two if it's the case where there's a human in the review human in the loop process overall and so those are the types of use cases we're going to likely see and so I think you know generally we've seen these applications picked up the most by tech companies so far but every organization has interest against it and I think it's identifying the right use cases the data that you know are going to be best suited for it one more question I want to ask you it has nothing with open source it's more or less like there is also a fear hey you know what genitive AI is going to take away our jobs but I see that the good analogy is Photoshop Photoshop did not eliminate the jobs of photographers actually it enabled a lot of you know even mob and pop shop regular people to become great photographers it is a very powerful tool in the right hands so same I see is the case with genitive AI it will actually enable it would take care of a lot of mundane you know like even the developers they really don't have to worry about a lot of low-level thing they can focus on higher which also means that they can charge more and they can use quality time for quality things what is your opinion I think my opinion is genitive AI is going to replace a lot of jobs and create a lot more new ones and I think that the you know value creation out of jobs that generated gen AI is going to create is going to be higher than the jobs that it's going to automate and so in general I think what we're going to see is net productivity increase you know across the economy of users that uses this I think we're going to see people no longer have to do those rotten mundane tasks like you were talking about like really just think about the processes to be able to do things like you know automate extracting information out of documents and putting them into a database and other things along those lines right like those types of tasks should not be the ones that we really deploy a lot of human labor against instead the types of tasks which deploy human labor against I think increasingly will be new ones that get created where humans work side by side with kind of gen AI in the loop doing review and just increase in the cadence and pace at which they're putting out productive applications David, thank you so much for taking time out today and talk about the database and also the whole larger growing ecosystem thanks for all those great insights and I would love to chat with you again soon thank you really appreciate the time thanks for having us