 Hi, this is Yozoblin Bhartiya and we are here at Open Source Summit in Vancouver. Today we have with us Denny Lee, senior staff developer advocate at Databricks. Denny is great to have you on the show. Thank you very much. Appreciate the offer. Opportunity, excuse me. You are here at the event. You folks, you know, you do do a lot of open source as well. So talk a bit about your presence at the event. Oh, the reason we're here at the Open Source Summit North America, obviously, outside of just being in Vancouver and the great food that this city has, is because we want to talk about community and the importance of the engineering and marketing actually working well together to help build up communities that are stronger, that they can outlive not just like a single organization, but multiple organizations, people leaving, people coming, people going, people coming, doesn't matter. The whole context is that the idea of community building can be very, very difficult and Delta Lake has become quite successful. A lot of that has to do with our partnership with the Lakes Foundation. So that's why we're here because we want to advocate for, well, the Lakes Foundation. You mentioned Delta Lake. Talk about the project. No problem. So there's the standard marketing terminology, which is we are the standard Lake House platform format. Let me stop. Okay. So the context basically is this. There's a concept of when it comes to data. There is traditionally databases and then we went into data lakes, right? And then we introduced a concept called Lake House. Lake House basically saying it's to take the best of the both worlds of what databases offered, which is standard transactions, schema, understanding what data you actually have, versus data lakes, which gives you that scale building flexibility. Now, the problem is that individually by themselves, there could be lots of problems. Databases have hard time scaling. Data lakes have a hard time actually keeping track of what the schemas are. So Lake House is the idea that take the best of both worlds. A very important component of a Lake House is the idea of asset transactions on your data lake. Well, that's what Delta Lake is. Delta Lake is a project within the Lakes Foundation, specifically to allow the community to work together to ensure that their data is actually safe and reliable. It's in LF because we have a very broad community that wants to use this type of technology to ensure their data is reliable. And it's also very hardened and battle tested. So for example, just talking about Databricks alone, Databricks alone, and these are last year's numbers. Let's say 1.7 exabytes of data a day is processed, not stored, processed a day. And more than 7,000 customers are in production because of Delta Lake. So behind the scenes when it comes to Databricks, a lot of our engines spark based. The patch spark is a cool distributor platform, but there will be failures, especially when you're talking about as many jobs as we're running. Delta Lake allows us to ensure the stability of our systems. I also want to talk about just the data world is that of course there used to be data warehouses, not there used to be digital and then data lakes and advantages. But the thing is, the world is going to coexist. There are advantages of both. So when we look at, of course, data lakes, data warehouses. So talk a bit about the coexisting of the benefit you already touched upon these two. But overall, we are heading. Well, so what we're hoping for when it comes to these lake houses is that by taking the transactional reliability and safety and security of a database and adding that to the scalability of a data lake, by creating this lake house concept, we're hoping that in essence, just like with us, the Databricks literally, you literally don't need to think about the scalability and flexibility of your data because the reality is like, you know, for example, I'm wearing a Dolly T-shirt right now, right? And Dolly is one of our LLMs, large language models during the hotness and craze of chat JVT. Most companies, most people want to focus on the analytics or the data science or Dolly, you know, LLMs. That's what they want to do. So how do we provide that capability? It's because we know that the data systems that we're relying on are very reliable. And that's where we're hoping to see that by providing technologies like Delta Lake, we actually have competitors like Apache Iceberg and Apache Hootie. But from where we're sitting from, it's not about like competitive. It's about like just use one of them. So that way you have reliable data. You have a lake house that gives you the advantage of both systems. So then you can start focusing on the good stuff. And since you brought, you know, LLMs and chat JVT, I also want to talk a bit about the kind of living and data driven data centric world. And as you rightly said, I mean, data itself has no value. The value is that you extract from it. Exactly, yes. At the same time, we're also collecting a lot of data, we're creating a lot of data too. So we'll be, and you know, some people say, hey, data is the new oil of all those things. But where are we realistically heading? Because, you know, at a certain point in the industry, they do realize, hey, you know what? This is not what we need. It was because everything goes through a hype cycle. Of course. I mean, when we look at these analytics, there are a lot of companies who are doing real analytics that is actually solving a lot of customer problems. Versus a lot of companies, everything is hype cycle. You know, they are creating a solution looking for a problem. Exactly. So talk about, you know, how are you seeing, you know, the data is the new oil, where we are actually heading? I do think the data is a new oil, in fact, actually is the correct concept. The problem, though, is most people don't know how to mine it, pull it up, extract it, make use of it. In other words, have the oil, if I was to use, for example, a chemical plant, right? A petroleum plant, excuse me, right? They can take the oil, but they actually have to extract out the gasoline or they have to extract out the systems to create the plastics, right? What is the value you are going to create based on this data? That's what's sorely lacking, right? And so, for example, let's just talk about the current hotness, like, which is chat GBT, LLMs, right? The whole reason we, Databricks, in this case, talks about LLMs all at Dahle. Dahle is a long story short for those who don't know what it is. Dahle is a small model based off of the Ethereum AI model, a really good one. But it was basically what we showed is that you could fine tune an older model from like two or three years ago and get results similar to what something like a chat GBT is. Except for, in the case of chat GBT, they were spending tens of millions of dollars to process all of this data. And in the case of Dahle, we spent $30 on a small amount of data. In fact, about 15,000 rows, quote-unquote, of data. So, significantly smaller than what we expected. And so, we quickly released Dahle 2.0. The reason we did that is because we would take the same, we've generated a new data set. It was actually created all by Databricks employees, so it's open source for everybody. And we fine tuned that model. And so now the model, the weights, and the data are publicly available. The reason I'm calling that stuff out is because, exactly to your point about the hype cycle, everybody's like trying to go ahead and figure, oh, now I need to create a bigger model, bigger model, bigger model. We just proved you can go ahead and take an older model, fine tune with a small amount of data. In this case, $100. That's how much we paid for the processing. And that's it. Yeah, it was still extremely valuable. For whether it's talking about Delta Lake or Data Lakes or Lake Houses right up to LLMs. I mean, you've called it out. The reality is people actually need to spend the time to understand how they're supposed to extract the actual value out of their data and actually also secure it, make it private, make it secure, ensure privacy of that data. These are the most important aspects, and unless we're focusing on that part, the reality is, you know, you're not going to actually get the value out of storing all this information. Talk a bit about the state of the project. You know, is only Databricks involved or there are other companies? What kind of future do you see of the project in the foundation? Oh, yeah. So the reason we work with the LLMs Foundation is because we wanted to broaden the scope and to work with lots of companies and individuals that we traditionally did not work with. So, for example, Databricks has been known for Apache Spark, and it's not like we're not proud of it. We're very proud of the project. But that's also very much like from the Berkeley, that particular set of folks, right? The JVM, who are at Java, the distributed systems based off of that. And nothing against that. Quite the opposite. I'm a Spark contributor myself, in fact, actually. But what we want to do is we want to expand the scope big time. So, for example, we are working very closely with Finos. Why? Because there's a lot of financial companies that actually need to build foundations for securing or making their data reliable. So guess what? It makes a lot of sense that they're talking to us, Delta Lake, right now, so that way they can use that as a standard for many of the financial institutions, right? Traditionally, and like if you're just in the West Coast tech world, might not be talking to these folks. But guess what? Everybody needs that. So this is why LLMs Foundation is so important for us. In terms of the project, yes. I'll freely admit the fact that Databricks is probably still the biggest contributor to the project. I'm not going to lie on that stuff in the context. But there are lots of other companies. If you look at our website, you'll see contributors from Apple, from Adobe, from Scribb, Backmarket, you know, small, big, all across. And the reason why is because everybody is sort of thinking, not sort of thinking the exact same thing, which is I want my data to be reliable so that way I can focus on the cool stuff. I can focus on actually, in this case, it's not just cool because I think it's awesome though it's true. It's cool because now I can work on extracting the actual value out of my data. Now I can fire this line of focus. I'm not worried about, oh, if my system is going to shut down or my job just fails, now I have to go ahead and figure out how to fix that. No, no. Job fails, no problem. Everything is cleaned up so you can just rerun the job again. It finishes, no problem. Now I can focus on the actual making, getting the value of the data. Can you talk about the future of the project, you know, is it going to remain, you know, just a project or part of another, as I said, Dillings Foundation, Foundation of Foundations? Yeah, yeah, yeah. So we are currently debating, and this is a healthy debate by the way, not the like, oh no, no, no, there's a healthy debate with the community whether we should create our own foundation, Delta Lake Foundation, similar to like the PyTorch Foundation, which is a foundation project as well. So that's one possibility. Another possibility is the LF AI and data. That's a great foundation to work off. They're pros and cons to both. In fact, this leak, one of the part of the reasons I'm up here is actually to get some more information from the various LF stakeholders so I can provide that to the community so the community can decide which one makes more sense. Either way, we're definitely seeing ourselves expanding because for example, one of the things that, I was hinting at it where I was talking about like, hey, we're talking about the JVM, we're talking about Scala. Yeah, our most popular component of the Delta Lake project is Rust and Python. So nothing to do with the JVM world, nothing to do with the Scala world, right? It's all about what's the new current set of technologies. Obviously for data science, Python is already lying with the franchise as it is anyways, and Rust being allowing us to have both the distribution and performance and stability that we love in a really cool language. So we're seeing us basically launching really high and fast on that side of the house. And so that's part of the reason, again, why the discussions about whether we're part of LF AI data or our own foundation just because there's a lot of excitement on that side of the house. Can you also talk about, you know, when you like, Databricks contributed the project to Linux Foundation, I use the word contributed versus donated because you're not donating as a charity, you are also involved with the project, right? Very much so. Your successful project is related to the success of Databricks as well. So talk about your engagement with the Linux Foundation, why you felt, you know, that Linux Foundation is the right place and what benefit you get when you contribute your code to an organization like Linux Foundation? No, thank you very much for asking that question. In fact, that's the basis of mine and Carly's presentation tomorrow, in fact. And one of the key things we wanted to really emphasize and advocate for as a developer advocate is the simple fact that it's not just about the engineering nor is it just about the marketing. When it comes to, like, building up communities often the practice of developer relations. What is happening is that it becomes either far too engineering focused because they're saying, no, it's all about the purity of the code. I'm like, okay, sure, right? Or it's just a marketing game, which is like, no, we'll just market it and people buy it, right? And I'm saying neither one's right, right? It's actually a balance of you have to care obviously about the code base and you have to care about the engineering aspect, but you also need to know how to communicate that and you need to know how to socialize that. And so when we started with the LF, it started really simple, like just like, oh, can you help us with events? And like, okay, so they helped us with events. And then we realized, hey, we don't have anybody marketing these events. Okay, so we started using marketing marketing resources from the LF. And then we're like, hey, we actually probably should analyze the stuff, see if we're actually producing good material. Is this actually the stuff people need? Is the content we're creating actually useful to people? A lot of that stuff just came very naturally, which was the idea that, oh, we can provide analytics on understanding if the people who are using and contributing to the community are actually finding these sets of information useful. And it went basically starting us experimenting to a full-fledged, we have a full-time Linux Foundation marketer working with us. And in fact, that's Carly, who is co-presenting with me tomorrow. We're heavily using the events service because we do lots of really cool events. And like I said, because we found how important it was, whether it's like using LFX insights, whether it was using even our own set of tools, like for example, we also use Sprout Social and Common Room. But then we're also literally utilizing Bevy for the events from the Linux Foundation. We're using LFX Insights. So all of these tools together to allow our project, because the vast majority of us who are involved with the project, most of us are geeks. I know it doesn't look like it, but I'm actually most of the time, I would prefer just being in front of a computer and just geeking out. I have a whole other story on that, but that'll take too long. LFX is, because we could partner with LF, then we actually got to make sure that other marketing, that other social events, all the other aspects that we completely spaced out on were taken care of. And then it wasn't just like, okay, we give them stuff and they're going to go do stuff. It became partnership between LF and us, between the geeks and the marketers and the socials and the events people, and we realized how much value everybody together was in order to be able to help the project grow. Danny, thank you so much for taking time out today and not only talk about the project, but also much wider picture, not only the whole AIML space, all the hype and the real estate technology, and more importantly, why companies like yours engage the Linux Foundation. So thank you for sharing all those insights and I would love to chat with you again. Thank you. No, thanks a lot for the opportunity. I would definitely love to do it again.