 From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. Spark became a top level Apache project in 2014, and then shortly thereafter burst onto the big data scene. Spark, along with the cloud, transformed in many ways, disrupted the big data market. Databricks optimized its tech stack for Spark and took advantage of the cloud to really cleverly deliver a managed service that has become a leading AI and data platform among data scientists and data engineers. However, emerging customer data requirements are shifting into a direction that will cause modern data platform players generally, and Databricks specifically, we think, to make some key directional decisions and perhaps even reinvent themselves. Hello and welcome to this week's Wikibon Cube Insights, powered by ETR. In this Breaking Analysis, we're going to do a deep dive into Databricks. We'll explore its current impressive market momentum. We're going to use some ETR survey data to show that. Then we'll lay out how customer data requirements are changing and what the ideal data platform will look like in the midterm future. We'll then evaluate core elements of the Databricks portfolio against that vision, and then we'll close with some strategic decisions that we think the company faces. And to do so, we welcome in our good friend George Gilbert, former equities analyst, market analyst, and current principal at Tech Alpha Partners. George, good to see you, thanks for coming on. Good to see you, Dave. All right, let me set this up. We're going to start by taking a look at where Databricks sits in the market in terms of how customers perceive the company and what its momentum looks like. And this chart that we're showing here is data from ETR's ETS, the Emerging Technology Survey of private companies. The N is 1,421. What we did is we cut the data on three sectors, analytics, database data warehouse, and AIML. The vertical axis is a measure of customer sentiment, which evaluates an IT decision maker's awareness of the firm and the likelihood of engaging and or purchase intent. The horizontal axis shows mind share in the dataset. And we've highlighted Databricks, which has been a consistent high performer in the survey over the last several quarters. And as we, by the way, he just sits aside as we previously reported, OpenAI, which burst onto the scene this past quarter, leads all names now, but Databricks is still prominent. You can see that the ETR shows some open source tools for reference, but as far as firms go, Databricks is very impressively positioned. Now let's see how they stack up to some mainstream cohorts in the data space, against some bigger companies and sometimes public companies. This chart shows net score in the vertical axis, which is a measure of spending momentum and pervasiveness in the dataset is on the horizontal axis. The chart, you can see that chart insert in the upper right, that informs how the dots are plotted and that score against shared end. And that red dotted line at 40% indicates a highly elevated net score. Anything above that we think is really, really impressive. And here we're just comparing Databricks with Snowflake, Cloudera and Oracle. And that squiggly line leading to Databricks shows their path since 2021 by quarter. And you can see it's performing extremely well, maintaining an elevated net score in that range. Now it's comparable in the vertical axis to Snowflake and it consistently is moving to the right and gaining share. Now why do we choose to show Cloudera and Oracle? The reason is that Cloudera got the whole big data era started and was disrupted by Spark and of course the cloud that Spark and Databricks and Oracle in many ways was the target of early big data players like Cloudera. Take a listen to Cloudera CEO at the time, Mike Olson is back in 2010, first year of theCUBE. Play the clip. But back in the day, if you had a data problem, if you needed to run business analytics, you wrote the biggest check you could on the Sun Microsystems and you bought a great big single box central server. And any money that was left over, you handed to Oracle for database licenses and you installed that database on that box and that was where you went for data. That was your temple of information. Okay, so as you heard Mike Olson imply that monolithic model was too expensive and inflexible and Cloudera set out to fix that but the best laid plans as they say, George, what do you make of the data that we just shared? So where Databricks has really come up out of sort of Cloudera's tailpipe was they took big data processing, made it coherent, made it a managed service so it could run in the cloud so it relieved customers of the operational burden where they're really strong and where their traditional meat and potatoes or bread and butter is the predictive and prescriptive analytics that building and training and serving machine learning models, they've tried to move into traditional business intelligence, the more traditional descriptive and diagnostic analytics but they're less mature there. So what that means is the reason you see Databricks and Snowflake kind of side by side is there are many, many accounts that have both. Snowflake for business intelligence, Databricks for AI machine learning where Snowflake, I'm sorry, where Databricks also did really well was in core data engineering, refining the data, the old ETL process which kind of turned into ELT where you load it into the analytic repository in raw form and refine it. And so people have really used both and each is trying to get into the other. Yeah, absolutely. We've reported on this quite a bit. Snowflake kind of moving into the domain of Databricks and vice versa. The last bit of ETR evidence that we want to share in terms of the company's momentum comes from ETR's round tables. They're run by Eric Bradley and now a former Gartner analyst and George, your colleague back at Gartner, Darren Brebbum. And what we're going to show here is some direct quotes of IT pros in those round tables. There's a data science head and a CIO as well. Just to make a few call outs here. I won't spend too much time on it, but starting at the top, like all of us, we can't talk about Databricks without mentioning Snowflake, those two get us excited. Second comment zeroes in on the flexibility and the robustness of Databricks from a data warehouse perspective. And then the last point is despite competition from cloud players, Databricks has reinvented itself a couple of times over the year. And George, we're going to lay out today a scenario that perhaps calls for Databricks to do that once again. They're big opportunity and they're big challenge for every tech company. It's managing a technology transition. The transition that we're talking about is something that's been bubbling up, but it's really, it's ethical. First time in 60 years, we're moving from an application-centric view of the world to a data-centric view because decisions are becoming more important than automating processes. So let me let me let you sort of develop. So let's talk about that here. We're going to put up some bullets on precisely that point and the changing sort of customer environment. So you got IT stacks are shifting as George just said from application-centric silos to data-centric stacks where the priority is shifting from automating processes to automating decision. You know, you look at RPA and there's still a lot of automation going on, but from the focus of that application, centricity and the data locked into those apps, that's changing. Data has historically been on the outskirts in silos, but organizations you think of Amazon, think Uber, Airbnb, they're putting data at the core and logic is increasingly being embedded in the data instead of the reverse. In other words, today the data is locked inside the app, which is why you need to extract that data and sticking it to a data warehouse. The point George is we're putting forth this new vision for how data is going to be used and you've used this Uber example to underscore the future state. Please explain. Okay, so this is hopefully an example everyone can relate to. The idea is first, you're automating things that are happening in the real world and decisions that make those things happen autonomously without humans in the loop all the time. So to use the Uber example, on your phone, you call a car, you call a driver. Automatically, the Uber app then looks at what drivers are in the vicinity, what drivers are free, matches one, calculates an ETA to you, calculates a price, calculates an ETA to your destination and then directs the driver once they're there. The point of this is that that cannot happen in an application-centric world very easily because all these little apps, the drivers, the riders, the routes, the fares, those call on data locked up in many different apps, but they have to sit on a layer that makes it all coherent. So if Uber's doing this, doesn't this tech already exist? Isn't there a tech platform that does this already? Yes, and the mission of the entire tech industry is to build services that make it possible to compose and operate similar platforms and tools, but with the skills of mainstream developers in mainstream corporations, not the rocket scientists at Uber and Amazon. Okay, so we're talking about scale, horizontally scaling across the industry and actually giving a lot more organizations access to this technology. So by way of review, let's summarize the trend that's going on today in terms of the modern data stack that is propelling the likes of Databricks and Snowflake, which we just showed you in the ETR data and it really is a tailwind form. So the trend is toward this common repository for analytic data that could be multiple virtual data warehouses inside of Snowflake, but you're in that Snowflake environment or lake houses from Databricks or multiple data lakes. And we've talked about what JPMorgan Chase is doing with its data mesh and gluing data lakes together, got various public clouds playing in this game. And then the data is annotated to have a common meaning. In other words, there's a semantic layer that enables applications to talk to the data elements and know that they have common and coherent meaning. So George, the good news is this approach is more effective than legacy, the legacy monolithic models that Mike Olson was talking about. So what's the problem with this in your view? So today's data platforms added like immense value because they connected the data that was previously locked up in these monolithic apps or on all these different microservices. And that supported traditional BI and AIML use cases. But now if we wanna build apps like Uber or amazon.com, you know, where they've got essentially on autonomously running supply chain and e-commerce app where humans only care and feed it. But the thing is figuring out what to buy, when to buy, where to deploy it, when to ship it. We needed a semantic layer on top of the data so that as you were saying, the data that's coming from all those apps, the different apps that's integrated, not just connected, but it means the same. And the issue is whenever you add a new layer to a stack to support new applications, there are implications for the already existing layers. Like can they support the new layer and its use cases? So for instance, if you add a semantic layer that embeds app logic with the data rather than vice versa, which we've been talking about and that's been the case for 60 years, then the new data layer faces challenges. The way you manage that data, the way you analyze that data is not supported by today's tools. Okay, so actually Alex, bring up that last slide if you would, I mean, you're basically saying at the bottom here, today's repositories don't really do joins at scale, the futures. You're talking about hundreds or thousands or millions of data connections. And today's systems, we're talking about, I don't know, six, eight, 10 joins. And that is the fundamental problem you're saying is a new data error coming and existing systems won't be able to handle it. Yeah, one way of thinking about it is that even though we call them relational databases, when we actually want to do lots of joins or when we want to analyze data from lots of different tables, we created a whole new industry for analytic databases where you sort of munch the data together into fewer tables. So you didn't have to do as many joins because the joins are difficult and slow. And when you're going to arbitrarily join thousands, hundreds of thousands or across millions of elements, you need a new type of database. We have them, they're called graph databases, but to query them, you go back to the pre-relational era in terms of their usability. Okay, so we're going to come back to that and talk about how you get around that problem. But let's first lay out what the ideal data platform of the future we think looks like. And again, we're going to come back to use this Uber example in this graphic that George put together, awesome. We got three layers. The application layer is where the data products reside. The example here is drivers, riders, maps, routes, ETA, et cetera. The digital version of what we were talking about in the previous slide, people, places and things. The next layer is the data layer. That breaks down the silos and connects the data elements through semantics and everything is coherent. And then the bottom layer is the legacy operational systems feed that data layer. George, explain what's different here. The graph database element, you talk about the relational query capabilities. And why can't I just throw memory at solving this problem? Some of the graph databases do throw memory at the problem. And maybe without naming names, some of them live entirely in memory. And what you're dealing with is like a pre-relational, you're dealing with a pre-relational in-memory database system where you navigate between elements. And the issue with that is we've had SQL for 50 years so we don't have to navigate. We can say what we want without how to get it. That's the core of the problem. Okay, so if I may, I just want to drill into this a little bit. So you're talking about the expressiveness of a graph. Alex, if you'd bring that back out of the fourth bullet, expressiveness of a graph database with the relational ease of query. Can you explain what you mean by that? Yeah, so graphs are great because when you can describe anything with a graph, that's why they're becoming so popular. Expressive means you can represent anything easily. They're conducive to, you might say, in a world where we now want the metaverse like with a 3D world, and I don't mean the Facebook metaverse. I mean like the business metaverse, when we want to capture data about everything but we want it in context, we want to build a digital, a set of digital twins that represent everything going on in the world. And Uber is a tiny example of that. Uber built a graph to represent all the drivers and riders and maps and routes. But what you need out of a database isn't just a way to store stuff and update stuff. You need to be able to ask questions of it. You need to be able to query it. And if you go back to pre-relational days, you had to know how to find your way to the data. It's sort of like when you give directions to someone and they didn't have a GPS system and a mapping system, you had to give them turn-by-turn directions. Whereas when you have a GPS and a mapping system, which is like the relational thing, you just say where you want to go and it spits out the turn-by-turn directions, which let's say the car might follow or whoever you're directing would follow. But the point is it's much easier in a relational database to say, I just want to get these results, you figure out how to get it. It's the graph database, they have not taken over the world because in some ways it's taking a 50-year leap backwards. All right, got it. Okay, let's take a look at how the current Databricks offerings map to that ideal state that we just laid out. So to do that, we put together this chart that looks at the key elements of the Databricks portfolio, the core capability, the weakness and the threat that may loom. Start with the Delta Lake, that's the storage layer, which is great for files and tables. It enables, it's got true separation of compute and storage. I want you to double-click on that, George, as independent elements, but it's weaker for the type of low latency ingest we see coming in the future. And some of the threats highlighted here, AWS could add transactional tables to S3, iceberg adoption is picking up and could accelerate, that could disrupt Databricks, George. Add some color here, please. Okay, so this is sort of a classic competitive forces where you want to look at, so what are customers demanding? What's competitive pressure? What are substitutes? Even what your suppliers might be pushing. Here Delta Lake is at its core, a set of transactional tables that sit on an object store. So think of it in a database system, this is the storage engine. So since S3 has been getting stronger for 15 years, you could see a scenario where they add transactional tables. We have an open source alternative in iceberg, which Snowflake and others support. And then, but at the same time, Databricks has built an ecosystem out of tools that are own and others that read and write to Delta tables. That's what makes the Delta Lake an ecosystem. So, you know, they have a catalog, the whole machine learning tool chain talks directly to the data here, that was their great advantage because in the past with Snowflake, you had to pull all the data out of the database before the machine learning tools could work with it. That was a major shortcoming, they fixed that. But the point here is that even before we get to the semantic layer, the core foundation is under threat Yeah, got it. Okay, let's, we've got a lot of ground to cover. So we're going to take a look at the Spark execution engine next. Think of that as the refinery that runs really efficient batch processing. That's kind of what disrupted Hadoop in a large way. But it's not Python friendly. And that's an issue because the data science and the data engineering crowd are moving in that direction. And, or, and or they're using DBT, George we had Tristan Handy on at SuperCloud, really interesting discussion that you and I did. Explain why this is an issue for Databricks. So once data, the data lake was in place, what people did was they refined their data batch and Spark has had it, has always had streaming support. And it's gotten better, the underlying storage as we've talked about as an issue, but basically they took raw data, then they refined it into tables that were like customers and products and partners. And then they refined that again into what was like gold artifacts, which might be business intelligence metrics or dashboards, which were collections of metrics. But they were doing that in, they were running it on the Spark execution engine, which is, it's a Java based engine, or it's running on a Java based virtual machine, which means all the data scientists and the data engineers who wanna work with Python are really working in sort of oil and water. Like if you get an error in Python, you can't tell where, whether the problem's in Python or where it's in Spark, it's just an impedance mismatch between the two. And then at the same time, the whole world is now gravitating towards DBT because it's a very nice and simple way to compose these data processing pipelines. And people are using either SQL in DBT or Python in DBT. And that kind of is a substitute for doing it all in Spark. So it's under threat even before we get to that semantic layer. It so happens that DBT itself is becoming the authoring environment for the semantic layer with business intelligent metrics. But that's again, this is the second element that's under direct substitution and competitive threat. Okay, let's now move down to the third element, which is the photon, photon is Databricks BI Lakehouse, which has integration with the database. Has integration with the Databricks tooling, which is very rich. It's newer and it's also not well suited for high currency and low latency use cases, which we think are going to increasingly become the norm over time. George, the call out threat here is customers want to connect everything to a semantic layer. Explain your thinking here and why this is a potential threat to Databricks. Okay, so two issues here, what you were touching on which is the high concurrency, low latency, when people are running like thousands of dashboards and data is streaming in, that's a problem because a SQL data warehouse, the query engine, something like that matures over five to 10 years. It's one of these things, the joke that Andy Jassy makes just in general, he's really talking about Azure, but there's no compression algorithm for experience. The Snowflake guy started more than five years earlier and for a bunch of reasons, that lead is not something that Databricks can shrink. They'll always be behind so that that's why Snowflake has transactional tables now and we can get into that in another show. But the key point is, so near term, it's struggling to keep up with the use cases that are core to business intelligence, which is highly concurrent, lots of users doing interactive query. But then when you get to a semantic layer, that's when you need to be able to query data that might have thousands or tens of thousands or hundreds of thousands of joins. And that's a SQL query engine, traditional SQL query engine is just not built for that. That's the core problem of traditional relational databases. Now, this is a quick aside where we always talk about Snowflake and Databricks in sort of the same context. We're not necessarily saying that Snowflake is in a position to tackle all these problems. We'll deal with that separately. So we don't mean to imply that, but we're just sort of laying out some of the things that Snowflake or rather Databricks customers we think need to be thinking about and having conversations with Databricks about, and we hope to have them as well. We'll come back to that in terms of sort of strategic options, but finally want to come back to the table. We have Databricks AIML tool chain, which has been an awesome capability for the data science crowd. It's comprehensive, it's a one stop shop solution, but the kicker here is that it's optimized for supervised model building. And the concern is that foundational models like GPT could cannibalize the current Databricks tooling. But George, can't Databricks like other software companies integrate foundation model capabilities into its platform? Okay, so the sound bite answer to that is, sure an IBM could call out, IBM 3270 terminals could call out to a graphical user interface when they're running on the XT terminal, but they're not exactly good citizens in that world. The core issue is Databricks has this wonderful end-to-end tool chain for training, deploying, monitoring, running inference on supervised models, but those models are the paradigm there is the customer builds and trains and deploys each model for each feature or application. In a world of foundation models, which are pre-trained and unsupervised, the entire tool chain is different. So it's not like Databricks can junk everything they've done and start over with all their engineers. They have to keep maintaining what they've done in the old world, but they have to build something new that's optimized for the new world. It's a classic technology transition and their mentality appears to be, oh, we'll support the new stuff from our old stuff, which is sub-optimal. And as we'll talk about their biggest patron and the company that put them on the map, Microsoft really stopped working on their old stuff three years ago so that they could build a new tool chain optimized for this new world. Yeah, and so let's sort of close with what we think the options are and decisions that Databricks has for its future architecture. They're smart people, and we've had Oli Goetze on many times, super impressive. I think they've got to be keenly aware of the limitations, what's going on with foundation models. But at any rate, here in this chart, we lay out sort of three scenarios. One is re-architect the platform by incrementally adopting new technologies. An example might be to layer a graph query engine on top of its stack. They could license key technologies like graph database. They could get aggressive on M&A and buy in relational knowledge graphs, semantic technologies, vector database technologies. You know, George, as David Floyer always says, a lot of ways to skin a cat, we've seen companies like, you know, think about EMC maintained its relevance through M&A for many, many years. George, give us your thought on each of these strategic options. Okay, I find this question the most challenging because remember, I used to be an equity research analyst. I worked for Frank Quattrone. We were, you know, one of the top tech shops in the banking industry, although, you know, this is 20 years ago, but the M&A team was the top team in the industry and everyone wanted them on their side. And I remember going to meetings with the CEOs where Frank and the bankers would say, you know, you want us for your M&A work because, you know, we can do better and they really could do better, but in software, it's not like with EMC and hardware because with hardware, it's easier to connect different boxes. With software, the whole point of a software company is to integrate and architect the components so they fit together and reinforce each other. And that makes M&A harder. You can do it, but it takes a long time to fit the pieces together. Let me give you examples. If they put a graph query engine, let's say something like Tinker Pop on top of, I don't even know if it's possible, but let's say they put it on top of Delta Lake, then you have this graph query engine talking to their storage layer, Delta Lake. But if you wanna do analysis, you gotta put the data in photon, which is not really ideal for highly connected data. If you license a graph database, then most of your data is in the Delta Lake and how do you, you know, sync it with the graph database? If you do sync it, you've got data in two places, which kind of defeats the purpose of having, you know, a unified repository. I find this semantic layer option in number three actually more promising because that's something that you can layer on top of the storage layer that you have already. You just have to figure out then how to have your query engines talk to that. What I'm trying to highlight is it's easy as an analyst to say, you know, you can buy this company or license that technology, but the really hard work is making it all work together. And that is where the challenge is. Yeah, and well, look, I mean, I thank you for laying that out. We've seen it, you know, certainly Microsoft and Oracle. I guess you might argue that, well, Microsoft had a monopoly in its, you know, desktop software and was able to throw off cash for a decade plus while it was, stock was going sideways. Oracle had, you know, won the database wars and had, you know, amazing margins and cash flow to be able to do that. You know, Databricks isn't even gone public yet, but I want to close with some of the players to watch. Alex, if you bring that back up number four here, AWS, we talked about some of their options with S3 and it's not just AWS. It's other, you know, blob storage, object storage. Microsoft, as you sort of alluded to was an early go to market channel for Databricks. We actually, we didn't address that really. So maybe in the closing comments we can Google, obviously Snowflake, of course, we're going to dissect their options in future breaking analysis. DBT Labs, you know, where do they fit? Bob Muglia's company, relational AI. Why are these players, players to watch George in your opinion? So everyone is trying to assemble and integrate the pieces that would make building data applications, data products, easy and the critical part isn't just assembling a bunch of pieces, which is traditionally what AWS did. It's a UNIX ethos, which is, we give you the tools, you put them together because you have them, you then have the maximum choice and maximum power. So where the, what the hyperscalers are doing, they're taking their key value stores. In the case of AWS, it's DynamoDB and in the case of Azure, it's Cosmos DB and each are putting a graph query engine on top of those. So they have a unified storage and graph database engine, like all the data would be collected in the key value store, then you have a graph database, that's how they're going to be presenting a foundation for building these data apps. DBT Labs is putting a semantic layer on top of data lakes and data warehouses. And as we'll talk about, I'm sure in the future, that makes it easier to swap out the underlying data platform or swap in new ones for specialized use cases. Snowflake, what they're doing, they're so strong in data management and with their transactional tables, what they're trying to do is take in the operational data that used to be in the province of many state stores like MongoDB and say, if you manage that data with us, it'll be connected to your analytic data without having to send it through a pipeline. And that's hugely valuable. Relational AI is the wild card because what they're trying to do is, it's almost like a holy grail where you're trying to take the expressiveness of connecting all your data in a graph but making it as easy to query as you've always had it in a SQL database or I should say in a relational database. And if they do that, it's sort of like, it'll be as easy to program these data apps as a spreadsheet was compared to procedural languages like basic or Pascal. That's the implications of relational AI. Yeah, and we're, again, we've talked before, why can't you just throw this all in memory? We're talking in that example of really getting down to differences in how you lay the data out on disk in really new database architecture, correct? Yes, and that's why it's not clear that you could take a data lake or even a snowflake and why you can't put a relational knowledge graph on those, you could potentially put a graph database but it'll be compromised because to really do what relational AI has done, which is the ease of relational on top of the power of graph, you actually need to change how you're storing your data on disk or even in memory. So you can't, in other words, it's not like, oh, we can add graph support to snowflake because if you did that, you'd have to change or in your data lake, you'd have to change how the data is physically laid out and then that would break all the tools that talk to that currently. What in your estimation is the timeframe where this becomes critical for a data bricks and potentially snowflake and others? I mentioned earlier midterm. Are we talking three to five years here? Are we talking end of decade? What's your radar say? I think something surprising is going on that's going to sort of come up the tailpipe and take everyone by storm. All the hype around business intelligence metrics, which is what we used to put in our dashboards where bookings, billings, revenue, customer, those things, those were the key artifacts that used to live in definitions in your BI tools. And DBT has basically created a standard for defining those so they live in your data pipeline or they're defined in their data pipeline and executed in the data warehouse or data lake in a shared way so that all tools can use them. This sounds like a digression, it's not. All this stuff about data mesh, data fabric, all that's going on is we need a semantic layer and the business intelligence metrics are defining common semantics for your data. And I think we're going to find by the end of this year that metrics are how we annotate all our analytic data to start adding common semantics to it. And we're going to find this semantic layer, it's not three to five years off. It's going to be staring us in the face by the end of this year. Interesting, and of course, SVB today was shut down. We're seeing serious tech headwinds and oftentimes in these sort of downturns or flat turns, which feels like this could be going on for a while, we emerge with a lot of new players and a lot of new technology. George, we got to leave it there. Thank you to George Gilbert for excellent insights and input for today's episode. I want to thank Alex Meyerson, who's on production and manages the podcast. Of course, Ken Schiffman as well. Kristen Martin and Cheryl Knight help get the word out on social media and in our newsletters and Rob Hof is our EIC over at siliconangle.com. He does some great editing. Remember all these episodes are available as podcasts wherever you listen, all you got to do is search breaking analysis podcast, publish each week on wikibon.com and siliconangle.com or you can email me at david.volante. siliconangle.com or DM me at dvolante. Comment on our LinkedIn posts and please do check out etr.ai, great survey data, enterprise tech focus, phenomenal. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching and we'll see you next time on breaking analysis.