 From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. Snowflake is not going to grow into its valuation by stealing the croissant from the breakfast table of the on-prem data warehouse vendors. Look, even if Snowflake got 100% of the data warehouse business, it wouldn't come close to justifying its market cap. Rather, Snowflake has to create an entirely new market based on completely changing the way organizations think about monetizing data. Every organization I talk to says it wants to be, or many say they already are, data-driven. Why wouldn't you aspire to that goal? There's probably nothing more strategic than leveraging data to power your digital business and creating competitive advantage. But many businesses are failing, or I predict will fail, to create a true data-driven culture because they're relying on a flawed architectural model formed by decades of building centralized data platforms. Welcome, everyone, to this week's Wikibon Cube Insights, powered by ETR. In this Breaking Analysis, I want to share some new thoughts and fresh ETR data on how organizations can transform their businesses through data by reinventing their data architectures. And I want to share our thoughts on why we think Snowflake is currently in a very strong position to lead this effort. Now, on November 17th, the Cube is hosting the Snowflake Data Cloud Summit. Snowflake's ascendancy and its blockbuster IPO has been widely covered by us and many others. Now, since Snowflake went public, we've been inundated with outreach from investors, customers, and competitors that wanted to either better understand the opportunities or explain why their approach is better or different. And in this segment, ahead of Snowflake's big event, we want to share some of what we learned and how we see it. Now, the Cube is getting paid to host this event, so I need you to know that. And you draw your own conclusions from my remarks, but neither Snowflake nor any other sponsor of the Cube or client of SiliconANGLE Media has editorial influence over Breaking Analysis. The opinions here are mine, and I would encourage you to read my ethics statement in this regard. I want to talk about the failed data model. The problem is complex, I'm not debating that. Organizations have to integrate data and platforms with existing operational systems, many of which were developed decades ago. And as a culture and a set of processes that have been built around these systems, and they've been hardened over the years. This chart here tries to depict the progression of the monolithic data source, which for me began in the 1980s, when decision support systems or DSS promised to solve our data problems. The data warehouse became very popular and data marts sprang up all over the place. This created more proprietary stovepipes with data locked inside. The Enron collapse led to Sarbanes-Oxley. Now this tightened up reporting that the requirements associated with that it breathed new life into the data warehouse model, but it remained expensive and cumbersome, I've talked about that a lot, like a snake swallowing a basketball. The 2010s ushered in the big data movement and data lakes emerged. With the dupe, we saw the idea of no schema on right, where you put structured and unstructured data into a repository and figure it all out on the read. What emerged was a fairly complex data pipeline that involved ingesting, cleaning, processing, analyzing, preparing, and ultimately serving data to the lines of business. This is where we are today with very hyper-specialized roles around data engineering, data quality, data science. There's lots of batch processing going on. The spark has emerged to improve the complexity associated with MapReduce and it definitely helped improve the situation. We're also seeing attempts to blend in real time stream processing with the emergence of tools like Kafka and others. But I'll argue that in a strange way, these innovations actually compound the problem. And I want to discuss that because what they do is they heighten the need for more specialization, more fragmentation, and more stovepipes within the data life cycle. Now in reality, and it pains me to say this, it's the outcome of the big data movement because we sit here in 2020 that we've created thousands of complicated science projects that have once again failed to live up to the promise of rapid, cost-effective time to insights. So, what will the 2020s bring? What's the next silver bullet? You hear terms like the lake house, which Databricks is trying to popularize, and I'm going to talk today about data mesh. These and other efforts, they look to modernize data lakes and sometimes merge the best of data warehouse in its second generation systems into a new paradigm that might unify batch and stream frameworks. And this definitely addresses some of the gaps, but in our view, still suffers from some of the underlying problems of previous generation data architectures. In other words, if the next gen data architecture is incremental, centralized, rigid, it primarily focuses on making the technology to get data in and out of the pipeline work, we predict it's going to fail to live up to expectations. Again, rather what we're envisioning is an architecture based on the principles of distributed data where domain knowledge is the primary target citizen and data is not seen as a byproduct, i.e. the exhaust of an operational system, but rather as a service that can be delivered in multiple forms and use cases across an ecosystem. This is why we often say that data is not the new oil. We don't like that phrase. A specific gallon of oil can either fuel my home or it can lubricate my car engine, but it can't do both. Data does not follow the same laws of scarcity like natural resources. Again, what we're envisioning is a rethinking of the data pipeline and the associated cultures to put data needs of the domain owner at the core and provide automated, governed and secure access to data as a service at scale. Now, how is this different? Let's take a look and unpack the data pipeline today and look deeper into the situation. You all know this picture that I'm showing. There's nothing really new here. Data comes from inside and outside the enterprise. It gets processed, cleans, augmented so that it can be trusted and made useful. Somebody wants to use data that they can't trust. And then we can add machine intelligence and do more analysis and finally deliver the data so that domain-specific consumers can essentially build data products and services or reports and dashboards or content services. For instance, an insurance policy, a financial product, a loan that these are packaged and made available for someone to make decisions on or to make a purchase. And all the metadata associated with this data is packaged along with the dataset. Now, we've broken down these steps into atomic components over time so we can optimize on each and make them as efficient as possible. And down below, you have these happy stick figures. Sometimes they're happy, but they're highly specialized individuals and they each do their job and they do it well to make sure that the data gets in, it gets processed and delivered in a timely manner. Now, while these individual pieces seemingly are autonomous and can be optimized and scaled, they're all encompassed within the centralized big data platform. And it's generally accepted that this platform is domain agnostic, meaning the platform is the data owner, not the domain-specific experts. Now, there are a number of problems with this model. First, while it's fine for organizations with smaller number of domains, organizations with a large number of data sources and complex domain structures, they struggle to create a common data parlance, for example, in a data culture. Another problem is that as the number of data sources grows, organizing and harmonizing them in a centralized platform becomes increasingly difficult because the context of the domain and the line of business gets lost. Moreover, as ecosystems grow and you add more data, the processes associated with the centralized platform tend to get further generalized. And they, again, lose that domain-specific context. Wait, there are more problems. Now, while in theory, organizations are optimizing on the piece parts of the pipeline, the reality is, as a domain requires a change, for example, a new data source or an ecosystem partnership requires a change in access or processes that can benefit a domain consumer, the reality is the change is subservient to the dependencies and the need to synchronize across these discrete parts of the pipeline are actually orthogonal to each of those parts. In other words, in actuality, the monolithic data platform itself remains the most granular part of the system. Now, when I complain about this faulty structure, some folks tell me this problem has been solved, that there are services that allow new data sources to really easily be added. And a good example of this is Databricks and Jest, which is, it's an auto loader. And what it does is it simplifies the ingestion into the company's Delta Lake offering. And rather than centralizing in a data warehouse, which struggles to efficiently allow things like machine learning frameworks to be incorporated, this feature allows you to put all the data into a centralized data lake, or so the argument goes. The problem that I see with this is while the approach does definitely minimizes the complexities of adding new data sources, it still relies on this linear end to end process that slows down the introduction of data sources from the domain consumer side of the pipeline. In other words, the domain expert still has to elbow her way into the front of the line or the pipeline in this case to get stuff done. And finally, the way we are organizing teams is a point of contention. And I believe it's going to continue to cause problems down the road. Specifically, we've again, we've optimized on technology expertise where for example, data engineers, well really good at what they do, they're often removed from the operations of the business. Essentially we created more silos and organized around technical expertise versus domain knowledge. As an example, a data team has to work with data that is delivered with very little domain specificity and serves a variety of highly specialized consumption use cases. All right, I want to step back for a minute and talk about some of the problems that people bring up with Snowflake and then I'll relate it back to the basic premise here. As I said earlier, we've been hammered by dozens and dozens of data points, opinions, criticisms of Snowflake. And I'll share a few here, but I'll post a deeper technical analysis from a software engineer that I found to be fairly balanced. There's five Snowflake criticisms that I'll highlight and there are many more, but here are the some that I want to call out. Price transparency. I've had more than a few customers telling me they chose an alternative database because of the unpredictable nature of Snowflake's pricing model. Snowflake, as you probably know, prices based on consumption, just like AWS and other cloud providers. So just like AWS, for example, the bill at the end of the month is sometimes unpredictable. Is this a problem? Yes, but like AWS, I would say kill me with that problem. Look, if users are creating value by using Snowflake, then that's good for the business. But clearly this is a sore point for some users, especially for procurement and finance, which don't like unpredictability. And Snowflake needs to do a better job communicating and managing this issue with tooling that can predict and help better manage costs. Next, workload management or lack thereof. Look, if you want to isolate higher performance workloads with Snowflake, you just spin up a separate virtual warehouse. It's kind of a brute force approach. It works generally, but it will add expense. I'm kind of reminded of Pure Storage and its approach to storage management. The engineers at Pure, they always designed for simplicity. And this is the approach that Snowflake is taking. There's been Pure and Snowflake, as I'll discuss in a moment, is Pure's ascendancy was really based largely on stealing share from legacy EMC systems. Snowflake, in my view, has a much, much larger incremental market opportunity. Next is caching architecture, you hear this a lot. At the end of the day, Snowflake is based on a caching architecture. And a caching architecture has to be working for some time to optimize performance. Caches work well when the size of the working set is small. Caches generally don't work well when the working set is very, very large. In general, transactional databases have pretty small data sets. And in general, analytics data sets are potentially much larger. Is it Snowflake in the analytics business? Yes, but the good thing that Snowflake has done is they've enabled data sharing. And it's caching architecture serves its customers well because it allows domain experts, you're going to hear this a lot from me today, to isolate and analyze problems or go after opportunities based on tactical needs. That said, very big queries across whole data sets or badly written queries that scan the entire database are not the sweet spot for Snowflake. Another good example would be if you're doing a large audit and you need to analyze a huge, huge data set. Snowflake's probably not the best solution. Complex joins you hear this a lot. The working set of complex joins by definition are larger. So see my previous explanation, read only. Snowflake is pretty much optimized for read-only data. Maybe stateless data is a better way of thinking about this. Heavily write intensive workloads are not the wheelhouse of Snowflake. So where this is maybe an issue is real-time decision-making and AI inferencing. Number of times Snowflake, I've talked about this, they might be able to develop products or acquire technology to address this opportunity. Now I want to explain these issues would be problematic. If Snowflake were just a data warehouse vendor, if that were the case, this company, my opinion would hit a wall, just like the MPP vendors that preceded them by building a better mousetrap for certain use cases, hit a wall. Rather, my premise in this episode is that the future of data architectures will be really to move away from large centralized warehouses or data lake models to a highly distributed data sharing system that puts power in the hands of domain experts at the line of business. Snowflake is less computationally efficient and less optimized for classic data warehouse work, but it's designed to serve the domain user much more effectively in our view. We believe that Snowflake is optimizing for business effectiveness essentially. As I said before, the company can probably do a better job of keeping passionate end users from breaking the bank, but as long as these end users are making money for their companies, I don't think this is going to be a problem. Let's look at the attributes of what we're proposing around this new architecture. We believe we'll see the emergence of a total flip of the centralized and monolithic big data systems that we've known for decades. In this architecture, data is owned by domain specific business leaders, not technologists. Today, it's not much different in most organizations than it was 20 years ago. If I want to create something of value that requires data, I need to cajole, beg, or bribe the technology and the data team to accommodate. The data consumers are subservient to the data pipeline, whereas in the future, we see the pipeline as a second class citizen where the domain expert is elevated. In other words, getting the technology and the components of the pipeline to be more efficient is not the key outcome. Rather, the time it takes to envision, create, and monetize a data service is the primary measure. The data teams are cross-functional and live inside the domain versus today's structure where the data team is largely disconnected from the domain consumer. Data in this model, as I said, is not the exhaust coming out of an operational system or an external source that is treated as generic and stuffed into a big data platform. Rather, it's a key ingredient of a service that is domain-driven and monetizable. And the target system is not a warehouse or a lake. It's a collection of connected, domain-specific data sets that live in a global mesh. What is a distributed global data mesh? A data mesh is a decentralized architecture that is domain aware. The data sets in the system are purposely designed to support a data service or data product, if you prefer. The ownership of the data resides with the domain experts because they have the most detailed knowledge of the data requirement and its end use. Data in this global mesh is governed and secured and every user in the mesh can have access to any data set as long as it's governed according to the edicts of the organization. Now in this model, the domain expert has access to a self-service and abstracted infrastructure layer that is supported by a cross-functional technology team. Again, the primary measure of success is the time it takes to conceive and deliver a data service that can be monetized. Now by monetized, we mean a data product or data service that either cuts costs, it drives revenue, it saves lives, whatever the mission is of the organization. The power of this model is it accelerates the creation of value by putting authority in the hands of those individuals who are closest to the customer and have the most intimate knowledge of how to monetize data. It reduces the diseconomies at scale of having a centralized or a monolithic data architecture. It scales much better than legacy approaches because the atomic unit is a data domain, not a monolithic warehouse or a lake. Shumak Tagani is a software engineer who is attempting to popularize the concept of a global mesh and her work is outstanding and has strengthened our belief that practitioners see this the same way that we do. And to paraphrase her view, a domain-centric system must be secure and governed with standard policies across domains. Has to be trusted, as they said. Nobody's going to use data they don't trust. It's got to be discoverable via a data catalog with rich metadata. The data sets have to be self-describing and designed for self-service. Accessibility for all users is crucial, as is interoperability, without which distributed systems, as we know, fail. So what does this all have to do with Snowflake? As I said, Snowflake is not just a data warehouse. In our view, it's always had the potential to be more. Our assessment is that attacking the data warehouse use cases, it gave Snowflake a straightforward, easy to understand narrative that allowed it to get a foothold in the market. Data warehouses are notoriously expensive, cumbersome and resource-intensive, but they're a critical aspect to reporting and analytics. So it was logical for Snowflake to target on-premise legacy data warehouses and their smaller cousins, the data lakes, as early use cases. By putting forth and demonstrating a simple data warehouse alternative that could be spun up quickly, Snowflake was able to gain traction, demonstrate repeatability, and attract the capital necessary to scale to its vision. This chart shows the three layers of Snowflake's architecture that have been well documented. The separation of compute and storage, and the outer layer of cloud services. But I want to call your attention to the bottom part of the chart, the so-called cloud agnostic layer that Snowflake introduced in 2018. This layer is somewhat misunderstood. Not only did Snowflake make its cloud native database compatible to run on AWS, then Azure, and in 2020 GCP, what Snowflake has done is to abstract cloud infrastructure complexity and create what it calls the data cloud. What's the data cloud? We don't believe the data cloud is just a marketing term that doesn't have any substance. Just as SAS simplified application software and IS made it possible to eliminate the value drain associated with provisioning infrastructure, a data cloud in concept can simplify data access and break down fragmentation and enable shared data across the globe. Snowflake, they have a first mover advantage in this space. And we see a number of fundamental aspects that comprise a data cloud. First, massive scale with virtually unlimited compute and storage resource that are enabled by the public cloud. We talk about this a lot. Second is a database architecture that's built to take advantage of native public cloud services. This is why Frank Slutman says we burn the boats we're not ever doing on-prem. We're all in on cloud and cloud native. Third is an abstraction layer that hides the complexity of infrastructure. And fourth is a governed and secured shared access system where any user in the system, if allowed, can get access to any data in the cloud. So a key enabler of the data cloud is this thing called the global data mesh. Earlier this year, Snowflake introduced its global data mesh. Over the course of its recent history, Snowflake has been building out its data cloud by creating data regions strategically tapping key locations of AWS regions and then adding Azure and GCP. The complexity of the underlying cloud infrastructure has been stripped away to enable self-service and any Snowflake user becomes part of this global mesh independent of the cloud that they're on. Okay, so now let's go back to what we were talking about earlier. Users in this mesh can be, will be, are domain owners. They're building monetizable services and products around data. They're most likely dealing with relatively small read-only data sets. They can adjust data from any source very easily and quickly set up security and governance to enable data sharing across different parts of an organization or, very importantly, an ecosystem. Access control and governance is automated. The data sets are addressable. The data owners have clearly defined missions and they own the data through the life cycle. Data that is specific and purposefully shaped for their missions. Now you're probably asking what happens to the technical team and the underlying infrastructure and the clusters and how do I get the compute close to the data and what about data sovereignty and the physical storage later and the cost? All these are good questions and I'm not saying these are trivial, but the answer is these are implementation details that are pushed to a self-service layer managed by a group of engineers that serves the data owners. And as long as the domain expert slash data owner is driving monetization, this piece of the puzzle becomes self-funding. As I said before, Snowflake has to help these users to optimize their spend with predictive tooling that aligns spend with value and shows ROI. While there may not be a strong motivation for Snowflake to do this, my belief is that they'd better get good at it or someone else will do it for them and steal their ideas. All right, let me end with some ETR data to show you just how Snowflake is getting a foothold on the market. Followers of this program know that ETR uses a consistent methodology to go to its practitioner base, its buyer base, each quarter and ask them a series of questions. They focus on the areas that the technology buyer is most familiar with and they ask a series of questions to determine the spending momentum around a company within a specific domain. This chart shows one of my favorite examples. It shows data from the October ETR survey of 1,438 respondents. And it isolates on the data warehouse and database sector. I know, I just got through telling you that the world is going to change and Snowflake's not a data warehouse vendor, but there's no construct today in the ETR dataset to cut on a data cloud or globally distributed data mesh. So you're going to have to deal with this. What this chart shows is net score on the y-axis, that's a measure of spending velocity and it's calculated by asking customers, are you spending more or less on a particular platform and then subtracting the lesses from the mores. It's more granular than that, but that's the basic concept. Now on the x-axis is market share, which is ETR's measure of pervasiveness in the survey. You can see superimposed in the upper right hand corner, a table that shows the net score and the shared end for each company. Now shared end is the number of mentions in the dataset within, in this case, the data warehousing sector. Snowflake once again leads all players with a 75% net score. This is a very elevated number and is higher than that of all other players, including the big cloud companies. And we've been tracking this for a while and Snowflake is holding firm on both dimensions. When Snowflake first hit the dataset, it was in the single digits along the horizontal axis and continues to creep to the right as it adds more customers. Now here's another chart, I call it the wheel chart that breaks down the components of Snowflake's net score or spending momentum. The line green is new adoption, the forest green is customer spending more than 5%, the gray is flat spend, the pink is declining by more than 5%, and the bright red is retiring the platform. So you can see the trend. It's all momentum for this company. And what Snowflake has done is they grabbed ahold of the market by simplifying data warehouse, but the strategic aspect of that is that it enables the data cloud leveraging the global mesh concept. And the company has introduced a data marketplace to facilitate data sharing across ecosystems. This is all about network effects. In the mid to late 1990s as the internet was being built out, I worked at IDG with Bob Metcalf, who was the publisher of InfoWorld. During that time, we'd go on speaking tours all over the world. And I would listen very carefully as he applied Metcalf's law to the internet. Metcalf's law states that the value of the network is proportional to the square of the number of connected nodes or users on that system. Said another way, while the cost of adding new nodes to a network scales linearly, the consequent value scores scales exponentially. Now apply that to the data cloud. The marginal cost of adding a user is negligible, practically zero, but the value of being able to access any data set in the cloud, well, let me just say this, there's no limitation to the magnitude of the market. My prediction is that this idea of a global mesh will completely change the way leading companies structure their businesses and particularly their data architectures. It'll be the technologists that serve domain specialists as it should be. Okay, well, what do you think? DM me at Dvalante or email me at David.valante at siliconangle.com or comment on my LinkedIn. Remember, these episodes are all available as podcasts, so please subscribe wherever you listen. I publish weekly on wikibond.com and siliconangle.com and don't forget to check out ETR.plus for all the survey analysis. This is Dave Vellante for theCUBE Insights powered by ETR. Thanks for watching, be well and we'll see you next time.