 From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. A new era of data is upon us and we're in a state of transition. You know, even our language reflects that. We rarely use the phrase big data anymore. Rather we talk about digital transformation or digital business or data-driven companies. Many have come to the realization that data is and not the new oil because unlike oil, the same data can be used over and over for different purposes. We still use terms like data as an asset. However, that same narrative when it's put forth by the vendor and practitioner communities includes further discussions about democratizing and sharing data. Let me ask you this, when was the last time you wanted to share your financial assets with your coworkers or your partners or your customers? Hello everyone and welcome to this week's Wikibon Cube Insights, powered by ETR. In this Breaking Analysis, we want to share our assessment of the state of the data business. We'll do so by looking at the data mesh concept and how a leading financial institution, JPMorgan Chase, is practically applying these relatively new ideas to transform its data architecture. Let's start by looking at what is the data mesh? As we've previously reported many times, data mesh is a concept and set of principles that was introduced in 2018 by Jamak Deghani, who's Director of Technology at ThoughtWorks. It's a global consultancy and software development company. And she created this movement because her clients who are some of the leading firms in the world had invested heavily in predominantly monolithic data architectures that had failed to deliver desired outcomes in ROI. So her work went deep into trying to understand that problem. And a main conclusion that came out of this effort was the world of data is distributed and shoving all the data into a single monolithic architecture is an approach that fundamentally limits agility and scale. Now a profound concept of data mesh is the idea that data architectures should be organized around business lines with domain context. That the highly technical and hyper-specialized roles of a centralized cross-functional team are a key blocker to achieving our data aspirations. This is the first of four high-level principles of data mesh. So first, again, that the business domain should own the data end to end rather than have it go through a centralized big data technical team. Second, a self-service platform is fundamental to a successful architectural approach where data is discoverable and shareable across an organization and an ecosystem. Third, product thinking is central to the idea of data mesh. In other words, data products will power the next era of data success. And fourth, data products must be built with governance and compliance that is automated and federated. Now there's a lot more to this concept and there are tons of resources on the web to learn more, including an entire community that is formed around data mesh. But this should give you a basic idea. Now the other point is that in observing Jamak Tagani's work, she has deliberately avoided discussions around specific tooling, which I think has frustrated some folks because we all like to have references that tie to products and tools and companies. So this has been a two-edged sword in that on the one hand it's good because data mesh is designed to be tool agnostic and technology agnostic. On the other hand, it's led some folks to take liberties with the term data mesh and claim mission accomplished when their solution may be more marketing than reality. So let's look at JP Morgan Chase and their data mesh journey. It's why I got really excited when I saw this past week, a team from JPMC held a meetup to discuss what they called data lake strategy via data mesh architecture. I saw that title, I thought, well, that's a weird title. And I wondered, are they just taking their legacy data lakes and claiming they're now transformed into a data mesh? But in listening to the presentation, which was over an hour long, the answer is a definitive no, not at all in my opinion. A gentleman named Scott Hurleman organized the session that comprised these three speakers here, James Reed, who's a divisional CIO at JPMC. Arup Nanda, who is a technologist and architect and Sarita Baxed, who is an information architect, again, all from JPMC. This was the most detailed and practical discussion that I've seen to date about implementing a data mesh. And this is JP Morgan's, their approach, and we know they're extremely savvy and technically sound and they've invested, I mean, it has to be billions in the past decade on data architecture across their massive company. And rather than dwell on the downsides of their big data past, I was really pleased to see how they're evolving their approach in embracing new thinking around data mesh. So today, we're going to share some of the slides that they used and comment on how it dovetails into the concept of data mesh that Jamak Degani has been promoting. And at least as we understand it and dig a bit into some of the tooling that is being used by JP Morgan, particularly around its AWS cloud. So the first point is it's all about business value. JPMC, they're in the money business. And in that world, business value is everything. So JR Reed, the CIO, showed this slide and talked about their overall goals, which centered on a cloud-first strategy to modernize the JPMC platform. I think it's simple and sensible, but there's three factors on which he focused, cut costs always, sure, got to do that. Number two was about unlocking new opportunities or accelerating time to value. But I was really happy to see number three, data reuse. That's a fundamental value ingredient in the slide that he's presenting here. And his commentary was all about aligning with the domains and maximizing data reuse, i.e., data is not like oil and making sure there's appropriate governance around that. Now, don't get caught up in the term data lake. I think it's just how JP Morgan communicates internally. It's invested in the data lake concept, so they use water analogies. They use things like data puddles, for example, which are single project data marts or data ponds, which comprise multiple data puddles. And these can feed in to data lakes. And as we'll see, JPMC doesn't strive to have a single version of the truth from a data standpoint that resides in a monolithic data lake. Rather, it enables the business lines to create and own their own data lakes that comprise fit for purpose data products. And they do have a single truth of metadata. We'll get to that. But generally speaking, each of the domains will own end to end their own data and be responsible for those data products. We'll talk about that more. Now the genesis of this was sort of a cloud first platform. JPMC is leaning into public cloud, which is ironic since in the early days of cloud, all the financial institutions were like, never. Well, anyway, JPMC is going hard after it. They're adopting agile methods and microservices architectures. And it sees cloud as a fundamental enabler, but it recognizes that on-prem data must be part of the data mesh equation. Here's a slide that starts to get into some of that generic tooling and then we'll go deeper. And I want to make a couple of points here that tie back to Jamak Degani's original concept. The first is that unlike many data architectures, this puts data as products right in the fat middle of the chart. The data products live in the business domains and are at the heart of the architecture. The databases, the Hadoop clusters, the files and APIs on the left-hand side, they serve the data product builders, the specialized roles on the right-hand side, the DBAs, the data engineers, the data scientists, the data analysts that we could have put in, quality engineers, et cetera. They serve the data products because the data products are owned by the business, they inherently have the context and that is the middle of this diagram. And you can see at the bottom of the slide the key principles include domain thinking and end-to-end ownership of the data products. They build it, they own it, they run it, they manage it. At the same time, the goal is to democratize data with a self-service as a platform. One of the biggest points of contention on Datamesh is governance and as Sarita Bax said on the meetup, metadata is your friend and she kind of made a joke. She said this sounds kind of geeky but it's important to have a metadata catalog to understand where data resides and the data lineage and overall change management. So to me, this really passed the Datamesh stink tests pretty well. Let's look at data as products. CIO Reed said the most difficult thing for JPMC was getting their heads around data products and they spent a lot of time getting this concept to work. Here's the slide they used to describe their data products as it related to their specific industry. They said a common language in taxonomy is very important and you can imagine how difficult that was. They said, for example, it took a lot of discussion and debate to define what a transaction was. But you can see at a high level, these three product groups around wholesale, credit risk, party and trade and position data as products and each of these can have sub-products like party will have no your customer, KYC for example. So a key for JPMC was to start at a high level and iterate to get more granular over time. So lots of decisions had to be made around who owns the products and the sub-products. The product owners interestingly had to defend why that product should even exist, what boundaries should be in place and what data sets do and don't belong in the various products. And this was a collaborative discussion. I'm sure there was contention around that between the lines of business. And which sub-products should be part of these circles? They didn't say this, but tying it back to data mesh, each of these products, whether in a data lake or data hub or data pond or data warehouse, data puddle, each of these is a node in the global data mesh that is discoverable and governed. And supporting this notion, Sarita said that this should not be infrastructure bound. Logically, any of these data products whether on-prem or in the cloud can connect via the data mesh. So again, I felt like this really stayed true to the data mesh concept. Well, let's look at some of the key technical considerations that JPM discussed in quite some detail. This chart here shows a diagram of how JPMorgan thinks about the problem. And some of the challenges they had to consider were how to write to various data stores. Can you and how can you move data from one data store to another? How can data be transformed? Where is the data located? Can the data be trusted? How can it be easily accessed? Who has the right to access that data? These are all problems that technology can help solve. And to address these issues, our group Nanda explained that the heart of this slide is the data ingestor instead of ETL. All data producers and contributors, they send their data to the ingester and the ingester then registers the data. So it's in the data catalog. It does a data quality check and attracts the lineage. Then data is sent to the router, which persists in the data store based on the best destination is informed by the registration. This is designed to be a flexible system. In other words, the data store for a data product is not fixed. It's determined at the point of inventory and that allows changes to be easily made in one place. The router simply reads that optimal location and sends it to the appropriate data store. Now, as you see, the schema and fur there is used when there is no clear schema on write. In this case, the data product is not allowed to be consumed until the schema is inferred and then the data goes into a raw area and the inferrer determines the schema and then updates the inventory system so that the data can be routed to the proper location and properly tracked. So that's some of the detail of how the sausage factory works in this particular use case. It was very interesting and informative. Now, let's take a look at the specific implementation on AWS and dig into some of the tooling. As described in some detail by a Rupenanda, this diagram shows the reference architecture used by this group within JP Morgan and it shows all the various AWS services and components that support their data mesh approach. So start with the authorization block right there underneath the kinesis. The lake formation is the single point of entitlement and has a number of buckets, including you can see there, the raw area that we just talked about, a trusted bucket, a refined bucket, et cetera. Depending on the data characteristics at the data catalog registration block where you see the glue catalog, that determines in which bucket the router puts the data. And you can see the many AWS services in use here. Identity, the EMR, the Elastic Map Reduce Cluster from the legacy Hadoop work done over the years. The Redshift Spectrum and Athena, JPMC uses Athena for single threaded workloads and Redshift Spectrum for nested types that they can be queried into, so they can be queried independent of each other. Now remember, very importantly, in this use case, there is not a single lake formation. Rather multiple lines of business will be authorized to create their own lakes and that creates a challenge. So how can that be done in a flexible and automated manner? And that's where the data mesh comes into play. So JPMC came up with this federated lake formation accounts idea and each line of business can create as many data producer or consumer accounts as they desire and roll them up into their master line of business lake formation account. And they cross connect these data products in a federated model. These all roll up into a master glue catalog so that any authorized user can find out where a specific data element is located. This is like a superset catalog that comprises multiple sources and syncs up across the data mesh. So again, to me, this was a very well thought out and practical application of data mesh. Yes, it includes some notion of centralized management but that much of that responsibility has been passed down to the lines of business. It does roll up to a master catalog but that's a metadata management effort and seems compulsory to ensure federated and automated governance. As well at JPMC, the office of the chief data officer is responsible for ensuring governance and compliance throughout the federation. All right, so let's take a look at some of the suspects in this world of data mesh and bring in the ETR data. Now of course, ETR doesn't have a data mesh category. There's no such thing as a data mesh vendor. You build a data mesh, you don't buy it. So what we did is we used the ETR dataset to select and filter on some of the culprits that we thought might contribute to the data mesh to see how they're performing. This chart depicts a popular view that we often like to share. It's a two dimensional graphic with net score or spending momentum on the vertical axis and market share or pervasiveness in the dataset on the horizontal axis. And we filtered the data on sectors such as analytics, data warehouse and adjacencies to things that might fit into data mesh. And we think that these pretty well reflect participation of data mesh. It's certainly not all-encompassing and it's a subset obviously of all the vendors could play in this space. Let's make a few observations. Now as is often the case, Azure and AWS, they're almost literally off the charts with very high spending velocity and a large presence in the market. Oracle you can see also stands out because much of the world's data lives inside of Oracle databases. It doesn't have the spending momentum or growth but the company remains prominent. And you can see Google Cloud doesn't have nearly the presence in the dataset but its momentum is highly elevated. Remember that red dotted line there, that 40% line indicates anything over that indicates elevated spending momentum. Let's go to Snowflake. Snowflake is consistently shown to be the gold standard in net score in the ETR dataset. It continues to maintain highly elevated spending velocity in the data. In many ways, Snowflake with its data marketplace and its data cloud vision and data sharing approach fit nicely into the data mesh concept. Now caution you, Snowflake has used the term data mesh in its marketing but in our view it lacks clarity and we feel like they're still trying to figure out how to communicate what that really is. But there's really, we think a lot of potential there to that vision. Databricks is also interesting because the firm has momentum and we expect further elevated levels in the vertical axis and upcoming surveys especially as it readies for its IPO. The firm is a strong product and managed service and is really one to watch. Now we included a number of other database companies for obvious reasons like Redis and Mongo, MariaDB, Couchbase and Teradata. SAP as well as in there but that's not all database but SAP is prominent so we included them and as is IBM more of a database, traditional database player also with the big presence. Cloudera includes Hortonworks and HPE Esmeral comprises the MapR business that HPE acquired. So this, these guys got the big data movement started between Cloudera, Hortonworks which was born out of Yahoo which was the early big data, sorry early Hadoop, innovator, MapR went as kind of an own course and now that's all kind of come together in various forms. And of course we've got Talon and Informatica are there. They're two integration data integration companies that are worth noting. We also included some of the AI and ML specialists and data science players in the mix like DataRobot who just did a monster $250 million round DataIQ, H2O.AI and ThoughtSpot which is all about democratizing data and injecting AI and I think fits well into the data mesh concept. And you know we put VMware cloud in there for reference because it really is the predominant on-prem infrastructure platform. All right, let's wrap with some final thoughts here. First, thanks a lot to the JP Morgan team for sharing this data. I really want to encourage practitioners and technologists go to watch the YouTube of that meetup we'll include it in the link of this session. And thank you to Jamak Degani and the entire data mesh community for the outstanding work that you're doing and challenging the established conventions of monolithic data architectures. The JPM presentation, it gives you real credibility. It takes data mesh well beyond concept and demonstrates how it can be and is being done. And you know, this is not a perfect world. You're going to start somewhere and there's going to be some failures. The key is to recognize that shoving everything into a monolithic data architecture won't support massive scale and agility that you're after. It's maybe fine for smaller use cases and small firms, but if you're building a global platform and a data business, it's time to rethink data architecture. Now, much of this is enabled by the cloud, but cloud first doesn't mean cloud only. Doesn't mean you'll leave your on-prem data behind. On the contrary, you have to include non-public cloud data in your data mesh vision, just as JPMC has done. You got to get some quick wins. That's crucial so you can gain credibility within the organization and grow. One of the key takeaways from the JPMorgan team is there is a place for dogma like organizing around data products and domains and getting that right. On the other hand, you have to remain flexible because technologies are going to come, technology is going to go. So you got to be flexible in that regard. And look, if you're going to embrace the metaphor of water, like puddles and ponds and lakes, we suggest, maybe a little tongue in cheek, but still we believe in this, that you expand your scope to include data oceans, something John Furrier and I have talked about and laughed about extensively in theCUBE. Data oceans, it's huge. It's the new data lake, go transcend data lake. Think oceans. And think about this, just as we're evolving our language, we should be evolving our metrics. Much of the last decade of big data was around just getting the stuff to work, getting it up and running, standing up infrastructure, managing massive, how much data you got, massive amounts of data. And there were many KPIs built around, again, standing up that infrastructure and gesting data, a lot of technical KPIs. This decade is not just about enabling better insights, it's more than that. Data mesh points us to a new era of data value and that requires new metrics around monetizing data products. Like how long does it take to go from data product conception to monetization? How does that compare to what it is today? And what is the time to quality? If the business owns the data and the business has the context, the quality that comes out of them out of the chute should be at a basic level pretty good and at a higher mark than out of a big data team with no business context. Automation, AI and very importantly, organizational restructuring of our data teams will heavily contribute to success in the coming years. So we encourage you, learn, lean in and create your data future. Okay, that's it for now. Remember, these episodes, they're all available as podcasts wherever you listen. All you got to do is search Breaking Analysis Podcast and please subscribe. Check out ETR's website at ETR.plus for all the data and all the survey information. We publish a full report every week on wikibon.com and siliconangle.com and you can get in touch with us. Email me david.volante at siliconangle.com. You can DM me at dvolante or you can comment on my LinkedIn posts. This is Dave Vellante for theCUBE Insights powered by ETR. Have a great week everybody. Stay safe, be well and we'll see you next time.