 From theCUBE Studios in Palo Alto in Boston, bringing you data-driven insights from theCUBE and ETR. This is Breaking Analysis with Dave Vellante. The introduction and socialization of Datamesh has caused practitioners, business technology executives and technologists to pause and ask some probing questions about the organization of their data teams, their data strategies, future investments and their current architectural approaches. Some in the technology community have embraced the concept, others have twisted the definition, while still others remain oblivious to the momentum building around Datamesh. We are in the early days of Datamesh adoption. Organizations that have taken the plunge will tell you that aligning stakeholders is a non-trivial effort, but necessary to break through the limitations that monolithic data architectures and highly specialized teams have imposed over frustrated business and domain leaders. However, practical Datamesh examples often lie in the eyes of the implementer and may not strictly adhere to the principles of Datamesh. Now part of the problem is lack of open technologies and standards that can accelerate adoption and reduce friction. And that's what we're going to talk about today. Some of the key technology and architecture questions around Datamesh. Hello and welcome to this week's Wikibon Cube Insights powered by ETR. And in this breaking analysis, we welcome back the founder of Datamesh and director of emerging technologies at ThoughtWorks, Jamak Degani. Hello, Jamak. Thanks for being here today. Hi, Dave. Thank you for having me back. It's always a delight to connect and have a conversation. Great, looking forward to it. Okay, so before we get into it and the technology details, I just want to quickly share some data from our friends at ETR. You know, despite the importance of data initiatives since the pandemic, CIOs and IT organizations have had to juggle, of course, a few other priorities. This is why in the survey data, cyber and cloud computing are rated as the two most important priorities, analytics and machine learning and AI, which are kind of data topics, still make the top of the list well ahead of many other categories. And look, a sound data architecture and strategy is fundamental to digital transformations. And much of the past two years, as we've often said, has been like a forced march into digital. So while organizations are moving forward, they really have to think hard about the data architecture decisions that they make because they're going to, it's going to impact them, Jamak, for years to come, isn't it? Yes, absolutely. I mean, we are moving really from, slowly moving from reason-based algorithmic, logical algorithmic to model-based computation and decision-making where we exploit the patterns and signals within the data. So data becomes a very important ingredient of not only decision-making and analytics and discovering trends, but also the features and applications that we built for features. So we can't really ignore it. And as we see, some of the existing challenges around getting value from data is not necessarily that no longer is access to computation. It's actually access to trustworthy, reliable data at scale. Yeah, and you see these domains coming together with cloud and obviously it has to be secure and trusted. And that's why we're here today talking about Datamesh. So let's get into it. Jamak, first, your new book is out, Datamesh Delivering Data-Driven Value at Scale, just recently published. So congratulations on getting that done, awesome. Now in a recent presentation, you pulled excerpts from the book and we're going to talk through some of the technology and architectural considerations. Just quickly for the audience, four principles of Datamesh, domain-driven ownership, data as product, self-serve data platform and federated computational governance. So I want to start with self-serve platform and some of the data that you shared recently. You say that Datamesh serves autonomous domain-oriented teams versus existing platforms which serve a centralized team. Can you elaborate? Sure. I mean, the role of the platform is to lower the cognitive load for domain teams, for people who are focusing on the business outcomes, the technologies that are building the applications, to really lower the cognitive load for them to be able to work with data, whether they are building analytics, automated decision-making, intelligent modeling, they need to be able to get access to data and use it. So the role of the platform, I guess just a stepping back for a moment is to empower and enable these teams. Datamesh by definition is a scale-out model, is a decentralized model that wants to give autonomy to cross-functional teams. So at this core requires a set of tools that work really well in that decentralized model. When we look at the existing platforms, they try to achieve the similar outcome, right? Lower the cognitive load, give the tools to data practitioners to manage data at scale because today centralized teams, they're really their job, the centralized data teams, their job isn't really directly aligned with one or two or different business units and business outcomes in terms of getting value from data. Their job is manage the data and make the data available for then those cross-functional teams or business units to use the data. So the platforms they've been given are really centralized around or tuned to work with this is structure of the team, structure of centralized team. And although in the surface, it seems that why not? Why can't I use my cloud storage or computation or data warehouse in a decentralized way? You should be able to, but there's still some changes need to happen to those underlying platforms. As an example, some cloud providers simply have hard limits on the number of account storage accounts that you can have, because they never envisage you have hundreds of lakes. They envisage one or two, maybe 10 lakes, right? They envisage really centralized in data, not decentralized in data. So I think we've seen a shift in thinking about enabling autonomous independent teams versus a centralized team. So just a follow-up, if I may, we could be here a while. But so this assumes that you've sorted out the organizational considerations that you've defined all the, all the, what a data product is and a sub product. And people will say, you know, of course we use the term monolithic as a pejorative. Let's face it, but the data warehouse crowd would say, well, that's what data marts did. You know, so we got that covered, but you're the premise of data mesh, if I understand it, is whether it's a data marsh or a data mart or a data warehouse or a data lake or whatever, a snowflake warehouse, it's a node on the mesh. Okay, so don't build your organization around the technology, let the technology serve the organization. Is that? That's the perfect way of putting it, exactly. I mean, for a very long time, when we look at decomposition of complexity, we've looked at the composition of complexity around technology, right? So we have technology and that's maybe a good segue to actually the next item on that list that we looked at, oh, I need to decompose based on whether I want to have access to raw data, put it on the lake, whether I want to have access to model data and put it on the warehouse, whether I need to have a team in the middle to move the data around. So, and then try to fit the organization into that model. So data mesh really inverses that then as you said, is look at the organizational structure first, the scale boundaries around which your organization and operation can scale. And then the second layer, look at the technology and how you decompose it. Okay, so let's go to that next point and talk about how you serve and manage autonomous interoperable data products, where code data policy, you say, is treated as one unit, whereas your contentious existing platforms of course have independent management and dashboards for catalogs or storage, et cetera. Maybe double click on that a bit. Yeah, so if you think about that functional or technical decomposition, right, of concerns, that's one way, that's a very valid way of decomposing complexity and concerns and then build solutions, independent solutions to address them. That's what we see in the technology landscape today. We will see technologies that are taking care of your management of data, bring your data under some sort of a control and modeling. You will see technology that moves that data around will perform various transformations and computations on it. And then you see technology that tries to overlay some level of meaning, metadata, understandability, storage policy and policy, right? So that's where your data processing kind of pipeline technologies versus data warehouse storage, late technologies and then the governance come to play. And over time we decompose and recompose, right? We construct and reconstruct back this together. But right now that's where we stand. I think for data mesh really to become a reality as in independent sources of data and teams can responsibly share data in a way that can be understood right then and there, can impose policies right then when the data gets accessed in that source. And in a resilient manner, like in a way that data changes to the structure of the data or changes to the scheme of the data doesn't have those downstream down times. We've got to think about this new nucleus or new units of data sharing. And we need to really bring that transformation and governing data and the data itself together around these decentralized nodes on the mesh. So that's another, I guess, deconstruction and reconstruction that needs to happen around the technology to formulate ourselves around the domains and again, the data and the logic of the data itself, the meaning of the data itself. Great, got it. And we're going to talk more about the importance of data sharing and the implications. But the third point deals with how operational analytical technologies are constructed. You've got an app dev stack, you've got a data stack, you've made the point many times actually that we've contextualized our operational systems but not our data systems, they remain separate. Maybe you could elaborate on this point. Yes, I think this is again, it has a historical background and beginning for a really long time, applications have dealt with features and the logic of running the business and encapsulating the data and the state that they need to run that feature or run that business function. And then we had for anything analytical driven which required access data across these applications and across the longer dimension of time around different subjects within the organization, this analytical data, we had made a decision that, okay, let's leave those applications aside, let's leave those databases aside we will extract the data out and we will load it or we'll transform it and put it under the analytical kind of data stack and then downstream from it, we will have analytical data users, the data analysts, the data scientists and the portfolio of users that are growing use that data stack and that led to this really separation of dual stack with point to point integration. So applications went down the path of transactional databases or the document store but using APIs for communicating and then we've gone to late storage or data warehouse on the other side. If we are moving and that again, enforces the style of data versus app, right? So if we are moving to the world that our missions that are ambitious around ambitions around making applications more intelligent making them data-driven, these two worlds need to come closer as in ML analytics gets embedded into those applications themselves and the data sharing as a very essential ingredient of that gets embedded and gets closer, becomes closer to those applications. So if you are looking at this now cross-functional app data based team, right, business team, then the technology stacks can't be so segregated, right? It has to be a continuum of experience from app delivery to sharing of the data to using that data to embed models back into those applications and that continuum of experience requires well-integrated technologies that give you an example, which actually is in some sense, we are somewhat moving to that direction but if we are talking about data sharing or data modeling and applications use, you know, one set of APIs, you know, HTTP-compliant, RACQL or REST APIs and on the other hand, you have proprietary SQL like connect to my database and run SQL like those are very two different models of representing and accessing data. So we kind of have to harmonize or integrate those two worlds a bit more closely to achieve that domain oriented cross-functional, you know, teams. Yeah, we're going to talk about some of the gaps later and how there are actually, you look at them as opportunities, you know, more than barriers but they are barriers, but they're opportunities for more innovation. Let's go on to the fourth one. The next point, it deals with the roles that the platform serves. A data mesh proposes that domain experts own the data and take responsibility for it end to end and are served by the technology. You kind of, we referenced that before. Whereas your contention is that today, data systems are really designed for specialists. I think you use the term hyper specialists a lot. I love that term. And the generalists are kind of passive bystanders waiting in line for the technical teams to serve them. Yes, I mean, if you think about the, again, the intention behind data mesh was creating a responsible data sharing model that scales out. And I challenge any organization that has a scaled ambitions around data or usage of data that relies on small pockets of very expensive specialist resources, right? So we have no choice but upskilling, cross-skilling the majority population of our technologies. But we often call them generalists, right? That's a shorthand for people that can really move from one domain to one technology to another technology and, you know, paint, sometimes we call them pantry people. Sometimes we call them T-shaped people. But regardless, like we need to have ability to really mobilize our generalists. And we had to do that. At ThoughtWorks, we serve a lot of our clients. And like many other organizations, we are also challenged with hiring specialists. So we have tested the model of having a few specialists really conveying and translating the knowledge to generalists and bring them forward. And of course, platform is a big enabler of that. Like what is the language of using the technology? What are the APIs that delight that generalist experience? And this doesn't mean no code, low code. We have to throw away good engineering practices. I think good software engineering practices remain to exist. Of course, they get adopted to the world of data to build resilient and sustainable solutions. But specialty, especially around kind of proprietary technology is going to be a hard one to scale. Okay, I'm definitely going to come back and pick your brain on that one. And your point about scale, in the practical examples of companies that have implemented DataMesh that I've talked to, I think in all cases, there's only a handful that I've really gone deep with, but it was their Hadoop instances, their clusters wouldn't scale, they couldn't scale the business around it. So that's really a key point. It was a common pattern that we've seen. Now, I think in all cases, they went to like a data lake model in the AWS. And so that maybe has some violation of the principles, but we'll come back to that. But so let me go on to the next one. Of course, DataMesh leans heavily toward this concept of decentralization to support domain ownership over the centralized approaches. And we certainly see this, the public cloud players, database companies as key actors here with very large install bases, pushing a centralized approach. So I guess my question is, how realistic is this next point where you have decentralized technologies ruling the roost? I think if you look at the history of places in our industry where decentralization has succeeded, they heavily relied on standardization of connectivity with across different components of technology. And I think right now you're right, the way we get value from data relies on collection. At the end of the day, collection of data, whether you have a deep learning machine learning machine, machine learning model that you're training or you have reports to generate, regardless of the model is bringing your data to a place that you can collect it so that you can use it. And that leads to naturally set of technologies that try to operate as a full stack integrated proprietary with no intention of opening data for sharing. Now, conversely, if you think about internet itself, web itself, microservices even at the enterprise level, not at the planetary level, they succeeded as decentralized technologies to a large degree because of their emphasis on openness and sharing, API sharing. We don't talk about in the API worlds, like we don't say, I will build a platform to manage your logic or applications, maybe to a degree, we actually move away from that. We say, I will build a platform that opens our applications to manage your APIs, manage your interfaces, right? Give you access to API. So I think the shift needs to, that definition of decentralized there means really composable open pieces of the technology that can play nicely with each other rather than a full stack, I'll have control of your data, yet being somewhat decentralized within the boundary of my platform, that's just simply not going to scale if data needs to come from different platforms, different locations, different geographical locations. It needs a rethink. Okay, thank you. And then the final point is data mesh favors technologies that are domain agnostic versus those that are domain aware. And I wonder if you could help me square the circle, because it's nuanced, and I'm kind of a 100 level student of your work, but you have said, for example, that the data teams lack context of the domain. And so help us understand what you mean here in this case. Absolutely, so as you said, we want to take data mesh tries to give autonomy and decision-making power and responsibility to people that have the context of those domains, right? The people that are really familiar with different business domains and naturally the data that that domain needs or that naturally the data that domain shares. So if the intention of the platform is really to give the power to people with most relevant and timely context, the platform itself naturally becomes, as a shared component, becomes domain agnostic to a large degree. Of course, those domains can still, platform is a fairly overloaded world. And if you think about it as a set of technology that abstracts complexity and allows building the next level solutions on top, those domains may have their own set of platforms that are very much domain agnostic. But as a generalized, shareable set of technologies or tools that allows us share data, so that piece of technology needs to relinquish the knowledge of the context to the domain teams and actually becomes more agnostic. Okay, makes sense. All right, let's shift gears here, talk about some of the gaps and some of the standards that are needed. You and I have talked about this a little bit before, but this digs deeper. What types of standards are needed? Maybe you could walk us through this graphic, please. Sure. So what I'm trying to depict here is that if we imagine a world that data can be shared from many different locations for a variety of analytical use cases, naturally the boundary of what we call a node on the mesh will encapsulate internally a fair few pieces. It's not just the boundary of that node on the mesh. It's the data itself that it's controlling and updating and maintaining. It's of course the computation and the code that's responsible for that data. And then the policies that continue to govern that data as long as that data exists. So if that's the boundaries, then if we shift that focus from implementation of implementation details, that we can leave that for later, what becomes really important is the scene or the APIs and interfaces that this node exposes. And I think that's where the work that needs to be done and the standards that are missing. And we want the scene and those interfaces be open because that allows different organizations with different boundaries of trust to share data, not only to share data to kind of move that data to yes, another location to share the data in a way that distributed workloads, distributed analytics, distributed machine learning model can happen on the data where it is. So if you follow that line of thinking around the central position and connection of data versus collection of data, I think the very, very important piece of it that means really deep thinking. And I don't claim that I have done that is how do we share data responsibly and sustainably, right? And that is not brittle. If you think about it today, the ways we share data, one of the very common ways is around, I'll give you a JDCN point or I'll give you an endpoint to your database of choice. And now as a technology or as a user actually, you can now have access to this scheme of the underlying data and then run various queries or SQL queries on it. That's very simple and easy to get started with. That's why SQL is an evergreen standard or semi-standard or pseudo-standard that we all use. But it's also very brittle because we are dependent on a underlying schema and formatting of the data that's been designed to tell the computer how to store and manage the data. So I think that the data sharing API as a future really need to think about removing these brittle dependencies, think about sharing not only the data but what we call metadata I suppose, additional set of characteristics that is always shared along with data to make the data usage I suppose ethical and also friendly for the users. And also, I think we have to, that data sharing API, the other element of it is to allow kind of computation to run where the data exists. So if you think about SQL again as a simple primitive example of computation when we select and then when we filter or when we join, the computation is happening on that data. So maybe there is a next level of articulating distributed computation on data that simply trains models, right? Your language primitives change in a way to allow sophisticated analytical workloads run on the data more responsibly with policies and access control enforced. So I think that output port that I mentioned simply is about next generation data sharing, responsible data sharing APIs, suitable for analytical decentralized analytical workloads. So okay, so I'm not trying to bait you here but I have a follow up as well. So schema for all it's good, creates constraints, no schema on right, that didn't work because it was just a free for all and it created the data swamps. But now you have technology companies trying to solve that problem, take Snowflake for example, enabling data sharing, but it is within its proprietary environment. Certainly Databricks doing something, trying to come at it from its angle, bringing some of the best of data warehouse with the data science. Is your contention that those remain sort of proprietary and de facto standards and what we need is more open standards? Maybe you could comment. Sure, I think the two points, one is as you mentioned, open standards that allow actually make the underlying platform invisible. I mean, my litmus test for a technology provider to say I'm a data mesh kind of compliant is your platform invisible? As in, can I replace it with another and yet get the similar data sharing experience that I need? So part of it is that, part of it is open standards. So they're not really proprietary. The other angle for kind of sharing data across different platforms so that, we don't get stuck with one technology or another is around what is around APIs. It was around code that is protecting that internal schema. So where we are on the care of evolution of technology right now we have, we are exposing the internal structure of the data that is designed to optimize certain modes of access. We're exposing that to the end client and application APIs, right? So the APIs that use the data today are very much aware that this database was optimized for machine learning workloads. Hence you will deal with a columnar storage of the file versus this other API is optimized for a very different report type access, relational access, and it's optimized around rows. I think that is, that should become irrelevant in the API sharing of the future because as a user I shouldn't care how this data is internally optimized, right? The language primitive that I'm using should be really agnostic to the machine optimization underneath that. And if you did that perhaps this war between warehouse or lake or the other will become actually irrelevant. So we're optimizing for that human best human experience as opposed to the best machine experience. We still have to do that but we have to make that invisible, make that an implementation concern. So that's another angle of what should, if we daydream together the best experience and resilient experience in terms of data usage then these APIs become agnostic to the internal storage structure. Great, thank you for that. We've up to our ankles now in the controversy so we might as well wade all the way in. I can't let you go without addressing some of this which you've catalyzed so tight by the way I see as a sign of progress. So this gentleman, Paul Andrew, is an architect and he gave a presentation I think last night and he teased it as quote the theory from Jamak Daghani versus the practical experience of a technical architect aka me, meaning him. And Jamak, you were quick to shoot back that data mesh is not theory, it's based on practice and some practices are experimental, some are more baked and data mesh really avoids by design the specificity of vendor or technology. And then you say perhaps you intend to frame your post as a technology or vendor specific implementation. So Touche, that was excellent. Now, you don't need me to defend you but I will anyway. You spent 14 plus years as a software engineer and the better part of a decade consulting with some of the most technically advanced companies in the world. But I'm going to push you a little bit here and say some of this tension is of your own making because you purposefully don't talk about technologies and vendors, sometimes doing so it's instructive for us neophytes. So why don't you ever like use specific examples of technology for frames of reference? Yes. My role is pushes to the next level. So everybody picks their fights, pick their battles. My role in this battle is to push us to think beyond what's available today. Of course, that's my public persona but on a day-to-day basis actually I work with clients and existing technology. And I think in hours we have given until we gave a case study talk with a colleague of mine and I intentionally got him to talk about Sinha Jahan to talk about the technology that we use to implement data mesh. And the reason I haven't really embraced in my conversations, the specific technology. One is I feel the technology solutions we're using today are still not ready for the vision. I mean, we have to be in this transitional step no matter what we have to be pragmatic of course and practical I suppose and use the existing vendors that exist and I wholeheartedly embrace that but that's just not my role to show that. I have gone through this transformation once before in my life. When microservices happened, we were building microservices like architectures with technology that wasn't ready for it. Big application, web application servers that were designed to run these giant monolithic applications so now we're trying to run little microservices on them and the tail was wagging the dog. The environmental complexity of running these services was so consuming so much of our effort that we couldn't really pay attention to that business logic, the business value and that's where we are today. The complexity of integrating existing technologies really overwhelmingly, capturing a lot of our attention and cost and money and effort to as opposed to really focusing on the data product themselves. It's just that's the role I have but it doesn't mean that we have to rebuild the world. We've got to do with what we have in this transitional phase until the new generation technologies come around and reshape our landscape of tools. Well impressive public discipline. Your point about microservices is interesting because a lot of those early microservices weren't so micro and for the naysayers, look past is not prologue but ThoughtWorks was really early on in the whole concept of microservices so we're very excited to see how this plays out. But now there were some other good comments. There was one from a gentleman who said the most interesting aspects of data mesh are organizational and that's how my colleague Sanjeev Mohan frames data mesh versus data fabric. I'm not sure, I think we've sort of scratched the surface today that data mesh is more and I still think data fabric is what NetApp defined is software defined storage infrastructure that could serve on-prem and public cloud workloads back whatever 2016. But the point you make in the thread that we're showing you here is that you're warning that the, and you referenced this earlier that the segregating different modes of access will lead to fragmentation and we don't want to repeat the mistakes of the past. Yes, there are comments around, you know, I, again, going back to that original conversation that, you know, we have got this at a macro level we've got this tendency to decompose complexity versus based on technical solutions. And, you know, the conversation could be, oh, I do batch or you do a stream and we are different. You know, we create these modifications in our decisions based on the technology where I do events and you do tables, right? So that sort of segregation of modes of access causes accident or complexity that we keep dealing with because every time in this tree, you create a new branch you create new, you know, kind of new set of tools that then it somehow need to be point to point integrated you create new specialization around that. So the least number of branches that we have, I think, and think about really about the continuum of experiences that we need to create and technologies that simplify that continuum of experience. So one of the things, for example, give you a past experience, I was really excited around the papers and the work that came around on Apache Beam and generally, you know, flow-based programming and stream processing because basically they were saying whether you were doing batch or whether you're doing streaming it's all one stream and sometimes the window of time, you know, narrows and sometimes the window of time over which you're computing, you know, wide ends and at the end of the day is you're just getting, you know, doing the stream processing. So it's those sort of notions that simplify and create a continuum of experience. I think resonate with me personally more than creating these tribal fights of this type versus that mode of access. So that's why DataMesh by naturally selects kind of this multimodal access to support end users, right? The persona of the end users. Okay, so the last topic I want to hit is, look at this whole discussion that the topic of DataMesh, it's highly nuanced, it's new and people are going to shoehorn DataMesh into their respective views of the world and we talked about, you know, lake houses and there's three buckets and of course the gentleman from LinkedIn with Azure, Microsoft has a DataMesh community. So you're going to have to enlist some serious army of enforcers to adjudicate, but and I wrote some of the stuff down. I mean, it's interesting, Monte Carlo has a DataMesh calculator, Starburst is leaning in, Chaos Search sees themselves as an enabler, Oracle and Snowflake both use the term DataMesh and then of course you've got big practitioners, JPMC, we've talked to Intuit, Solando, HelloFresh has been on, Netflix has this event based sort of streaming implementation. So my question is, how realistic is it that the clarity of your vision can be implemented and not polluted by really rich technology companies and others? Is it even possible? Right? Is it even possible? That's, yes, that's why I practice then, because I should practice then, because I think that's, it's going to be hard. I think the, what I'm hopeful is that the socio-technical labeling DataMesh that this is a socio-technical concern or solution, not just a technology solution, hopefully always brings us back to the reality that vendors try to sell you, say, oh, it lets you solve all of your problems, all of your DataMesh problems, it's just going to cause more problem than the track. So we will see, time will tell, Dave, and I count on you as one of those members of folks that will continue to share their platform to go back to the roots of us. Why in the first place? I mean, I dedicated a whole part of the book to, why, because we get, as you said, we get carried away with vendors and technology solution tried to buy the wave. And in that story, we forget the reason for which we even making this training, generally, are going to spend all of these resources. So hopefully we can always come back to that. Yeah, and I think we can. I think, you know, you have really given us some deep thought, and as we pointed out, this was based on practical knowledge and experience. And you look, we've been trying to solve this data problem for a long, long time. You've not only articulated it well, but you've come up with solutions. So Jamak, thank you so much. We're going to leave it there and love to have you back. Thank you for the conversation. I really enjoy it, and thank you for sharing your platform to talk about taste mesh. Yeah, you bet. All right, and I want to thank my colleague, Stephanie Chan, who helps research topics for us. Alex Meyerson is on production, and Kristen Martin, Sheryl Knight, and Rob Hoef on editorial. Remember, all these episodes, they're available as podcasts. Wherever you listen, all you got to do is search breaking analysis podcast. Check out ETR's website at ETR.ai for all the data. We publish a full report every week on wikibon.com, siliconangle.com. You can reach me by email, David.Valante at siliconangle.com or DM me at D-Valante. Hit us up on our LinkedIn post. This is Dave Valante for theCUBE Insights, powered by ETR. Have a great week, stay safe, be well, and we'll see you next time.