 In 2009, Hal Varian, Google's chief economist, said that statisticians would be the sexiest job in the coming decade. The modern big data movement really took off later in the following year after the second Hadoop world, which was hosted by Cloudera in New York City, Jeff Hammabock, are famously declared to me and John Furrier in the cube that the best minds of his generation were trying to figure out how to get people to click on ads, and he said that sucks. The industry was abuzz with the realization that data was the new competitive weapon. Hadoop was heralded as the new data management paradigm. Now what actually transpired over the next 10 years was only a small handful of companies could really master the complexities of big data and attract the data science talent really necessary to realize massive returns. As well, back then, cloud was in the early stages of its adoption when you think about it at the beginning of the last decade, and as the years passed, more and more data got moved to the cloud and the number of data sources absolutely exploded. Experimentation accelerated as did the pace of change. Complexity just overwhelmed big data infrastructures and data teams, leading to a continuous stream of incremental technical improvements designed to try and keep pace, things like data lakes, data hubs, new open source projects, new tools, which piled on even more complexity. And as we reported, we believe what's needed is a complete bit flip and how we approach data architectures. Our next guest is Jamak Dekhani, who is the director of emerging technologies at ThoughtWorks. Jamak is a software engineer, architect, thought leader, and advisor to some of the world's most prominent enterprises. She's in my view one of the foremost advocates for rethinking and changing the way we create and manage data architectures, favoring a decentralized over monolithic structure and elevating domain knowledge as a primary criterion in how we organize the so-called big data teams and platforms. Jamak, welcome to theCUBE. It's a pleasure to have you on the program. Hi, David. It's wonderful to be here. Well, okay. So you're pretty outspoken about the need for a paradigm shift and how we manage our data and our platforms at scale. Why do you feel we need such a radical change? What's your thoughts there? Well, I think if you just look back over the last decades, you gave us a summary of what happened since 2010. But even if we go before then, what we have done over the last few decades is basically repeating, and as you mentioned, incrementally improving how we've managed data based on certain assumptions around, as you mentioned, centralization. Data has to be in one place so we can get value from it. But if you look at the parallel movement of our industry in general, since the birth of internet, we are actually moving towards decentralization. If we think today, like if let's move data aside, if we said the only way web would work, the only way we get access to various applications on the web or pages is to centralize it, we would laugh at that idea. But for some reason, we don't question that when it comes to data. So I think it's time to embrace the complexity that comes with the growth of number of sources, the proliferation of sources and consumptions models. You know, embrace the distribution of sources of data that they're not just within one part of organization, they're not just within even bounds of organization, they're beyond the bounds of organization. And then look back and say, OK, if that's the trend of our industry in general, given the fabric of computation and data that we put in globally in place, then how the architecture and technology and organizational structure incentives need to move to embrace that complexity. And to me, that requires a paradigm shift, a full stack from how we organize our organizations, how we organize our teams, how we put technology in place to look at it from a decentralized angle. OK, so let's unpack that a little bit. I mean, you've spoken about and written that today's big architecture, and you basically just mentioned that it's flawed. So I want to bring up, I love your diagrams, you have a simple diagram, guys, if you could bring up a figure one. So on the left here, we're adjusting data from the operational systems and other enterprise data sets. And of course, external data, we cleanse it. You've got to do the quality thing and then serve them up to the business. So what's wrong with that picture that we just described? And granted, it's a simplified form. Yeah, quite a few things. So and I would flip the question maybe back to you or the audience. If we said that there are so many sources of the data and actually the data comes from systems and from teams that are very diverse in terms of domains, right, a domain. If you just think about, I don't know, retail, the e-commerce versus auto management versus customer, these are very diverse domains. The data comes from many different diverse domains, and then we expect to put them under the control of a centralized team, a centralized system. And I know that centralization probably if you zoom out, it's centralized. If you zoom in, it's compartmentalized based on functions. And we can talk about that. And we assume that the centralized model will be getting that data, making sense of it, cleansing and transforming it, then to satisfy a need of very diverse set of consumers without really understanding the domains because the teams responsible for it are not close to the source of the data. So there is a bit of a cognitive gap and domain understanding gap without really understanding how the data is going to be used. I've talked to numerous, when we came to this, I came up with the idea, I talked to a lot of data teams globally just to see what are the pain points, how are they doing it. And one thing that was evident in all of those conversations that they actually didn't know, after they built these pipelines and put the data in whether the data warehouse, sables or lake, they didn't know how the data was being used. But yet they're responsible for making the data available for this diverse set of use cases. So a centralized system, a monolithic system, often is a bottleneck. So what you find is that a lot of the teams are struggling with satisfying the needs of the consumers, they're struggling with really understanding the data. The domain knowledge is lost. There is a loss of understanding and kind of in that transformation. Often, we end up training machine learning models on data that is not really representative of the reality of the business. And then we put them to production and they don't work because the semantic and the syntax of the data gets lost within that translation. So we are struggling with finding people to manage a centralized system because still the technology is fairly, in my opinion, fairly low level and exposes the users of those technologies that they warehouse a lot of complexity. So in summary, I think it's a bottleneck. It's not going to satisfy the pace of change, the pace of innovation and the pace of availability of sources. It's disconnected and fragmented, even though this centralized is disconnected and fragmented from where the data comes from and where the data gets used. And it's managed by a team of hyper-specialized people that are struggling to understand the actual value of the data, the actual format of the data. So it's not going to get us where our aspirations and ambitions need to be. Yeah, so the big data platform is essentially, I think you call it, context agnostic. And so as data becomes more important in our lives, you've got all these new data sources injected into the system. Experimentation, as we said, with the cloud becomes much, much easier. So one of the blockers that you've cited, and you just mentioned it, is you've got these hyper-specialized roles, the data engineer, the quality engineer, data scientist. And it's illusory. I mean, it's like an illusion. They seemingly, they're independent and can scale independently, but I think you've made the point that, in fact, they can't. That a change in a data source has an effect across the entire data lifecycle, entire data pipeline. So maybe you could add some color to why that's problematic for some of the organizations that you work with and maybe give some examples. Yeah, absolutely. So in fact, initially, the hypothesis around data mesh came from a series of requests that we received from both large-scale and progressive clients, and progressive in terms of their investment in data architecture. So these were clients that they were larger scale. They had diverse and rich set of domains. Some of them were big technology, tech companies. Some of them were retail companies, big healthcare companies. So they had that diversity of the data and the number of the sources of the domains. They had invested for quite a few years in generations of they had multi-generations of proprietary data warehouses on-prem that they were moving to cloud. They had moved to the various revisions of the Hadoop clusters, and they were moving to that cloud. And the challenges that they were facing were simply, if I want to just simplify it in one phrase, they were not getting value from the data that they were collecting. They were continuously struggling to shift the culture because there was so much friction between all of these three phases of both consumption of the data, then transformation, and making it available, consumption from sources, and then providing it and serving it to the consumer. So that whole process was full of friction. Everybody was unhappy. So bottom line is that you're collecting all this data, there is delay, there is lack of trust in the data itself because the data is not representative of the reality. It's gone through transformation, but people that didn't understand really what the data was got delayed. And so there's no trust. It's hard to get to the data. It's hard to create, ultimately, it's hard to create value from the data. And people are working really hard and under a lot of pressure, but it's still struggling. So we often, our solutions, like we are technologies, we often point at the technology. So we go, okay, this version of some proprietary data warehouse we're using is not the right thing. We should go to the cloud and that certainly will solve our problem, right? Or warehouse wasn't a good one. Let's make a lake version. So instead of extracting and then transforming and loading into the database and that transformation is a heavy process because you fundamentally made an assumption using warehouses that if I transform this data into this multi-dimensional perfectly designed schema that then everybody can run whatever query they want, that's gonna solve everybody's problem. But in reality, it doesn't because you are delayed and there is no universal model that serves everybody's need. Everybody needs a diverse data scientist necessarily don't like the perfectly modeled data they're looking for both signals and the noise. So then, we've just gone from ETLs to let's say now to Lake, which is, okay, let's move the transformation to the last mile. Let's just get load the data into the object stores into semi-structured files and get the data scientists use it but they're still struggling because of the problems that we mentioned. So then with the solution, what is the solution? Well, next generation data platform let's put it on the cloud. And we saw clients that actually had gone through a year or multiple years of migration to the cloud but it was great, 18 months, we've seen nine months migrations of the warehouse versus two year migrations of the various data sources to the cloud. But ultimately the result is the same unsatisfied, frustrated data users, data providers with lack of ability to innovate quickly on relevant data and have an experience that they deserve to have, have a delightful experience of discovering and exploring data that they trust and all of that was still amiss. So something else more fundamentally needed to change than just the technology. So then the linchpin to your scenario is this notion of context. And you pointed out, you made the other observation that look, we've made our operational systems context aware but our data platforms are not. And like CRM systems, sales guys are very comfortable with what's in the CRM system, they own the data. So let's talk about the answer that you and your colleagues are proposing. You're essentially flipping the architecture whereby those domain knowledge workers, the builders, if you will, of data products or data services, they're now first class citizens in the data flow and they're injecting by design domain knowledge into the system. So I want to put up another one of your charts guys, bring up the figure two there. It talks about, you know, convergence. You showed data distributed domain driven architecture, this self-serve platform design and this notion of product thinking. So maybe you could explain why this approach is so desirable in your view. Sure, the motivation and inspirations for the approach came from studying what has happened over the last few decades in operational systems. We had a very similar problem prior to microservices with monolithic systems. Monolithic systems where, you know, the bottleneck, the changes we needed to make was always, you know, orthogonal to how the architecture was centralized and we found a nice niche. And I'm not saying this is the perfect way of decoupling a monolith, but it's a way that currently where we are in our journey to become data driven is a nice place to be, which is distribution or decomposition of your system as well as organization. I think whenever we talk about systems, we've got to talk about people and teams that are responsible for managing those systems. So the decomposition of the systems and the teams and the data around domains, because that's how today we are decoupling our business, right? We're decoupling our businesses around domains and that's a good thing. And what does that do really for us? What it does is it localizes change to the bounded context of that business. It creates clear boundary and interfaces and contracts between the rest of the universe of the organization and that particular team. So removes the friction that often we have for both managing the change and both serving data or capabilities. So the first principle of data mesh is let's decouple this world of analytical data the same to mirror the same way we have decoupled our systems and teams and business. Why data is any different? And the moment you do that, so the moment you bring the ownership to people who understands the data best, then you get questioned that, well, how is that any different from silos that connect to databases that we have today and nobody can get to the data? So then the rest of the principles is really to address all of the challenges that comes with this first principle of decomposition around domain context. And the second principle as well, we have to expect a certain level of quality and accountability and responsibility for the teams that provide the data. So let's bring product thinking and treating data as a product to the data that these teams now share and let's put accountability around it. We need a new set of incentives and metrics for domain teams to share the data. We need to have a new set of kind of quality metrics that define what it means for the data to be a product and we can go through that conversation perhaps later. So then the second principle is, okay, the teams now that are responsible, the domain teams responsible for their analytical data need to provide that data with a certain level of quality and assurance. Let's call that a product and bring product thinking to that. And then the next question you get asked by CIOs or CTOs and people who build the infrastructure and spend the money that said, well, it's actually quite complex to manage big data. Now we want everybody, every independent team to manage a full stack of storage and computation and pipelines and access control and all of that. And that's, well, we've solved that problem in operational world and that requires really a new level of platform thinking to provide infrastructure and tooling to the domain teams to now be able to manage and serve their big data. And that I think that requires reimagining the world of our tooling and technology. But for now, let's just assume that we need a new level of abstraction to hide away a ton of complexity that unnecessarily people get exposed to. And that's the third principle of creating self-serve infrastructure to allow autonomous teams to build their domains. But then the last pillar, the last fundamental pillar is, okay, once you distribute a problem into smaller problems then you found yourself with another set of problems which is how I'm gonna connect this data, how I'm gonna, that the insights happens and emerges from the interconnection of the data domains, right? It's just not necessarily locked into one domain. So the concerns around interoperability and standardization and getting value as a result of composition and interconnection of these domains requires a new approach to governance. And we have to think about governance very differently based on a federated model and based on a computational model. Like once we have this powerful self-serve platform we can computationally automate a lot of governance decisions and security decisions and policy decisions that applies to this fabric of mesh not just a single domain or not in a centralized one. So really, as you mentioned, the most important component of the data machine is distribution of ownership and distribution of architecture and data. The rest of them is to solve all the problems that come with that. So very powerful. Guys, we actually have a picture of what Jean-Marc just described. Bring up figure three if you would. So me and essentially you're advocating for the pushing of the pipeline and all its various functions into the lines of business and abstracting that complexity of the underlying infrastructure which you kind of show here in this figure data infrastructure is a platform down below. And you know what I love about this Jean-Marc is it to me it underscores the data is not the new oil because I can put oil on my car, I can put it in my house but I can't put the same court in both places. But I think you call it polyglot data which is really different forms batch or whatever but the same data, the data doesn't follow the laws of scarcity. I can use the same data for many, many uses and that's what this sort of graphic shows and then you brought in the really important sticking problem which is that, the governance which is now not a command and control it's federated governance. So maybe you could add some thoughts on that. Sure, absolutely. It's one of those I think them I keep referring to data mesh as a paradigm shift and it's not just to make it sound ground like kind of ground and exciting or important. It's really because I want to point out we need to question every moment when we make a decision around how we're gonna design security or governance or modeling of the data. We need to reflect and go back and say, am I applying some of my cognitive biases around how I have worked for the last 40 years I have seen it work or do I really need to question and we do need to question the way we have applied governance. At the end of the day the role of the data governance and the objective remains the same. I mean, we all want quality data accessible to a diverse set of users and these users now have different personas like personal data analysts, data scientists, data application user, these are very diverse personas. So at the end of the day we want quality data accessible to them trustworthy in an easy consumable way. However, how we get there looks very different in, as you mentioned, that the governance model in the old world has been very command and control, very centralized. They were responsible for quality, they were responsible for certification of the data, applying, making sure the data complies with all sorts of regulations, make sure data gets discovered and made available in the world of the data mesh really the job of the data governance as a function becomes finding the equilibrium between what decisions need to be made and enforced globally and what decisions need to be made locally so that we can have an interoperable mesh of data sets that can move fast and can change fast. Like it's really about instead of putting the systems in a straight jacket of being constant and don't change, embrace change and continuous change of landscape because that's just the reality we can't escape. So the role of governance really, the governance model I call federated and computational and by that I mean, every domain needs to have a representative in the governance team. So the role of the data or domain data product owner will really understand the data of that domain really well but also where as a product owner is an important role that has to have a representation in the governance team. So it's a federation of domains coming together plus the SMEs and people have subject matter experts who understands the regulations in that environment, who understands the data security concerns but instead of trying to enforce and do this as a central team, they make decisions as what need to be standardized, what need to be enforced. And let's push that into that computationally and in an automated fashion into the path from itself. For example, instead of trying to be part of the data quality pipeline and inject ourselves as people in that process, let's actually as a group define what constitutes quality, like how do we measure quality? And then let's automate that and let's codify that into the platform so that every data product will have a CI CD pipeline. And as part of that pipeline, those quality metrics gets validated and every data product needs to publish those SLOs or service level objectives or whatever that we choose as a measure of quality. Maybe it's the integrity of the data, the delay in the data, the liveliness of it, whatever are the decisions that you're making. Let's codify that. So it's really the role of the governance, the objectives of the governance team trying to satisfy is the same, but how they do it, it's very, very different. I wrote a new article recently, trying to explain the logical architecture that would emerge from applying these principles. And I put a kind of a light table to compare and contrast the role of the, how we do governance today versus how we would do it differently to just give people a flavor of what does it mean to embrace decentralization and what does it mean to embrace change and continuous change? So hopefully that could be helpful. Yes, very, so many questions I have, but the point you make it to on data quality, sometimes I feel like quality is the end game, whereas the end game should be how fast you can go from idea to monetization with the data service. What happens again, you sort of addressed this, but what happens to the underlying infrastructure? I mean, spinning up EC2s and S3 buckets and my pie torches and TensorFlows and where does that lives in the business and who's responsible for that? Yeah, that's, I'm glad you're asking this question, David, because I truly believe we need to reimagine that world. I think there are many pieces that we can use as utilities and foundational pieces, but I can see for myself a five to seven year roadmap of building this new tooling. I think in terms of the ownership, the question around ownership, that would remains with the platform team, perhaps a domain agnostic technology focus team, that they are providing a set of products themselves, and but the products are, the users of those products are data product developers, data domain teams that now have really high expectations in terms of low friction, in terms of lead time to create a new data products. So we need a new set of tooling, and I think the language needs to shift from I need a storage bucket, or I need a storage account, or I need a cluster to run my Spark jobs, to here's the declaration of my data products. This is where the data for it will come from. This is the data that I want to serve. These are the policies that I need to apply in terms of perhaps encryption or access control. Go make it happen platform. Go provision everything that I need, so that as a data product developer, all I can focus on is the data itself. Representation of semantic, and representation of the syntax, and make sure that data meets the quality that I have to assure, and it's available. The rest of provisioning of everything that sits underneath will have to be taken care of by the platform, and that's what I mean by requires a re-imagination, and there will be a data platform team. The data platform teams that we set up for our clients, in fact, themselves have a fair bit of complexity internally. They divide into multiple teams, multiple planes, so there will be a plane as in a group of capabilities that satisfy that data product developer experience. There will be a set of capabilities that deal with those nitty-gritty underlying utilities. I call them at this point utilities, because to me, the level of abstraction of the platform is to go higher than where it is, so what we call platform today are a set of utilities we'll be continuing to using, we'll be continuing to using object storage, we'll continue using relational databases, and so on. And so there will be a plane and a group of people responsible for that. There will be a group of people responsible for capabilities that enable the mesh-level functionality, for example, be able to correlate and connect and query data from multiple nodes that's a mesh-level capability, be able to discover and explore the mesh-level data products that's a mesh-level capability. So it would be a set of teams as product platforms with a strong, again, product thinking embedded and product ownership embedded into that to satisfy the experience of these now business-oriented domain data teams. So we have a lot of work to do. I could go on, unfortunately, we're out of time, but I guess my, first of all, I want to tell people there's two pieces that you've put out so far. One is how to move beyond a monolithic data lake to a distributed data mesh. You guys should read that. And then data mesh principles and logical architectures kind of part two. I guess my last question in the very limited time we have is are organizations ready for this? We, I think the desire is there. I've been overwhelmed with the number of large and medium and small and private and public and governments and federal organizations that reach out to us globally. I mean, it's not, this is a global movement and I'm humbled by the response of the industry. I think the desire is there. The pains are real. People acknowledge that something needs to change here. So that's the first step. I think that awareness is spreading. Organizations that are more and more becoming aware. In fact, many technology providers are reaching out to us asking what, you know, what shall we do? Because our clients are asking us, you know, people are already asking, we need a data mesh and we need the tooling to support it. So the awareness is there in terms of the first step of being ready. However, the ingredients of a successful transformation requires top down and bottom up support. So it requires, you know, support from chief data analytics officers or above the most successful clients that we have with data mesh are the ones that, you know, the CEOs have made a statement that, you know, we want to change the experience of every single customer using data and we're going to do, we're going to commit to this. So the investment and support exists from top to all layers. The engineers are excited that maybe perhaps the traditional data teams are open to change. So there are a lot of ingredients of transformation needs to come together. Are we really ready for it? I think the pioneers, perhaps the innovators, if you think about that innovation, careful adopters, probably pioneers and innovators and lead adopters are making move towards it and hopefully as the technology becomes more available, organizations that are less engineering oriented, they don't have the capability in-house today but they can buy it, they would come next. Maybe those are not the ones who are quite ready for it because the technology is not readily available and requires, you know, internal investment today. I think you're right on. I think the leaders are going to lean in hard and they're going to show us the path over the next several years. And I think the end of this decade is going to be defined a lot differently than the beginning. Jomak, thanks so much for coming on theCUBE and participating in the program. Thank you for having us, David. It's been wonderful. All right, keep it right there. We're back right after this short break.