 Go. Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Officer of Data Diversity. I'd like to thank you for joining the latest installment of the monthly Data Diversity webinar series, Advanced Analytics with William McKnight. Today William will be discussing architecting a modern data platform sponsored this month by Reltio. Just a couple of points to get us started. Just due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A section. And if you'd like to chat with us or with each other, we certainly encourage you to do so. To open the Q&A panel or the chat panel, you will find those icons in the bottom of your screen for those features. And just to note, the chat defaults to suggest the panelists, but you may absolutely change that to network with everyone. And as always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and any additional information requested throughout the webinar. Now let me turn it over to Mike from Reltio for a brief word from our sponsor. Mike, hello and welcome. Hello, Shannon. Thank you for having me. And I'll go ahead and hop right in because I know we only have an hour. So this is a very complicated and exciting topic. So in today's digital landscape, we find that data is more and more siloed. It's more and more fragmented. And that fragmented data and siloed data is leading to failed digital outcomes. Data is either poor quality, duplicated data, data is unreliable. And today in the average organization, the average enterprise, there are 446 different applications with distinct silos of data that are operating. And so this is becoming even more and more complex. It's getting harder and harder every day to be able to take advantage of all the different SaaS apps, MarkTech apps, cloud platforms, on-prem stuff, legacy databases that you have across your enterprise. And so this is really becoming a barrier to driving the digital outcomes that organizations want to drive. So in order to solve this, organizations have really started to commit to invest in data unification management platforms. And so really the core to this is being able to take all of these different data assets from all these different applications, bring it together in one place, and be able to service that trusted interoperable data wherever it needs to be, whether it's an LLM, whether it's an operational application, whether that's analytics and reporting, whether that's automation tool, but be able to provide that trusted core data, products, customers, households, and make that available wherever you need to, whenever you need to. Because it's not so much that we just can't have the data inside these places, but oftentimes batch APIs and batch delivery makes this unusable in these operational applications. So it's not just where this data needs to be, but when this data needs to be available across organizations. So at Reltio, our focus and our mission is to provide this trusted unified data in order to mobilize all of that core data across your enterprise. And so we bring that data together, we organize it, we enrich it, we standardize it, and then we make that available in milliseconds wherever it needs to be. And so as we're doing this with customers and working with various customers, we found that as they're making investments in these platforms, not everybody's on the same maturity level. There's certain customers that are just starting this journey, certain customers that have been doing this for a long time. And so we see three primary flavors of this unification platform. On the left side, we see this focus on entity resolution. And so these are things like understanding that Mike Frasca from your marketing stack is the same Mike Frasca in your shipping system, which is the same Mike Frasca in your customer service system, so that once you have whatever channel I come in on, you have a universal ID to refer to me from. This way you know me across your entire organization. Then kind of shifting to the right, we have this multi-domain master data management or multi-domain MDM. It's not so much just understanding me as a person and adding additional demographics, but it's understanding Mike Frasca and my relationships. Mike Frasca related to my household. Mike Frasca is related to my wife, Nicole. Mike Frasca related to what products I bought for your organization, what other suppliers I may have relationships with. And so now you're starting to get a much larger view of that person or that object across your enterprise to share into customer experience, to share into your risk and fraud detection algorithms, to share into your privacy and compliance teams. But there's one step further. And that's not only do you understand those core data assets, who Mike Frasca is, where do I do business with and what products they have, but what transactions do they have with me? Mike Frasca went for a test drive on a used car. I applied for credit. I bought three items last week. I've made five calls to your customer service application. And so unifying all of that together with this core data asset allows you to start delivering customer data products across your organization. So start thinking about how powerful it is when you can do things like understand my likelihood to churn and when I call into your customer service application, do a look up live and be able to route me directly to a top tier customer service person. No longer are you going to make me go through that level one, level two hop, because you've been able to identify who I am, what I'm doing, and then provide analytics in real time to that customer service operational application. So we see that that journey from the beginning of understanding this core assets, all the way to attaching not only relationships, but transactional information and analytics information to them to be able to service that to all of your, your applications. And so these things are very foundational data programs. But in order to drive real value across your organization, we really want to see how these impact all of the actual business initiatives that you're engaged in. So whether that be sales effectiveness, because your sales teams are on the phone less with their prospects, or they have the right addresses to drive out and visit them, or self service through their your web portal, where you can automatically understand that three accounts are really the same person. And so you can provide a better experience for them, or you can detect fraud by aligning and figuring out these three different accounts are really the same person. And so now you can see that I've returned much higher return rate than the average person. And maybe you detect this and add this to your fraud detection team. All of these outcomes are based on having clean unified trusted data to be able to service these. And so can't be just data for data's sake, measuring true value of that data is going to be on moving the needle on these other core business initiatives across your entire enterprise. So thank you for that time. Shannon turns back over to you. Mike, thank you so much for kicking us off. And thanks to Real Teal for sponsoring today's webinar and to help making these webinars happen. If you have questions for Mike, feel free to submit them in the Q&A portion of your screen as Mike will be joining us in the Q&A at the end of the webinar today. Now let me introduce to you our speaker for the series, William McKnight. William has advised many of the world's best known organizations. His strategies form the information management plan for leading companies in numerous industries. He is a prolific author and a popular keynote speaker and trainer. He has performed dozens of benchmarks on leading database, data lake streaming and data integration products. And with that, I'll give the floor to William to get his presentation started. Hello and welcome. Hello and thank you, Shannon. Thank you, Mike. It is great to have Real Teal aboard. Like Mike said, this is a big subject. It's a really big subject. As a matter of fact, that hit home this past month when I was putting this together and there are so many components to a modern data platform. I think what I'm going to end up doing today is a lot of what we might call riffing on all these different components in order to get them all in there. So kind of a survey topic here and going to get a lot of opinions of mine along way here. I do work on this stuff on a daily basis. I'm going to try to with each component talk about how difficult it is to add that component. What is the priority of this and maybe a little bit of what is the perception versus the reality of adding that component to your modern data platform. Now don't hate me, but I'm actually on a cruise on the Rhine River as we speak. But I am so grateful for having this webinar to come to to get me out of three hour dinner, which I didn't want to go to. And I'm so glad that we could work from different places these days. So that is me. This is my company. These are some of the logos that we're associated with some of the technology that we work with. I'm proud of all of these. And I would say they all have their place somewhere in the environment. Sorting it out can be a challenge if you need any help with anyone or especially if you have multiple of these that you're thinking about implementing. And you got to sort out the workloads and the priorities and the work plans and so on. Let us know we're really good at that stuff. So it's a mixture of big data, analytic data, data movement, APIs, data management, and operational transactional data management. So the big the big theme here is data. That's what I've spent my career in. I really enjoy data and let's modernize your data platform. It's gotten more difficult than ever to do this. It's more complex and it's going to continue to be more complex. There's no one size fits all. And one colleague of mine who saw when I was some time you say it depends. And I don't know if that's because I say that a lot or the topic lends itself to that. But anyway, don't do that. Even though I will try to keep that to a minimum and really give you the reasons behind it depends now. There's a lot of perspectives that you take on data platform decisions. And these are them. And not everybody thinks of all of them. And it is surprising to me sometimes how each shop kind of comes at the problem from a different perspective. And it's usually one of these. And usually that drives the whole thing. But I suggest that you think about all of these as you think about platforming your data, platforming your data and developing your data architecture is really important today. As a matter of fact, I say it's important not only to the application, but it's important to the overall TCL of the organization, the overall possibilities for the organization and really the sheer profitability of the organization. So these are the perspectives and they're all pretty relevant. OLTP versus OLAP. Yeah, that's still real. That's still real. You still have a difference between operational data and analytical data. And I might throw in here, didn't make it to the slide, but HTAP, Hybrid Transactional Analytical Processing. That spans both OLTP and OLAP. So one decision you should be making as a shop is, what do you think about that? A technology like single store or Apache, Asterix, DB, these are both operational and analytical at the same time. Now there's also Lambda architecture, which is a data processing architecture designed to handle massive amounts of data by combined, there's that too, relational or object storage. Wow, this is a huge one. And they're not mutually exclusive either. Like for example, Snowflake uses S3, which, okay, that's object storage, but it lays it out in relational format. So I call that relational. All the rows are in there, et cetera, et cetera. Batch versus Stream, I mentioned the Lambda architecture there. Big data versus not big data. A lot of people in our industry don't like big data, don't like that term. I like it, I get it. I definitely see different, there are differences, there are distinct differences. And it has mostly to do with the speed at which the big data accumulates. It's accumulating right now. As we sit here on this presentation, there's big data accumulating in your company. And it used to be we could just ignore it. It was operational, did its job, but we really can't do that anymore if we want to be competitive. And so there's big data and there's not big data, which is what we used to try to manage. And by the way, the not big is probably still more important if you haven't brought that to a certain level of maturity. SMP, the operating system and the IO subsystem, it's like having multiple cores in a single CPU that work together. And usually this is good for smaller, maybe OLTP type systems. But MPP, that's shared nothing. So that's in an MPP system. Each processor has its own dedicated resources. And I will break down these. You're lying just to FYI, you're cutting in and out. To make a decision on this in your shop, polyglot persistence. It's just FYI, you're dropping in and out your audio. Oh, well, hmm. I don't know if there's anything to do with that. I will carry on, and hopefully we can all see her. SMP and MPP. I did that. Okay. Polyglot persistence. Okay. That's the idea that you need a distinct database for every distinct piece of data, every distinct type of data. In your pattern, in your architecture. Now, there are multimodal databases, which I actually gave a presentation on this some time ago, where these databases, they're mostly of no SQL origin. And they have all kinds of different possibilities of what you can store in them, all the different no SQL breeds. And then there's single vendor versus best of breed. That's another perspective you could take on this. And I really like best of breed. As a matter of fact, a lot of the components that I'm going to be sharing today, you really can't find them great if you stick with a single vendor approach. And so definitely open yourself up to best of breed if you want to have a great data architecture. And that's going to be evident as we go through the different components. Are you departmental or are you enterprise-wide? At what perspective are you making these decisions? How broad will they encompass within your organization? And some people don't know this, even as they're making those decisions. And how do you get the word out if it's supposed to be enterprise-wide? And then finally, a perspective is open source versus closed source. And some shops have a thing about open source. We don't want it. Well, open source is actually at the root of many of the technologies that make a lot of sense in a data architecture today. In my opinion, not allowing any form of open source is not a tenable position in an organization today. Now, you may want to get the enterprise adjudicated versions of the open source software. I get that totally and actually agree with that. Now, a modern data platform, and I hope I'm coming through better, I could have drawn this about 10 different ways. As a matter of fact, when I walk into shop, I walk in with about 10 different laminated architectural patterns today. And then it's constantly changing. And that's just because there's so many different starting points that enterprises are at with their data architecture. And there's so many different places that they want to go. Now, your starting point actually has a lot to do with where you want to go, because there's no way you're going to actually erase everything you have and jump to some other architecture that has nothing to do with where you are today. That's not going to work. So all of this that you see here, operations, the analytic layers, the data science layers, etc. All of this can be covered by what I've been, hopefully you have a chief data architect or similar kind of role within your organization. Also, on this slide, I couldn't put everything. I couldn't put data virtualization over the top, which is pretty important. I couldn't put graph databases, which are also pretty important with their vertices and edges. Anytime you have a highly networked type of application, you're going to want that. Some of the data producers are actually no SQL databases, like your product catalogs and so forth. I also didn't put in time series databases, which also have a role to play if that's really important to you. If you're dealing with a lot of recent, highly changing type of data like weather, sensor readouts, and so on and so forth. That might be your influx database, QuestDB, Prometheus, Clickhouse, others like that. So these are some modern data platforms, and I show you the big three, if I may call them that, of course, Azure, AWS and Google Cloud, and what they provide for some of these major categories. I don't have all the categories listed on the left, but as you can see, you can stick within the big three and find your way through most of the categories. Now, Snowflake, I had to include because they are very popular, and I wanted to show a very heterogeneous type of platform where you have Snowflake, you have to get a different ETL tool or data integration tool. So I got talent, you got your Kafka, Confluent, Cloud, et cetera, et cetera. So in reality, we're all heterogeneous to some degree or another, but I threw a Snowflake one on here as well. So I'm not going to go through each of these. Hopefully this makes a little sense to you in terms of some of the components that you're going to need. And already I'm starting to show you the vast number of components that you are going to have to have today in a modern data platform. And you got to make them all work together. Well, like a bunch of Lego pieces. So it really takes a lot of skill today, more than ever, to do this right and achieve some good TCO in the process. You got data warehouse compute storage, data integration and streaming, they're different, data exploration, data Lake BI, data science, identity management, data catalog, and some others. Now, some others are like data catalogs, data observability, which I'll get to later, master data management, and a lot of the others, if you will, they're going to be hard to stick to 100% one of the big three vendors. So that's where we get into, you got to be heterogeneous today, to some degree. So let's start with the data warehouse compute that's really to me, it's the heart and soul of the entire data platform. It's the core of the analytics stack dedicated dedicated compute represents the data warehouse itself, the heart of the analytics stat. Now, I show you what the big three and what snowflake provide to you, I got some pricing up there now. The pricing is circa 2023. So maybe a little bit off, it may have changed a little bit, but the specs across these four are roughly equivalent. I tried to do that for you. Azure Synapse Analytics, we they have the unified workspace experience, reserve capacity pricing is not yet available for compute resources. AWS redshift RA3 includes addressable redshift managed storage in its price, but there's a separate storage charge. And Google BigQuery, you've got dedicated slots that offer a more economical option. As a matter of fact, if you're in BigQuery land, if you're in Google land, you are probably getting one of those annual commitments, because it's just, in my opinion, not really tenable otherwise. So enjoy that snowflake. I usually have enterprise plus installed at my clients. And that's $4 per hour per credit. And we all like to keep an eye on our snowflake and all our credits really for that matter. So you want cost optimization techniques in place, like pausing compute, like Synapse redshift and snowflake, they allow pausing compute to avoid unnecessary billing during idle periods, which is great scaling compute, the ability to scale compute size up or down. And you've got slot commitments, like for BigQuery. So all of those help manage your budget storage. Now, whenever I talk to vendors, I get I get kind of excited about everything's in the cloud, right? Everything's going to the cloud. That may be true. But then when I work with my clients, I'm back to reality, there's a lot of on premises, there's a lot of on premises out there. And it's not really in the plans that every piece of that's going to go to a cloud anytime soon. As a matter of fact, my clients have convinced me a time or two that it actually makes sense for that data to be on premises. So I get it, we won't break that all down here. It is a very important piece of storage out there. So I say you definitely want to make sure that your vendor technologies work with on premises, if you have it. Cloud storage, of course, network attacks, patch storage and storage area network. Now back to cloud for a moment. Of course, there's public, the big three, like I mentioned, Oracle, we might consider them for, okay, at this point with what they're doing. Private cloud, though, that tends to be a pretty squishy term. And I like to say that it's not just you turned your on premises into a private cloud and did nothing. That's not actually activating any kind of cloud capabilities like containerization. And that cloud native capability is really, to me, the definition of private cloud. Master data management. Very important. I've given several presentations completely dedicated to master data management in this series. Take a look if you have more questions about that. There goes the hub in the middle of everything. It really is what I like to say operational data, data, warehouse, analytical data, interfacing to the data catalog, get all the subject areas that you see on the side there. And that's just a beginner set. As you get into master data management, you learn the process. And you begin to think of subject areas that you and I couldn't think of on day one when I walked in and we were talking master data management. So about a year down the road, my clients are surprising me when they're having great MDM success with what other subject areas can we do this with? Because it's working out so well. So you want MDM today to really solidify this data that's so important everywhere. And it's going to be a heterogeneous dynamic environment. You don't want to be messing with this and having 10 versions of customer and so on and so forth. Is it hard to do? It's in the medium difficulty side of things. I will admit that. But it is a high priority. Now data integration, for a fuller treatment of data integration, get my paper, which was just recently published at the easy to remember URL on this slide here. And in that slide, I broke it down. I broke down all of the major, well actually 10 of the major data integration vendors. I plotted them on the on a famous quadrant, right approach. And as you can see, things shook out. We've got ETL versus ETL. That's one perspective on data integration. We have a new category, folks. We have reverse ETL, something that I wish I had thought of before, because I've been doing manual approaches to this for quite some time, because it is important that one state of flows in an analytical direction. And you get insights that you are able to get it back into operations. And I never had a great solution for this, except, well, let's do data integration and move it back. Well, that's kind of what they're doing, but they're doing it in a more automated, managed way. So reverse ETL focuses on moving data from a central repository, like a data warehouse, out back, if you will, to operational systems and business applications. And some vendors to watch there, or high touch census, Hivo activate, let me think, Omnata, Rutterstack, okay, there's a few others. And I show you here what some of the big three and Snowflake with Talon, because you got to have a data integration tool with that architecture, but I show you what those four charge, usually on some kind of squishy per hour basis, like Azure DIU, that's data integration unit hour, okay, data integration unit hour. So you really have to play with it to learn how many hours you're going to consume of these things. And these are more what you're seeing here, except for Talon, these are more kind of bare bones, the kinds of ETL, but truth be told, a lot of applications, that's all they need. Speaking of moving data around, this is the new elegant way of moving data, and you're going to move big data with streaming. Remember earlier, I said, yeah, there's a difference, you're going to move your big data with streaming, because you probably can't do it any other way. So if you're really serious about it, we're talking about things like Kafka, and there's Apache Kafka, okay, and many managed Kafka services, probably the most prominent is Confluent. Confluent provides streaming data everywhere with its Confluent platform, which is self-managed and Confluent cloud, which is fully managed and available across all cloud providers. So I think that was as of last year, maybe the year before, not too, not too long ago. Stream processing ecosystem, a wide range of complementary stream processing engines, like Apache Flink and software as a service solutions have emerged to handle this stream. So you don't have to be in Kafka, that's for sure. During the same year that Kafka came out, which was 2011, the first open source version of NATS.io was released. NATS was originally developed as a stateless messaging technology supporting low latency in real-time communication patterns, such as publish, subscribe, and request reply. So these, whether it's Kafka, NATS, or one of the others, these are great little hubs where data comes to and based upon the subject of that data gets distributed out in real-time. There's not as much transformation capabilities. You don't want to slow down a process like this with a lot of that. So you live with maybe a lower level of data quality, but I'm going to get to data quality solutions in a little bit now when I talk about data observability, because there is a solution for that. But just to kind of finish out on some of the players here, I talked about Confluent, CloudDera provides Kafka as a self-managed option, Red Hat provides Kafka as a partially managed cloud offering, and self-managed Kafka on Kubernetes via OpenShift. And AWS has its famous Amazon MSK, which is partially managed. That's managed services Kafka, and they have managed services Kafka serverless, which is fully managed. Kafka support is excluded in the MSK offerings. It's not Kafka, technically. AWS has hundreds of cloud services, and Kafka is part of that broad spectrum, only available on AWS and Microsoft Azure HD Insight, only available on Azure. And you got technologies like Pulsar and Red Panda buying for market share alongside established solutions. So this is a big area of competition right now, and it's a big area of opportunity. They usually tend to go hand in hand. Data science and machine learning, wow, I thought I could just access the data. Well, the data scientists usually will have a set of tools that they prefer. The focus on the focus is on predictive and prescriptive analytics using machine learning and AI techniques. Data science is a sophisticated aspect of data analytics that involves collecting cleansing and curating source system data and implementing AI and ML. Okay, now truth be told, a lot of data scientists today still, by necessity, do a lot of data integration and a lot of what they usually call data wrangling. And we want to get them out of that work because that's work that other specialists can do. And so we want to free up data scientists from manual tasks. And I'm not saying all DI is manual, but we want to free them up from those other tasks and allow them to get involved with the data and the players in this space. Altair, RapidMiner, Amazon, Cloudera, Databricks, DataIcu, DataRobot, Google, H2O. And what do you get with these? You get data science and machine learning workflows and a powerful and user-friendly tool for data visualization of these workflows. Now some people ask me about Snowpark because it's gotten a lot of publicity. What is Snowpark? Is it one of these? No, kind of. It empowers data science tasks within Snowflake, right? Snowflake pushes data processing down to the database, but its approach is more to let third parties do the work. So like some of the ones that I just mentioned, so you still need or you could use a partner tool like a 5Tran to develop the pipeline and tools from partners like DataIcu, DataRobot, or H2O.AI to develop, train, and manage the life cycle of ML models. After developing and those partner tools through the Snowpark API, then you would run them inside of Snowflake. So it's a way to get the data science and machine learning working in Snowflake. I think it was ingenious of them to do it and definitely a lot of uptake on that one. But all the players I mentioned are doing well. This is another area of high competitiveness and there's definitely a lot of opportunity here in data science and machine learning. How difficult is it to do data science and machine learning? Well, a lot of the difficulty is in the data wrangling, as I mentioned. It's in the data architecture. A lot of everything I'm saying today has to do with preparing the way for the data scientists to do his or her data science and machine learning. I say it's a priority as well. The data lake. The data lake is for your big data. Can I just say that? It is less valuable data per capita than the data warehouse and our relational data sets. But still, you are expected today to have your data warehouse and have your analytical relational environment up to snuff, up to a standard. And then beyond that, where competitiveness is today, it's in big data. So that means the data lake. Now be careful with your terms. A lot of people talk about a data lake, but it's really a data warehouse they're talking about, in my opinion. So data lake, cloud data, big data, cloud storage, all data formats. This is where you might do historical data retention. Where are you doing historical data retention? It could be here. It could be other places. But at some place, you want to keep all historical data, other than what your legal team says we want to get rid of that data. Other than that, you want to keep all historical data, but maybe not all in one platform. You might have a tiered approach to that, something that you need to design. Now the data lake is the domain of the data scientist and higher level, if you will, analysts. Not like the data warehouse where you get some more routine reporting visualization and so on going there. I'm not saying one is more important than the other, but I am saying that there is a distinction today. Now, five years from now when I'm giving this presentation, I don't know. Looking at the tea leaves, there's a lot moving into the data lakes. And so because it can handle at a good price point, very big scalable data, the chances are that the data lake will be surviving very well at that point. Now, you tend to have less governance, but I would say lack of full data governance, it's lack of any data governance, I should say, is at your peril. And a data catalog is good for this. And I have shown you an example of Parquet. I didn't label it, but in the graphic, that's an example of Parquet approach. I'm a big proponent of the Parquet approach. It's kind of the columnar approach, if you will, to the data lake, which I like very much for your data warehouse. And I find that most data lakes are for analytical approaches, analytical, environments and applications. So we like Parquet. Only Azure Synapse and Google BigQuery have a serverless pricing model. But watch out for serverless. I don't know that you want to do a lot of production in serverless. I'm showing you the prices there of, I guess, non, well, for Azure, I show serverless. Amazon Redshift Spectrum, Google BigQuery and Snowflake. As you can see, it has to do with, for Redshift and BigQuery, $5 per terabyte scan. And again, that was last year, things may have changed a little bit. Snowflake is back to your enterprise plus $4 per hour per credit. Keep an eye on it, learn and project from there. A lot of people get surprised after the first month or two with that, though. So don't be surprised, do a good trial. Data governance. Is that a tool? Well, actually, it's a set of tools and a set of approaches, right? And I keep saying tools, but, you know, you can do a lot of this without a tool. As a matter of fact, please prove to me that you need a data catalog before you start, we start talking about data catalog and getting that all dusted off and getting that whole process up and running in the shop. And you got half people see the need and half people don't. I don't like that. I'd rather, let's start building our data catalog in a spreadsheet. And when the pain is so great, I want 75% of the people to be clamoring for something new, and that's going to be a data catalog. Anyway, that's kind of my approach to things. Data DevOps, okay, to make sure that what you do makes its way into production and ML ops is like Azure Machine Learning or AWS SageMaker. These are for machine learning, right? Machine learning operations, flowing machine learning on its path to production because you're going to want to iterate quickly. And finally, I threw security information and event management in there. You mean to tell me that our other security we've been doing all these years is not enough? Yeah, it's not enough anymore. CM, am I saying it right? CM technology supports threat detection, compliance and security incident management through the collection and analysis of security events and a wide variety of other events. So the products there might be micro focus. Actually, it's now open text, excuse me, open text Arc site, Splunk enterprise security, IBM Q radar, Microsoft Sentinel. Yeah. And a big question you want to ask there is how well do the vendor solutions perform in detecting attacks, the leverage techniques recognized by the Mitra attack framework. Okay, speaking of catching things early, speaking of data quality, there is data observability. How important is it now? Hmm. Again, I want to make sure that the shop recognizes there's the data quality problem. There probably is. And I've been as guilty as anybody of promoting manual approaches to data quality. Well, not totally manual, but I have said, get the business involved, get them to say what the data quality problems might be, go see if they're there, and then manually do something about it because that's really all we could do for a while. Yeah. Okay, there were data quality tools that could find data quality problems. I'm not saying I never said don't use those tools, but data observability takes it one step further. It's the new data quality, if you will, because we're backing up, backing up, backing up to the beginning, and we're trying to find these problems early in the cycle. And there are so many, what I like to say are drifts that can occur in data that you want to catch early on. And most of them have only been around for a year. But here's another pocket of what we do that you just can't rely on the big three to or the big four, if you will, to to fully solve today. The idea is the greatness isn't coming from them. So observability vendors. Wow, there are so how am I Monte Carlo, Metaplane, Datadog, maybe you've heard of some of these CloudWatch, Anomalo. Yeah. So they're all going to detect these data quality issues earlier, although it's broader than data quality, encompassing data and motion quality focus mostly on data at rest. It leverages automation, AI ops, predictive analytics, and knowledge representation for its core functionalities. How difficult is it? It's really not that difficult. If you get a great tool, a tool that finds things that you care about, that's kind of the key here. So do get yourself data observability, if you have data quality issues, and that brings me to the end of the components that I wanted to tell you about. How about that? Weren't there quite a few? Maybe some that you were overlooking. Either way, your data architecture is it can be plus or minus millions of dollars, depending upon the decisions that you just made in all those components, millions of dollars, perhaps even the whole profitability and even the whole company viability is resting on this. It's that important. If you're not generating great capabilities with data, then what are you doing? We as data professionals, we must be doing that. Now I'm going to talk a little bit about the TCO quickly, but remember the flip side of TCO is ROI, much more exciting to talk about, but that could go in a thousand different ways. TCO is just your cost of that modern data platform. Now it's not that simple. Does it include this? Does it include that? Does it include people? Does it include people on the business side? Yeah, figure all that out, do it consistently, but as far as the stack breakout, it looks something like this. Now I did this study probably three years ago now, but I think it pretty much hangs together still because it was a pretty widespread study of all the big three. This one happens to be the medium enterprise, which is a $10 million to $2 billion company, medium enterprise for AWS. Dedicated compute on here was the data warehouse, the big stuff, the big costs, are still the data warehouse compute in particular, data integration, the data lake and data exploration even sneaks into there in the top four. And these are kind of the four, except for the data lake, these are kind of the traditional components of a data architecture, aren't they? Yes, they're still there. We're just building around, building around that. So that basic data warehouse, if you will, you're expected to have that in place. You're expected that that be great. And then we're building around that. All these different components have entered the picture. Am I just hyping this stuff? Am I just jumping on everything? No, I thought about it. All the things I'm telling you, I really believe in. I really believe that these components add value if you do it right, but you have to have the skills in your shop to make sure that you do it right. And you got to pick the right components, et cetera, et cetera, all the good old things, right? Now, I show you the cost breakout, but the price performance metric is important. Price per performance, dollars per query hour. This is defined as a normalized cost of running a workload. It's calculated by multiplying the rate by the cloud platform vendor times the number of computation nodes used in the cluster and by dividing this amount by the aggregate total of the execution time, because you're paying for the execution time. And so performance is important if for no other reason than to save you money. So definitely look at performance to determine pricing. Each platform has options that you should be aware of. And I talked a little bit about that earlier. Cost predictability and transparency. Wow, has this raised its head with me quite a bit lately? Clients are saying, I'm willing to pay a little more if it's predictable. And I say, yes, guess what? Cost predictability lowers your cost because you're not in this mode of, I don't know what it's going to cost. I don't know what it's going to cost. I need to take care of this. I need to take care of that because I don't know what it's going to cost. And who's going to be upset if it costs more or costs less, as the case may be. So cost predictability, pretty important in all of these and transparency. You want to be sure you understand that. As a matter of fact, a new, I didn't even, it didn't even make my list here, but a new category is coming into focus. It's called FinOps for data. And I'm not going to mention the vendor. There are vendors for this, but they are looking at things like, hmm, is your snowflake workload on the right cluster for you? Or could you save money and not lose performance by moving to this other one and things like that? Or how about this one? There's a lot of data here, many terabytes that you never access. Maybe we should get rid of it or move that off to some colder, cheaper storage. And these are things that I got. I look at that, I go, well, you know, most shops are just not going to discover that on their own. So cost consciousness and licensing structure are important. Let me point out not paying when the system is idle, please, okay, compression to save storage costs and moving or isolating workloads to avoid contention and look for the ability to directly operate on compact formats like Parquet and ORC without converting the data. That adds just a ridiculous amount of overhead to a workload if you have to convert the data. So don't get in a situation where you got a tool that it doesn't work with the format of our data, we have to convert it. And there's a lot of extras, bells and whistles that can get hung on your license. So be very careful here. A lot of extras like are you paying per user per node per terabyte per CPU per scale up? As the case may be with one of the famous architectures. All right, distributed data architecture patterns. Now, before I get all buzzwordy here, and before I kind of bring you in on some of these things, I think about before I jump on something, I think, have I tried to do that without having the technology? Have I tried to do it because it's good? Have I tried to do it because it's right? And if I have and technology or science comes along and and provides a more efficient way of doing that, I say, that's great. That's relevant. So these distributed data architecture patterns, which I'm going to talk about, are relevant. And there are pros and cons to following the patterns. Don't get too crazy about following the pattern. Keep your eye on the prize, which is business happiness and business return on investment. All right, nobody's going to come along. And with a checklist and give you a bonus based upon you adhere to the proven or the published components of a data fabric. Okay, it just doesn't happen. I've never been given that. Okay. And I've created some some ones that actually do adhere pretty well. But that's not what that's not what matters, what matters is the bottom line. So keep your focus on that. Yeah. All right, these are not mutually exclusive. I'll talk a little bit about them. Data fabric, data mesh, data lake house, and data cloud. As a matter of fact, last month, right here, I spent the whole time on a data fabric. So if you really care about this, go find that one. And I will mention here that there's no one size fits all a lot of it depends where you're at. These are not mutually exclusive. The best is to use them all. The best is to use them all. Okay. Now your entry point into this stuff though might be the lake house. What's a lake house? Simply, there's a warehouse and a lake. And Databricks kind of invented the term, right? And they're pretty good at it as a result. It's a lake plus a data warehouse. It's a good idea. So for example, you might have your lake on S3. Okay. And you might have your data warehouse on Oracle in a cloud somewhere or Snowflake in a cloud somewhere. Great. Does the warehouse reach through to that data lake? Well, definitely not all due to all. There are certain players in data lake houses, which I'll get to, that makes sense. Okay. And the real skill here that the vendors have put into their lake house approaches is the pattern matching between the warehouse and the lake. So it knows when to go into the lake and it can seamlessly bring that data into your query with what I still call a reach through. I don't know what they're calling it anymore. All of the major data platform vendors have converged their messaging around the lake. So we know it's good. Data lake house principles, a data lake house offers many pieces that are familiar from historical data lake and data warehouse concepts, but in a way that merges them into something new, like open storage formats, like flexible storage, support for streaming, and it can handle all the tasks of the organization. I shouldn't say all, should never say all, but most of the tasks of the organization from a query perspective, benefits, administration management, better organization of information, simplified rules and regulations, and more cost efficiency. So data lake house, here's an example, Snowflake external tables, some good and bad with this, right? Scheme on read. If an error occurs, it skips to the next file, but still returns rows found in the current file up until the error occurred. Just a little FYI, recommended Snowflake recommends 16 meg 256 megabyte file sizes or higher for par K. It has Delta Lake support and the workflow is maybe not intuitive, but you create stage, you create an external table, you create a cloud object storage event notification, and it does automatic refresh for you. So that's an example of a data lake. What about a data mesh? Is that a product? Hmm. Well, actually in a data mesh data products is is what you create what departments departments create. And the idea here is to have some structure to the fact the reality that in an enterprise, there are multiple warehouses, there are multiple lakes, there are multiple data integration components, etc, etc. Why not do it in a more organized fashion than throwing up your hands and saying, well, we tried an enterprise data warehouse and it didn't work. So we're just doing whatever. It's more than that, right? It's taking that and coalescing the problem into a solution, which is each component, each department, and I'm using that term loosely, you might define it differently than business department, but each department will create its own either lake or warehouse or both or data integration component or both or all, and it will apply adaptive governance, which is really important in a data mesh approach, where you might have some tiered governance, you might have some governance at the departmental level because that's where they really know their business. I think that's what we've learned over the course of years doing data architecture, that it's hard for a central group to understand everything in our complex business environments to the point where they do a great job for everybody. And so let's break it up, divide and conquer, and still have a centralized place, but allow for some decentralization. All these are decentralized architectures. And what's really important, yes. I just wanted to give you a quick time check. There's about eight minutes left, and we've got some questions coming in. So. Okay. Let me get to those questions. I think that I will just mention that there's a data fabric, which again, go back to last month's presentation. It's like a big enterprise data virtualization layer with data integration and data analytics for data integration and data analytics. So you can get to data all over your organization with a great data fabric, a lot of principles and benefits of that. Finally, a data cloud, which is kind of a kind of a fuzzy term here. And it's being used differently, like Snowflake has their data cloud and Salesforce has a data cloud that's more dedicated to unifying a company's customer data for marketing and other channels. So it kind of makes sense, but I think the industry is kind of conflated data cloud with database. So just be careful when you're talking about the data cloud. Now, an important part of the data cloud is the data marketplace, which provides live access to ready to query data with a few clicks. And I predict massive growth in the data marketplace. And finally, that brings us to a summary so I can get to your questions, Mike and I can get to your questions. There are many lens to view data platform decisions, a lot of components, which I went through so many. And for most companies, all are essential. And most are product decisions today. Data architecture can easily easily make or break a company. It's that important. TCO ROI, cost predictability and transparency are important. And these decentralized architecture patterns, Lakehouse mesh, fabric and cloud are all valuable and they are not mutually exclusive. You can have them all. So back to you, Shannon, to see what questions we have. Mike, William, thank you so much. And despite a couple of audio breaks, it's amazing how tech works now, even from a boat. And to answer most commonly asked questions, just a reminder, I was in a follow-up email by End of Day Monday for this webinar with links to the slides and links to the recording. So diving in here, I don't want to miss a chance to get to these questions. A question came in initially, Mike, about RELTL. Does RELTL require ETL and rehosting data? Or does the platform support virtualization to push queries to underlying data sources? So for RELTL, we do rehost the data inside of the RELTL platform. That's primarily driven by the need to do that enrichment and standardization. And then oftentimes, we end up getting direct queries into the platform, put right into operational applications. And so to support that scale and that throughput, we do rehost that data. Really cool. So to both of you, what about virtualization as part of the data platform trace-based Postgres SQL, FDW, or Denoto, or data virtuality, etc.? Oh, well, I would say that we just had to draw the line somewhere. But yeah, I actually do believe in data virtualization. I think I mentioned it. But yeah, over at the top of all of this that I showed, all of this that we're doing, you need a virtualization layer to catch the fact that you're never going to get it all right, and you're never going to get all data in a singular platform for all queries. So queries will have to span platforms by necessity. And sometimes you don't care, you want that. It's okay. And great data virtualization like a Denoto, it's okay. It's okay to have some queries that are kind of just dedicated to work under Denoto. I wouldn't push everything to virtualization because you're growing up with performance challenges down the road, not because of Denoto, just because you're you're spanning databases. All right. So yes, data virtual, as your environment becomes more complex. And if you're doing the things I just said to do, you're complex, period. So yeah, get data virtualization to catch those educations, I would say. Okay. And Mike, feel free to jump in at any time here. And so how would a data platform fit into an organization driving towards a data mesh architecture, which strives to move to decentralized storage and processing? So I can ask you to get a shot at that way. I would say that when we talk about data mesh architecture, data mesh doesn't necessarily preclude centralized storage and processing. It certainly has a principle dedicated to decentralized data ownership and decentralized governance of that data. But within that overall data architecture, storage and processing and compute on that data can be centralized in order to meet performance characteristics. And so I would separate those two pieces of the actual storage and the overall access and ownership of that data. I agree with that. I mean, you can separate storage. Probably that's more of a, well, that's how we did it. And now we're trying to get to a great data mesh and that's not priority to fix. So you might have some spread out storage, but it is more about ownership. And I will add that master data management is really great for any of these decentralized data architectures. It takes something very important kind of off the table for you and allows that, at least that level of commonality across all of the data products in the departments. Perfect. Just a couple of minutes left. There's a question I keep in the chat here. You know, how does current data privacy legislation impact the architecture and data management policies of the modern data platform? Well, I think what it means is that data governance is really important. All those things I mentioned about data governance, including security, I think it means that data architecture is much more important. It can't be haphazard. You have to be able to jump into your data and find things out quickly and not allow it to be so disjointed that you might have some accidental or maybe some overdone levels of privilege granted out there. So it means SIEM technology is more important. It means governance is more important and data architecture is more important. You have to be able to draw it and understand where things are. And so you've got to pull that in centrally. You got to pull something in centrally and whether those departments that are building their lakes and so on like it or not, they need a visit from the central data security person and they need a coordination there because that person is seeing the landscape of regulation and what's coming. And we often want to overlook that and drive hard to the finish line, but we need security more than ever. I would 100% agree that the primary driver that I see with customers is that they are approaching this with more robust security models for data access and data privileges. And we still see some customers become even more extreme, especially when you have state of storage laws in regions where they have separated distinct storage environments for things like the EU and things like North America. But for the primary vector, at least that I've seen is through that security framework and our strong investment in what that should look like. Perfect. Well, thank you both so much. William, thank you for joining us from Germany, from your crews and to take the time to do such a great presentation. And Mike, thank you so much for joining us today and to Raltio for helping to make today's webinar happen. Really appreciate both of you and thanks for our attendees and everything you all do. And just a reminder again, I will send a full up email by end of day Monday with links to the slides and links to the recording along with anything else. Thanks, y'all. Hope you all have a great day. Bye-bye. Thank you.