 Hello and welcome my name is Shannon Kemp and I'm the Chief Digital Manager of Dataversity. We would like to thank you for joining the latest installment of the Monthly Dataversity Webinar Series Advanced Analytics with William McKnight, sponsored today by Chaos Searched. Today William will be discussing, will be discussing data architecture best practices for advanced analytics. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be clicking by the Q&A section or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag 80VAnalytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. To open the Q&A panel or the chat panel, you'll find those icons in the bottom middle of your screen for those features. And just to note the chat defaults descended just the panelists, we may absolutely change it to network with everyone. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session and any additional information requested throughout the webinar. Now let me turn it over to Courtney from Chaos Search for a brief word from our sponsor, Courtney. Hello and welcome. Thank you, Shannon. It's awesome to be with you guys again. I think this is our third day diversity session this calendar year and we're really happy to be back with the group and will you I'm happy to share some new topics today with you and look forward to getting some perspective from our audience here and we'll just do a quick little summary of Chaos Search, who we are, what we do, how we think we are or know that we are helping organizations know better when it comes to considering modern analytic infrastructures, which will be the topic at hand today. And to kick that off, what I would say is the following hang on a second. What do we do at Chaos Search? We activate the data lake. I'm sure when people think about infrastructure, they talk about lakes and lake houses and fabric and mesh, many of which we'll talk about today. But we activate the data, we activate the data lake for analytics at scale. So for those organizations looking to analyze multiple types of data, our data lake platform eliminates complexity and helps organizations overcome the challenges posed by costly siloed analytics solutions. We'll go into a little bit more about that, what that means. And for our customers, what can they do, they can perform both search and SQL queries at the same time, directly from their cloud object storage, no pipelining, no transformation, no movement. What does that mean for the customers we have noted below? Huge reductions in time, cost and complexity. So, for, for us, what that looks like to the Chaos Search approach, what we're really looking for simplicity elimination of complexity, the ability to constantly scale and then ultimately reduced costs. So if you look at this architecture, you may be sitting on this call going, oh, I know, yes I have a log data and well TP data and yes I put that stuff in object storage. So as simple as it looks when the description here shows that the Chaos Search data lake platform connects directly to your object storage. When we do that, we are providing you a single unified version of your data without copying it or moving it in any way. So we're reaching in, we're reading that data set we're making it available to you. What then happens is that whether you're asking a search question through an elastic API or a SQL question through any sort of SQL front end tool by connection through our open API as you may use your tool of choice to come into that data set and ask questions. What does that mean for a customer, besides the simplicity that we've talked about, it allows you to get to answers more quickly without a doubt. It also allows you to access data of multiple types so ask different types of questions this idea of, I have my search data here and I have my SQL data set here and I have ML data sets here. The variety of solutions today that are in the market helping organizations solve for that are many. I won't, I certainly won't name them here. What the search has done is fundamentally disrupt how we think about those data sets by saying that we can access, whether you're asking a search question or a SQL style question you can do that on the same data set without change. For use cases, if you think about a lot of our customers today they start in logs so they have cloud application infrastructure security monitoring troubleshooting threat hunting style use cases those are traditional large volume logs use cases. And where organizations often find themselves maybe falling over with traditional solutions is that you're constantly limited in terms of the amount of data you can retain, and therefore troubleshooting monitoring or or looking for any threats without a real lens in terms of perspective can be a serious challenge for a cloud ops DevOps or a sec ops team with chaos search those limits come off you're able to look at your data in an unconstrained way because we talked about how it's not moving through the cloud object store. To give you an example of organizations who are seeing a benefit Blackboard who is now part of anthology is an online education business, and their director of DevOps engineering has been a customer of chaos search when all of a sudden, their usage grew by 3000% with the move to online schooling, their existing log solution was simply falling over and there was no way they could support the amount of new activity with the infrastructure they had and the budgets that they had in place with the move to chaos search as Joel talks about it 98% of all the operational burdens have been lifted. So we can focus on more Blackboard specific tasks, not making the system work or run. What does that really mean for those on the call we're thinking about well how do you do that. Well what was the challenge they needed to reduce elastic search costs they need to figure out they are architecturally challenged to put as much data as they were generating into their existing elastic search infrastructure usage was growing in an unprecedented way. Are there SREs on their team were spending 10 to 15 hours a week simply maintaining the environment not helping end users derive answers. They knew they needed a better solution. They turned to chaos search because they knew our log solution was going to give them a single access pane of glass Joel talks about that pretty often, giving them, you know, five nines of uptime. What can they do they have complete visibility into their cloud environments at scale, and they can look for app troubleshooting and alerting root cause analysis with a long term view for some organizations that could be seven days that's not long enough to really understand what a root cause may look like. The facts were significant uptime was improved. They retain much more data so more than twice the amount of data at half the cost, no movement, no duplication, and they can query on demand. So this benefit of really focusing on value add work if anyone here works with or participates on an SRE team, hopefully that that's something that resonates with you. From a picture perspective it's as simple as this if you look at what's on the left with with Amazon elastic search, you had multiple clusters across multiple regions, simply working to get the job done. With chaos search you have a single point of, of access within our platform, and a dramatic simplification of the footprint. I hope I've told you enough about how we're helping customers and what we do, but I'd love to know from this audience, if you're out there live today. If you were going to chime in, what are your greatest challenges you can pick more than one when it comes to deriving insight from all your data. It's not a polling question if you choose to answer I'd love to get a sense for it will help us better answer questions at the end of this webcast. Is it expertise, or is it more need need for more data scientists or engineers is it resources is it tech is your architectural is your analytics platform, not set up for scale or is it is it limiting you in any way. Is it time and always the ask around cost, or is it something else. It may be multiple. I'd love to get your perspective on what's top of mind when it comes to thinking about getting more and better access to information. So thanks I see, I see answers coming in through the chat already. And with that what I would tell you is, I'm going to turn it over back to Shannon to take it away from here but if you'd like to learn more our websites here you can try it demo it anything you want or and take a look at some of our recent data so with that, thank you for the time and over you Shannon. Thank you so much and thanks to chaos search for sponsoring today's webinar helping to make these webinars happen. Always a pleasure to have you here if you have questions for Courtney feel free to submit them in the q amp a portion, as she will be joining us in the q amp a portion at the end of the webinar today. I'm going to introduce to our speaker for the series William McKnight, William has advised many of the world's best known organizations, his strategies form the information management plan for leading companies in numerous industries. He has a prolific author and a popular keynote speaker and trainer, he has performed dozens of benchmarks on leading database data lake streaming and data integration products. And with that I'll give Florida William to get his presentation started hello and welcome. Hello, and thank you Shannon. Thank you Courtney for that great review of chaos search. Glad to have you back aboard here. And I trust that my slides are being shown correctly if not chime on in. But if so, I want to introduce the talk today by saying I'm going to be talking about data architecture. I'm going to be talking about the best practices for analytic and advanced analytics, and I am going to be talking about analytics so we're going to focus on that sort of that post operational data. Okay data that's already done its transaction done its operation and now it's, it, it has a record of history that we need to capture in our analytic ecosystem, and that's exactly what it is today it's an ecosystem, and no two are the same, but the stacks getting so complicated. And so voluminous in terms of the things that we need to pull off a great stack. Okay, so I have 12 best practices for you today. I figured 12 was about right. So that I'm not rapid fire, throwing best practices at you. So we get to to sit with each best practice for just a little bit which I hope will help them to, to sink in and help you to think about them agree or disagree. Again my perspective, but let's move on. Okay, let's go. There we go. Okay, so a little bit about me. I'm a day strategy and implementation consulting firm I've been in consulting now for 25 or so years, all in the space of data, of course that has evolved quite a bit over the years so I've seen a lot started with a lot of data warehousing analytics and bi. That's still relevant today, by the way, but it has only expanded the envelope for data is only expanded as a matter of fact to Courtney's question earlier. The answer was data. That is the impediment to, to, to success so we'll see if you agree as we go along here. All right, so one principle I want to start with here is it's about all data. It's about all your data in the enterprise and the relevant data from third parties as well we don't talk enough about third party data today I think there's tremendous opportunity in third party data but I'm gonna leave that at that that could probably be a best practice to take a look into that marketplace but anyway, let's get all your data under management. And let's start to erode the edges of this chart where we have data that's not under management. We have data that is maybe it's not even being considered to be broad under management, then I'll talk a little bit more about what I mean by this this is a, this is a foundation of, of my implementations a foundation of, of philosophy, as I look at data architectures and so on. So it's something that I thought I put up here in the beginning some of you may have heard me say this before because it's, it's definitely something that I talk about a lot. I think that we need to get the data under management so that the business can utilize that data. And you may get, you may, I may hear some pushback at this point about well the business isn't asking for it. It's not asking for all that data just some of the data or just this part of the data, just summary data, etc, etc. Then I think you concurrent with that activity of getting all data under management, which I'm not backing off of. You need to kind of be working on the business side of things as well and helping them understand the possibilities, because today is here's another thing. It's not all about just following on to business priorities and so on that is great. It's also about data professionals like people who come to this webinar shaping what that business direction should be because we're going to have all this knowledge about data that others don't have best practice, get all data under management. Okay, so what does that mean it's under management when it is the following it's in a leverageable platform. Okay, it's in a platform that is built for leverage built for multiple applications not built for one application only. And that's this doesn't mean that there won't that those structures won't exist they surely will. But I would like to see all data be somewhere in a leverageable platform so that if, if a department wants special consideration for their transformations if they have special security requirements etc. If they have if they throw up all kinds of impediments to doing it this way. That's fine we we at least have the data someplace in a leverageable platform and then we may be push it to to their special platform as appropriate. Data is under management when it's in an appropriate platform for its profile usage we are not in the day and age when there's one size fits all meaning all data. There will be a difference between performance of analytics and operations in different databases and databases are engineered for one or the other, most of the time, not all the time. And that is true for a lot of the surrounding elements of the platform not just the, you know the platform itself with the data resides the storage data is under management when it has high non functional specifically it's available. It forms well to purpose is scalable. It's stable. People can count on it. It's durable and it's secure. Today is captured at the most granular level. It meets a data quality standard is defined by data governance not defined by us technical people who enjoy doing such things but by data governance and so almost always I will say if, if I'm implementing a data warehouse the data lake and master data management hub what have you data governance needs to be enhanced for that particular activity. And sometimes it just needs to be started which is, you know that means more work before we get to the, the quote unquote real work right data governance must be there for these things and it enables self service. And how are everyone with true self service analytics, not just expert fed insights. We're not over here, finding out the insights and, and, you know, metering them out to the business hopefully we're doing some of that but that's not it that can't be it. We need to empower the business to interact with the data, and we need to do that by delivering engaging data driven user experiences, not just simplistic transactional user experiences. We need to be delivering user experiences that allow for continued access to the data, or the data can be queried it can be queried again and again and again you can drill and drill and drill and heck maybe not every really get to the end of the data you just keep drilling in circles because you're learning and learning and learning, and sooner or later with that approach. The business gets to real insights which need to be operationalized for action, not just put into things like reports and dashboards which are kind of going by the wayside in terms of big time interest out there amongst the user communities that I follow. So here's the best practice empower everyone with true self service let's get out of that game, we builders, whether your it or not. Whether central it or not, we builders need to get out of the way between users and their data was plenty more workforce, hopefully you'll see that a little bit here today. Here's a quote 80% of analyst time is spent simply discovering and preparing data, and this was from 2017 and you know what is still pretty true today. It may not quite be 80% I don't know how this is measured, but it's certainly very high, and I know even a data scientist job is pretty high in terms of discovering and preparing data that means that we have not built it once. So that it can be used many times, we haven't done a great job at the architectures out there, and that is really where we need to focus that's where the big leverage comes from. And I know you got to got the day to day deadlines you got the day to day, you know, phone calls and so on maybe slack today give me this report give me that report, we got to do that. But at the same time we have to be adhering to our higher calling in the organization which is to raise the foundation and get out of that game. So let's start start getting concerned with the tools and processes of the Alice, how are they accessing the data. How can we facilitate that access to the data. How can we perhaps, if there's a better way that they can access to how can we change that. I think we shouldn't be just leaving all the use of community to their own devices, in terms of how they access data, because they're under those same deadlines I just mentioned right, they just kind of turn around and give it to us as well right but they're under those deadlines and that's why we are right but they're under those deadlines and they may not be thinking and how should you say a system to thinking, I like that system one versus system to thinking kind of all are always in system one thinking. I'd like to think I bring you a little bit of system to thinking here to go back and mall over. We're building up to the structures that are relevant to our analytics the relational data page, which I will go into great detail here except to say you got all the records there on the page. And they're in order in some kind of order if you've clustered the table, if not they're in there in some kind of random order. But we know where they are because of the row IDs at the end of the page and that's how the data manager of the DBMS navigates the page to bring us back record. That's on the record that's on page number 100 record number two. Well there it is. That's how they kind of do that so it's a beautiful thing. It's been around since the 1970s. And here let's add another row, we had a row ID we had the record. Again, beautiful thing works great. And nothing has supplanted the relational data page as being the preeminent way to store enterprise data. Yes, it's been around quite a while. Well, what else has been around since the 1970s. Let's think about email, the microprocessor ethernet, post it notes, and the mobile phones which weren't very mobile back then, but nonetheless, the relational data page has, you know, all kinds of neat things about it and it's kind of like the quirky keyboard right. We kind of all know the story of the, if our keyboards under our fingertips right now probably, where that was the, the most inefficient way to design it so that the keys in didn't match but I guess it's the most efficient way to do it now because that's how we have learned that we've passed it down to our children and in our genes and so on, perhaps as well. So the relational data page now let me give you a little twist to it and that's the columnar orientation. I cannot, I cannot talk about the relational data page without talking about columnar orientation. So if you're going to be relational and you're doing analytics, those structures. I almost always in my experience what I found almost always are optimized as columnar. So here's the best practice for you make all analytics structure columnar. And you can do some quick tests to figure this out maybe you don't have a columnar database well that's maybe problem number one. You don't have a columnar capabilities within your relational database or you don't know how to use them, or some such such thing. These are problem number ones. But usually they work. This works best for analytics structure where you get to deal with a single column at a time instead of the whole record, kind of like on iTunes you can go and play, go and buy one song. You can go and buy the whole album. And that's a beautiful thing to so columnar orientation has been around for quite a while. The, the, the history is interesting. And if anybody wants to look into it quickly though in the early 1990s, expressway technologies this developed the expressway 103, which is a was a column based engine optimized for analytics that would eventually become side base IQ. Anybody remember side base IQ side base acquired expressway and reintroduced the product in the 1990s 1995 as IQ accelerator. I remember that then renamed it shortly thereafter to side base IQ, and the story goes on and on now pretty much most data lake databases have columnar capabilities so take advantage of them. Now data lakes. Yeah, this is where we need to focus a lot of time and energy. I know we still got to shore up our data warehouses, but we've been ignoring big data. We've been under utilizing big data and big data, you know, kind of being defined as special and different in many ways right more of it. There are different, different data types in there, and so on it's like streaming data, mostly it's data that just accumulates very very rapidly it's not tied to a, a quote unquote business transaction. Data lakes are kind of in the terminology that seem the industry seems to have taken up I don't necessarily like it but I got to go with it because I want you to be able to follow along after this presentation everything you're going to hear. Most of the time it's data lakes is whatever we're doing to cloud storage s3 etc etc. Okay, so that's what we're talking about and how they're normally organized is in the upper left here. The record, it has a record length and a key length that has the key and has the value, etc, etc. And the record may or may not be compressed. That's a big deal and matters a lot but we won't go into that in great detail. What we do like is the parquet format to our data lakes for analytics and most data lakes are for analytics there are such things as operational data lakes. We built them. That's great as well. But most of the time when we refer to the data lake it's an analytic structure that kind of sits in the architecture alongside the data warehouses it's conceptually very much to me anyway like a data warehouse just different data. And obviously different, you know price structure. Better for big data. Period. Big data is competitive data today. We must get a handle on that. And by the way, since the data is sitting out there and data lakes and we're going to have so much more going there we do need to be able to access it. We need to be able to index it we need to be able to search it efficiently and that's where tools like chaos search come into play which is why we love that tool. Here's your best practice put big data in data lakes. Here's another best practice rapid fire index the data lake. Okay, so you got your data lakes hopefully they're architected pretty well. They're not necessarily well let me bring up the slide they're not necessarily brought to the same level of governance and quality. I'm not going to go into the data warehouse but I suggest that the lack of data governance I'm going to the last bullet here first the lack of data governance you do that apparel to the data lake, you need to bring some data governance to the data lake, and understand where maybe you're maybe where maybe you haven't applied consciously where you haven't applied data quality rules where you haven't applied real data governance and hopefully you wrote that over time. Data catalogs are really good for this really helpful to data lakes to understanding them, almost to the word mandatory but I'm just going to say very helpful data lakes are common and centralized storage for the enterprise. There is no defined data model into which the data is formed. No relationships is that it's a great place for historical data retention we don't talk enough about historical data retention, you have to have data that's available to the user community. You don't have to have that data going back a certain number of years no doubt about it. Prior to where they need that data going back to, you got to store it someplace well you don't have to really worry about if the data is in the data lake because it's such low cost storage I would just keep all history there for all time. Without having different tiers of data but in the data warehouse, you do have to be concerned about that, and perhaps the data lake is a good place to put some of that historical data as well. Again don't just shovel things in here do it mindfully do it under catalog do it under governance. All data formats can go in the data lake. It is for big data. Hopefully I've made that point. Mostly analytical processing, as I've said today it's mostly data scientists and coming on on board really quick as best as I can tell is, is the data analysts, if you will, and more and more, we're going to have to curate and understand what goes in the data lake again, how it's being used so that we can do a better job as builders for our users which are becoming so much more diverse. Now, I'm not going to be, I'm not I don't want to fool anybody. The data in the data lake is less valuable per capita than the data warehouse, meaning any given random terabyte that you pull out of the data lake. What's going to be as valuable as the terabyte in the data warehouse that's okay. Again, different pricing models and different purposes and over time I think we may get to the point where our data science within our organization says that all of this data is equally very highly valuable. Graph databases. Yeah, there's a finite set of structures that your analytic data should be in when it comes to best practices. I am at the end of that list right now, as I say graph databases. So graph databases, a little bit of their background in the mid 1960s navigational databases such as IBM's IMS remember that supported tree like structures in its hierarchical model. But the strict tree structure could be circumvented with virtual records graph structures could be represented in network model databases from the late 60s. Yeah, and they've been around, they've been around but they've only come into the big time I would say in the last few years, where we have learned that the sort of network data within the organization is important, and it must be navigated simply and efficiently, and that is not done in a relational database. I have tried, maybe you have tried as well to force feed graph data into relational databases, heck I've tried to force feed big data into relational databases and done an okay job with it for the time, but now you've got to use the tools of the graph databases I'm not going to go on and on about them or about the other tools but they are definitely a tool for you to be using for your sizable connected data, I had to throw the word sizable in there as a lot of relational databases now that will do an okay job with with graph algorithms. Like between us, and things like that between the centrality is is a big one showing you know the path between two nodes and the network, etc, etc. But anyway, so you may find that you can get away with using some of that. Check it out there that would be equivalent to a graph database for you. So graph databases you got graph databases we've got data lakes. We've got data workhouses we have other island analytics structure that's in relational databases. Now, because we have so much, and you're probably thinking what about this what about that yeah I'm thinking about it to but it's not worth bringing up how the other things at this level, but whatever we have to access some of this data. That's in multiple structures at once and so data virtualization just want to slip this in there how important that is in these new architectures. That's the best practice. So enable data virtualization for edge and temporary needs yeah not all needs, but for your edge needs where you happen to have data in multiple structures and you need it in one query great. And maybe you're in process of architecting that together which is always going to give you the best performance, but it's okay to have some queries go to data virtualization in the modern enterprise analytics stack which is right here. It's a modern stack and I've had other webinars that are completely devoted to this stack and explaining each and every one of them. Let me see if there's any I want to call out here dedicated compute is going to be your highest costly one. Storage is up there as well data integration. Sometimes you have to go to a third party. Speaking of third parties, here's the best practice for you drum roll leverage best to breed for your analytics stack. You don't have to go all in with whatever AWS provides or whatever Azure provides or whatever Google provides. They may not like me saying that but you may bring some tools to the party that you're already very skilled in for number one. Okay. Well, that might be one reason to swap out something in there in their stack for you. But some of, I mean, some things are just best to breed and they don't fall under those categories that should be perfectly fine and by the way I just mentioned three stacks. There are a few others that are, you know, more or less complete as well. And there's heterogeneous stacks galore out there. Okay. I think that's that I have slipped in data catalog. I slipped in data virtualization identity management, machine learning. There's set of libraries and so on I think the others may be a little bit more evident but I think there's 11 things here they're kind of all required for most, most of your big time is machine learning applications. And remember, total cost of ownership is more than just cloud costs. There's the cost of administrating it, the lack of platform features that leads to more work for work arounds. That's definitely true when it comes to data lakes when you move from the data warehouse environment and performance impacts cost of ownership as well. So best practice, get a strong handle on your cloud costs. I've given complete webinars dedicated to cloud costs and how to, how to look at them, how to understand them, how to manage them. So check that out and definitely do that. That's best practices now for analytics. There are some capabilities for data integration that you need to have. You have to ask yourself, here's an example of best of breed in the stack, right? You have to ask yourself whether AWS glue, etc, etc, whether those provide enough for you for your enterprise data integration needs. Because they don't, those tools don't do all of what I'm showing you here without getting into back and forth. Now here's some of the tools and we've done full evaluations of these tools and here's our net net, cloud arrow, IBM Informatica talent for anything. What else we have for anything? Matillion. Matillion for anything within our contained project scope, I would say. That's true for some of the other data prep vendors, but I think there's a role here for them in an enterprise alongside these enterprise data integration tools. I'm not trying to over tool anybody, but there are needs and there is no one-size-fits-all and that's true for data integration as well. In the best practice, you fit for purpose data integration, not one-size-fits-all. Don't think about it as one-size-fits-all. Yeah, it's not as bad, if you will, as the data management platform for which you will have several. You might just have two here, but I'm saying that you will have at least two. And that's okay. Competitive analytic architectures. Okay. And by the way, I'm giving these examples today. I'm not trying to exclude anybody. I'm not trying to exclude the great other stacks that there are out there. Okay, architecture component needs security privacy. That's a big one. Governance and appliance availability, et cetera, et cetera. I want to get to some other slides. Here's an analytics reference architecture. So you can see that we load the data lake. We also load the data warehouse. I like to use the data lake as staging for the data warehouse. What does that imply? That implies all data in the data warehouse is in the data lake as well. That implies that my historical data is safe and secure in the data lake and cost effective in the data lake. So I can age off my data warehouse if I want to, if I want to data over a few years old and not have to worry about it. I got it. It's in the data lake. It's not as accessible there and we might have to bring data virtualization to bear a time or two after we do that, but that's okay. This is how we take our big data, our low latency data. We stream that into the lake, but we use ETL or EOT as the case may be probably more likely EOT for loading batch data, relational data, structure data, et cetera, into our data lake. There I show you S3 as an example and I show you Parquet. So S3 turned as Parquet. Parquet does apply to other kinds of structures in this case in the data lake. The data lake applies to the S3 and then we can stream or ETL or what have you into the data warehouse. You have choices there. The data warehouse, you might create some data in the data warehouse that you want to offload into the data lake and that's fine. You might have some reach through queries from the warehouse into the lake, which some of you are picking up that that is the data lake house notion, right? Now let me introduce the lake house and some of the other constructs that we're hearing about lately. It's very exciting, very exciting. Data lake house, the terminology comes from Databricks, right? Okay. It's the idea of the lake and the warehouse working together. So points of integration or points of failure. So you want to be careful, but these data lakes have emerged to handle raw data in a variety of formats on cheap storage for data science and machine learning. So there is a skill to what data goes where or what data, really the skill is what data will go on to the data warehouse. And I talked about virtualization and the fact that, well, you kind of want your data together most of the time. And so it is important to carefully cultivate what data moves from the lake to the warehouse. But we're talking about the lake house right now. We're talking about the other way. We're talking about queries. The queries for the lake house start at the warehouse when it needs more data or reach into the lake and simply bring that to bear on the query. And that's the basic notion of the data lake house. It is more difficult than it sounds, maybe to set up to make sure that the lake understands, or excuse me, the query understands what data is where and can connect that data up appropriately between the two structures. Okay. And this is where, you know, you want something like a database that has already ensured that that will happen appropriately. All the major data platform vendors have converged their messaging around this concept, though. It takes the best attributes of traditional data warehouses enables them to run on platforms with data lake storage architecture. So the data warehouse might have S3 but I'm still calling it. This is tricky. Okay, follow me and I hope I'm trying not to confuse you but some data warehouses do sit on cloud storage. However, I can still call them a data warehouse we can still call them relational because of how the data is structured on the cloud storage. How is the structure. I showed you a few slides ago in that relational format and of course it's important that SQL works directly upon that data. All right, you've been hearing about this the data mesh. Yeah, the data mesh the mesh in the fabric, which is next are architectural patterns for data management that are decentralized. And to me decentralized. It's, you know, you don't want it as a purist but it's almost inevitable. Okay, so I look at the data mesh idea, which by the way is not attached to a vendor like Databricks to the data to the lake house. It's consultant led. I'm not sure words origins came from I just suddenly started hearing a lot about it. And looking at it. And it is something that I can tell already, or I can tell pretty quickly that a lot of my implementations have sort of back their way into looking like this by default right just out of necessity to have multiple warehouses multiple legs, multiple ETL pipelines, if you will. How are they, but in the, in the quote unquote mesh. How is it done it's done by domain. So you have different warehouses lakes and integrations for different quote unquote domains. Domains are typically business domains like a business department. But domains can be other things like marketing sales HR. Or you can organize it a little bit differently the point is not what it's organized by the point is that it is decentralized and there are multiple structures, working together, but multiple structures this enables flexibility and design, especially for self service. And it does have some challenges, right, because you don't have the enterprise level and necessarily recognized in here so you may have that as one of the domains. Okay, maybe you don't want to know it to that level. So, governance security availability recovery performance can be difficult, but it's just flexible and designed for self service you just may have to work around the shortcomings, because I think a data mesh is definitely the ideas here to stay so I look forward to some best practices coming out about data meshes. I look forward to contributing to best practices around data meshes, because I definitely see this as something that is sort of an acknowledgement of the reality. Now let's move on to the data fabric now I don't have a great way to draw the data fabric, I have not seen a great way to draw the data fabric so I threw a fabric over the architecture, and hopefully that conveys the message. Now, but a fabric is you got your architecture there looks like these looks like the lake house a bit here, but the fabric provides common shared services connectivity and application portability. It's all about the use of metadata to enable the data to inform its own management and governance. So, big challenge here is in making sure that we capture the metadata. Often times, architectures have not captured metadata, do not have great metadata, and therefore they are behind the wheel when it comes to trying to get to a data fabric but nonetheless it can be done. I like the idea of the fabric, and I like the idea of the automation that's inherent within the fabric the fabric working on the data in the background when you think about rule a think about rule B because it's automatic. Different things are happening automatically within the confines of the fabric, it utilizes the continued flow of data over all metadata assets to provide insights and recommendations. So I really liking the idea of the data speaking for itself. And I was at the thought spot conference this week I brought this up many times that we want the data to start we want processes start working on the data in the background and telling us more telling us more about what we should be focused on, not doing all of not just sitting there, sitting there statically. And so the data mesh has these engines running on the data in the background, which is great we look forward to seeing more of this data virtualization is like a logical data fabric. Data fabric comes from different places but IBM is one company that's all over it, and they're clearly a bellwether company so we have to pay attention. By the way, you cannot do either a mesh or fabric effectively without enterprise level data. And what I might call master data management. How do you answer the question how many customers are there. When you have these decentralized architectures. Well, again, that's something that you want to take care of as well. Context, context matters to the business departments, but you must have a place for overall answers. So, machine fabric pursue machine fabric architectures to the degree possible. I'm not carefully worded. I know it's a bit murky and high level kind of best practice but that's as best as I can do right now because we have to do some waiting and seeing and this is a general best practice. Each of you, I might advise a more specific best practice around a mesh or fabric or neither. But I think generally speaking, I can say this with confidence. And finally, we have the data cloud from snowflake, right, where everything is a cloud service. The data cloud allows organizations to unify and connect to a single copy of all their data. The result is an ecosystem of thousands of businesses and organizations connecting to not only their own data, but also to each other by effortlessly sharing and consuming share data and data services. The cloud makes the vast and growing quantities of valuable data connected, accessible and available. It looks a little bit more integrated to me than the lake house. But in reality, it may be more complex and expensive. Data science is only supported with new clusters. In fact, almost every additional demand of performance scale around the list can only be met by adding new resources at a price. It's more integrated than the lake house, like I mentioned, but in reality, more complex and expensive. You're continually spinning up new clusters. Best practice around this is to do an ELT into snowflake, and then transition that data using SQL into a snowflake data warehouse. Every client will buy an ETL tool for this, mutilion, phytrain, informatica, et cetera, to load snowflake. This does take advantage of things like cloud security, mostly at a pass-through level. Again, if you can get past the expense of it all and you can set this up, not a bad way to go. I'm not really saying it's a bad thing. I think most organizations would definitely give an arm and a leg to have a great data cloud up and running today. As they would a data mesh, as they would a data fabric, or a data lake house, or different of the above. You can have lake houses and fabrics and meshes all at once. I don't know about throwing the cloud in on that one, but you can under this cloud, you could definitely have some of those things as well. I hopefully haven't confused you too much, but this is all kind of getting sorted out. This is all new stuff. Data cloud hasn't been around that long. This is the year, the idea of it anyway, from snowflake. They are providing more and more best practices as time goes on. We're seeing more and more customers that are doing more and more with this. I wouldn't say anybody is fully there. I think that's a fair statement. I think that's an okay statement to make, right? But once they are, I think they'll be adding on to it because I think it is great as well. In summary, get all enterprise data under management. Relational databases, especially cloud, cloud storage, especially parquet, and graph cover most analytic platform needs. Cost of ownership is more than the cloud cost, get a handle on that. Data integration is vital to data architecture for modern analytics. That's why I spent some time on it here today. Let's put the data structures in place in a best practice way without the data flowing in a best practice way. Okay, and that's great data integration. Data integration, data prep, streaming data, what have you. The data mesh and data fabric are decentralizing the architecture and acknowledging reality and it's a great thing. So I did recommend pursuing them. I have a couple of offerings there at the bottom of the page that we do that's related to the concepts that were discussed today. And also, I think I failed to mention most of the time on these presentations, but I'll give anybody out there a free half an hour with me to talk about your situation. No, how do you how do you say it, no obligation necessary or whatever. But just get in touch with me. Happy to talk about these things I learned to our upcoming topics in this series. I can't wait to next month one of my favorite topics. I like this one too. But is our information management mature. Everybody's asking about maturity. How are we mature. How are we compared to our, our peers and so on. Well, let me give you some guidelines around that. The future based on AI and analytics. I really like that one. That one's going to be I'm going to be a go a little bit astray of the pure technical aspect and take a look at the future. We're all going to live into the future to some degree. So let's learn about it. Organizational change management is going to be out there graph database use cases though there it is. And then assessing new databases the translitical use cases those cases that are both operational and analytical, eroding what I said before about there being different databases for for operational and analytical. But that's a, that's how things are going. Some of the best practices by the way that you saw today, you will see them again next month in the mature environment talk why is that because I have found that mature environments actually do these things. And so when I share with you the model, you're going to see that they're correlated to some of the things I said today. Okay. So that brings me to the end of my formal presentation. I'll turn it back over to Shannon and Courtney and I will now take your questions. Thank you so much William for another great presentation, and just to answer the most commonly asked questions just a reminder I will send a follow up email by end of day Monday for this webinar with links to the slides and links to the recording of this session and anything else requested throughout so diving in here so data virtualization introduces very complex views and breaks the idea of having modular code to debug and analyze. So why would one pursue this knowledge, this knowing the drawbacks. Well I'll take a first shot at it here. Since I brought it up in my presentation. Yeah, I hope I put the right words around the utilization of data virtualization because I essentially agree with the question and don't want to see it over overly done right. When you're moving fast as enterprises are you're going to have those edge cases where you didn't load this data set in the data warehouse, yet you need to access it along with the 80% of data for the query that is in the data warehouse so what are you going to do you're going to wait a month for, you know, a turnaround of loading that data set a month is probably pretty generous, I think in a lot of enterprises maybe small like two or three. They can't do that. So data virtualization saves the day there so you don't want to you don't want to be trying to implement data virtualization on the fly here in response to something like this you probably won't as a matter of fact you'll probably go ahead and load the data set. Okay, so you don't want to have that you don't want to spend that month. Now you want to do it as part of your core architecture so that you can take advantage of those situations and and be be responsive I think it's, I think it's, you know, definitely it's incumbent upon us to be a uber responsive to the business and this is just part of that to me. Anything up there. Oh, I think Williams answer on that one was was great. I love it. So, lots of great questions that coming in here. So, what is your view on using graph databases versus column there. They're two different things. They're four different data. Graph is for connected data relationship data network data. Things like, you know, the Twitter graph how we're all connected to each other things like maps, where you're connecting nodes on the earth to create efficient pathing where you're finding out relationships between between people between customers between parts what have you, things like that, columnar is just simply analytic relational database should be columnar because they only access the columns that are needed for the query in an I up, which is very expensive. I don't, I don't think the concept of columnar applies necessarily to graph data I showed you real quick, but I showed you on the slide, the, the RDF format for graph data which is what's called a triple store subject and predicate in the in the in the triple, if you will, and subject predicate object and so that's how that data is structured. And so they're just different. And Courtney I'll just let you jump in whenever for these questions. Um, so, can you explain your comment that you need MDM before data mesh and fabric and why. Yeah, I'll be happy to because I want everybody to know this and happy to underscore what I said before about the importance of master data management now I've given a full webinar on the importance of master data management so you can find out a little bit more about what I what I think about it in that webinar if you want to hook that up but I think it's just really important to an enterprise where you know data has been determined to be important this is a an elegant way of managing that data. Now, that being said, in conjunction with a measure of fabric, I think we have to be careful in that, even without MDM, you know, folks have enterprise level data somewhere. It's a measure of fabric but if you go with a measure of fabric you meant you have the option I guess of not having anything at the enterprise level because everything is decentralized and out at the department level or some other some other domain that you've established level. And that is where you are just, that's where you're going to have a hole in your capabilities. I don't think virtualization is the answer there I think MDM is the answer there to have it physically manifested and up to date in real time, and it's not virtual and probably pushed out to the applications that need that kind of data and that's what MDM does so that's why I'm high on MDM, if you will, and believe that it's a very important special concept. When we're talking about decentralized architectures like mesh and fabric. So and one of your thoughts on data lake house combining data warehouse and data lake. It's very important. I think you should do it. Absolutely. Make those structures work together. You're going to need both. I hopefully established a little bit here today about the need for both of those structures for different types of data in the organization and in order to get all data laid under management. So you're going to need both best for them to work together in harmony. And again, again, it's another way of providing great service to the business and being ready for anything that comes up within the data so kind of like a virtualization concept right but but it's special because it combines the data warehouse and the data lake. I guess I'll chime in Shannon on that one to this. Yeah, I mean we spent so much time talking about all four of the things right the lake, really five right the warehouse the lake the lake house the mesh the fabric. And I guess I'm echoing Williams comment that there's room for both things I think that you see a lot of opportunity and like what is the macro problem here. Data is is both diverse in its type it's diverse and where it persists it's diverse and the type of groups that are trying to derive questions, you know drive answers. One thing I would say is when it comes to a lake or lake house, it, you know it doesn't have to be, or warehouse or lake house I think was a question. It doesn't have to be one or the other but there is a like an awareness point for enterprises around trying to drive towards simplicity. Right, which is, people aren't going to want to have a mesh of fabric warehouse the lake house the lake like I don't think that's really our end game. What is important is that are you know is everybody here really thinking through where are the greatest needs for data within my business. What is the best way for me to answer a couple of questions right so that you have something that's like extensible and it's in its structure, and then, and then move towards that. So it's not it's not only a one thing but it's also not an everything answer. I love it. So I'm going to get to see what about data vault. What about data vault okay. Very popular in Europe, by the way. That's, that's a modeling technique that can work with any of the above and not better or worse in any of the above, just a modeling technique that has been, you know decided upon personally I don't pursue vaults in my implementations. Maybe it's just because you know I'm good with straight up Kimball approaches to data modeling and, and I do like the idea of the idea behind it to get all data. But I think you can do that in other structures as well so. So modeling technique, not necessarily a architecture best practice but something you can work with within all the others. Not something I'm going to raise to a best practice that love it well thank you both so much for these great presentations and thanks to chaos search for sponsoring today's webinar again helping to make these webinars happen. I appreciate it and great to have you with us again Courtney. And thanks to our attendees for being so engaged in everything that we do. Just a reminder, I will send it again I will send a remind follow be mobile in a day Monday with links to the slides and links to the recording of this session. So thank you everybody hope you'll have a great day as well. Thanks everyone bye.