 Hello and welcome. My name is Shannon Kemp, and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining the latest installment of the Monthly DataVercity Webinar Series, Advanced Analytics with William McKnight, sponsored today by AtScale. And today, William, we'll be discussing the evolution of the data platform and what it means to enterprise analytics strategy. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share how to take questions via Twitter using hashtag ADV Analytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom right-hand corner of your screen for that feature. And as always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and any additional information requested throughout the webinar. Now, let me turn it over to Dave for a brief word from our sponsor, AtScale. Dave, hello and welcome. Great. Thanks, Shannon. And hello, everybody. Thanks for joining us today. Just a little bit about AtScale before I hand it over to William here. So what AtScale does is that we help these customers that you see here modernize their analytics infrastructure. So William's talk is very pertinent here in terms of what we help our customers do. We have a great representation between financial services, healthcare, and retail. And each one of these customers was looking to modernize their infrastructure, move to the cloud, but still carry with them all the functionality that they had, especially when it comes to multi-dimensional analysis. So next slide. AtScale fits right here in this new analytics stack. And I think William will be talking a lot about what this new analytics stack looks today, but our message is, look, there's data warehouses and there's data lakes. It's not an either or, it's an and. And so you need an architecture that allows your consumers, whether they be people using BI tools, using BI tools like Power BI, Excel, and Tableau, whether you have data scientists who are using their Jupyter notebooks, or whether you have developers who are building applications with analytics. You need to have that one multi-dimensional governing view of data. And that all comes with a scale out virtualization engine. And that's where we do. We make people to consume atomic data if you're a data scientist from the data lake or blessed enriched data in the data warehouse if you're a BI analyst. And we make all that invisible to those consumers where that data sits. Next slide. So some of these customers and some of our value propositions, propositions is, you know, is to accelerate decisioning. So Koch Industries, for example, had a big investment in Redshift. Only a handful of users could run concurrent queries. We extended that to hundreds of users and improved their performance by almost 40 times. We have Cole Industries, or sorry, Cole's department store. They eliminated nine months of data prep. So truly getting to new data sets and new decisions very, very quickly. We control the complexity and the cost of analytics. So as you move to the cloud and you get those big cloud bills, we, for example, help bull.com reduce their Google BigQuery spend by 91%. We help customers like Wayfair migrate to Google BigQuery, and still be able to run their OLAP workloads using Excel against with thousands of analysts. And we've helped companies like Tyson Foods, Home Depot, and United Healthcare be able to govern their analytics and make sure that only the right people can see the right data and that the data is secure regardless of who's consuming it. Next slide. These are our sort of core IPs. This is our core value propositions. We have a data virtualization engine that will play the data where it is. There is no loading data in AppScale. We will actually pass those queries right down to the underlying data stores and hide all the complexity from users. We have a universal semantic layer. That means both whether you're querying with OLAP through tools like Excel and Power BI, or through SQL with tools like Tableau, everybody's going to get the same answer at the same time, including those data scientists who are building their models. Autonomous data engineering is our technology to eliminate all the laborious and manual ETL it takes to stand up an analytics infrastructure. We do all that automatically with AI. We look at user queries, tune those automatically, and make your underlying data platforms perform by orders of magnitude better. And then we do all that with a security and governance layer, again, that makes sure that only people can see only the data they're supposed to see. Next slide. We do that with a platform that looks like this. I encourage you to come to our website. We can tell you more about it, but basically we're connecting consumers to the underlying data that you see at the bottom there. And we do it with a modeling framework, a multi-dimensional engine that's built in with data virtualization. That makes sure that we can join data across these different silos. And of course, we make it all super fast. And then finally, one last slide. If you go to the next slide is I encourage you to come to our website and look at our TPC DS benchmark. So we ran a grant all these different data platforms, BigQuery, Redshift, Snowflake, Synapse, Databricks. We ran them on that scale and without that scale. So you can see the raw performance if you're looking in the market and looking to compare each of these platforms. And you can see what we've done with that scale. We improved query performance up to 12x faster. In terms of user concurrency, you can look at up to 86 times more users and faster queries. And of course, we saved a ton of money up to 10x cheaper or 11x cheaper in Databricks case for the same amount of queries. And that means lower cloud bills. So that's it for me. I'm going to hand it over to William and so back to you, Shannon. Dave, thank you so much for kicking us off. Thanks to outscale for sponsoring. If you have questions for Dave, feel free to submit them in the Q&A section in the bottom right hand corner of your screen as he will be joining us in the Q&A at the end of the webinar today. Now let me introduce to you our speaker for the series, William McKnight. William is the president of the McKnight Consulting Group. McKnight Consulting Group focuses on delivering business value and solving business problems utilizing proven streamlined approaches and information management. They have on several best practice competitions for their implementation. He's been helping companies adopt big data solutions. And with that, I will give Florida William to get his presentation started. Hello and welcome. Hello, and thank you, Shannon. And thank you, Dave. It's great to have at scale aboard this month. And it's been a really great catching up with that scale over the past few weeks. Of course, I see them in the wild, as it were. I'm a fan of improved performance. So what can I say? And it slides right in there. So definitely something to think about. It's a little bit about me, but I have been introduced. So I will move along here. We do strategy training and implementation, as most of you know. And I have packed a lot of information into this presentation. It just kept getting bigger. Hopefully, it all makes sense if it comes together. It's just a lot of information on this topic that I wanted to share with you. So hopefully that makes sense as we all pursue getting all our data under control. All of our data, not just some of it. And is it, are we getting this data together in anticipation of the use now? Or are we, are the uses, what comes first? You know, does the demand for usage come before we actually decide to get some data together here? I think a little bit leadership is necessary. And I do think that all data should be interesting. I think we have to work both sides of the equation. We have to deliver the data in manageable format. And I'm going to talk about that. I'm going to find that in a minute. But I think we also have to stimulate the demand because we are the ones. We data professionals, which is usually who I speak to in these webinars. Excuse me if you're not, but data professionals are sitting on the gold of the organization. And I hope you, I hope you believe that. And I hope you feel strong, strongly about that and can put that message forward to your organization. Because we want to get all data together, working for the business. We want to get it under management. So let me define that a little bit because that can be taken as fairly nebulous. Some people may think it's under management, but these are my criteria for being under management. It's in a leverageable platform. It's not something that you just sort of decided, ah, yeah, this is what we do for everything. So let's do it for this as well. This is not the era for that. This is the era of, as Dave was saying, multiple vessels for carrying this information asset into the organization. He mentioned data lakes and data warehouses. I might mention a few more that are relevant here, but there's no one side that's all right now. We'll see about 10 years from now if something comes along. There's some glimmers of hope on that front, but it's certainly nothing that we're taking action on today. We are building multiple functions for our data for our very important asset. I'm going to bring the history in here in just a minute because I know that some of this is about the history of it. And history is important because it shows you where it's coming from and what it was built for. So anyway, let me go on here. In an appropriate platform, four is profile and usage. As you may know, that's a real hot button of mine getting the data into the right platform. You can have data under management with the other criteria, but if it's not in the right platform for its profile and usage, then your odds of success go way down. With high non-functionals, we know what they are. We've got to have them. I find a lot of organizations get hyper-focused on one versus the other. Others like performance at all costs or scalability at all costs and forgetting about the other. These are all important. Data captured at the most granular level, this gets back to the philosophy of getting all data under management, not just summary data, not just some data, not selective data, not data that you can't quit, you can't turn away the screens for that data. That should not be the criteria. You need to work together with the business side of your business and make sure that it's all getting put into the most action because I get asked this question a lot. Is data important? Is data where we should be placing our energy, our budgets, our focus within organizations? The answer is absolutely unequivocably, yes, as long as you're doing it the smart way. If you're not doing it the smart way, then all bets are off. We want that data at a data quality standard as well. We do not want to be moving around data and making data available that is below the quality standard that the organization has agreed upon. We don't want everyone coming up with all the rules for their data that they're going to use in their application. Most applications need multiple data sets, multiple sets like customer product, site, part, and so on. We could go on and on. You know we could. Those all should be governed somewhere so that we're on the same page, and that's data governance. Data governance is not a product. Data governance is a process, and some products help, but it is not a product. One of the main reasons why we need all data under management is for artificial intelligence. So I'm not going to read all this. These are just some examples of data that we put to work for artificial intelligence. I'm sure you can think of many more. Most of you are starting to at least think about artificial intelligence for your enterprises and the data that goes along with that, getting all that data under management. So what's the great architecture for artificial intelligence? Well, that's the answer right now. For me anyway, it's a great data architecture. A great data architecture with multiple components working together. We've got the lake, and the lake is going to be very important here. I don't want to just throw it in there with everything else. The lake is going to be where the bulk of the data is that artificial intelligence will draw from, at least in most organizations. But there's others, and I'll get into all of that as we go along here. By now, maybe you're catching on to my philosophy of, well, it's kind of embodied in this graphic. Maybe you've seen me give this one before, but it just speaks volumes. It's kind of below that water line is the data, and most users don't see it, don't care about it, don't have the capacity to think about it. They just want you to do the right thing there and make that data available. But we should be putting our energies commensurately into the respective areas based upon what you see here. Yes, of course, the BI and AI layer is something of importance that we need to place emphasis on, but that data layer, and putting foundational pieces on that data layer, like AtScale, making that data really more functional within the organization. That's where we should be putting most of our focus. Yeah, you can get some quick wins on the BI and AI side, but I've seen so many organizations that that's all they work. That's all they work on. And they get some quick wins, but they just keep adding on to that layer so that they have an inverted one of these pictures. So their BI is actually the behemoth, and the data as well, it's huge, but they don't work on it. And so it's ugly, and that can become a real mess. So please keep the focus on data even when a user will say, well, can you change this report? Think data. Think data. I want that data to be able to jump into any BI tool and make sense, right? It shouldn't be the BI tool standing on its head trying to get value out of the data. It should be the other way around. Okay, now I'm going to bring in some of the history here. Now the relational database data page, this is obviously the structure of where we put most of our data in an organization. I say most. That may be questionable today with the data lake in place, which is not usually on one of these things, all right? By the way, as I get into these terminologies, you know that the industry doesn't have great standards, and one person's data warehouse might be another person's lake, might be another person's mart. I just try to try to understand what you all are trying to say out there. And go along with it. But in absence of that, or in places where there's a lot of difference of opinion, if you will, this is where I go. Data warehouse being that relational database, okay? Or databases, as the case may be, where you have that centralized data that's been governed and is well-performing and so on. So, back again, what is the relational database? Well, it sits on... Now, there's a lot more to it than just the data page, but the data page comprises the bulk of the actual storage unless you have a ton of indexes. Okay. So, the data page has your records on it. I'm not getting into gory details here, but this is stuff that you should know if you work at all with a relational database. I'm talking Snowflake, Redshift, Synapse, on and on and on, right? So, I'm talking about the historical DB2 on these guys. All right, if you're working with these guys, you've got to know what happens down there on the storage layer, because I think that helps you determine what is right for putting into this kind of storage. And today, most appropriately, by the way, we've been putting most relevant data into the relational databases. But that should not be the end of the story because this is... While it's great, you can see maybe a little bit of great forward has these row IDs. So, you can get some good random access. Well, random access is going to be a lot better here in a relational database. But this is a costly endeavor to have all this structure around the page. You're going to have some holes here right before the row IDs is where typically you have some gap space. And you really need indexes, which as I mentioned before, can take up a lot of space just to navigate the records in a relational database. So, there's a lot of good and a lot of bad when it comes to a relational database. But it's been around for a while. It's been around. Let me just fill this out. Let's add a record. Okay, so we added another row ID. All right, we pointed it to our new record. How do you like that? That's how it works. So, it figures out what page it should go on, keeps the row IDs in order if you have any clustering sequence or anything like that. And then it puts the records in there. And what I don't show here, by the way, is that records typically have a header of a few bytes that have some navigational information about the record, which comes in handy if you have bar charts, especially because it says how long that record is, and so on and so forth. But anyway, this all came around in the 70s. That's right, the 70s. Now, also from the 70s are many other things that we still use today. Okay, it's not all that. Now, WYSIWYG, the microprocessor, Ethernet, Post-it notes, and even mobile phones were invented back in the 70s. They weren't so mobile. They were huge. But we have some good things that came out of the 70s. I was around then, even before, but we don't need to talk about that. But the 70s did give us a lot of good things that we're still using today, and the relational database is one of them. We're not necessarily discounting the relational database because it is so old, all right? We haven't been able to improve upon it, and it's kind of like the keyboard that we're all typing in, right? ASDF, right? It was just so used to. So there's a lot of goodness there for sure. But there are some nuances to it that have been helpful. So in the early 90s, there was a company called Expressway, and they developed the Expressway 103, which was a column-based engine optimized for analytics because they figured, well, why are we doing IOs on... Let me back up and show you this here. Why are we doing IOs on all of this when we only want certain columns in here? I'm not going to fully explain the column orientation, but it's significant enough to where most databases go to market. I should say all databases come to market now and that's at least the possibility of having column orientation. And most of the databases out there have been engineered so that they can go this way. And what this simply means is that you have columns and values stored together instead of complete rows. Now, back to the history, I think it's interesting that company Expressway eventually became CyBase IQ. When CyBase acquired Expressway and introduced the IQ accelerator and renamed it to IQ, that was around 1995. So you see there was some time spent in the market with just the relational non-colonel, I said row-oriented approach to things. But the majority of database systems that you have been working with and I have been working with has been row-oriented unless you just came into this business in the past year or so. You may never have applied that label, know why it's important, or that alternatives exist. And for many even DBAs, the need to know down at this layer has not even been there, but it should be there. The better DBAs are going to know this stuff and going to use it to their advantage. So other companies, very important in the history of Colomer, one of them was Vertica, and that came along in 2005 by it was Michael Stonebreaker and Andrew Palmer, and they presented this as an alternative to IQ in some ways better, and it continues to be strong to this day. And I'm going to talk about the relevant DBMSs to you as we get along here a little bit in the presentation. And I know I'm belaboring Colomer here, but it's because I'm just such a fan of it for the analytic workload. And so in absence of information to the contrary, I think probably a good 90% of analytic workloads, which includes the data warehouse, by the way, is going to be Colomer oriented. And I'm even applying this to our data lakes today, and I'm going to show you how we do that when I get to the lakes part of this. Now, along came distributed file systems. And I think the first of this was probably the first actually marketable one. That was the co-founders were Doug Cutting and Mike Caffarella, and that was the Google file system published in October 2003. So you see some time gap to distributed file systems and really for most of us, we just started embracing Hadoop until, what, maybe 2010 around there. And then that was even early days, I would say. So Hadoop kind of came along, and it was not the only distributed file system by any stretch. We also have all the NoSQL databases out there, the Mongos, the Couchbases, the Radis, and React, and so on. All the NoSQL databases were also distributed file systems, along with Hadoop. But Hadoop and those are different in what they attack. Hadoop is more for your analytic workloads, whereas the NoSQL is going to be much more operational. And I won't get into all the reasons why on that, but they both follow this kind of pattern where the data is blocked up and it's spread around different nodes. It's spread around to nodes on different racks and within the same rack. And all this is to minimize failure. This is how it handles failure. So, yeah, it does kind of blow up the data a little bit, but still, there's not a lot of overhead on it. There's not a lot of navigational abilities and so on. That's why the first Hadoop, we had to read everything from start to finish, which was rather a taxing thing to do when you just wanted to pull a record or two. We don't have that anymore, fortunately. And it's holding its ground, I would say, but a lot of new stuff has gone on to Cloud Storage, which doesn't even have what you see here. As a matter of fact, there's not a lot of exciting things to show you about Cloud Storage. But I'll get to that. Let me put a nuance here on Hadoop. I mentioned earlier that we like putting Hadoop into a columnar structure. We like putting Cloud Storage into columnar structure. And the big thing around that is the parquet file format. And so we're big advocates of that for the analytic workload for all the reasons I mentioned before around columnar databases, really. But Hadoop has many different file formats. So the default is going to be something like a sequence file like what you see here in the upper left of my graphic. There can be compression. There doesn't have to be compression. There's also... Let me see what else that I want to say about that. Nothing much about the sequence file. It's just what you see here. There are some other row-oriented... This is row-oriented. There are some other row-oriented structures for Hadoop. And that includes the map file and Avro, in case you've heard of those. But column-oriented, there's a few there, too. Parquet, RC file, and Orc file. And we like Parquet, and I'll explain that a little bit because that will help explain really all of them. Now, Parquet is, as you see here, it's based on Google Dremel. It's especially good at handling nested data where you can have varying amounts of data for a given column. So Parquet converts the data to a flat column store, which is represented by something they call repeat level and definition level, which helps to define all the nesting that you see here. So, yeah, like I mentioned, there's Parquet for column. There's also the Orc file for column approaches to Hadoop. And the RC file. Parquet also works in cloud storage, which is something that we're advocating strongly for data lakes. Speaking of that, let's talk about it. Nothing too exciting here to show you. So I'm showing you a data lake architecture. Nothing exciting to show you about the storage. It's more or less bare bones. Now, cloud storage, though, just using cloud storage, has been around a long time since back in the 60s. And now what we found is to be a real elegant use of that is it with the data lake. So Hadoop sort of had his heyday and is still out there for many lakes. But now we're pushing forward into cloud storage, which seems to have the right balance of things that companies are looking for in terms of costs, lower costs than Hadoop generally speaking. And we don't have a super high amount of usage on it right now, at least in terms of numbers of people that will change. And I'll explain that as we go along here. But cloud storage price performance seems to fit the bill for data lake, which is where you're going to put most or all of your data. That's right. Most or all of your data in the lake. Now you're going to push some of that out to the data warehouse probably a lot, but not necessarily all of it. And you're going to have your data scientists in there and so on. So a data lake, it's a collection of long-term data containers that capture, refine, and explore any form of raw data at scale and deal with low-cost technologies from which multiple downstream facilities may draw, like the data warehouse, the data mart, and so on. There's no one-size-fits-all here in terms of architecture. I think I will show you an architecture slide later where it puts in place some parameters for you. But nobody's coming to the table with a blank sheet of paper. So nobody, consequently, nobody is going to go to any kind of great reference architecture in any short amount of time or really with any reasonable pursuit. So you have to move forward from where you are into these architecture bits that I'm showing you here. So most companies could use the data lake. Yeah, most companies could use the data lake. If you believe in the notion that you're trying to get all your data under control, if you believe in data science, if you have data scientists, et cetera, you're going to need that. The other big structure that came along, this is an influence, I would say, by some technology that also came out in the 1960s. Some of you may remember IMS from IBM. I do remember that. And it's tree-like structures. It's hierarchical model. Graph structures could be represented in network model databases from the late 1960s. But they frequently were not, and this only became really commercial in the mid-to-late, I'd say, 2000s. And the early player was Neo4j. He's still a huge player in this business. And Oracle was a nod to them there because they were a good early player in Graph as well. So there's a whole lot to say about Graph databases. As a matter of fact, I've given a whole webinar on that in this series. You can check that out on YouTube or I think here at Data Diversity. So there's a lot of... Let me just pick on something. Gosh, I want to give you a concrete example. Now, Graph databases are in a subject predicate object notation. That's probably the point I should mention to be consistent with the other structures that I've talked about. Subject predicate object. So John knows Frank. Okay, great. What about that? Knows business. Well, knows is a triple. We call that a triple. It has a predicate. We have a confidence that John knows Frank of 70. Whatever that means. Sounds fairly confident but not 100% confident by any stretch. So this is what we think. That same triple has a predicate or provenance, which is how John got to know Frank. How John got to know Frank is with Mary Jones. Mary Jones being the object of that triple. So subject predicate object. That is the structure of a lot of Graph databases. I should say, especially the ones that follow the RDF notation, not necessarily a one like a Neo4j, which follows a different kind of orientation but produces a lot of the same results for you. Here we see a lot of the nodes in the upper right hand of my Graph. You see a lot of the nodes, so-called nodes of the Graph. And we're going to label that one there in the orange as a bridge for Vertex because it stands out as being something that connects these nodes on the left of it to the nodes on the right of it, seemingly very well. So that would make it pretty important, which is one thing I always like to say about Graph databases when I share information, which is that they are not only great for visual, but they are great for determining what's important in my network. All right. So let's put this to work. Now we know some structure. We know about relational databases, column orientation, Hadoop, cloud storage. We know about Graph database. Those are the main ones. So let's put them to work. Now there is an increasing probability that you get this right. You get the platform selection right and that's going to lead to success. If you just have one question before you make your platform selection and that question is, what do we always do or who do we have an enterprise agreement with? What do they think? Those are not enough questions to get this right. Those are putting you at high risk. I've got 20 questions and of course they're nuanced and of course I have to believe the answers that I get, right? You should too. So you need to get more nuanced about most of you. You need to get more nuanced about this to get this right. You don't want to be throwing the proverbial dart against the wall when it comes to platform selection. Now here I'm not even talking about whether you're choosing Oracle or IBM or Microsoft or what have you. I'm just talking about getting into the right category. Okay, getting into the right category from the categories that I mentioned before. You don't want to put a graph workload into a data lake. You don't want to put a relational workload into the graphs, et cetera, et cetera. That you get wrong and you pay for it. So that's something I want you to avoid. So let me see if I can help us a little bit. Big decisions, big decisions. So the data store type, whether you need that relational database or you need a distributed file system. Okay, that's number one. Which do you need? Do you need everything that the relational gives you, that random ability? The fact that it works with so many tools and you probably have a lot of good in-house expertise and synergy with that approach. Now, it's hard to explain, say, the 20 questions and how answers to those questions might help you arrive at the answer to that question. But I'll just pick on one thing. And this is, you know, it's an important thing. Data size, right? Data size. So if you're kind of below, I don't know, a few terabytes, you're most assuredly going to be okay on a relational database. You just don't have the complexity. You may actually have in that workloads among structured data, but you know what, that's okay. Most relational databases have provisions for unstructured data today. They may be clunky. They may be o-performing. But you'll get there from here. Now, the better organizations don't just look at data size and go, okay, let me make a quick decision here. They understand these architectures and they make a more nuanced decision. Somewhere like when you're, I don't know, again, just picking on a number, 20 terabytes or something like that. 20 terabytes enough, you probably need to be in a distributed file system today, but that's not 100% true, really, because there's a lot of value in relational databases in terms of what they bring in. My information has to be kept up to date because they're adding so much to their abilities that you can have, frankly, you can have petabytes in a relational database and it may be the right decision. I'm not sure about that, but it may be the right decision. I'm thinking more of working together between the two and something like that, but nonetheless, hopefully you get my point. Now, data store placement, where are you going to put it in the cloud, private cloud, public cloud, on-prem, et cetera, that's important. Yeah, that's, yeah, it's important. The workload architecture, it's operational or analytical, there's still a difference. I'm going to be picking up the pace just a little bit. What is the data platform for? Here's your choices. If you're picking something outside of this, or if you think you have something outside of this, I'd ask you to reconsider. I'd ask you to reconsider whether you actually do have a, oh, I don't know, what you might call a marketing database that acts as an operational data mark, but is really a data warehouse to some people, et cetera, et cetera. When you have that kind of long-tail definition that you have to put on that data store, that might be a candidate for reverse engineering into something that makes more sense in real architectures like the things you see here. Hopefully, most of these things make sense. And yes, I did get back into the operational arena a little bit for this list. But analytically, we've got the data warehouse. We've got dependent data marks. I should have said just data marks that are dependent or independent, meaning whether they're fed from the warehouse or not, the data lake itself. Maybe you need cloud storage, but it's not a data lake. It's not a shared experience kind of structure that you're building. It's for individual application. Well, first of all, I'm going to ask, are you making sure you're not shortchanging yourself and your future by building it for one application? And maybe this is the data lake, but if it's not, then you have an analytic big data application. Big data might belong in a separate cloud storage artifact, which I wouldn't call the data lake. It might because it's on cloud storage, but there might lie a separate category. Archive storage and staging areas for all of the above, especially the data warehouse. Yeah. Okay. So your analytics reference architecture is going to look like this. And I say this is the start. Okay. Like I mentioned before, there's no one size fits all. Not trying to say that. Most of you are going to have a much messier slide. Of course, so do I. But this is the start. Okay. And you do want to have that true north architecture that you're moving towards. So my box here at the bottom is, it says S3. That's one of the cloud storages. That one is from AWS. Azure and Google have them too. That's just an example. And I mentioned before, I like to put them in parquet. So this is your data lake. This is your data lake. Okay. And you're streaming data. You might be ETLing, but more likely you're going to be streaming some data in there through a spark oriented routine. Maybe you have some Kafka that's dealing with some of the topics at the front end of that routine. You're also going to have some ETLs here. The data workhouse, typically we ETL or ELT data in there. There may be some stream processing that occurs there as well. Okay. Depends upon if you need that data right away. I like to push things up where I can, where it doesn't hurt anything else. In other words, getting things to where they need to go faster rather than slower. So if it doesn't hurt, let's go ahead and push it in there faster. There's some pros and cons to the stream though. I will get into that maybe a little bit later. Now the data workhouse gets to reach through to the data lake. We call that the lake house concept. It also pushes some offloaded data into the lake. Now that all kind of depends upon where you're storing history data. That's another of my 20 questions. That's an architecture question. Where are you storing history data? That is all data for all time. Is that going to be in the lake? Is that going to be in the warehouse where it's a little bit more expensive but maybe you already have it go in there and you don't want to mess with it? Maybe it's really not that big. So make some good decisions there as well. And the data workhouse tends to be where we send our users with their queries for their dashboards, their reports, et cetera, et cetera first. And again, the data workhouse can dip into the data lake. Don't get into trouble. Don't get into trouble. Start turning users loose for basic reporting today on the data lake. That's for other things. And the data workhouse can reach in there. I know you have to step up the next rung on the ladder to enable some of these things in your environment. Frankly, the whole idea of a data lake is to step up on that rung. And the more sophisticated organizations are the ones that have done it and embraced technology. They've been experimenting with it. They have knowledge and expertise. And they have ways to persuade the company to do the right thing. And they get into these technologies earlier rather than later. But I have laggers in my network that you pretty much are the whole data environment is going to fail before they embrace distributed file systems. And that is an extreme that is definitely going to hurt you, not only in data, but as a company. So please don't be there. Please be on the, not the bleeding edge. Okay. But the leading edge of some of these things. You've got to pick your winners and go with it and get on early because those are going to be, those companies that are going to be the only winners eventually. Okay. Okay. Data warehousing. Again, I said before it's going to be on a relational database. I'll put a period after that and we'll leave it at that. Data warehouses still have a lower total cost of ownership than data marks. I've been dragging this slide around from Gartner for a long time, as you can see, but the idea of you're either doing a data warehouse or you're building data marks, whether you call them that or not. And if you're building marks that aren't associated to the warehouse, you're building all these structures that eventually will cost you more and give you less opportunity than a proper data warehouse would. I don't think there's a lot of people that would argue with that too much anymore, but the challenge becomes in getting there. And I think a lot of the times when I get pushed back on this idea of build a great data warehouse, a lot of it comes because it's been hard to do, not because it is actually a great thing to have done. So I like to attack, well, why has it been so hard? Let's talk about that, see if we can't solve that problem because it is so beneficial. Data warehousing concept has been around a long time and today it's almost like nobody has just one. Despite the fact that one would be great to have, you know, one data warehouse in the sky, they've moved into flavors. So I can almost put a flavor on every data warehouse that I encounter. Most organizations have multiple of these flavors. So data warehouse being the prima fascia, analytic data store that's out there and it's, like I said, it's going to use a relational database. So what are the great relational databases out there that want your data warehouse today and have some great ideas and provisions for handling your data warehouse? These are them. These are them. And there's a lot to say here. This is not going to be the webinar where I do a complete teardown of all of these. But quickly, Redshift was the first managed data warehouse cloud service and have great interesting features with the data engineering ecosystem. And of course, that's Amazon, so it's going to be tight with AWS if you're there. It is limited by its tight coupling between compute and storage. Big query has a distinctive server less approach. It has distinctive pricing by data inquiry, but it has great access to a data ecosystem and it works well in real-time environments we found. Vertica is a solid ANSI SQL compatible relational database. High performance at all levels of data volume. Great database to go to for that. IBM DB2 cloud data warehouse, I believe it's the latest label for that. It's a solid offering with a robust history now with scaling of compute, Azure SQL data warehouse. Sorry, that's the old term, isn't it? Things change so fast. Azure Synapse Analytics. A strong security oracle with its autonomous data warehouse. That's the one I put into this mix from Oracle. Don't be using OLTP databases, by the way, anymore for warehouses of any substance. All right. HANA, as we know, that's all in memory. Snowflake went public a couple weeks ago. That was the big news of this industry, right? Great horizontal and vertical partitioning. Great usability, no doubt about it. They're taking their long-beloved database into a strategy with shared data advantage at the center in the cloud. Strong, acting avalanche. Yeah, super fast data warehouse. They're another solid one for your consideration. But again, not trying to go pros and cons on all of these today, but these are them. And you got to know things as you step into these waters. What is a node? They all have some different definitions of node anymore. Yeah, used to be cleaner. But Azure SQL Data Warehouse, I need to update the slide. It's obviously Azure Synapse Analytics. It's scaled by data warehouse units, which are bundle combinations of CPU, memory, and IO. According to Microsoft, EWs are abstract normalized measures that compute resources and performance. There is a trend going on now, there isn't there. Of obfuscating the infrastructure. And some of these are more prominent in this arena than others. And some of these are also going to be more forward in terms of removing, is that a way to say it, more, better at having less. Better at having less knobs to turn and so on. Snowflake comes to mind. And their architecture is described as a hybrid of traditional shared, just database architectures and shared nothing database architectures. That virtually runs itself in their words. I still like knobs, whether I want to use them or not. I don't think we're at the point where that is a detrimental thing. But anyway, that's a little philosophy. Amazon Redshift, I skipped that one uses EC2 like instances with tightly coupled compute storage nodes. And BigQuery, as I mentioned, they have what they call slots, which you can buy by the month and all you can eat or you can pay for query and data. Very interesting, very interesting service approach. Maybe on the cusp of some big things here. That was fun. Everybody tends to really, if I had one of these meters where the interest in my webinars are going up or down, everybody tends to, I think, anyway, go way up when I start talking about vendors. But I'm going to bring it back here to some generalized things here that are very important. Costing the platform. For many, you pay for compute resources as a function of time, and I'm going to skip over a little bit, but alternatively, some cloud vendors have consumption-based pricing models instead of paying by the hour you pay by the bike process. This all isn't rocket science, though. But it is something that's different than what it used to be, and you just got to get on board with it. And once you got it, you got it. And you can keep an eye on it a whole lot closer because the price is certainly a consideration as well when you're looking at these things. So price performance really being the ultimate factor in making decisions around workloads. And it's more than just cloud costs. If your database lacks administration features, if it lacks features that cause you to do increased workarounds, increased configuration and management, if it doesn't have things like storm procedures, reference integrity and uniqueness capabilities, you can survive with that clearly, you know, especially in smaller workloads. But that is going to create some more coding, some more configuration of the database and so on, things that we're trying to get away from. Mission critical options for backup and disaster recovery, which typically includes the standby database. You don't have that. You're going to do it yourself. Full anti-SQL compliance, you're going to have some rework when it comes to doing moving SQL to the database, et cetera. And performance. Wow. Yeah. If it doesn't perform, it's going to be sitting there raking up a bill on you without coming back. And that's no good. You want performance. Obviously think of all the possibilities lost when users are sitting and waiting. Let's do this. Iterating with their data, which is what you want. There's another pricing, gotcha. Scaleout impacts cost. Scaleout impacts cost. And there's a whole lot behind this. But even if you're just scaling out, let's just pick on something here, not even 24-7 or Monday through Friday, but end of month. Just end of month processing. It's going to blow your budget unless you've thought through this. Just scaling up. Let's say that you have eight nodes, and the database requires the scale up to be another eight nodes because that's what they do. You may or may not need it, but it will definitely double your cost and at least for that time period. And as you can see here, there's some numbers we threw into our calculators and you can see it can go from 2.2 on a flatline basis to, oh my God, do you ever have to scale up? All the way up to 3.3 if you're scaling up at three times just Monday through Friday, nine to five, handle those workloads. So if you don't spec it right in the beginning, you're in that boat. And so there's no freelance. Sorry. Memory pressure on scaleout compute. Yeah, there's a lot of that as well. Whenever a data warehouse is not enough, memory to build and join hash table, which is frequent, and keep it in memory, it has to spill to disk. This is costly in terms of performance because the database has to do double work, writing, sorting, and reading rather than having it all in memory. And you can't just provision a medium, whatever a medium-sized cluster is to you and let it scale up to two of them during the busy hours. A large join would spill to disk on one of the clusters in order to handle the concurrency. I hope I explained that well. And then finally, getting into the appropriate IO is expensive. And the appropriate IO, for example, might not be what you are looking at. There could be a little bit of barbieware slash sticker shock when you have to move to a different IO tier. And they all have them now, different IO tiers for your workload. So anyway, we've gone through the different formats. We've applied them in our analytic enterprise. We've looked at the pricing, but we have to bring the data there and we have to constantly bring the data there and move it all around because there's no one place for all data. There's some data that's going to be in multiple places. Now you can't put all data in all places and, you know, just to cover yourself. That's not going to work. That's extremely inefficient. You have to allow for the fact that there's going to be some data in multiple places. And what are you going to do about that for a given, say, query? Well, I'm going to knock this one down right away. I don't mean down like that. I mean, let's talk about it real quick here. And this is definitely a solution to this. It's data virtualization. This is saying, okay, I've got data in multiple places for this query, but that's okay because my query can reach around. Yet what it needs, there might be a performance hit to that. But I can live with that. It's not something I do every day. It's just for the edge queries that fall into this crack in between the seats, as it were. And sometimes it's perfectly okay to architect so that there is data virtualization every day as long as it's performing for you. It is. I still admit it is sometimes some heavy lifting to bring on a new data set into one of these devices I've talked about, data warehousing, data lakes, and so on. So in the meantime, you can do things like this. Hopefully you don't live your entire life in the meantime and you do come back on those things and make them more architecturally sound. Capabilities for data integration, these are some of the things that you should be looking for in your data integration solution. I just finished a report on this. It'll be out probably in a week or so. Keep an eye on my social for that if you like it. But compliance of native connectivity. By that, I mean the connectivity leverages technology, specific data access APIs when they're available instead of using generic protocols like JDBC and ODBC. That you need. Multilatancy data integration, ingestion that works in batch, real-time streaming, change data capture, or combination of that. Data integration. What do I mean here? I mean that integration is available in all of the methods that I just talked about. All of the multilatancy data ingestion methods. You have robust data integration there. Maybe ETL, maybe ELT, maybe streaming. And this should be a visual code-free development environment supported by artificial intelligence, by the way. Data quality and data governance, applying data quality consistently across the enterprise is essential. I don't believe data quality is a product, but I do believe products help out a lot. Especially those products like data catalogs where you can enter rules and they will be automatically applied to new data integration routines that you build. That seems very efficient. The cloud data management solution should be able to connect and scan metadata for all types of databases. SAS apps, ETL tools, BI tools, and more to provide complete and detailed data lineage. And then we have enterprise trust and enterprise scale. All cloud infrastructure and data platforms certified to industry standards such as SOC, SOC2, HIPAA, ISO, Privacy Shield. You know them. Have your data integration solution be certified to those artificial intelligence and automation? Yeah. So many possibilities for AI in DI. Data discovery, looking at similarity, matching, classification, schema inference, lineage inference, business term association, and so much more. Even data pipeline building which should be self-integrating, self-healing, and self-tuning with anomaly detection. And finally, I'll say that I like your solution to be multi-cloud because most of us are multi-cloud today. A solution should be hybrid and operable on the three major cloud service provider platforms as well as on-premises. These are your options running up to the end here. So I'll be quick about it. I wasn't intending to say too much about them, but these are sort of the great options out there today. Some of them are going to be heterogeneous and work for any workload. Well, like Informatica, Taland, I throw IBM in there. There are other aspirants to that. The Cloudera approach, it's certainly a formidable approach for a data science environment for data integration. So there are four of these. I would say that are good for any workload if you want to go that route. There are others that I would say are more specialist. As a matter of fact, why don't I just show you the next slide. There are specialists such as AWS, Azure, Data Factory, 5Tran, Google. Now there are specialists for different reasons. Sometimes it's because they only work in their cloud. And they only ever will work in the cloud, and that's okay if you're going to be doing work in their cloud. But do know that you're not doing something that's going to work throughout the enterprise. And then you've got your 5Trans, you've got your Matillians, an aspirant, I would say, for heterogeneous, but you've got 5Tran, Matillians, and others stitched from Talon and so on that are more data prep greater at data ingestion than anything else. So I'd say contain scope around that. SAP and SAP only environments in Oracle is great with Oracle databases. And yeah, so I'm going to move on from that. And I have left a few minutes for questions back to you, Shannon, to see if we have any questions. William, thank you so much for another great presentation. And if you have questions for William or Dave, feel free to submit them in the bottom right hand corner in the Q&A section of your screen. And just to answer the most commonly asked questions, just a reminder, I will send a follow-up email to all registrants by end of Monday for this webinar with links to the slides and links to the recording as well as anything else requested here. So diving in, does the data fabric architecture help to the one size fit all? I'll jump in on that and Dave, you may have some things about that as well. So again, data fabric architecture, I have to ask a little bit more about what that means, but I mean, it depends on what, on who's asking because if a user, a pedestrian user, let's say, if I may, is asking, yes, there's a data architecture here. You don't have to know about the details down in here. That's just for the data engineering geeks to know and deliver for you. Then, yeah, from their perspective, it's the one size fit all, but, you know, I am the data geek. We are the data geeks for most of us on this call. So for us, it does not achieve a one site, one platform fit all. That's not to say that there's not organizations out there that are living, breathing, dying on one platform. Okay, that's certainly true. Doesn't mean it's the best thing for them or something that I would recommend. So I would say that we have a few more years to go at least when we are still breaking out workloads like this, like an accordion. It's expanding right now. Maybe one day it'll come together, but we're not living in that world. We're taking action on what is true today because everything I've built for 20 plus years of consulting, everything I've built is going to go in production as soon as we're ready, right? Not years down the road. So that's what I'm thinking about. No, the data fabric is not one size fits all from a data geek perspective. Dave, anything to add to that? Yeah, look, I think you mentioned that there's multiple workloads, right? There's analytics and operational. So I think that for analytics it's really beneficial to have a data fabric or a data mesh because you have knowledge workers within the enterprise that are making business decisions on the data. So whether you're a BI analyst or whether you're a data scientist, we want to make sure that your data that you're accessing and making decisions on this consistent and it's also governed and people are getting access to what they should be getting access to and they're all speaking the same language. So that's why I started that skill is that at Yahoo I had all these different consumers from Yahoo Labs to my micro strategy users and my Tableau users, my click users and we couldn't agree on a single most simplest metric to run the business which was what a page view was. And so it's that that really says to me that that it is good to have and use a data fabric especially for the analytics workload in use cases. And speaking of data mesh, how does the data mesh fit in with the enterprise data platform? Dave, I'm going to let you continue right into that one because I think you got a great handle on it. Yeah, and I think that whether you call it data fabric or data mesh, whether you call it virtualization, it's I think they're all sort of to me they're synonyms for the same sort of architecture is that different platforms are going to be better for storing different types of data. And I think William really spelled that out is that, you know, a data lake is the landing zone and that's where data needs to land and the data warehouse is great for making data blessing it and adding and reaching it and making sure that it holds up to scrutiny and you mentioned graph databases and other operational databases that are made for doing real time types of decisioning to run a business. So again for me it's if you're making a decision on data analytics and using it for analytics, I think it's beneficial. I think that it's it's not a one size fits all so I definitely agree with William on that but I think it's something that and you see how fast time is moving and these things are changing it also gives you that level of abstraction and allows you to sort of insulate a lot of your downstream knowledge workers from changing and the databases and data platforms and data technologies changing under their feet. Well Dave thank you so much and William thank you so much but that is all the time we have for this webinar and just to the top of the hour again just a reminder to all registrants I will send a follow-up email by end of day Monday for this webinar with links to the slides and the recording thanks at scale for sponsoring today and thanks to all of our tenets for being so engaged in everything we do we just really appreciate it. I hope everyone has a great day and stay safe out there thanks all thanks Dave thanks William bye bye thanks Shannon thanks William