 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining the latest installment of the Monthly DataVercity Webinar Series Advanced Analytics with William McKnight. Today William will be discussing platforming your data for success. Just a couple of points to get us started. Due to the large number of people that attend these sessions, he will be muted during the webinar. For questions, we will be clicking them via the Q&A in the bottom right-hand corner of your screen. If you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag ADV-Analytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can follow William and each other at community.natediversity.net. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speaker for this series, William McKnight. William is the president of McKnight Consulting Group. He takes corporate information and turns it into a bottom line producing asset. He's worked with major companies worldwide, 15 of the Global 2000, and many others. McKnight Consulting Group focuses on delivering business value and solving business problems, utilizing proven, streamlined approaches, and information management. His teams have won several best practice competitions for the implementations. And he has been helping companies adopt big data solutions. And with that, I will give the floor to William to get today's webinar started. Hello and welcome. Thank you, Shannon. And welcome, everybody. So glad you are taking some time out of a most crazy day out there in the world. But you know, these skills are going to be necessary when the dust settles. And now might be just a great time to invest in yourself. And I hope to be a small part of that today. We're going to talk about platforming data across the divide of analytic and operations. We'll focus here on analytic, of course. That's where there seems to be a lot of choices these days. This is a topic that in the course of an hour, I'm only going to be able to hit on a certain level of information as opposed to drilling down too deep. But hopefully I get you into a path that's right for you and in your current situation. Because we're all out there either dealing with something that has been platformed or we are going to platform something in the near future. And this is going to be ongoing. It's going to be a big part of any data professionals career. And any analytic professionals career has to do with where their data is located. And it's important that you get it right. I'm going to try to make that case here as well. That it's important that you don't just pick the same old. And that can really get you into trouble as you go forward and lack capabilities. And a lot of that sort of thing still happens, of course, but it does get smoked out in due time. And I want you to make a good scalable decision here. Not that any decision is worth spending months and months and years and years to make. You got to make it based upon the information that you have and use good judgment, of course. And I'm just trying to add to that judgment here. That judgment mix that you bring to the table here today. So with that, okay, I've been introduced. You see a couple of my books there and this is what we do at McKnight Consulting Group. If I can help you in any way, please feel free to reach out. This is sort of part of what I do all the time here is platforming data. So it's a lot of fun keeping up with this industry. And I can tell you this guy, this is the Wild Toad, is that right? From Disneyland, apparently he has a pretty wild ride over there. And we've been on a wild ride ourselves, haven't we? With Hadoop, with Cloud, with NoSQL, with MDM, with stream processing, with graphs, with in-memory databases, and with competing, shall we say, standards, if you will, about how we should be doing things and reference architectures that are completely contradictory being proposed to you out there I know. And so it is a challenging environment. And it's important that we get all our data under control. And so even as the environment becomes more challenging in terms of platforms, it's all good. It's all because the market is moving to where we sit. We sit on data. And that is really what's important today in business. I wrote a blog entry the other day. I said business is about data. And really, I mean that. We are in that era of give me all data fast and effectively. If for no other reason today, many of you are telling me, you're getting ready for artificial intelligence. And you're gathering your data in a great platform or platforms, as the case may be, so that it's ready for artificial intelligence. You know that that's going to require a lot of data. All the data you can muster, really. So for no other reason, let's do that. Now, at some point or another, I need to address the Hadoop question. And I think that probably, you know, it kind of came upon us in the 2000s and still some semblance of it still around. Is it dead depends what you mean by Hadoop, right? Okay, there's about 50 projects in there, 50 plus Apache projects. I tried to count them yesterday. I ran out of time. There's so many projects that are under Hadoop anymore. That's alive and well. And those projects are definitely finding their way into stacks all over the place. Now, original Hadoop was HDFS and MapReduce. We're not really doing that so much anymore, although plenty of companies did that and still have that and haven't come to a point where they have to change that, so they're not. Now, Hadoop did come to mean a lot of on-premises hard work that we are now discovering may not have been the best thing for us now that the cloud is really maturing. And so what has matured a lot is cloud databases. And that is where I'm going to make a point here today that that is where a lot of your data belongs. Certainly most of it belongs in the cloud, but a lot of it belongs in these cloud analytical databases. And they've been around for a while. I just recalled the other day that back in 2011, Google BigQuery, I know you're thinking, well, BigQuery, that just came out a few years ago. Well, yeah, the update did, but it originally came out back in 2011. And it really didn't catch on back then. It wasn't very good at some kinds of joins and didn't have standard SQL, not widely used. But Google made Google itself, that is they made a lot of use of it and now make that available, of course, along with so many others. And we'll get to some of that as we go along here. Now, I did say about how important artificial intelligence data is. Come back next month when I will focus on this as a topic. But this is a list of some of the data that you're going to want to collect for artificial intelligence. I'll go into the why of it more next month. But for now, if you look across this, I think you can just see on the face of it that not all of this is going to fit in one platform. There's so many different kinds of data in here. And we are definitely in the era where you have to pick the right platform for the workload. And there is no one-size-fits-all. And not all of this or anything else is going to go in any one platform. So I'm asked a lot about what's the platform for AI data or how do we architect for the AI future? And the answer right now, I'll say my best answer right now is to have a great architecture. Have a great architecture and place for data. Have fit-for-purpose artifacts in there. Have a great data flow, a sensible data flow. One that moves data sensibly with pipelines or even ETL and whatnot. So great architecture is the foundation of great platform selection. You want to do that first and do the architecture selection first. You don't want to always be running to the same vendor for your tooling necessarily. That can be really problematic. Now, here's another speaking of things that are buggerboost for me. It's this one. It's that we focus so much on the AI component of our stack. We don't focus enough on the data stack, but that is really the majority of where the work effort should go. And if that is done right, you can interchange and have a lot of different BI and AI stacks on top of great data, and it's all going to work harmoniously. But so many times we look at our data stack and we say, well, it's insufficient. But the solution is to get another tool that sits on top that has to go through a ton of work to get to the point where it's actually doing what it should be able to do just by slapping it on the data. And that's why these BI tools or what now we refer to as BI tools, the clicks and the tableaus and size sense and so on, they're really infrastructures in and of themselves now, aren't they? They're not like the BI tools of yesteryear, like the business objects and cognose and so on, that pretty much sat on data and pulled data and did its job. They have built in a lot of abilities for you to pull data and do certain other things. Certainly the trifactas and the pexadas in that lot is really famous for doing that. And the reason is because we don't have great architecture, but I still say, even with those tools in place, that the energy needs to go into data platform. That's why we're here today because there's an increased probability that platform selection is going to lead to success. If you just keep reaching for the same old platform, your chance of success of the overall project and this isn't scientific or anything, this is just my experience that having been a part of some 70-odd programs, the chance of success is going to be today anyway. It's going to be pretty low, maybe 50-50, it's like throwing a dart. And success is really a relative term now, isn't it? But you kind of know when you have it enough if you've been part of a project. Most projects, I would say, don't get to that point, unfortunately. But here's one of the reasons why. We just go for the same old platform. Or now it's a little bit better if you extend within that vendor and get into the right platform category, if you will, from that vendor. But if you keep going back to the same vendor over and over again for all your platforms, that can also be pretty problematic. You're only hearing one message, one voice. And today, now that we've had about, oh, I don't know, the relational model has been out about 30 years, 30-plus years. And we've had all this time to build up. Many enterprises, obviously, are 30-plus years old and have built up a lot of infrastructure. By now, we have a lot of build-up technical debt. And we have to decide, OK, what technical debt do we eliminate here? And how do we jump up? And is it mature to get into new categories of data platform? And I say yes to a degree. Yes, it indicates to me some degree of maturity if a client is already into cloud storage. If a client is already into some of the things I think are going to be oldies but goodies in the future, keepers, artificial intelligence, master data management, screen processing, things like that, some of which I'll get into in this presentation. But it's important that you select the platform correctly. And it's really important that you get into the right tool within that platform. But let's start with the platform. What is it going to be used for? This is kind of one of the first decisions. These are architectural decisions here. And I suggest to you that this is a finite list of the things that we really want to see in terms of data storage in an enterprise. An operational database, the majority of databases are operational. The majority of data stores are operational. Some of them now are for real-time. And that brings its own nuance to the decision. Some of them are with big data. And that might speak to a no-SQL solution, for example. There are operational data hubs, which kind of serve as operational warehouses, if you will, if you're familiar with the data warehouse and how it distributes information. That's happening now operationally because the warehouse was too late in the data cycle to be effective operationally. So some of my clients are doing that. Of course, master data management. That is an operational concept, by the way. That's an operational database, a small one, small but important. And there is definite decisions to be made in regards to where data belongs in master data management, in an operational data hub, in a CRM, for example. So that's something that we like to parse over time. A data warehouse, of course, we're mostly going to be familiar with that, I hope. And data marks that are either dependent or independent of that data warehouse. And data mark is kind of a catch-all term. So if you don't see your database in here any other way, it might be a data mark, but hopefully it's an architect data mark. And there is a difference between something that's architected done well, you know, and something that was slapped together to just make a, just get through the day. Well, it gives something to production on time. And I do respect that as well, by the way. An analytical application or an analytic big data application, a singular application. Here we're talking about archive storage can be a lot of things. Staging area, probably staging for your warehouse, but maybe staging for something else. So don't get into this architecture where, well, here's the database, but it's kind of a this and a kind of a that. And, you know, it should, one thing that we do a lot of is re-engineer environments to be more sensible and more efficient over time and cost less and deliver more, et cetera. And, you know, this used to be called data mark consolidation where, you know, an organization has sprung up so many data marks to meet such small finite purposes that there's a lot of synergy to be found by consolidating some of those data marks. And certainly that's a big part of overall re-engineering. Trying to get your artifacts into this list is a part of that. So take a look at what you've got, see if you can label them appropriately. And if you can't, that might become a candidate for something that you can re-engineer. Now speaking of which re-engineering, wow, that sounds like a lot. You know, you got a lot on your plate right now to deliver what, you know, what you need to deliver. I'm going to get into this a little bit more later in the presentation, but the way to get there into great architecture is to marry the great architecture need that you have with a business deliverable that you're going to do anyway. So for a lot of my clients, they struggle getting into the cloud, for example. They struggle getting into big data, into cloud storage for big data, etc. They struggle getting into master data management. It's all because you've got to cross the threshold to do that. And the poor person who thought they were going to be a champion of building a data warehouse, for example, or building a great new analytical application, now finds that, oh, well, if I'm going to be the first to move something of significance into the cloud for this organization, that's going to actually consume more of my time and energy than to build a warehouse or the application database in and of itself. And so for that reason, many turn away from it and kind of kick that hand down the road. And so I don't want you to do that. I want to empower you. And I just want you to be aware, because if you're aware going in that, oh, the cloud's a big deal, you know, you're going to be so much better off. Now, getting into the right data platform, there are four major decisions. Actually, just this morning I was thinking I really need to add a fifth, but let's see what I got on here already. The data store type, do I need what the relational model brings to me? And if you're a data professional, you have to understand, or you should understand because there are ramifications, you should understand what the relational model is. And not just, oh, it's a table. It's, you know, down at the data page level, you've got these consecutive columns that comprise a row and then you've got consecutive rows on the storage and so on. And there's ID maps and so on that map where the data is. So it's helpful in getting at random access to data. So it's really good for so many different things. And certainly most of the major data stores out there are relational. So, but you don't need it for everything kind of 2020 going forward, necessarily. Cloud storage is pretty much, I don't want to say the opposite, but it's sort of void of all of that structure that the relational model has for you when it stores data. So you need it, when do you not? Well, you have to understand kind of what it is and what it gives you. Data store placement is another one. It's not so simple anymore. A lot of you are working under cloud mandates. I know. And so are we at our clients in some cases. And in some cases, it's the opposite. It's really you got to prove that the cloud is going to be workable for this situation because, you know, my sponsors don't want to get all the arrows and so on. So, but I think that if you know what you're doing, and you will do it right, and that will not be a problem. And you'll you'll win the day if you do the right thing here. And a lot of times today, the right thing is definitely data store placement in the cloud. So there are reasons why you wouldn't. And I think I have a slide on that a little bit later. But the third decision here is the workload architecture. There's still a distinction between operational and analytical workloads. And if you choose a database that's geared one way or the other for the wrong kind of architecture, you are probably asking for a bit of a world of hurt there. Now it used to be 25 years ago when there was only one database. And I worked on DB2 back at IBM then. In those days, we were just happy that we had a database. And oh, we need to make a copy of this data over here for the data warehouse, but it's the same database, same, you know, DBMS, I should say. But that's not good enough. That became quickly not good enough. And now we have significantly specialized databases for analytics. And it's a good thing that it gets a lot, gets a lot of value to be in one of those databases as opposed to an OLTP database, if I may say so, for your analytical architecture. Finally, I've added, and this is just this year, really the node architecture, because we do a ton of benchmarks for our clients, for the market, and so on. And we're looking at this all the time. We're seeing this is a really big decision anymore, because you have, you have different types of high performance storage. Architecture that you can pick from. And now that we're separating memory, or excuse me, compute from storage, you have to determine the profile that's right for you. You have decisions to be made there every single time. And you want to do that right, because that's a highly leverageable decision in the overall balance of things. Now, the fifth decision that I was going to add, or I will add onto this, is the pricing architecture. The pricing architecture anymore, because you have to be somewhat knowledgeable about how these things are priced. And, you know, much of the time now it's consumption based, right? But sometimes you will get a great deal if you commit for three years. And I know you're thinking, well, I thought that was, that was not in play anymore now that we're in the cloud, but it certainly is. It certainly is. So those kinds of decisions, downtime decisions are also important because, you know, if you're going to run the database 24-7, and you choose the wrong database, the one that charges you by the query, instead of by, you know, the actual consumption, which, by the way, is usually around 25% or something like that for our clients, maybe not for you, but it's a lot lower than some people think. But anyway, you could actually be in a world of pricing. Sorry to reuse that. But anyway, let's talk about the 800-pound gorilla when it comes to platforming, and that's around these things. Data warehouse, data marks, data lakes, the data. What about that? So quickly, data warehousing still is important. Data warehousing is changing, however. It is changing in terms of being able to access different kind of levels of storage, which means different types of data than it used to. It's not all going to be necessarily strictly relational data. I'll get into that as we go along here, but take the very best, most basic definition of data warehouse. It's a shared platform. And by the way, at the risk of dating myself, I went ahead and put up the old bill in definition here, which in my view, anyway, still holds true. These are great things to know about a data warehouse. But anyway, there's another Gardner Group chart I've been dragging around forever. This is kind of the old timey slide, if you will. But it speaks volumes, right? If you have a bunch of data marks and they don't really define what that is, we kind of know, right? Over time, you're just going to pay more and more for that kind of environment versus to build a data warehouse. So this idea of a shared database is very important in every enterprise. And it's important to put that on a relational model. So I've been talking a little bit about a relational model. I mentioned about the data page and how the data spread out on that data page and so on. There's ID maps to where the records are located and so on. These are a lot of things that you get from relational, and you don't get from anything else, okay? And certainly, a lot of these, anyway, these functions have been enhanced in the recent explosion of databases on the market. Consistency, transaction. Not that you're always doing transactions in there, but it's nice to know that they will be whole. Partitioning, arrays, inheritance. You can read this as well, but some I'll point out, custom data types like JSON, XML, and so on. Building graph capabilities. Yes, I'm excited about this. I'm excited about the possibilities that graph algorithms now brings to the table. And if you want more on that, I believe last month or maybe the month before, which would be what, January? I think it was December, actually. I talked about graph databases. So if you want to go learn more about that, I had an hour on that for you back then. Also, on interrelational model, how you organize that data will have a lot to do with the performance of the query set that you throw at it. Now might be the time to talk a little bit about columnar and how that is the preferred way for analytic databases and all the big ones, red flags, red flags, snowflake, red shift, and BigQuery. For example, all have columnar capabilities out of the box and as a default. So they're going to store data not by row, but by column. So they're going to aggregate all the single column's values together or in my group columns, if you choose. But anyway, the point is that really facilitates the analytical query. I've done a lot of studies on this for my clients to show them that, yes, they need to spin, if you will. They're analytic databases to a columnar orientation. The other thing I'll mention is that how you sort the data in a relational model, I think I mentioned this before, but how you sort the data has a lot to do with what queries it's going to be optimized for. So design has a lot to do with it as well. So the more you know, the more you know, as I say, that the better off you're going to be going into your relational design. So if you know some of the queries, great. Don't assume that you're going to ever know all of them, though, and keep your options open. Now let's look at the analytic data ecosystem. I'll have a more detailed chart to go along here, but first high level, you're going to have a data lake. Yes, I'm introducing that now. And I'm kind of using, okay, we've got a defined data lake. I'm kind of using it to mean big cloud storage that is for multiple consumption. So it's not, it's like a data warehouse in that sense, but it's cloud storage. So we're saying big data belongs there, okay? We're also saying all data belongs there because one of the functions of the lake is going to be to push data to the data warehouse and serve as the staging area. I think that's a great use of the data lake. It's also serving data to the data scientist. Now, the data scientist can go anywhere here, of course, but they tend to will spend a lot of time in your data lake. And it's important to define terms as well here to make sure that you're on the same page with whoever you're talking to. Some people will look at this and go, well, this whole thing you're showing me here, William, that's the data warehouse. You know, it truly is an ecosystem or it's a logical data warehouse, okay? I can roll with that just so I know where you're coming from. I tend to speak more physically as in around, well, that database itself, that database is the data warehouse. That data store is the data lake and so on. So yeah, I got to know what you think. Now, other people will look at this whole thing and they will say, well, William, that's the data lake. And that's what we're calling it. And I find that a lot of people are excited about the term and are really building, they're building a data warehouse in my terms, but they're calling it a data lake. So again, hopefully you can, just like me, I'll roll with whatever you want to bring to the table there, I think. And it's important to get on the same page with whoever you're talking to. Now, data warehouses out there today have flavors. Yeah, that's my word. And this is based on a study. This was actually a published white paper I did where I found that today, that idea that there's one database for the enterprise, that didn't really happen so much. There can be multiple. There probably are multiple. And they fall into these categories, generally speaking. Customer experience, asset maximization, operational extension, risk management, finance modernization, and product innovation, data warehouses. And I'm not saying everybody's going to have six by any stretch. Some of these are consolidated, obviously, within an organization. But the idea that there was one for the enterprise never really happened. It's a great goal. It's just like what I was saying before about this idea of sharing is great. So the more the better, but it's just hard to pull off sometimes. And it's important to keep progress moving forward. So balancing all this out and making good decisions and bad decisions. Organizations have tended to come to this point where they have flavors of their data warehouse across the board. Now they all have to, or they should, work together kind of harmoniously and not have a lot of redundancy and so on. There's a lot of opportunity when you have multiple data warehouses to find some redundancy and eliminate it, create some efficiencies. May or may not be the best use of your time. But take a look at your data warehouses and it's okay if you have multiple these. Most people do. What is required for your data warehouse for any modern analytical database whether it's the warehouse or not? In-database analytics. Yeah. Having the capability to throw some analytical function at that data native to the database is a big leg up. Now this kind of brings up what about in-database machine learning? You know, that's sort of out there in a mixed bag right now. I would say that is, there's a lot more. What is out there is not that usable yet. So you're going to have to augment that and bring in your own machine learning capabilities to the data today anyway. And there are a lot of tools that's available for that. But over time, over time, in-database machine learning will become very important. In-memory capabilities, the columnar orientation that I spoke about. Modern programming languages, like Python, R, Scala, TensorFlow, Spark, MLib, et cetera. Those are some of the modern programming languages, languages that I want to be sure this works with. In new data types, I mentioned some of those before. So make sure that your solution works with that. I talked a little bit already about the columnar orientation, so I won't belabor it here. It is quite important. Back in 2013, I believe, when Redshift came out, that was the first good columnar database that was sort of at the modern price point that we expect for data. And not, you know, super huge enterprise big-time contract kind of thing. Obviously, it's in class. So Amazon did this by buying the source code from a company called Plurxcel. Some of you remember them. It may be a challenge to evolve that greatly at this point, but they're doing a good job. And I did want to call out their columnar orientation, kind of setting the stage for where we're at now, saying that columnar is important for all your analytics. Now, object storage instances. This is what I refer to as the data lake, right? Okay. So, and I also talked about the data warehouses change. So this is where I get into some of that. Object storage instances and clusters have local storage, i.e., on the physical drives mounted to the instances themselves. That is HDFS and Hive. Their technologies access their cloud vendors respective cloud storage. So Amazon EMR, for example, access is S3, S3 kind of the granddaddy right now. I would say cloud storage, but coming on strong is Azure. Data Lake Storage Gen 2, which has been out, I don't know, about six months, provides a lot of advantages. And then Google Cloud Storage, of course, coming on. So local storage is used by the object storage platform for its housekeeping. There's also a serverless way with Amazon Athena, etc., but most of us are building our lakes like this. Now, what I'm suggesting is to, and I'm not the first, but I'm suggesting that you build data lakes with analytical access pricing. And how you do that is carrying a lake with an analytical engine. The charge is only by what you use, otherwise you can get into a lot of costs. Remember I said the fifth dimension of the selection now is the pricing one, the pricing architecture. And this is where this really comes out. When you're deciding, well, what data do I put into my cloud storage? What data do I put into my relational storage? Because you're going to pay a performance hit for the cloud storage for reaching in there, but you're going to obviously save a lot of money on the storage by a factor of four, five, something like that. So, you know, you want, and plus it's only right for some unstructured data. You're not going to want to force it, your unstructured data, as I have done, as we all have done, right, for the past decade. You're not going to want to try to force fit that into the relational model. And by not doing that and opening up the possibilities of the cloud storage for unstructured data, you're going to be storing a lot more data. And that's going to serve you well as you go forward into this artificial entitlements future. Now here's a little reference architecture, if you will. I've introduced a few things now that may look different from before, but we have different low latency sources. Okay, there's still a thing of source data. Okay, I still call it source data if it's over there and we're bringing it over here, but we're bringing a lot of that now through a Kafka or a Stream Leo or a Rabbit MQ or something like that that's going to parse this data down into its topics. And through streaming or Spark, send it to my data lake. That data lake is that box on the bottom there and I am sort of suggesting that I like Parquet here. Am I not? So yeah, think about that as well as an orientation an orientation for your data lake. S3 is representative of whatever cloud storage that you choose to use and notice that I have, if you'll notice, I have all the data including the batch data coming on in coming on in there to the data lake. Now, how does that data lake interface with my data warehouse or warehouses as the case may be? Well, the data, it has all the data so the data warehouse can reach in there and grab that data and use it. Now, there's that performance hit I talked about. Data warehouse can also offload its data. If it's collecting data over time, it can offload some of that off to the cloud storage. However, if you jump into an architecture and you just do this, you don't have any historical data in the warehouse that's not in your data lake. But most of us do because we're not jumping into this clean. We have history in the data warehouse. So at some point, we might push that off to the data lake whereas we have not sourced that data necessarily into the data lake. That's a lot of words. I hope that made sense. And you're bringing the data in from the lake into the warehouse through stream processing or ETL, ELT or kind of how we like to do it now with Spark. And yes, indeed, the querying, the dashboarding, the ad hoc stuff that is all still happening to the data warehouse and the assorted marks that gets pushed off from the warehouse. Now, this is a nice clean sheet of paper. A nice clean architecture. Believe me, nobody ever presents this to me when I come to a client. So if you're saying, well, I guess we're a lost cause because ours doesn't look like this. I haven't met the lost cause yet. But should we try to get in here? Yeah, we should try to get it somewhere in here, I believe. So here's some notes on the data warehouse of the future, more achievable separation of the future storage, et cetera. I'm going to move push forward here because I want to spend some time on cloud and lake databases, which I gave it the number one ribbon. It should be blue, I guess. But it's number one for your workload. For most of your workloads, I say they belong here. Analytic workloads. Okay, talk about analytic workloads. They belong in the cloud. They belong in a specialized cloud analytic database. There's so many advantages to this. I'm going to hopefully share some of that with you. They have robust SQL, built-in optimization on the fly elasticity. So it will grow. The cluster will grow and shrink as necessary. Keep an eye on that though because sometimes it's not exactly fluid and dynamic. It can be somewhat of a stepladder approach where, oh, you need to have another terabyte. Let's add another terabyte. That's not really dynamic, but some of them are still that way. So you got to throw that in the mix. I don't think that's a knockout factor, but you got to throw that in the mix. Separation of compute and storage. Separation of compute and storage was really, I say popularized in the database world by Snowflake back in 2015 or so when they came out. Their stores separate data from compute, the data being in F3. You can provision any number of compute nodes, and that idea really took off in 2017. So it's responsible for a lot of their success, frankly. I do believe now, of course, a lot of them now do this very well. So there you go. What else do I want to say about it? That idea came from the Hadoop world, which had that already. By the time the database world got some of that. Cloud and Lake databases in the enterprise. In the enterprise. Okay, so we're in the enterprise. What do we do? What do we get there? Do we put our test dev there and keep our prod on Pram? I'm saying I want you to have all this in the cloud, but I get that it might be a stepwise progression there. But what's your plan? That's my question is to have, what's your plan? Have a plan. Have a plan for getting where you need to go, where you need to be able to step back and determine where this thing needs to go, and then marry up the projects with business success along the way. And that's the real key. That's the real key to a data leader in the enterprise today. I don't have, when I talk to a true data leader, they already know that it's a mixed message that we're speaking of adults. And we know that there's no one size fits all. And they realize they have to serve the business while they grow the maturity of the environment. Not good enough. Not good enough. Just to serve the business because they will take you down an artificial hole, if you will. If you just, you know, be an order taker. We don't need that. We need real leaders in our data roles. Okay, off that soapbox. Performance. Managed cloud databases are the winner of all these categories here for performance. Queering cloud storage directly is inefficient. And bringing subsets of data down for on-premise processing takes time and costs egress fees. Yes, pulling data out of the lake costs you some egress fees. Nobody can predict ahead of time exactly what it's going to be, but you can get kind of in the ballpark. And that's what I encourage you to do. And that's what we try to do when we platform these solutions and we try to price them out. So point there is not don't put any data in cloud storage because oh, there's a fee and there's a cost. No, that's not the point. Overall, you could have some serious savings there. But the point here is really that if you just have cloud storage, it's the data warehouse that's reaching in the cloud storage. If you just have cloud storage and you just have tools that access data in cloud storage, that's not going to be the most efficient. So is the data lake in my terms, is the data lake going to replace the data warehouse absolutely not? Absolutely not. But I do see them coming together. Performance testing on Hadoop, like Hive, Spark, and Impala have shown improvements in performance, but they still like significantly behind the performance and power of a solid relational cloud data. Managed cloud databases win on administration as well. Great. So I've been pumping up the promise of the relational database here. So why do we need big data technologies for big data? Well, there are some reasons why cloud storage and formerly Hadoop, and I shouldn't say formerly. Today even Hadoop would be better than a relational database for much of this. But I'm not sure why you do that. If you have cloud storage there. See, I thought that Hadoop provided a nice balance of structure that's needed for data in the enterprise. I thought it clearly wasn't near the structure that a relational database provides, but it wasn't on the other end of the spectrum. Basically nothing in terms of what cloud storage provided. You still had the age catalog that told you where the records were. You were able to scan less than a full cluster and so on, but I was wrong about that in terms of how they priced their Hadoop and so on opened up a market for the cloud storage providers. And now we're seeing a bit of a scramble for what was Hadoop. Anyway, however, why big data technologies for big data? I mentioned new data types. Schema less relaxed acid, because that can be costly. Faster, less expensive provisioning, programmer freedoms. Yeah, the programmer oftentimes, the data scientist, is a programmer. I mean, they know how to utilize these tools. They're not just turning their head and talking to a programmer and saying, would you mind giving me this kind of data? They're doing it and they're interacting with it. And that's kind of where we are now. I think it'll change. I think we'll get more specialized, just like we have with everything as time goes on, but that is where it is today. I can't even, as a data architect, even for my clients, I can't even say 100% what the data scientist is doing and I don't even know if they can say 100% what they're doing with the data because it changes on a day to day basis. But I do know how to provision them an environment where they get access to the data that they need. And they're pretty happy with it right now. So because the opposite or the alternative, if you will, is to not have access to data. And that would be really bad for the data scientist. So the data lake, let's recap a little bit on that. It's the data scientist workbench and software data warehouse staging. The great data lakes out there do both of these functions and do them well. They are related to the data warehouse. There are many lakes that have been built outside of involvement from the data warehouse or with the data warehouse. And I think that's a shame because these two need to work together harmoniously. As a matter of fact, you could call the whole thing, especially if you're accessing data in the warehouse and it's reaching into the lake for access data or unstructured data, etc. You could call the whole thing the data warehouse now. So I don't know where the label goes. I'm not really one for forging a lot of labels into the vernacular. But the lake, you got a lake, you got a warehouse, but they're kind of coming together here in terms of architecturally becoming very close cousins, I would say. So that might be something you need on your roadmap, which is to get them working together better. Now I put down HGFS a bit. I am going to give it a nod in terms of query performance. I'm going to give it a nod in working with objects of a certain size, working around those put limits that we might have in cloud storage and so on. But knowing what you're doing, that's not the worst thing in the world. Cloud storage is more scalable and persistent. It is backed up and supports compression, making the cost of big data less. And yeah, it comes down to cost a lot. I'm not even sure we ever would have had a big data a big data movement. And it's been around about a decade now, I'd say, but I'm not sure we ever would have had the cost of storage drop precipitously before then for other kinds of data. But anyway, so leveraging cloud storage for data lakes. Yeah, these are some of the reasons why you might do that. I think I've hit on most of these. So let's talk, I'm such a graph advocate these days it's hard for me to talk about platforming the enterprise at this level, this level of 50 minute discussion without bringing in some minutes to talk about graph databases. Now, like I said, I spent a whole hour on it some months ago, go get that as well. But I want you to identify graph workloads and think about the value of having it in a graph database or at the least utilizing the graph capabilities of your favorite relational database, which is going to be less than what you get in a graph database. But anyway, if the workload is identified by words like network hierarchy tree ancestry or structure, these are flash words to me. They immediately turn my attention to perhaps a graph database. If you're planning to use different relational performance tricks to make it happen, you might be eating a graph database. If your queries are about passing, you might need a graph database. If you're looking for non-obvious patterns and so on. So yeah, graph database, many of them are stored in what's called triples. And it's a beautiful thing, but subject predicate an object. We have a name, John knows Frank, etc. I'm not going to get into the same great detail. We have the nice, nice, nice visualization capabilities of graphs, which sells a lot of people right there. Don't remind the triples. But they like to see their data and interact with it in that manner. So that's a nice thing as well. But anyway, so where are we going in the future? Well, there are some future things I want to put on your horizon. I'm not saying that you make decisions in 2020 based upon this future however. I'm not saying there's any decisions I've come across in an enterprise that are worth waiting. What is it now, March? Worth waiting nine months to make. But these are things that hopefully if you're in a bigger enterprise, you have some R and D devoted to looking at this and being ready when the time comes because the early adopter is going to get a lot of the benefits of doing this. Doing what? Doing GPU databases. For example, a GPU database is like CPU on steroids. They actually work with CPUs. But anyway, there you see some of the names that are in this category. I'm watching this space carefully. GPU databases, spatial analytics. You can query and visualize billions to trillions of near real-time objects. These are things that you just don't do today. Now, many of us don't have data that big where that's actually going to be a thing. But many of us do. I encourage everybody to think broadly about your data abilities as an organization. If nothing else, think about third-party data that you could be bringing in. All that data that you've been leaving behind or you've been ignoring because, well, nobody's asking for it. That's not good enough anymore. Not good enough. Leadership demands that we look at the artificial intelligence features start getting all that data under control. If you do that, I think the market's going to move towards these GPU databases. Keep an eye. Type A organizations may want to deploy now, and that goes for upper-litical databases, if I may use that word. I did mention at the outset of this presentation that there is a divide between operational and analytical. There is a divide between operational and analytical still, for the most part. But some of the IoT, some of the graph things that we're doing, I'm having a hard time categorizing so distinctly anymore. These upper-lake databases could come in handy. They could actually serve some organizations even broader than the upper-litical actual application because it can do both row-based for transactions and column-based for analytics. And that's the best of both worlds right there. Now, does it duplicate the data? Yes, it does, but it does it under the covers, and that's a big benefit. And process both orders and machine learning models simultaneously with fast performance and reduced complexity. What else? What else is out there in the future? Well, I see that enterprises are going to move towards, you know, this architecture that I've been showing you here, different pieces of, there might be multiple of those in the future in your enterprise that work together, and how might they be divided? They might be divided by enterprise domain. So I see a future with more decentralized, more decoupled, distributed architecture. So data infrastructure is like a platform with complete domain mastery as the nodes of that platform. Hopefully you're seeing the, you're getting the visual as I say this. But I definitely see things getting more complicated before they get less complicated. I don't know when they get less complicated in our world. It's going to take some time. In the meantime, we're going to go through things like this. We're going to have enterprise master data management, master data management is clearly going to be important for our future. It's important today. And those who don't embrace it are really, I think, suffering. And sometimes they know, sometimes they don't. In the future, nobody solved the Federation challenge that I just now speak of. Someone will, and that could be a really big thing. Moving away from conventional integration and its technical debt and effort, more streaming, more cost to stuff for sure. Containerization, microservice databases and embedded database. Embedded databases as part of the analytics environment. Embedded databases, it's not just for software anymore. It's for the things that you build inside an enterprise. Yes. And embedded databases as distinct from just using raw data in your enterprise. And extend this, let's say we're doing an IoT architecture, extend it out to the edge out there on that edge, having our real embedded database out there. Yes, that's part of our future. Integration speed uptake and maturity. Eliminating redundant data stores. And dare I say, eliminating a lot of the work effort that's involved today in integration because a lot of that has been done. And finally, the unification of batch and streaming and tools is something of the future. So such as Apache Beam or Google Cloud Dataflow, these are things that are here today. But definitely the market's going to go more in that direction. But again, if you leave this presentation at the top of the hour and go back to work and go back to platforming what you got in front of you, I don't think too much about this. But every once in a while as a data professional in a fast moving market, you have to take a look out there. And I do that. I do that on a regular basis. And this is kind of where I'm at with that. So something for you to think about. That brings me to the end of this part of the presentation. I still have time for your Q&A here a few minutes. And I'll pass it back to Shannon to direct that. William, thank you so much for another great presentation. We do have some questions coming in. And just to answer the most commonly asked questions, just a reminder, I will send a follow-up email to all registrants by end of day Monday with links to the slides and links to the recording. So William, do I need to consider the ingestion type, for example, batch versus stream or into the architectural decision? Okay. Well, yes. But is it separate from the platform decision? The question said architecture. I have to say yes to that. I'm not so sure about the platforming decision. But you have to architect your ingest today. It used to be that that wasn't that important. That was pretty much what it was going to be. But now that we're ingesting IoT data, we're ingesting data velocities that we never even thought about before. It's important to architect your ingest. And I think if you do that, you will find that ETL tools, ELT tools, are going to be insufficient for the modern workload. And that's where you get into streaming. Great. And should cost be considered also as a disruption vector? Yes. Certainly it is disrupting the market. Certainly it's something at that level that we look at when choosing a platform that you should look at when you're choosing a platform. You should put your workloads up against it. And you should know, how much are you running? Is consumption based right? Is serverless pricing kind of right? Like what BQ does and solutions like that? So I heard it said not too long ago, and I've picked up on this, that how we as database professionals were now valued for our design abilities with databases. And but in the future, we're going to be valued for our free true. And the options have opened up. So you definitely want to consider that a dimension of your selection. So which solution works or is it successful implementing operational analytics database base column and row? Column and row. What are the options there? Yeah. Yeah. Yeah. You've got a few. DB2 is stepping into that area somewhat. I'm talking about the popular one. We like, we like, yeah, Actian, Exosol, MemSQL. Okay. These are all claiming some ground there and are worth looking at. Take a particular look at MemSQL and its ability to do both and do it well and seamlessly and step up to those challenges. And really, it makes its case. It makes its case for doing more than just that application in the enterprise. For some smaller enterprises that want to standardize on one database, which is almost unheard of anymore. But, you know, if you're going to standardize on one, please don't. But if you're going to, you know, you can think about these solutions as being pretty paramount in that. So you look to see the future of tech and options. Where do I look to see the future? Well, I do these analyst days at various vendors. So I try to get to them. I take in as much information as I possibly can. I have a feedly set up with RSS, I guess formally known as RSS kind of fees, where I have my, you know, the keywords of things that I'm following. And I hit that every day, every day. And I see what's going on out there. And I think, you know, I don't just absorb information and reiterate. I think about it. And I think about how it fits with all the other information I'm getting. I'm not saying it's any better or worse than anybody else's, but that's kind of how I do it. I just make sure I invest the time in the future. I have a network that is second to none in this space. And so whenever we can get together and bounce ideas off each other, that's very helpful to me as well. All right. We have a minute here, but I'm going to try and slip in this last question. Can you elaborate a bit more on decentralized architecture for enterprise master data management? I've seen many centralized designs. Why the decentralized architecture is the trend pros and cons? Yeah. Well, that's a big one. So master data management is very interesting because it has so much value to the organization. But you, as the MDM champion in your organization, you have to pick the right level to hit the organization with this because you don't want to go too big because it'll never happen. You don't want to go too small because you're not really delivering anything of value. It's not really MDM to me if you're just supporting one application. If you're just decoupling the master data from an application and doing it over here instead of over there in terms of the team or something like that. But so when you find that happy balance, and let's say it's not the entire organization that you serve with your MDM hub, and you're getting some traction, you've done customer, you've done product, it's great. Let's say the domestic business loves it, but then you've got that international business kind of sitting out there and they don't have it. Well, are they going to jump in here? That might be too hard. So maybe they built their own in this. So we're now starting to decentralize what is supposed to be a centralized function. But you know what? That's okay. Because you're getting things done. And so you should not listen to anybody who says this is the one way to do it, the only one way. And you should not listen to anybody that says if you're not hitting the entire enterprise with this, you're doing it wrong. Yes, that's nice to have. Yes, that's great. But I'm not here in the real world trying to spin up projects and get them going and get companies moving in the right direction. So you've got to know how to use your judgment and find the balance in things. And that that is leading us to decentralize architecture. I love it. Very well answered in a very short time. Well, thank you William again for another fantastic presentation and thanks to all of our attendees for being so engaged in everything we do. Again, just a reminder, I will send a follow-up email by end of day Monday to all registrants with links to the slides and links to the recording. And I hope you all have a great day. Stay safe out there. Thanks William. Thanks all. Thank you. Bye bye.