 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVercity. We'd like to thank you for joining the latest installment of the monthly Webinar, DataVercity Webinar Series, Advanced Analytics with William McKnight. Today, William will be discussing databases versus Hadoop versus Cloud Storage. Just a couple of points to get us started due to the large number of people that attend these sessions. He will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag ADV Analytics. As always, we will send a follow-up email within two business days, continuing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for the series, William McKnight. William is the President of McKnight Consulting Group. He takes corporate information and turns it into a bottom line producing asset. He's working with major companies worldwide, 15 of the global 2000, and many others. McKnight Consulting Group focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in information management. His teams have won several best practice competitions for their implementations. He has been helping companies adopt big data solutions. And with that, I will give the floor to William to get today's webinar started. Hello, and welcome. Hello, hello everyone. Thank you, Shannon. And welcome to everyone all over the world who joins these webinars. It's my privilege to bring this information to you. Hope you enjoy it and can make some use out of it. This is a hot topic. A lot of companies out there are scrambling now to make sure that they're in the right data platform for their workloads. And the choices are incompatible to anything that we've ever had, really. I mean the viable choices. So you can get it right or you can get it wrong, really. And this is one of the most important decisions, not just for the data of the project, the data layer, but for the whole project, for the success of the whole project. And there's a lot of things that fall on from this decision of what platform to use. So I'm very excited to be bringing with you a topic that's burning up my ears, which is databases versus a do versus cloud storage, which to use, when, where, how, etc. So I've been introduced, so I'll skip on this. And my company, we do a lot of strategy for our clients out there. We give training workshops to nail a particular issue you may be having by bringing in expertise and also obviously implementation of all the things we're talking about here. And by the way, I'll add that I've been doing a lot of field work recently. I should say field testing recently on a lot of the products that will be implicated in such a comparative presentation such as this. So I do bring you up to the minute hands-on experience with a lot of these products and been to the thought pattern, I guess you might say, of making this decision with our clients. Yeah, all data used to be give me some data, give it to me fast. Okay, great. We didn't have data warehousing back then. We weren't thinking that way. We were provisioning rapidly and repeatedly and getting kind of crazy there with it. And then data warehousing got some things together for us in the 2000s. I was around for all of this and so I've seen it change. And now it's all data. Give it to me fast and effectively. Don't let any data slip away without us getting the value out of that data. And usually that doesn't mean just looking at the data or processing the data. It also means storing the data for current and quite potentially future uses of that data. And one of the biggest areas has got to be artificial intelligence. So I'll hit on that in a minute. But first I want to say this guy has nothing on us in the data business. This is Mr. Toad. He's got a wild ride over there at Disney apparently. But we've been on a wild ride ourselves and we're all scrambling trying to make sense of it. And not only is the landscape changing as we speak, but it's changing rapidly. And at the end of the presentation, I'm going to throw our wrench out there and see if you like it. But I'll be talking about maybe some future direction of data platforms. But for the majority of this presentation, I'm going to be firmly grounded in the present, firmly grounded as I am with our clients in terms of making decisions today. We have to make decisions today. My motto is you can't keep delaying the provisioning of a platform for this workload just because you think there's something around the corner. There's always something around the corner. And you have to take that into consideration if you kind of know what that is, that may factor into your decision. But as you can see, it's a multifaceted decision and I want us to make it properly. So be prepared for the wild ride that will continue to be in this space. We're getting ready for AI out there. And AI data means all data. It means call center recordings and chat logs, streaming sensor data, et cetera. I won't read it all, but use a website, behaviors, sentiment analysis, all data. And so I have clients that their data science isn't quite ready for it, but they know that that science, when it is ready, will want data because it already does in many different ways. And so we're getting our data act together. We're getting it into a manageable architected format because as things happen in your business, as things happen that affect your business, and you want to react to that, if you have a chaotic environment with mismatched data to platform, you're going to have a much harder time. And if you don't understand where things are and what the level of say quality is in the various places and what you can do with data in various places, you're going to have a really hard time. So this is all kind of what I call architecture, what we call architecture. It's putting the data together in the right way. And getting it into the right platform is a big part of that. I want to stress this before we jump in further, which is this is about the proximate proportion of where we should be spending our time and energy in an enterprise. Yeah, on the data mostly. It's not what really happens, unfortunately. And I do believe companies pay the price for trying to continue to just access data, however it may be in the enterprise. Get that data act together, and then you can slap any good old BI or AI on top of a great data foundation and get some return on investment out of that. If you have a chaotic data foundation again, you're going to have a hard time getting anything out of it. And people will just kind of walk away and go back to the old ways. They won't use the data, and we don't want that. If you believe as I do that data is a critical asset of the business, you have to take care of it in this way. And by the way, all the things I'm going to be talking about here today, the architecture, the platforms and so on, let's be sure that data is perceived so you get the right attention over to these platforming issues. Because otherwise, if data is seen as a commodity, as a drag along, if you will, to applications as it is still in many places, you're going to have challenges at getting the data in the right platform, getting your company really ready for the future. That's what this is about. There is an increasing probability that if you do the correct platform selection that it will lead to success of the overall project. And this is not based on a study. This is just my impression of 20 years of consulting and data that if you just kind of throw a dart at the board and you just use the same old platform that you've always used, I mean I've been in application specification workshops where we're talking about the application. It's all great. And we're going to spend about the same amount of time talking about the data in five minutes to go in the meeting. And it's like, John, would you provision another, insert your favorite vendor database instance for this. And that's it. Whoa, what happened there? We just made a very quick decision based upon not really thinking about it. And that's not the way to go about it. So I want you to get into not only the right category, not only the right top two category, but the best category. And by category I mean the subject of today's talk, a database. Yeah, and there's different categories within there. Of course we'll get into that. Hadoop, still very viable out there. And then Cloud Storage and other things. So get into the right category and get into the right tool. And I believe there is a best tool, specific vendor tool for every application. And that you really increase your odds for success of that application if you get into that. There's still enough diversity in, let's say, Cloud and Lake databases. There's still enough diversity to matter. And there's still a lot of positioning happening out there in that marketplace. For example, there's still a lot of differentiation. And I've been doing a lot of research lately in this area, comparing the various platforms, not only from a performance standpoint but from a feature function standpoint. And I can certainly attest that we can award some different gold medals to each of the participants for different things, right? Best performance over here, you got your best backup or recovery over here, you got your best security over here and so on. So what's important to you? Let's get into that as we get into provisioning your application. So what's it for? What are we wanting a data platform for? What is it? It's not a database, there are categories. Now I'm putting aside anybody that wants to do something on, you know, tape or something like that, okay? We're squarely in modern architecture here. So an operational database. Now you'll see some words in here, operational. Hopefully we know what that means as contrasted with the analytical side of the business. An operational database. Hey, this is where it all began. This is where databases began, just running the business, doing accounts receivable, doing transactions and so on. Oh yeah, we still do that, by the way. That's still pretty important. But one aspect of operational, or there's a few that have changed, but one that's changed is the need for real-time, like real-time catalogs for e-commerce and so on. That throws a wrench into the decision process and to me changes the game in terms of the database that you might select. So is it that? Is it big data? And it could be a combination of real-time and big data. But in the first three categories, mostly here I'm talking about provisioning for a single application in the operational realm. And then the fourth category is an operational data hub, which has gained some prominence recently. It's almost like the data lake, but it's not quite. It's really operational in the sense that it's serving up data to a lot of different applications for operational processing, real-time processing in many cases. And it's doing multiple. So are you provisioning a database for operational purposes for something like that? That's an operational data hub. You're doing that. Master data management? Yeah, that's an operational database too. That's a relational database. A data warehouse. Oh yeah, data warehouse. Now we're flipping to the analytical side of things. Post-operational, where a lot of activity has happened recently. It's a very important part of the enterprise. Of course, a data warehouse. Of course, so important. I'll come back to that. A dependent data mart, a data mart that is fed from the data warehouse. There could also be an independent data mart, which is viable in certain cases. Somewhat congruent with an analytic big data application. Although that's for big data, not necessarily the non-big data. So independent data mart, non-big data, analytic big data application, big data. I believe big data is a thing. I don't mind using the words. Some people do. They say, oh, it's just data. No, it's data with different characteristics altogether. And so you have to do something different about that. And I'll get into that. A data lake. I skipped over that. A data lake. Okay, these have gained a lot of prominence recently. I think they have a lot of value in an enterprise. I hope to impress a little bit about that on you today. But if you want my take on data lakes, that was the topic of last month's Advanced Analytics webinar. That's all I talked about. So go back and maybe find that in the archive, all about data lakes. Archive storage. Some people provision just for that. Okay, that's not too exciting. But yeah, you got to do that too. And that's a whole decision, I mean a whole list of vendor products that are specialists in that area. And also a staging. Or you might be staging for now famously, we stage for the data warehouse. But these other things may have staging as well, which is where you might be doing some initial data quality work, cleanup work, transformation work before the data is ready. And in some cases, in most cases I would say for a data warehouse, it does require that extra step to get it right. Okay, this borrows down to three major decisions. The data store type. So the type is, to me it's either databases or non-databases, what I call file-based systems. File-based scale-out systems. And largely there I'm talking about Hadoop, cloud storage, NoSQL, the things that aren't SQL. Whereas databases are SQL. I believe SQL is here to stay by the way. I certainly make my decisions based upon that. I see no evidence to the contrary. And I'll be getting into that a little bit later. Not every application, not every provisioning needs to have what a database brings to the table. And databases have been around for 25 years, you know, relational databases or longer. And they do a lot of things, but they do provide a lot of structure. And we have learned in recent years of some of the shortcomings of databases. So sometimes you don't need what a database brings. You don't want to pay for it. Data store placement. And I'm not really talking too much about this today, but anything and everything that I say today, you can throw a cloud dimension on top of that and say, well, should that be in the cloud or not? So this is an important part of the provisioning cycle. And as a matter of fact, I think next month I'm going to talk about the cloud. It's pretty important, you know, whether you put things in the cloud or not. My take is that that is the future. That is where my clients want to go. That's where I take them. And that has the pole position as far as I'm concerned, that we are going to do this in the cloud. Convince me not. The workload architecture, because there's still a differentiation, although it's blurring between operational and analytical. So those three decisions kind of get you into the right set of tools. We want to drill in from there. Let's start by looking at probably what most data professionals work on a lot. Data warehouses, data markets, data lakes, and generally big data. So let's talk about the analytical side of doing what we do. And oh, by the way, I failed to mention a point here earlier when I was talking about operational data. I wanted to say that this could be SQL. This could also be new SQL, or it could also be no SQL. So those are some of the categories that are viable for the operational workload. And a lot of, by the way, a lot of databases these days also have no SQL-like capabilities. The capability to store JSON is regular data, et cetera. So that's taken up a lot of slack out there. And so we'll see what happens with no SQL. I don't believe I will come back to no SQL very much more in this presentation. Okay, data warehousing, yeah, still has lower total cost of ownership than data marks. And I've been dragging around this thing from Gartner in the lower right-hand corner, because I like it, and it tells the story. And I don't know what they mean by however many data marks they're talking about there, but you get the idea that if you keep doing one-offs at some point it's going to cost more. And I would say not only does it cost more, and that's not good, but it also is going to leave you without the capabilities of a shared resource like a data warehouse. So we love data warehouses. As a matter of fact, of all the things I'm going to talk about here today, if you don't have your data warehouse up to a certain standard, that's probably the place that you can get the most bang for the buck out of your data environment. So you should probably be focused on your data warehouse. I know it's yesterday's news, right? Not so interesting anymore, but it provides so much value to an organization. And so we're going to talk about provisioning for the data warehouse. Let's put it in an ecosystem here first. It's going to be next to Hadoop, as I say, next to in that tier of the analytic architecture. Now that Hadoop could be for an application, a big data application, could be for a data lake. And also I show you some other appliances out there, some independent data mards, some dependent data mards, yeah, all of the above. There's no right and wrong in terms of this, because I encounter clients with no two clients are the same out there. No two of you are the same in terms of what you've done. And by the way, none of you have implemented the nice laminated architectures that the vendors used to wag in your face in the past decade. Not quite, and most of you are pretty divergent from that. But hey, I'm not taking anything away from those laminated architectures, those reference architectures. I think you should have a reference architecture that you're targeting out there sometime in the next five years or so. And that helps you make your decisions, because getting to that reference architecture is part of the decision-making criteria for provisioning anything. So you should have that. And yes, it should have a data lake. It should have a data warehouse. It should show your prominent data mards in there, and the other things I'm going to be talking about here today. So anyway, my point is that the data warehouse sits in an analytic ecosystem. Some people call the whole thing a logical data warehouse. Okay, great, if you want to. I tend to be physical. I call that data warehouse that physical database that we're calling the data warehouse. Yes, there might be staging, yes, it might feed some data mards, but that's what they are. That's what it is. Now, I've noticed something. Data warehousing has been around for so long, right? Data warehouses now seem to have what I call flavors out there in the marketplace, and that's perfectly okay. It's perfectly okay if your data warehouse is not the do-all, end-all, be-all for analytic data in your organization, because it's hard to do that. It's really hard, and so many organizations have multiple of these data warehouses, and some of them have taken on multiple of these flavors, and some take on one flavor, and you have multiple of them, yeah, all of the above. But these are some of the flavors that I've seen in data warehouses recently. And yes, the flavor could impact the decision of how you're going to provision for the data warehouse. Customer experience transformation. People are doing that with their data warehouse. Asset maximization, a lot of IoT interrelated to those data warehouses. Operational extension data warehouses that do things that we should be doing in the operations, but we can't for various reasons. Risk management has become huge recently. Finance modernization, all the financial numbers, and product innovation. And the reason that, one last reason, then I'll move on here, but one last reason that I bring up these flavors of data warehouses is because maybe the ideal out there is to have a couple of different data warehouses with different flavors, because maybe it's just hard to meet all these needs in one data warehouse. I don't know. That's kind of what I'm seeing, though. But do have data warehouses, which are shared resources at the end of the day. It's shared. It's not for one application. Just because you're provisioning a database on the analytic side of things doesn't mean that that's a data warehouse, okay? Let's move on. What is required in all these analytic databases? The ability to do analytics inside the database, okay? The ability to store data in memory. The ability to have a columnar orientation to that data. That's pretty important now. I'm going to come back on that. Modern programming languages. Yeah, all the data science languages that are now invoked and very interesting and making a difference in organizations. You want the ability to store different data types. I mentioned JSON before, Avro, XML, on and on. The new data types that we're seeing out there in a lot of exchange environments, if nowhere else. Your analytic database should have the ability to store that kind of data. Some of the major decisions that you need to make are, okay, where to store the data. I mentioned this a little bit ago. Now I'm not saying let's store a lot of data in memory, but I think you should be, if I had a general piece of advice that I've built up over working with so many clients is that they don't use enough memory. That a little bit more memory can take a lot of clients' performance up dramatically. Now, that's using memory. This graph is talking about where to store the data. So what this is saying is low amounts of data, and in their case, I mean this can go up to 100 terabytes. Okay, so let's take that with a grain of salt. But in low amounts of data with a high query rate, then you might be storing in memory. And then somewhere in between SSD, which is really the sweet spot, I'd say, for where analytic data should be stored today. Okay, and then of course you have this for some of the data that, maybe it's legacy, maybe you're not doing very much with it, etc., high volume, low query rate, yeah, disk for that. This is an important decision by the way. It's not good enough to just say, okay, well we've analyzed everything and we've decided Oracle. Okay, well, go on. Because what about, you know, the makeup of the storage layer for Oracle? That's pretty important now too. That can make or break your workload right there. So you have to keep going. It's not as easy as it used to be. And the column to orientation. And if you've heard me over the years, you know I'm really big on this and have done a lot of work in the column to area. Most sand lake databases have a columnar capability. And what this does is it isolates columns of data in its own storage area. Okay, so when that's all you want, that's all you get. Just those columns, that's all you'll get in the processing and in the visualization part, whatever. It doesn't have to hit and doesn't have to deal with. Every other column of every table that you might be touching. Okay, it's very important. I have done studies and found that most data warehouses, let alone and Lake Martz, them too, but most data warehouses really would do well and do better in a columnar orientation. Now I know that you're going to, you're thinking, wow, that's a big deal though. I've got 10 constituent parties in the data warehouse. Yeah, I know. I know. Could you plan a weekend sometime later in the year maybe to twist your data in a columnar direction and boost most of them? Maybe, maybe not. Maybe you have to provision something else, which has led me to this notion that there is the data warehouse but then there is the specialist database in the analytic realm for specialist workloads that can't fit in the data warehouse for one reason or another. And in my work recently, I would say at least 50% of the workloads that we're dealing with in the analytic realm, not even talking about big data here, but because of the concurrency requirements, because of the access requirements, the deep data science, maybe they might be happening on that data, or just the volume of the data, the veracity of the data, et cetera. They just can't fit in the data warehouse anymore. So that's okay. You know, spinning up some independent data marts that are alongside the data warehouse. But now, yes, I'm putting a ribbon on this slide because if there was one area that is, if you're looking for, if you came to this webinar looking for the answer to the question of Hadoop versus databases versus whatever else I had, the answer is Cloud A and Lake databases. And I'm not saying that generally for everything. Okay, don't get me wrong. But for a lot of things out there that I know you're mulling over, Cloud and Lake databases are going to be front and center in terms of things that you need to really be thinking about. So yes, there is that word in there, isn't there? Cloud, yeah, cloud. These belong in the cloud. These have some of the things that you see here. Robust SQL built-in optimization on the fly elasticity. You don't have to think about it. It just expands and contracts as it needs to and you pay for what you use, which price predictability is huge in terms of what I'm asked about as we provision data platforms for our clients. And clients are even willing to accept a higher price if they can predict it because they have to do this thing called a budget. So put that out, just put that out there. Dynamic environment, adaption, separation of compute from storage. Yeah, that's really necessary today. That's really necessary. You need to be taking advantage of that. Now this came from the Hadoop world. And they got into it first, but a lot of these Cloud and Lake databases now have that. Support for diverse data, I mentioned this a little bit ago, very important. Cloud and Lake databases in the enterprise. Now the cloud aspect is what's important here for some of these benefits. And just the database aspect is important for some of the other ones. But this one I'm talking mostly about the fact that it's in the cloud. The fact that it's in the cloud. Can be used for test dev or prod, disaster recovery. Yep, whoops. Or bursting, so that you have some of the workload in Clouds some on-prem. CapEx accounting, of course, we love this. Our accounting department loves this. The cloud now offers attractive options with better economics such as pay as you go, which is easier to justify in budget. You can get up and running quicker, which is very important today. I know for me, the cycles of getting a client up and running shrunk dramatically from years ago when it used to be an elongated RFP process, et cetera, et cetera. If you've been around, you know what I'm talking about. But now it's more like let's think about it a little bit, let's get it right, let's spin it up, and let's try it out. And if it doesn't work, let's jettison that and get something else going. And I encourage everybody though to not be so flippant about it and actually do run your own proof of concepts internally. Pretty much I kind of described a proof of concept there, but to do it a little bit formally and to have some competition in the mix. While on-premise, this first development brings a robust database to the table, not all functions are always part of the cloud solution, so be careful about that. And today, there's a lot of data gravity in the cloud because there's so many other applications that are out there in the cloud that we are enjoying. Now what did I call this? Hadoop versus databases versus cloud storage. What's the winner? What's the winner in terms of performance? Managed cloud databases are the winner for performance. There you go. You got a straight answer from a consultant. Today must be your lucky day. But querying cloud storage directly is inefficient and bringing subsets of data down for on-premise processing takes time and costs egress fees. Okay, this is what we found. Now, by the way, just because it's better at performance doesn't mean it's right for everything. I hope you get that. All right, I'm just picking on one thing here. Performance testing on Hadoop engines like Hive, Spark, and Impala have shown improvements in performance, but they still lag significantly behind the performance and power of a solid relational cloud database slash data warehouse. Maybe if I had a better slide in here, I would show you that kind of that line between database and non-database has been pushed up a little bit. So database keeps gaining ground. And this is interesting because a few years ago weren't we all talking about, well, no, actually it's Hadoop that's going to push down into the realm of the database, right? It's going to be for everything, right? It's going a little bit the other way now, but there's still that top end where some other things make sense, and we'll get into that a little bit. But there you go. How about administration? That's pretty important. Now the DBAs and the people that manage this aspect of it, they care about this. Managed cloud databases win this category too. What do you know? Many of the latest and greatest fully managed cloud database platforms are streamlining and subsuming much of the DBA work these days. Yeah, there is that march. We could have a separate, complete discussion about what we're going to do in the data world in the next 5 to 10 years, but this is one aspect of it. DBA work is shrinking. Things like indexes, constraints, partitioning and other DBA level performance tuning are fading away. Many of these cloud databases do not have those things. You don't touch any of this. It just happens. Now, second in administration is cloud storage because of its very simple architecture. I didn't say it was simple to provision. I'll get to that. But it's simple in terms of architecture. It's basically a file layout. And last place is Hadoop. You will still need expertise to help diagnose why spark executors fail, which they do, or high-throw the exception, which it does, or why troublesome queries never finish. And depending upon your query set, that could very well be happening to you as it has with us. However, there are reasons why you want big data technologies for big data. So if your workload is square on into big data, which I've mentioned is a thing, is characteristically different than our legacy non-big data. It really is. The volume is at another level. The data types are different, to say the least. And what I said, the volumes are at a different level. The volume that comes in, the interesting nature of the data, it's less per byte, if you will, but it's still important. And this is actually an area of gaining competitive advantage in the marketplace. So we do want big data technologies for big data because of the things that you see here and plenty of other things. And by the way, if you have extreme unstructured data, you might be needing to do something different than just even Hadoop or Cloud Storage. You might need a product like Amazon Cloud Search, Elastic Solar Sphinx, or Splunk for your extreme unstructured data. If the overriding aspect or criteria of the workload is the fact that there's a lot of unstructured data, yeah, you might be in that realm instead of in Hadoop or Cloud Storage. That might be what makes sense. But let's talk about the Data Lake. There's a lot of talk about the Data Lake out there. And some of it's good, some of it's not so good, but I believe that the Data Lake has a tremendous amount of value to organizations. It supplements the Data Warehouse. It's not the Data Warehouse. It's not for your everyday reporting. It's not for the 70% or some odd of queries that an organization needs to run. Which that's the Data Warehouse, okay? But this will be an increasing percentage of the queries that an organization needs to run as data scientists increase within the organization. And hopefully you are embracing this. If you're a mid-size company or up, you need to be embracing data science today. And you need to be showing how you are going to be competing in data science out there in the marketplace. And I would say that the window is closing a little bit now already for those prime movers, those companies that will be data science leaders. And I think that's pretty important in terms of overall viability of company. So get your data act together so that you can effectively have data scientists and they will want a Data Lake. I promise you this. This is what I found over and over again. And, yeah, in a Data Lake, I don't want to say it's all data that sounds kind of too easy to say, hard to do, but it's a lot of data. It's more data than the warehouse. As you can see in this snippet of architecture, I believe the lake is a great staging ground for the Data Warehouse and quite potentially other data marks as well. So, the big question out there though, after you get past the yes, we want to do a Data Lake is where HDFS or Hadoop, which is, HDFS is pretty much what the term Hadoop has come to mean in recent years. Let's be sure that we're on the same page when you're out there talking about Hadoop with other people. Or cloud storage, HDFS or cloud storage. Now, the early days of the Data Lake, which wasn't that long ago, but it feels like there's been some sea changes. In the early days, Hadoop was it, right? Hadoop was everything, every data lake. And it still is the majority of Data Lakes out there. It has to be because unless you provisioned really recently, that's what you did. But cloud storage really took the scene lately and my clients are telling me this is what they want to look at. I've been doing some field testing on this and I believe that cloud storage at the end of the day, I believe that cloud storage is probably the most elegant place to put your Data Lake these days. It's more scalable and persistent, it's backed up and it supports compression making the cost less. Now HDFS does have better query performance so I'll leave that out there but cloud storage is closing the gap on that. And it used to be that you were not able to use Parquet Cloud Storage, which you know my columnar orientation so I feel like that's pretty important. But now we do things like we make external tables for the database pointing at Parquet files on S3. Of course that's Amazon's cloud storage. Now the cloud storage does have object size and single put limits that need workarounds but we can do that. So I'm putting forward cloud storage for your Data Lakes. Just an elegant design idea for you. Paralake with an analytical engine that charges only by what you use, which is a lot of them these days. If you have a ton of data that can sit in cold storage and only needs to be accessed or analyzed occasionally stored in Amazon S3 Azure Blob Storage or Google Cloud Storage, those are the three Data Lakes that we have been working on. So that's your body of selection there I'd say for that. Use the database on-premera in the cloud that can create external tables that point at the storage. Analyst can then query directly against it or draw down a subset for some deeper intensive analysis. And the storage fee plus the data transfer egress fees will be much cheaper than leaving it all in a data warehouse. So this is a big idea that many are doing for the right workloads to save money. For example, S3 is $0.02 per gigabyte for storage. If you had 100 terabyte data set marked for cold storage you could keep it in cloud storage for $2,000 a month. If your provision EBS general purpose drives to hold that data it would cost $10,000. If you use higher performance solid state drives with provisioned I.O. as you might in a good data warehouse, you would spend $12,500 a month on storage alone. So when it comes to cost, this is a great way to work with cloud storage and databases. Hadoop would be second place in the cost equation. Hopefully you see cloud storage at first. Hadoop would be second place. For example, Amazon EMR cluster having nodes with 32 CPUs and 244 gig of RAM each would run about $1.50 per node per hour. Redshift for the same processing power would cost over three times as much at $4.80 per node per hour. But remember I did say that those cloud animated databases are really the fit for most of your work. You don't want to sacrifice the performance you need the administration you need just because something is cheaper. This is why I dread the conversation which is almost inevitable in any provisioning cycle that I get into by dread the conversation about what's the cheapest way to store this especially with the CFO office. What's the cheapest way to store this? That's not the way to look at it. But anyway, I imagine that for 80 to 85% of your use cases you will be able to beat a good cloud database or data warehouse platform when you look at everything taken together. We are still in those days of not one size fits all. We are still in those days where you have to get into the right category in order to succeed. So hopefully I'm clear about that and that's why we have talks like we have today because it's still a thing. So leveraging cloud storage for data lakes more achievable, separate compute and storage architecture. You got compute resources that can be taken down, scaled up or out or interchange without data movement. I won't read the whole thing but most of the query execution is processing time and not data transport. So cloud compute and storage are in the same cloud vendor region performance is hardly impacted. So that's another tip for you out there and I think we had more tips like this last month if you want to check that out learn more about data lakes. Now a little aside because I couldn't leave them out because after all in the first webinar in this series back in January I did a look ahead to 2019 and beyond and I said this year 2019 is the year of MDM but it's also the year of the graph database and so I believe that we're going to see a lot of graph databases deployed or at least graph capabilities deployed as organizations take up this idea of a graph workload. Yeah, not trying to force fit it into a relational database anymore because there's alternatives to that. So a workload. What's a graph workload? If you can identify the workload by the terms network hierarchy, tree ancestry structure if that's how you would describe it chances are you're in a graph workload there so you might want to think about it that way. If you're trying to use relational performance tricks like self-referencing tables and so on we've all done it for years. Those are not going to be necessary. They're going to be going to performance traversing a graph much better than traversing relational self-referencing tables. What else do I want to point out here? I want to point out this. So there's two types of graph databases. There's RDF graphs and there's property graphs like Neo4j. There's some subtle differences between the two. There's some capability differences between the two but for a large degree I'd say you get the same set of algorithms in both types and they address this graph problem. Some of those algorithms are things like PageRank. PageRank is very important. I don't want to describe it, it would take too long but let me just say about PageRank that it's emblematic of the algorithms in a graph database that show that the value of a graph database is not just solely in the visualization layer. We all see that, we enjoy that. Even on the graph here you can see in the lower right I have a bunch of nodes and it looks great because you can zoom in on a node and you can see there's a so-called bridge vertex. That's a connecting node between this group and that node. You can see it and that's all great. What's the most important nodes in the network for other reasons? The graph algorithms like PageRank. PageRank is about what's the most important web page and a set of web pages or on the Internet in total if you're Google. Betweenness, what's the way to get between I'm reading some of the different graph algorithms here. What's the way to get between some of the nodes? Closeness, how close are a couple of the nodes on the network? The nodes by the way, they don't have to be all the same. They can be very heterogeneous. You can have, for example people. You can have their products. You can have geographic locations and various other things in there as nodes. Again, get great performance traversing that graph. Iconcentrality, clustering coefficient. Those are some of the main graph algorithms at least in my book. Those are some things you get only in something that is provisioned with graph capabilities. I almost said graph database in the game in town anymore for graph capabilities. The relational database vendors have added a lot of graph capabilities in many cases what they have are the algorithms that I just mentioned. You would point them at your so-called nodes in your relational database. You would say, well here in this relational database here are some of the nodes and here are some of the connections. I hope I did justice to that topic in four minutes there. I thought it was pretty important when you are talking about platforming your workloads to put that out there and let you know there is that alternative as well. If you are struggling with a workload that might be identified by these words think about that. Now let's step to the future. I like looking at the future too. Remember now though the advice I have given today so far is the advice that I would take up now for the decisions that I know you are dealing with now. We cannot wait for this future. What we can do is keep an eye on it and every organization upper midsize and up should have people that are doing this. Should have people that have time to do this and people that have capability to do this and people that have the interest in doing this and applying it to the industry that you are in applying it to the company that you are in applying it to the architecture that you have already laid down making sure that it fits. Now first of all I am going to mention the trend of GPU databases. A GPU database was engineered to work with GPUs not CPUs this is all about speed and there are some of the some of the leaders there across the bottom, Matt East, Karim and Connecticut The reason I mentioned it is some of the other databases could and are working on being compatible with GPU which will take all our performance calculations today and throw them out the window but it does take time and it does take a dedication and who knows about some of these vendors if they are dedicated to this. I think this is a keeper which is why I bring it up what I see. I think in the future we are going to have a lot more GPUs in place in our enterprises so that is important. What else? How about hybrid databases? Take half of what I said today about the difference between operational analytical and you provision differently for each and you might say well maybe in the future we don't have to. This industry is like an accordion where we started out with the database DB2 and Oracle back when the accordion was closed there wasn't many choices and then in the past 10 years that accordion opened way up and you've got all these choices and I just talked about some of them I didn't talk about some others that have gone by the wayside and that I would not be viable today for my clients but now what about the future will it close back up again and will it come back down to something that works across the board now that we recognize after 30 years that there's more to our business than the operational side there is that very important analytical side there's that pending AI that's going to have to happen we're going to have to make that transition and cannot even be done in our current chaotic environment we'll see but hybrid databases are a combination of row base for transactions and column base for analytics can process both for example orders and machine learning models simultaneously wow with fast performance and reduced complexity so there are a few players out there doing this, there are some sprouts I'll say of activity in this area and if you're a type A organization you may have some high end combo applications already today that I would even say only a hybrid can do today but you better be type A and you better be ready for it it's still SQL by the way still SQL, going into our future still SQL there's reduced complexity there though by having both in one I'm sure you would agree with that we're talking here about like and by the way, I'm going to read some names here there are no particular order and I'm not saying that all of them have all these capabilities I just talked about Life Machine, Databricks, MemSQL, Actian Map R and Kudu that's from Cloudera that's their hybrid database which does updates, inserts and deletes on Hadoop data Hortonworks used to bank on high for the ACID merge functionality now I'm not sure what the direction is probably Kudu from Cloudera so anyway that is definitely something to keep an eye on for your future but not today, not today I mean for those high end applications where you actually need this yes, otherwise I'm recommending what I've been talking about in the past hour now this brings me to the end of my the forum part of the presentation Shannon, I see we have some questions I'm going to let you take it away from here though yeah we have a lot of great questions coming in and just to oh hey I see lots of familiar names on there hey just to answer the most commonly asked questions just a reminder I was going to follow up by end of day Monday to all registrants with links to the slides and links to the recording of this presentation and anything else requested throughout so then William diving right in early on in the presentation I think a slide, 14 or so can you support multiple flavors in the warehouse? absolutely you can it gets trickier the more flavors that you do try to support and not every business needs them all or needs a hyper focus on all of them but and don't get me wrong there are some organizations out there that pulled off the whole EDW concept and they have all flavors in one great unified data warehouse which is great it's just rare that's all and so what I see is more the flavoring of data warehouses but yes you can and why don't you recommend leveraging Hadoop for a data warehouse oh hmm how long do we have um listen yeah anyway um performance is one thing complexity is another I mean everything sort of falls from that provisioning is another I didn't mention too much about the work effort that goes into setting up these environments and I should mention that cloud storage is very difficult I did mention that extensively in the last last month's webinar but Hadoop is also difficult needs specialized resources that are hard to find and we're seeing a little bit of market moving in some other directions so you got to think about your commitment there but I would say you know you need the tooling you need SQL for most of these workloads you need the simplicity of it all and I know that a lot of us myself included we're working on some pretty high end stuff out there but when I'm talking about more SQL than Hadoop I'm talking about the fact that most applications fit that mode out there now we can take any single application and look into it and I'm sure that out of the 100 applications that I might be presented with over the course of a year that there's going to be a handful that still makes sense in Hadoop in environments that are really committed there already etc but the majority are not you did that well in few minutes I hope so it's a tough one you know we got all hot and heavy about you know Hadoop is replacing this that the other thing there for a while but you know it's settled in and it's not settled into the warehouse in my opinion sure so should you not query cloud directly and should not bring data down on premise so how do we access this data let's see let's see if I understand the questions because I think the questioner said should we not should we not access cloud directly and should we not bring it down on prem I mean those are your two choices right so for the most part I would say keep it simple and if you're getting performance from keeping it simple which is keep the data in the cloud where it is then great do that if you're getting enough performance out of that you don't over complicate it but if you do need to bring it down because you have scientists that are going to pound on it all day or whatever go ahead and do so and by the way I failed to mention in my talk that if you take me up on doing some different things for different different workloads do keep data virtualization in mind and keep that in play in your environment in fact I think every environment should have a data virtualization capability because if you're going to be provisioning all of these different data storage types you're going to need it because you can't put all data everywhere you can only put it in the best one or two places and this factors into the questioner's question which is how to access that data that's in the cloud virtualization can bridge that gap as well. So why aren't constraints unique indexes still important? Why aren't constraints unique indexes still important? Well I don't know that I would necessarily say they're not important they're not going away as rapidly as indexes are going away because indexes are going away because of other performance capability the performance that's inherent in the new ways that the data is being stored but in terms of constraints I want referentially integral data for one I want to turn on constraints wherever I possibly can in that way to ensure that now if performance takes a hit we'll deal with that but again I want to keep it simple and keeping it simple means letting the database do what it does best and not doing things programmatically but it's just something that I would say in comparing some of the different databases that it's not as important doesn't seem as important so I think some companies are having to go back to programmatic ways to ensure integrity in the data because of some of the capabilities that are being taken away so it's not all good cloud and lake databases it's not all good 100% and something else is 0% it's somewhere in the middle and this is kind of a gray area so there are always going to be some things that are left behind or laggards in terms of feature functions that come along and I just don't think that this has been seen as important enough to get out there initially so that's why I'm guessing but I'm guessing that's why we don't see it as much I think we have time for one more question Do you see columnar data storage more adaptable and cost effective for evolutionary data structure changes to warehouses over other row oriented storage? No if the data is going to have a high change activity upon it probably and you're changing like multiple columns of the row on a repeated basis like IOT type of data maybe that's not a great candidate for columnar because columnar is not great for update let's face it updating rows, deleting rows inserting rows but if it's stable data that once you load it you load and go as we tend to do for data warehouse data 99% of it then it's great for that but if you're thinking you're going to update I'd say no so this is why we don't do it very much if at all in the operational environment I'm going to slip in one more question I love it you're so efficient in your answers on slide 25 you are showing a push of all data in a lake do you believe that this is a singular pattern and all governance should take place in the lake before distributing to the other warehouses or being consumed by data science and real-time apps so how about both how about both I mean that's to me the data lake serves both functions it is a data science laboratory and it's a staging area for the push out to let's say the data warehouse ok now that leaves you with some decisions and we have and by the way I've hit kind of the tip of the iceberg of the decision making apparatus for being successful here every day you make decisions you know after the selection and they have to be good ones too you have to decide what data goes in the lake what data goes in the warehouse and what you are going to do as an organization in the lake versus the warehouse and I've done this for companies but I don't know that saying anything generically would make sense here except that in the warehouse you're going to do your everyday reporting your legacy stuff compiling data for analytics but there is such a thing as a data scientist maybe this is the point I'll leave you with there is such a thing as data scientists doing data science and I'm not a data scientist and I don't know completely everything what they're doing in my data lake but I know they want all that data and I know I can serve them well and I can be successful as a data manager by giving them that data lake starting to learn what they're doing so I can help them and get ahead of them so that in the future we don't have to have as we do today some pretty advanced people doing all the data science in the organization I can help them as a data manager and that's what I want to get to and I think we're going to have to get to the same spot in the data lake as we have got to in the data warehouse where it may have gone too far but where we have to almost spoon feed the data out to the users because they're used to that we highly curate the data in the data warehouse we summarize it we aggregate it we derive data we verify our calculations we just get that data ready for consumption in the data warehouse we don't do that in the data lake well that does bring us to the top of the hour William thank you so much for this fantastic presentation and again just a reminder to everybody I will send a follow-up email by end of day Monday for this webinar with links to the slides and links to the recording and thanks all of our attendees for being so engaged in everything we do I just love all the great questions that could come in William I'll send you all of those questions we didn't have a chance to get to and I hope you all have a great day William thank you so much thank you bye bye thanks all