 Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager at Data Diversity. I'd like to thank you for joining the latest installment of the Monthly Data Diversity Webinar Series, Advanced Analytics with William McKnight. Today, William will be discussing Building and Growing Organizational Analytics with Data Lakes. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share and answer questions via Twitter using hashtag ADV Analytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat feature in the bottom middle of your screen. As always, we will sort of follow up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speaker for the series, William McKnight. William is the President of McKnight Consulting Group. He takes corporate information and turns it into a bottom line producing asset. He's worked with major companies worldwide, 15 of the global 2000, and many others. McKnight Consulting Group focuses on delivering business value and solving business problems, utilizing proven streamlined approaches and information management. His teams have won several best practice competitions for their implementations. He has been helping companies adopt big data solutions. And with that, I would give the floor to William to get today's webinar started. Hello and welcome. Hello and indeed here we go, Shannon. Thank you and welcome everybody. This is one of my favorite hours of the month, especially during this pandemic time. I look forward to sharing with you every month. I know you're out there even though I can't see you. This is a very hot topic, at least for me, I would suspect for many in this industry right now. It's about data lakes and what are they and what are they good for and do I need one of these things and how do I build one? I just have been feeling a lot of calls on this topic and I want to kind of bring together a lot of the conversation that I've been having about this, try to clear some things up because there's a lot of lack of clarity about this out there. I mean, you don't want to build a lake for the sake of it. You definitely don't want to build a lake that turns into a swamp kind of thing, which is unused. And a lot of people are wondering, well, is this just more vendor height and consultant height to guilt me into doing a project that I don't really need? And what is it really? Is it the function that the lake provides, meaning that it can be on a variety of platforms or is it really the platform that it's on? Is it cloud storage or HDFS or some combination of that? So most people use it as defining the platform and I will as well. But I do caution that we use the term, I don't want to say appropriately, like if I have the definition of it, but I'd say consistently. So if you're lacking that consistent definition, hopefully I'll give you something here today to kind of forge within your organization. And one thing that will come out is that I am advocating that most organizations, most enterprises that are mid-sized and up-need, at least one of these. And hopefully I'll share with you why as we go along here in some of the nice use cases that others are enjoying with the data lake. So a little bit about me here. I've been introduced. I do a lot of just kind of talking and primarily consulting to you all as well as doing quite a bit of analyst work lately. Yes, strategy known for our strategy, a lot of training and workshops, things like that to help you over an issue you might be having and we do implementation in all of these areas. Okay, let's start with analytic data stores because after all, the data lake is an analytic data store. Now, I realized that sometimes it's operational. That's not what I'm talking about today. I'm going to go with the primary definition here of something that's post-operational that receives data, kind of like a data warehouse. If you're familiar with that, so a data warehouse where we see data from the operational environment and distribute that data into the analytical environment and obviously make it available to use. The data lake does something very similar. I'm also talking about the data lake similarly to a data warehouse in that it has multiple uses. It's not just data for an individual application in your enterprise, although that's a great way to get started. And frankly, if that's where you are, that's where many are with this, is they found something that has a need for all that data. That's okay. And I'd say the vast majority of this talk is for you, but I do want you to aspire to kind of like our data warehouse, make it available for multiple purposes within your organization. So without further ado, here are some of the right fitting, as I say, analytic platforms. So if you have platforms out here that fall outside of this little cluster of terms, these are well-worn terms in the industry and well-worn ideas and great artifacts for your architecture. So if you have other things, I'd encourage you to try to make them into something that is one of these things. The enterprise data warehouse, yeah, we all know about that, right? A dependent data warehouse, something that's fed from the data warehouse. I also have on here an independent data mart. Yeah, I'm using some old terms, but I like them and they fit. An independent data mart is not fed from the data warehouse, but fed from the operational system. It's not for multiple purposes. And yes, yes, people, there are real good reasons to have independent data marts. Some of our data warehouses are locked into what they do for us as an organization. They're not gonna move with agility. They're kind of like a big old freighter ship in the harbor that takes a little time to turn around, right? So if you need something with electricity, okay, let me get it right, and electricity for that matter. But anyway, if you need something like that, yeah, you might consider an independent data mart, but keep it architected as you do that. And then the enterprise data lake, which is what we're here to talk about and what we're kind of also here to talk about is a nice big data cluster on cloud storage that is useful for a given application. And that might be the beginnings of a data lake, or it might be something that is meant forever for that singular purpose. And you also have a data lake. Okay, so there's there's so many combinations of this. I know it's just five bullets, but there's so many combinations of this out there. You're not necessarily wrong with whatever you're doing now. There are reasons why you are where you are, in terms of your architecture, in terms of your technology selection. The only thing to do really is focus on the future and focus on moving forward and inching the thing forward towards something that is going to provide more opportunity for the business as you go forward. And so that's why I say things like, well, try to try to make all those things into one of these five things as we go forward. So we're in the definition phase of the presentation here for sure. But this is really important, because this is where many enterprises trip up, they have multiple of these definitions running around. And it just makes things a little bit slower. And I always, whenever I'm talking to a vendor taking in a briefing or what have you, I'm always making sure we're on the same page in terms of what we mean to be a data lake. It's just thrown around so much that you got to be careful. Anyway, this is the cloud, of course, okay, cloud storage. There's a lot of people. So I'm trying to show here a lot of people on the data lakes, not just a singular application. Now this is a relational database page structure for a data page, the predominant, you know, shape of pages out there. This is not in my ultimate data lake. This is not a data lake. I'm going to strike through this in a minute. But I mean, this is basically what snowflake is, right? It's I'm going to put another word up here. Okay, so snowflake is a relationally formed storage where SQL works with it. It's on S3. It's on cloud storage. Is it a data lake? I'd say by my definition, no, it's not a data lake because it's formatted relational. But you know, label it what you wish. Even I've even heard snowflake kind of refer to it as a data lake sort of opportunistically. So watch out and be careful and all that. But I am talking about structure that is not completely SQL based. There might be SQL like access to it. Everything SQL like access. We tried to a little soapbox here, but SQL is so strong that we can't we can't fork off of it really hard. So here we are. We're still using forms of SQL if not SQL itself. And back to my slide, the data lake is not formatted relationally. It's formatted more in a row, if you will, and delimiter and row delimiter and and yeah, there's some loose some loose pointers over the top that tell you where things are like the old age catalog and stuff like that. So there are ways to find your way around in cloud storage. And I specifically did not say Hadoop in here. Now did I and many of your data lakes are on Hadoop and that's fine. I'm talking about going forward though, many fewer if that's a term will be a less on HDFS and Hadoop and more and more and more in cloud storage. But that doesn't mean you're you're rushing to the refrigerator to get rid of your HDFS data lakes that are maybe serving a good purpose. But I do see a lot of companies that are starting to create a parallel strategy to that with cloud storage. Also, also, when I say enterprise, don't be alienated because you're working in one part of the enterprise and yet you're doing this stuff because especially in big enterprises, it's just too much of a ball of wax to get your arms around the entire enterprise. Everything that rolls up the CEO, it must be at that level to be an enterprise data lake and to do it right and all that. No, no, no, not necessarily at all. Now you want it as big as possible, but that as possible is very fungible here because I don't want you to I don't just stop within your say small department and say, well, you know, that's all I can see. So I was going to make a data lake here. I mean, reach around, reach out, branch out, but there does become limits to one's capability. I'd say if you're not going to be able to start and finish something of value to the business with a data lake approach in the next, say, six months, then you're thinking too big. And this is where a lot of companies just stall out because, well, it's not it's not big enough. It's not at the enterprise level. So let's just, I guess we can't do that. And I've heard the same thing, the same thing multiple times in the world of data warehousing. Yes, we can't do that. We'll just build data marks. Well, no, you can build the enterprise concept in a smaller part of the organization. So over here under the vice president maybe is good enough and maybe this organization will have two or three. It's kind of an approach to things. The enterprise is kind of an approach to things. Now, I mentioned HDFS, Hadoop, right? Hadoop versus the cloud storage and why is there some pull away from HDFS and toward cloud storage? Cloud storage is going to be cheaper. And that's obviously a big selling point. It doesn't have as much formatting on as HDFS. HDFS has a little bit more. It gives it, I would say, straight up better query performance, although straight up is kind of fungible. But generally speaking, I've had better luck out of the box with HDFS for query performance. But all these other things, I haven't. And I think we've learned that, especially in these early days, that we don't have a huge army of people necessarily that's on this. I know I said earlier it's an enterprise data lake. You got multiple people, multiple applications. It's true, but it's not the armies of people that the data warehouse is supporting. Not today and probably not for the foreseeable future. So I'm not saying performance isn't important. It's very important. But other things are important as well around the manageability concept of it and the cost concept of it and so on. And by the way, there's many things you can do with cloud storage to speed up performance and learning what they are. And the vendors are definitely supporting cloud storage more and more. And we're seeing it perform much faster, exponentially faster over the course of the past few years of benchmarks we've been doing. So cloud storage, yeah, let's use cloud storage for our data lakes going forward. Got a lot of things going forward, including more achievable separation of compute storage. Compute resources can be taken down, scaled up, or out, or interchange without data movement to a higher degree, I would say, than with HDFS. And what else do I want to point out here? Yeah, most of the query execution is processed in time and not data transport. So cloud compute storage are in the same cloud vendor region performance is hardly impacted. It's just one of those tips, if you will, that we've learned makes for a great cloud storage environment. So in quick summary here at this point, we have data lakes, good for the enterprise, not necessarily for the entire enterprise. They are on cloud storage, they're not using SQL straight up, and they're not in relational format. So stuff like S3 and Google data proc and stuff like that we're going to get to some some names here as we go along. Data lakes, cloud storage versus data warehouse. Okay, this is the other side of the coin. I just just opposed the cloud storage with HDFS. And now I'm going to talk about cloud storage versus relational data warehouse. Now, before I launch into this, let me just say data warehouses have no end in sight. I'm pleased. I get calls as a day, you know, should we do everything in the lake. And usually, anybody saying that, when they describe what they mean by data lake, they're essentially describing the data warehouse. And the data, the RDBMS is going to be better technology, better technology foundation, along with the ecosystem for supporting the things that that person is saying. So again, it gets back to terminology. So much right now does. But data lakes versus data warehouse, some of the distinguishing points here. And you might need these as you're determining which of these analytics platforms to put your data into. And sometimes you select both of them. As a matter of fact, going to suggest that data lakes actually collect all the data and move it on to the data warehouse from there. So be your staging area. But that's not really what I'm talking about here. I'm talking about actual access to the data. Where are you going to access the data? And do you even need to move it on to the data warehouse or not? Okay, so in the data lake, usually no pre specified data model, although if one comes with the data, I'm not inclined to blow it away. A data lake is great for history data. And by the way, a little side soapbox here, another one. History data, where are you keeping your history data? And there should not be three, four, five answers to that should be a clear answer for the enterprise. And I don't mean just one one place. I just mean a clear answer. Maybe you're collecting your POS history data in the lake and you're collecting your customer history data in the warehouse. That's okay. As long as that's what you're really doing. But it should it should be clear. This is one of the, you know, 30 questions that you want to ask yourself to make sure you're doing your job right. As a data warehouse manager, as a data lake manager, as a data manager, an analytic data manager of the organization. Where is history data being kept? And hopefully the answer is it's being kept for a long, long time. You're not blowing that away and it's kept in an accessible place. Tape drives are not a great answer anymore. So let me move on from the second bullet. A data lake easily accepts all data types, all data types. Now the data warehouse is pretty good. The rdbms is now pretty good at accepting a lot of data types. Matter of fact, that's taking a lot of the things, keeping a lot of the things on the data warehouse that used to be destined for the data lake. Because the rdbms have done such a great job in the past few years of adding to its capabilities that it hasn't been sitting around while the lakes are moved, you know, moved into its, into its territory, if you will. So I see growth, I see massive growth really for both of these structures in organizations. And in the actionable future, as I say, that's the future that you actually plan for. Yes, one day we won't need any of this. We're not planning for that day. All right, we're planning for the next five years, five to ten years maybe. And so in that planning phase, we are planning in data warehouses for sure. We're planning in data lakes for sure. Now in the data lake, I mentioned fewer users, more exploration and discovery. You have a higher scientific level, I'll put it that way, of user base for data in the data lake. But again, you know what, I'm really okay if you think, if you can think this through down at the technical level about what these various platforms provide to you. And you make your decisions about what goes where, when and how, based on that. I mean, that's really what you should do. But if you can't, you can make it at a more shallow level, and you still be okay. But it is important that organizations have a direction as to when the warehouse is going to be used and when which warehouse is going to be used, if that's a question, and when data lakes are to be used. And yeah, sometimes it's, you know, it's obvious it's how people are using it. You don't want to lock it in, but a lot of users out there, even the scientific ones, they need some direction. More exploration and discovery, okay, lighter data governance, I'll define that as we go, limitless big data. So, I would say, you know, if it's really big data and, you know, interpret that for yourself, that needs to go into the data lake. I have also found, I'm sure you have as well, if you think about it, that the bigger the data, the smaller the audience for that data. So that kind of plays into this juxtaposition of lakes and warehouses. Now, let's talk about the balance of analytics going forward. Right now, the balance of analytics looks like this. Analytical applications, that's in my red, okay. Mostly done off the data warehouse, and you got a lot of people, hundreds to thousands, maybe even tens of thousands of users on the data warehouse in these enterprises, right. So you have the data lake at the other end of the spectrum, if you will. It's possibly a larger data store, doesn't mean it's a more important data store. It's possibly larger, and you have a smaller user base, and you're moving some of your analytical applications over there. So that's what it looks like today. Now, going into the future, I see that, let me fill it out here. Yeah, you're going to have expansion of both of these, as I mentioned. Bigger data warehouses, bigger data lakes, and more analytical applications for sure, for sure, in the enterprise. And that means more on the warehouse, that means more on the lake. It's kind of hard to say who wins, they both win, and they should work together. As a matter of fact, the concept that I'm proffering a little bit later is the lake house, and that really brings these two together in the most elegant way that we can do that today, and I think it's just great. So do note that I see an expansion of analytic applications everywhere. I think especially you're going to see an exponential expansion around the data lake. So yeah, we need these. All right, and let's continue with some definitions, and make sure we're on the same page here. The data warehouse, here we have the geeky fellow here, and I hope you accept that as I do as an endearing term. You have the one or the few supporting the many of the data warehouse. All right, and I'm going to show them the data lake. Yeah, you don't have as many users. They're a little bit geeky like us, and whoops, things moved a little out of order there, but we have at least the same kind of squad of geeky people supporting that data lake. Now, what you did see there, back up real quick, what you did see there was the movement of the data lake over to under my data warehouse profile, because over time, we, the geeky people that are building this stuff, the builders, are going to have to know more about how that data is going to be used in the data lake than what we do today. I can raise my hand to the ceiling on that one. Trying to catch up with the data scientists and organizations has been a challenge, because they're great at what they do, and they've devoted, you know, a career to that, and I haven't. I've been devoting my career to the building side of things, right? So, and many of you have as well. So, the data lake, we're going to have to get to know more, just like we did for the data warehouse, and we made it really user-friendly in all this. The science is going to be moving to the data lake, and that's where organizations are going to forge competitive differentiation. We're going to have to be there. Okay, so the data lake does not sit alone, and that's one thing that I am really on the soapbox about these days, is that, yeah, you don't just build a data lake and access it just like you did the data warehouse with a BI tool that does some UI friendly things, all right? To really get the full advantage of a data lake, you need a stack. You need a new stack. You need a machine learning stack, and it looks something like this, and this is somewhat of a flow, a flow of many that there could be for using the data lake, and using it in conjunction with the data warehouse. So historical transactional data might be on either of these, and you might have split that into categorical data and quantitative data, which enters your machine learning platform, which I'm going to share with you some meat on that in a minute here, all right? So that's where the models are going to get trained, scored, evaluated, probably retrained, okay, a few times before deployment in conjunction with the data in the data lake where it operates in real time, and scoring occurs, and finally, what it's all about is business action occurs. We work backwards from that business action that we want to build out this conceptual model. So the data lake doesn't sit alone. As a matter of fact, this is for all the technical people out there. This is really a couple great facts for what we're talking about here. When I said the data lake doesn't sit alone, okay? So this is the Azure SAC and the AWS SAC, and there are many others, let me just say that right now. When I say data engineering on here, I do mean the data lake. When I say data analytics on here, I do mean the data warehouse. There are many choices in these categories. Anytime I put a tool name on here, I have to say this, and I want to say this. This is true. There are many choices in these categories, but these are the platforms that are really becoming the standard SAC for Azure and AWS. They're the ones that we use, or the ones that we like. Now, I would not get into oh, let's do a best of breed here, and let's do the Azure Data Catalog over here in AWS, not that it was working in the first place, but you know, mixing and matching across this, not something that I would get into too much, although that does not mean at all that I would not go with a multi-vendor approach as appropriate. So going down the line, you might have Cubal, Snowflake, with Data Brick, and the Elation Catalog, and pulling this off the top of my head, the talent, you know, talent for data movement or Informatica for that. Yeah, that's another SAC category, if you will, a true best of breed SAC category. Go with that, if you like, as well, and then there are platforms like Cloudera that do it all as well, okay. So, you know, there's a lot of choices here, but let's kind of run down what you're going to need to really take maximum advantage of the Data Lake and get those analytics up and running. Obviously, a Data Engineering Platform, HD Insight, or Elastic MapReduce, and this is the core of the Data Lake, and then you have the Data Analytics, Synapse or Regif. Now, either one of these first two rows might be what you focus on, focus on, to make the determination of the platform that you're going to use, and then you just fill out the platform with the other things you see on here, okay. So, because a lot of you are asking me or are fielding the question yourself of should we go with Azure or should we go with AWS Star? Well, I'm trying to give you some pointers to unwind that thing and get it moving, and so sticking the first two columns, I think those are the main things, first two rows, excuse me. And the other things are important too, though, you want to look at all of them. So, for your Data Science, there's Azure Machine Learning, and then there's SageMaker for AWS, Data Catalog, got the Azure Data Catalog, got a Glue Data Catalog in AWS. I do note that neither one of them have great workload management yet, nothing overarching, only individual applications have their own workload management features. So, that'll fill in over time, but and so I'm getting ready for that, and I went ahead and put it on here. Data Movement, Azure Data Factory, or you have Glue or a Data Pipeline, your overarching deployment, Azure Data Portal, Azure Portal, and AWS Portal, and for Security Azure Active Directory, or Identify Access Management. Both of these stacks are pretty good at keeping up with one another and making sure that their stacks fill in, as you see, workload management. Not there for either one, so that'll change and probably will change for both within a span of six months, so we'll see. But there you go, there's a couple stacks, and again, there's more, but when you want to get the most out of the lake, you want to do all that. So, now, reference architectures are a funny thing, and I have two minds about sharing them in a presentation, but I will, because you like it, you tell me, and it's appropriate. Now, but what I also want to say about reference architectures is don't feel bad if you're so, doesn't look like this. As a matter of fact, none will look just like this. Okay, this is a nice clean power point. Reference architecture, your knowledge may vary. Your knowledge may vary a lot. You might, yours might look like spaghetti or something. Okay, you want to start to unwind that, because that's a real problem with Plasticate. But anyway, you're just moving forward. We have low latency data that's going to come in now through a distributed PubSub system. A lot of people use Kafka for that, or maybe Pulsar or RabbitMQ, something that's going to sort out the topics, sort out the topics and get that into stream processing, which is going to be needed for that low latency data. That low latency data, that's the net new data to organizations in terms of getting under management. And that's net new and over the past five to ten years. If you have an older model of this architecture, you don't even have that in there yet. And you really need to get it in there and start capturing your low latency data wherever it may be. And then there's the batch data, which we're familiar with, applications, files, ELT or ETL, that's what you still have. And that's mostly coming from relational databases and ERPs and stuff like that. All of it's going into the lake, that's the box at the bottom. I'm using S3 as an example, so obviously AWS here. We have learned to prefer the Parquet format for that data because it's a quote unquote columnar and provides advantages to most of the queries that you're going to have ultimately from this. But that's not necessary. And you have your data warehouse. Now, if your lake is the staging area for the warehouse, everything's going to flow from there either through stream processing or ELT or some form of import. There's now these days, there's like five, six legitimate ways to do this. And ETL, of course, as well, we should probably have a focused webinar on that topic. So we can sort out the ways because that's pretty important as well. Okay, now the data warehouse, again, if we're keeping history in the data lake, that's going to offload data to the lake because the data warehouse is going to, even though I show a nice clean one-way street here, it's going to capture some data that at least for the time being until you get to this nice surface architecture, it's going to capture some data that it's not in the lake. All right. And it's going to have to push that data on there and it may curate some data because of all the analytics that occurs in the data warehouse. Some data may be curated there that you want in your lake. Okay, so I show that flow. And then I have a reach-through flow. See the queue? Okay, that's for queries. Queries that are the start in the warehouse and most of them will that reach on through to the data lake and capture the data there and pull that data in. It's not going to be as fast. It's not going to be as elegant. You're going to have to set it up. You're going to have to bite that bullet and get over that hump. Once you do, it's smooth sailing from there and then you have created the lake house concept and that is really a strong concept for the enterprise today that reached through to the data lake. Now I'm not talking about your data scientists which are kind of going the other way and frankly one of the reasons why we quote unquote offload data from the warehouse to the lake sometimes is because the scientists want that data. Yes, they want that data in there as well as all the maybe petabytes of detailed data that's coming in and being stored in S3. So moving on here's an example architecture. I won't belabor this. But just to show you that forms of the architecture that I just showed you are actually out there in production. This is from Uber. They have kind of like what I showed you some low-latency data some batch data. They have Kafka for their distributed Pub-Sub system. You know they have a lake it's on Hadoop formatted as Parquet. They use Spark ETL to get some of that actually most of it into Vertica and Vertica is a foundation for a lot of the access that occurs around the machine learning that they're doing today and of course for all the other more traditional standard types of access via reports etc. So yes it's not all just theoretical. All right using the data lake here we go let's look at using the data lake. Now we're going to put some governance over the top but it's not going to be your full-on data governance. It's not going to be that full canister of everything that you've been doing. Hopefully you've been doing or let me put it a different way that the data governance that you think you know you should be doing on the data warehouse. Okay that may not be all that but there it is something. You just don't just want to lock and load garbage into your data lake. That's going to be a problem. So yes we have data stewardship coverage and now ideally a little soapbox time here again. Ideally stewardship is by subject area across business data subjects and that spans all the different vessels that that data is going to be stored in and that's okay. It does expand my point is it does expand all the way to the data lake for sure. As well data catalog you want that to cover the data in the data lake. There'll be a little bit more work because a lot of data there but we want to establish data catalogs inside of organizations as places that really do facilitate data access as well as data management. You're going to have less transformation on data. You're going to have your nulls and blanks you want them to be consistent you want dates formatted in a certain way. Yes data will have rangers patterns and outliers that you want to at the least identify if not manage in the integration process. Yes the data there will have some relationships that you'll want to know. There will be business rules that may have to be applied to that data. There may be data classification for data security purposes for sure in that data and there may also be confidence scores. This is something that I keep talking about very few people do but it's how confident are we in this data and different levels of confidence speak to the different ways that the data really can ultimately be used. So are people using the data lake out there? Yes they are they certainly are there are successful implementations right now across many industries and use cases pretty much you name it and a lot of these you're going to look at and say well I'm doing that today but I'm doing it in my data work that's okay that's okay but as you move into more maturity as you move into machine learning you are going to find that the data lake provides an efficient and economic component of the stack for doing this in the best way possible today and so you are going to ultimately include the data lake in all of this type of analysis even if the data warehouse is implicated as well they work together in the lake house concept. All right so let's talk a little bit about deploying the data lake now we're all convinced we need one right and we've maybe found some things in there that we can help out our organizations with by using a data lake approach and we've kind of ferreted out what it is and how it's going to work architecturally let's deploy one all right so in the data lake the data lake is it's not Hadoop HDFS but these are managed deployments in the Hadoop family of products and probably by next year this time I won't even be saying that anymore because that that would be kind of an older concept but technically these are in the Hadoop family of products all right external tables in the Hive meta store we're mostly still deploying Hive on these deployments they point to the cloud storage and your big choices today are s3 google cloud storage or azure data lake storage gen 2 which we've recently benchmarked and that is available for you to run SQL against the data SQL like kinds of things against the data and don't forget high ql and spark sequel require entries in the meta store yeah I'm going from 10,000 feet to two feet here sorry about that object storage instances and clusters have local storage on the physical drives mounted to the instances themselves HDFS and high so object store technologies access the cloud vendors respective cloud storage I'll stop reading but that is the data lake concept the data lake of the future is going to be paired with an analytical engine that charges only by what you use and I believe on a prior webinar in this series I talked a little bit about price performance because the pricing of these things you're going to need to need to know some things about if you have a lot of data that can sit in cold storage and only needs to be accessed or analyzed occasionally stored in the cloud storage this is that data lake concept but what I want to say about this this is emblematic of the situation that I mentioned earlier where if you really know down at the technical level what's going on you can seek broader than just classifications of data access will occur at the lake or this mart or that warehouse but you can think well what technically is really going to work and if you really know the structure of the workload you can make a great selection not just the shallow selection and this is part of that if you have a ton of data that can sit in cold storage and only needs to be accessed or analyzed occasionally if you know that to be true put that in your lake and this is a sample cluster configuration using the Google big query as the foundation the analytic data store so I talked a little bit about AWS and Azure I'm going to bring in Google here and show you some things here these are some of the different versions of the Google stack at least at some point in the not too distant past anyway so as you can see you're going to need a few things to complete the stack in order to do great data science kind of cross the board with your data lake so some tips if possible configure remote data to be stored in parquet format use github for your co-distribution use data partitioning to improve performance co-locate compute storage in the same region encryption and drop commonly used data in the lake the data that would commonly be used by the users or maybe you should reword that but commonly used by the users of the lake which are again the scientists and one thing that we've just learned to just do to just do is to push master data from MDM if we have it in MDM push that into the data lake yeah that's a little different way of thinking but again we're trying to support our scientists and some of them are I mean we're kind of getting into it here but some of them are they do work at different levels right and they may not want to be switching all the time they may I don't mean to say they're going to do some basic reporting you know the P and L reporting for last month being done by our data scientists no I'm not saying that not that there's anything wrong with P and L reporting but I'm not saying that I am saying that the scientists will do some other forms of reporting other than their D data science experiments and they need some pedestrian data in there as well to do that okay I think I promise somewhere in the lead up to this presentation I'm going to talk a little bit about the data that goes in there and and the data that really I'm talking about the data that we're going to store somewhere we're going to get under management we're going to get it where we know where it is we're going to get it where we believe in it we're going to get it where it has acceptable data quality we're going to get it where users are using it in in the appropriately user friendly terms so to get that data in you have to look at the super set of data that you have access to including third party data all data is AI data though so if you're truly concerned you're truly committed as well to artificial intelligence you will start to think this way you'll start to think about storing all the data all your call center recordings and chat walls all streaming sensor data all customer account data email response metrics and so on video data website behavior data and that's a great place to start by the way some people ask well where do I start on this journey well you have a website right and you want people to use it even if even if you're B2B you want to be great and start start collecting that click string start also looking for things that you can improve upon that are being done maybe the quote-unquote hard way today look for things that can be automated through artificial intelligence and that usually backs you into the need for the data late sentiment analysis user generated content social graph data and other external data sources yes third party data third party data does get stored in data lakes absolutely it's voluminous and it's appropriate you see artificial intelligence and machine learning are on the horizon if not here today looking us straight in the face looming on the horizon is horizon is an injection of AIML into every piece of software that we buy and I'm going through a study right now of all the great data integration methods of today out there in the organization and they're not all just EDL tools either by the way there's a quite a variety of choices that you have in artificial intelligence is one thing that we're looking for because we think that's a that's a showstopper that's a keeper whether it has artificial intelligence or not today today today the software should have strong AI maybe not full AI the strong AI start to start to be a little bit smarter okay so much for software but consider the domain of data integration predicting with high accuracy the steps ahead fixing its own bugs you know in the pipelines machine learning is being built into databases so the data will be analyzed as it's loaded as it's loaded not after it's been loaded into a database and somebody comes along and does some great analysis after the fact maybe weeks after the fact but as it's being loaded into the database think about that and think about the split of the necessary AI and ML between the edge of corporate users and the software itself that's still being worked out and you need to think about that as you architect your data environment out there in the enterprise now you're also going to need to separate data into training data and non-training data real data training data for machine learning and artificial intelligence you have to have enough data for that to build the model and your data will determine the depth of AI that you can achieve and this is all tricky stuff which we can go into in more detail but for example statistical modeling machine learning or deep learning and the accuracy so the level of data will determine that so we have established importance of the data lake how to architect one into the environment some specifics about building out a data lake and some of the workloads that might go on it and how it works together with data warehouse and other components in the environment and this brings me to the end of the formal part of the presentation hopefully you've been lobbing your questions in there to Shannon if not maybe do so real quick right now and I will entertain your questions now over to you Shannon William thank you so much as always for a great presentation and just to answer the most commonly asked questions just a reminder I will send a follow-up email for this webinar right end of day Monday with links to the slides and links to the recording so William can you be more specific about a lighter data governance in data lakes yeah I mean I tried to be but you this is when you have the user community that you have on the data lake as opposed to the user community that you have on a data warehouse you don't need the same level of data governance and by that I mean maybe you don't need the same level of data catalogs data catalog depth you don't need the same level business data definitions you don't need the same level of you do need security but you don't necessarily need the same level of I would say fine-tuned security where you have say 10 different profiles you may have one you may have one or two and that's part of data governance and so so the way the way I see it is I think most organizations need more than what they have but I think they they're hesitant to get started because they see what's going on over there in the data warehouse and in many organizations data governance isn't uh shall we say shall we say the most uh say user-friendly of functions in the organization it should be it should be facilitating great data access it should be creating assets that users really enjoy and get a lot of value out of but if they're not and who wants that in the in the in this new structure that we're building called the data lake so that's something to get over because some is definitely appropriate and some data quality data quality is job one for data governance right and some data quality is going to be appropriate so here you have data governance now in the organization creating a deep set of rules a deep set of business rules that need to be applied to data as it goes into the data warehouse but you know what I and that's not all applicable to the data lake but you know what I really like about where data governance is going is the ability with these governance tools and the catalogs to set your business rules and then they just they automatically find a way into the pipelines that you build into the data that you load so that all the data actually adheres to governance standards so this is all going to get easier this all what I'm talking about here it's going to get easier and the the this whole notion of well you know you have more governance here less data governance there eventually that'll go away because we'll just have governance over the top and that's what we're that's what we're getting to but in the actionable future we need some light data governance over our data lakes why write to s3 with limited quote unquote schema when it can go to data warehouse with full schema and just be and be just as accessible to machine learning AI stats analytics well that's a great question and when you say just as accessible it depends on the volume of data that we're talking about to to whether that's just as accessible or not yes there would probably be more accessible if you could store it in in relational databases but that's not necessarily the most cost effective thing and it's not accessible if it's not if it's not stored if it's not managed and that unfortunately may be the case in many organizations because they're dealing with tons of data I want to say you know tens of terabytes hundreds of terabytes potentially petabytes of data that they're trying to bring into their machine learning algorithms to get them really really fine tuned and that's just not going to make make a lot of sense today in the data warehouse where you have yes of course it's a lower level of function that you have in the lake and yes of course it's a lower level of different things like performance but you actually have the data stored you're actually bringing the data into into the algorithms and it's just a great symbiotic symbiotic thing so horses for courses here you're going to have some data that makes sense in both of them and also maybe another aspect of this question is you know why should you bring it through the through the lake why should you bring it through a staging area you know I said the lake is a staging area so why should you bring it through a staging area what why are we staging data in the first place we're staging a lot of data because it's not yet quite appropriate for its final formatting and we're going to have to do some transformation to it and so we can do that in the lake we can do that somewhere else a lot of people have used ETL tools for this obviously in the past and one of the first first big uses I would say of cloud storage was to take cycles off of that and and just get it in get it in an ELT place in the in cloud storage where that staging activity could be done so I'd say the answer to the question is really because we want to stage the data and because it's appropriate for really large size data that is appropriate for machine learning what is Hive Metastore and is it the only way to use a data like no it no it's not the only way it's just maybe something that's kind of stuck in my head and stuck in our in our fingertips as we create these things for our clients and for our benchmarks but we just found that it facilitates the Hive style of queries should we want to do any of that and so we just automatically do it and I think Hive is still an important part of the stack and maybe it's a bit of a throwback maybe you're not going to do it but I think there there are there have been a few edge times and I can't really place it right now but there have been a few edge times when even with our bus set of other tools we've fallen back on Hive to do a thing or two and so I just kind of want that in there so the Hive Metastore is always worked out great for us and so I say just load it So what about that cost and resources to connect and integrate various technologies to the data lake? Is that this seems to be usually very high That was the cost of the resources that get connected to the data lake so kind of all these other things that I suggested here might need to get connected to a data lake yeah yeah oh yeah that's why it's a stack that's why I made a big point of it that the lake doesn't sit by itself it's a stack and I can I can share with you some recent research that I've been doing on these stacks and more that best-of-breed stack carries with it that I talked about earlier you know I said you know QVL, Snowflake, Databricks et cetera you know piece it all together that carries with it a large amount of overhead in terms of not only the technical effort to connecting all the pieces but also you know the work effort the people effort and the people effort is still very significant with data lakes by the way I didn't kind of bring that out in this presentation but it is and you want to minimize it and so that's why a lot of people are thinking about these as stacks and in some research that will be out I'd say probably the next week or two I would say keep an eye on my social and stuff and look for that research and it will it will go into that question in a lot of detail and share with you the costs of the stacks what could be the link or the relation between data virtualization and data lake yeah okay I'm glad that question came up because I didn't mention it but but man is it is it ever still something that needs to be at our architectures and I usually say it so slapping my my forehead for that but yeah data virtualization should be a part of this as well because going back back back to the first 10 minutes of this presentation I was talking about you know the data warehouse and the data march and the this is and the that's that you're going to have out there and you're not going to have all the data in one perfect place for any individual query so data virtualization is my catch-all it's kind of like you know I've mentioned the Hyde metastore we just do it yeah I just want to just do it in architectures that are serious you know because you're going to have those edge cases and most of the time we build our we build our architectures for the 80% we just kind of hope the 20% never works out or never happens but it does happen and then it sucks down you know 80% of our time at that point right if we can put a little more foresight into it we can change that equation and data virtualization is a big part of that so I'm not here to say yeah all the queries that go into the data lake have to go through through a data virtualization later not one of those virtualization people but I think that having the virtualization capabilities across the lake and the warehouse you know helps with the lake house concept but it's not I mean the lake house concept is the data warehouse is actually doing it but you have to set that up and there's always a time lag and so on you don't want to get dependent on something that doesn't make sense for the long haul like if if if you know you don't want to just do everything the virtualization way maybe I'm expanding the question way too much here but you don't want to start doing things that that you know we're going to get a lot of traction snowball down that hill and you're just going to be still kind of doing that for the long haul and you know it's not right so put a fork in that at the very beginning and get data virtualization going in these environments for sure I think we have a time for a question or two more in the data governance section you have mentioned outliers and patterns along with data with range or data range I can understand but not the patterns and outlier a data range I can understand but not the patterns and outliers could you elaborate well we have expectations of data that it fits I'll start with data ranges because the person gets that yeah it fits within a range right if you have somebody's age it's going to be between zero and what's the oldest person in the world like 130 so let's say 140 it's going to be in that range and that's what we expect we see something outside that range that would be clearly a problem somewhere it's the same with these other things patterns the patterns to data if we see certain certain like I don't know electricity usage or something being in a certain range for my house versus the commercial building down the street which is at a different level you don't want to apply the same rules to me as you do to that building you want to have appropriate rules have appropriate patterns established for me versus that building and so patterns means multiple multiple patterns across the data that's selectively applied depending upon other components of the profile of the person or whatever the whatever the thing is that you're modeling there and now a large same thing you know having you know a month of keep on keeping on the electricity example to have you know I'm going along it I don't know what the good numbers are but 10 gigahertz or whatever of it and you know suddenly it spikes to a thousand that's a that's an outlier that's a problem and sometimes these things are these things are are captured in analytics and you know what that's okay too is if somebody's going to sit there at the end of the load cycle and run some queries against the data and say okay oh there any allars in this data are there any patterns in this data that's that's a problem and so on anything I should be aware of any patterns that are happening in the data that are good patterns like wow sales keep trending up in this region so let's find out why and apply it everywhere you know you can do that in analytics after the fact but you can also do it the data governance way which data governance to me it stops when it doesn't stop but it in terms of its life cycle it really gets strongly applied to the data before it gets made ready for users and and so it's in the data it's in the data that gets released to the users I won't say that once it's loaded it's it's ready for action once data is loaded in the data lake or the warehouse or what have you it's not necessarily ready for action you still might have some things to do to that data you still might have some some transformation to apply to that data okay great but where's it come from what what what rule basis it come from and I say that comes from data governance and that applies to the data before it gets released to the user community so patterns and outliers that's just part of that William thank you so much for that does bring us to the top of the hour thanks everybody for attending and being so engaged in everything we do we just love it just again a reminder I will send a follow-up email by end of day Monday for this web presentation with links to the slides and links to the recording thanks everybody so much stay safe out there thanks William thank you bye bye