 Hello and welcome, my name is Shannon Kemp and I'm the Executive Editor of Data Diversity. We'd like to thank you for joining the second installment of the new Data Diversity Webinar Series, Data Insights and Analytics, brought to you in partnership with First Hand Francisco Partners. Just to kick off the second in the series, John and Kelly will discuss the series, discuss today Data Lake versus Data Warehouse. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, you will be collecting them via the Q&A in the bottom right hand corner of your screen. Or if you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag DI analytics. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session and additional information requested throughout the webinar. Now let me introduce to you our speakers for today. Well-known industry analyst, John Lathley is a business technology thought leader and recognized authority in all aspects of enterprise information management with 30 years experience in planning, project management, improving IT organizations and successful implementation of information systems. He is the president and chief delivery officer at First Hand Francisco Partners. Also joining us is Kelly O'Neill. Kelly is the founder and CEO of the First Hand Francisco Partners. Having worked with the software and systems providers key to the formulation of enterprise information management, Kelly has played important roles to many of the groundbreaking initiatives that confirm the value of EIM to the enterprise. Recognizing an unmet need for clear guidance and advice on the intricacies of implementing EIM solutions, she founded First Hand Francisco Partners in early 2007. And with that, I will turn it over to John and Kelly to get today's webinar started. Hello and welcome. Hello. How is everybody today out there listening? Good morning, good afternoon, good evening. Hello there, Kelly. Hello. So let's, we shall get started. I shall move on. First of all, the topics for today. We are going to talk about some definitional things. We have a wide variety of people listening to these webinars and some very experienced people out there listening for, you know, little gems and insights which we hope to provide. But there's also some people that are still grappling around some basic fundamental issues. So we will start today with some definition and differences and some conceptual things, getting into some optimization and how to deal with those various weaknesses and strengths. And take a look at some examples because that always helps tell the story. And then end up with some findings and some takeaways and maybe you might hear something you weren't expecting to hear that. Then again, maybe you are going to expect to hear, but we'll find out and we'll, as always, we will leave some time for some questions. And then we'll try to answer as many as we can. If we can't answer them, please keep submitting them. We do get around to answering those questions at some point after the event. So without further ado, I think we'll just keep going here. We're going to start by asking all of you a question. And it's a poll. And Shannon, that poll is available to the folks, I believe, and we're ready to go on that. So, and if not, just pop in and say no other than that. And away we go here with the poll. First question, what type of data repository does your organization currently have and actively use? And the key there is actively using. Because almost everybody or a lot of people could have some, you know, something that they built a while ago and nobody uses it. So, you know, it's got to be using it. If you haven't, if you're not using it, just say neither, right? It's like a warehouse for both or neither. Second question, if your organization has a repository, you're going to do something to it. Like enhance it or which improve its dream line, put some investment in it in the coming year. And, you know, yes, no, and we don't know, which is perfectly acceptable to this. So anyway, Shannon is... We have the first question open, which just closed. Sorry, I set it up as one question at a time. So let me push out the poll results there. Can you see? Oh, okay. We'll just do this. We're pushing this out as we speak. There it is. I see. I see. All right. There's question one. Okay. Let's push that now. And now everyone can answer it, right? You know, everyone's answered it. So question one, it looks like the vast majority are using data warehouses or are using data warehouses. Only one person is using a data lake, and 102 are using data warehouses, and only 28 are using both, and 22 are neither. And the other 200 people are just hanging out. They are. They are. Okay. I'm excited. Waiting till after the webinar, and then they're going to answer the question. Yeah. Okay. Are we ready to push out number two then? Absolutely. Here we go. Here it goes. Poll is open. All right. If you have one of these things, are there plans to do something with it, like an answer, you know, if you have a warehouse and you're thinking, Lake-ish, you know, are you going to spend some money this year, this year, or no, or, you know, going to hold Pat, or you don't know. And that would be another good question here. And this is helping us to find our audience. We have a lot of people listening to these, which is wonderful. But we also know that that means a very, very diverse level of experience and skill and architectures and such. So we do appreciate that you've given us a little bit of insight here. We'll probably ask a few more polls in a few more webinars here as the year goes on. And the time limit is 30 seconds on each, and that's why I'm just talking here, because I think we're almost up to it right now. If my... We are. They're out. They're correct. There we go. And there we go. We have a lot of people not really... Most people are going to do something this year, and then everyone else is just not... So, yeah, that's a pretty good deal. So the majority of folks are just going to... are going to do something. So I think we're going to get some insight out of this. We'll just move on to... I think we can move on here. Do I have control back? Yes, I do. I'll move on. And Kelly, take it away. Hello, Kelly. Hello, world. There's my mute button. So really interesting results. Thank you guys for participating. And one of the things when we were thinking about this poll, we thought about, I guess, Shannon, was it a year ago that we did our survey around... It was more data science versus, you know, business intelligence and that sort of thing. But I think it was a similar sort of outcome where there is still quite a bit of investment happening in data warehouses, even though the talk is all about data lakes. Anyway, that's a very interesting comparative poll. But so just pulling from, you know, the industry analyst, Gartner, how do we define a data warehouse versus a data lake? So data warehouses are generally constructed to solve predefined questions or to aggregate data for a known set of analysis and reporting needs. But as a result, they have clearly defined user communities. Many times, data marks are built on top of data warehouses to then provide an even more specific user community, a subset of that aggregated data that is structured in a way that makes the most sense to that community, like a finance data mart or a sales data mart or what have you. So it's a storage architecture. Now, a data lake is also a type of a storage architecture, but it tends to be more broad, more distributed. And data lakes are designed for more fluid environments, no pun intended, in which some of the questions are known, but many of the questions are not known. And as a result, the potential users of that data in the lake are also highly fluid. Data can be presented in a variety of different ways. And so it does need to consider the fact that it isn't a predefined community that's asking a relatively predefined set of questions, which means you don't know what information they're looking for and what attributes of the data they will be looking for. So it's kind of trying to fill this unknown dark abyss that's in the bottom of the ocean here represented by the iceberg, which is one of my favorite graphics. So just for some very simple level setting. And we thought it would also be good to quote James Dixon, who is thought to be the creator of the term data lake. So someone could correct me, but that's the, as far as our research shows, all arrows point to him. And so I know that he talks about data marks here, but you can consider this description that he's providing as also describing a data warehouse. So in the past, the standard way to handle reporting and analysis was to identify the data, identify the most interesting attributes, aggregate these into a data mark, and provide that data in, as he says, a cleansed, packaged, and structured way for easy consumption by those known communities of users. So the challenge with that, of course, and this is pulled from the same blog post in which he quotes this distinction between a data mark and a data lake, a challenge is that only a subset of those attributes are examined. So the assumption is, is that you know everything that you need to know when you set up that data mark or the data warehouse, if you will. And the data is aggregated. So there's a level of visibility that is lost because the fact that the data is aggregated versus the data lake was created kind of to address those challenges and requirements in order to provide a more optimal solution that provides the data in that kind of raw natural state and that's why it's called the data lake. So just a little bit of background. So some of the key differences, as you probably heard as I was kind of talking through this anyway, the data warehouse, in the data warehouse, the data itself is structured. So it's alphanumeric characters that are in a row and column structure traditionally held in some sort of relational database management system versus a data lake can take that structured data, but it can also take things like images, videos, or any other sort of data that doesn't conform to a certain model. So it could be the very raw data. It could be lots of different data types at the same time. The way that the data is processed through a data warehouse versus a data lake is actually one of the biggest differences between the two and is an interesting way to consider when to use one versus when to use the other. So schema on-right basically just means that structure is applied to the data as it is loaded into the data warehouse. Therefore schema on-read means that the structure is applied as the data is pulled out of the data lake. So fancy terms, but the implication is, is that if you're providing that structure as the data is pulled out of the data lake, then you have the opportunity to apply multiple structures or lens depending upon the purpose that the data is being used for. So this concept of structure versus unstructure also comes up. There is a big cost difference in terms of volume. Originally Hadoop was used most commonly for just cheap, high volume storage. And so as the usage became a lot more sophisticated, people were using it more for analysis and that sort of thing. I know a lot of people are still using it just for low cost storage, which is, you know, great. Because a data warehouse has this requirement of structured data, schema on-right, et cetera, this requirement for structure makes it more fixed, less agile, obviously. But what that means is it can be incredibly time consuming to change those structures to add new elements, new data types, new data sources, et cetera. In fact, when we were prepping for this, I remembered a client. It was about three years ago now. But when we first started working with them, one of their issues was the performance of the data warehouse. And we asked them, well, how long does it take, if you want to add a new source to the data warehouse? And their answer was nine months. Or a little gob smacked at that answer. And I know that there's a lot of different reasons for why the time frame could be long like that. But part of it is because of the requirement for so much structure. If the structure fits the requirement when it's built, imagine if that requirement changes, and you need to refine that structure. It's kind of like building a house from the ground up the first time is a lot easier of a task than doing a remodel. So I don't know if anybody in the audience has done one or the other. My brother's a builder, and he prefers ground up versus remodel any day of the week. Obviously there's a maturity difference. And then there's, like we said, a consumer difference. Data warehouses and data marts are generally consumed by specific lines of business that are looking to do things like operational reporting versus the data lake. We've coined this new term, data scientists. And it's a special, fancy, and of course more highly paid sort of analytical user that has experience with different tools and different capabilities to take advantage of the volumes within the data lake. So these are some key differences that we'll start to play into our conversation around appropriate use and best use of one versus the other or together, as we will discuss as the webinar goes on. Back to you, John. Thanks, Kelly. And I'm watching, just to quickly hear, we're watching the questions coming in and that we're already getting something I was anticipating, which is a flurry of questions about, and that the root of all of them is meaning and some of the definitions we've used already. And this is kind of a theme here, and you're going to see some more of this as we go through. There are no pigeonholes, okay? A lot of the questions are, well, you said that data warehouse is this, well, it also is that. And yes, there are shades of gray around everything here. So we will be addressing that actually here in a little while. Let's talk about the challenges of the data lake because they are a very popular type construct and a lot of organizations are building them. And what we're seeing out there in our practice is they're usually a rough start. A few nuggets come out and then the sustainability is questionable. Some classic reasons around that are the resource intensiveness of early instantiations of this. You've got to get some really smart people that know about this stuff, and they're probably not around. Believe it or not, they can be inflexible too in a different way just because of the volumes involved. They start out very sandbox-like, but they're not necessarily meant to be sandbox-like. Security, privacy, and governance are going to kind of put all those three together. These are structures that aren't intended to be wild west and free-form. They do have to have some discipline to them, but we see a lack of that. That also is evidence in the clutter of several speakers in this area use the family junk drawer analogy and anyone that has that drawer in their kitchen next to the refrigerator or wherever that everything goes in. I know when I was a kid, I used to just enjoy dumping it out and exploring what had found its way in there. But we do get into this cluttered and you can't find things. The overconfidence is kind of something that comes back and gets folks because a lot of money gets spent. Some really smart people do some really cool things with it and come up with a result, and that result doesn't work the magic that they heard. These are not unlike some of the early drawbacks of the early data warehouse technology either, and these are all kind of growing pains, but these are all legitimate challenges that are there at this point in time. The warehouse itself, I mean, it still has its challenges there. Talking about that, the enterprise model, where in order to find things and load things, whether it's a star schema or a snowflake or normalized or denormalized or whatever flavor of that you want or think it is or think it should be, there's modeling involved. And a lot of times, as most of us out there in the practitioner land knows, the data modeling and instantiating a model and getting it approved can be a difficult thing. Adding new data and subjects can take a long time. We did have a question where someone said it takes us a very short amount of time. That's absolutely wonderful. That is the exception. That is not the rule. It's typically six to nine months for most shops to add a new subject area or make a major change to a subject area. Kind of ties that back to that model thing and thing. When you really hit the number of folks you want to use it, they tend to get a bit doggy. And of course, governance has to be there because if you don't know what it means and you can't find it, it's difficult to get the benefits out of it. Some of these problems result in the expansion of the shadow IT or the departmental BI. People get tired of waiting. And because of that, they do it themselves. Hadoop is out there as an option for the enterprise data whereas absolutely. But what we're seeing is they tend to be siloed right now and not even part of a central approved enterprise architecture. And lastly, there have been always big scalable solutions for data warehouses. There are appliances. There are vendors who have specialized over the years in very large data warehouses. None of them are inexpensive. And so there are a set of challenges there. So why are we talking about all this depressing stuff here? Well, we're going to kind of get into some good news and then we're going to talk about how this all works out. I'll turn this back over here to Kelly to talk about some data lake good stuff that we've seen. Absolutely. So wanted to just, in this slide and then the next slide, we'll talk about some use cases. So these are three use cases for the data lake brought to us by a partner organization that we work with quite closely. And there are three great examples of a different purpose for a data lake. So on the left-hand side, this is a large medical device manufacturer, and they implemented a data lake kind of for a similar sort of internal analytical function that potentially could also have been done via a data warehouse, but based on some of the requirements that they had, they decided to use a data lake structure. It is a pure cost center, and it's used to support data scientists and analysts to see if they can develop analytics and models that assist in improving the operational efficiency of the organization. So can they find trending around internal processes that will enable them to make recommendations to increase productivity, increase efficiency, and therefore over time reduce cost. And thereby they could justify it being a cost center because the goal is that they are going to be driving best practices and driving improvements across the rest of the organization as a result. In this instance, it's entirely internal. There is no connection to the outside world, and so it can be built entirely on-site in a building that you can lock it down through, you know, physical locks, security badges, et cetera. In this instance, it would be architected so that you've got a traditional landing zone, a transformation zone where standardization occurs, and then of course the discovery zone where you put the analytics on top in order to consume the data. So that that business purpose was determined when they created the data lake. Now, if we think about the second example, this is in which a data lake is used to create a differentiator for the company. So think about this as your concept of recommendations, and the data lake in fact helps to drive additional services and additional product that can be provided to the client. So through the concept of recommendation, additional products can be presented for consumption and purchase and things like that. This is so that the user could consume more features. They can consume more, you know, of the Netflix content, whether it is, you know, the traditional DVDs or whether it is streaming content, et cetera. And it helps in the user experience and the buying process. So in this case, they use products like Splunk to help manage the clickstream data. So it's a very different environment because it's real-time versus the medical device manufacturer. The requirement for real-time was less of a priority versus for the operational differentiator, real-time is absolutely a product, a different, sorry, real-time was a requirement which would drive the tools and the solutions that they would implement as a result. And in this real-time environment, they would not have the ability to go through the landing zone, transformation zone, you know, et cetera, like the medical device manufacturer. Again, different use case, different architecture, different tools. Now the last one, the New York Stock Exchange is one in which they were truly monetizing the data and access to the data lake is actually sold to companies where they can access the data themselves and run their own analysis. For example, does the Dow Jones Index always go up in the time period between 3.30 and 4.00 p.m. when the market closes? Or does it only happen on certain days of the week? Or does it only happen when options, explorations occur? So the idea is that people can pay to access all of the data that the NYSC has and be able to take advantage of the lake directly. Implications here are that it has to have a very scalable multi-tenant architecture. It has to have extraordinarily high security, increased briar wall protection, and of course the secure user login identity management, et cetera. So the point here is that these are all uses of the data lake. They're all very different architectural approaches. And it has different implications for when you're building the lake. So to consider what is your business strategy for the lake as you start to construct it, because there are different requirements based on the way that you're planning on using the data, whether it's internally, externally, real-time batch, et cetera, et cetera. Okay. Good questions coming in here. One, I want to make sure that one just recently came in. We really like that one. In fact, I'm just going to hit it now if I would, if you don't mind. Data lake seem to be there's all the poor data modeling, descriptive versus semantic. The thing with the data lake is it's raw. There is no data modeling a priori before it goes in. That is a significant, almost a cultural difference to those of us that have been perhaps doing the rows and columns stuff for 20 or 30, 35 years or so. There is no, as when Kelly went over the difference thing, when you talk about data modeling, that implies a schema ahead of time. There is no schema ahead of time on many, many of the data lake approaches. The schema is derived once someone figures out what they're looking at. Really, really different than what we're used to, right? Most of the folks on the poll, one of the reasons we took it was to, you know, have done data warehouse. So here's our use case for data warehouse. This is a recent data warehouse. This is not something that was like, oh, wow, you know, they just didn't understand what Big Data or Hadoop was or whatever. But this is an organization that, through some analysis, determined that not quite ready for that type of thing. And they had some, an insurance company with some market-driven issues where independent agents puts a lot of stress on insurance companies now, as well as those of us that watch television for more than 30 minutes a day. We'll see at least two or three commercials for car insurance, for example. And it's a tough, tough market. And this was an evolving environment. Notice we have two operational data stores here, because they were, there were two. There were lots of the data marts. And we just kind of labeled the whole thing data warehouse. And out there with some extra data movement, we created this analytical data that they could actually tackle this problem of agent and policyholder retention. It was driven on an appliance, and it lived merrily with all the other stuff moving around, what we call data wrangling. Now, is this what you would draw ideally as an Minesk or a Kimball-esque architecture? No, but, you know, it fits. And it did it. And the users were in the areas that were most, could most help address the business problem, sales marketing, underwriting to make sure that they weren't losing their shirt on products and then claims to make sure that service was adequate for the customers. So more traditional, sort of, but a little untraditional. But again, this is primarily what we would call the data warehouse type architecture for this particular use case. In both cases, though, driving business, driving business benefit. Now, we've talked about data warehouse. We have talked about data lake. There are some differences. Some of our questions are reflecting that this is, you know, that these are differences. So, and there's some, you know, there are some of the old Kimball versus Inman type things. But we want to make a point here that this isn't about choosing going forward. This is really about what fits your maturity, what fits what you're doing with the data. The, you can look at this many, many ways. You can look at how you use the data. You can look at some, you know, how you perceive what maturity looks like. You can use how the business or the organization is responding to its environment. You can even assign some technical characteristics to that. You know, many, many years ago when we did reporting and stuff, you know, what happened yesterday, then we want to know what happens. And predictive analytics is what will happen. But we're moving very rapidly, even beyond that, to make things happen by themselves, being more adaptive through machine learning, things like that. And, you know, what should we do next? Actually making recommendations based on data. And, you know, you can label all this a bunch of, by ways. And it would be, it's a great discussion to talk about all of these things. But the key is that it's not really about choosing. On one side of this, you could say that it's data warehouse-ish type stuff, as we're used to them historically being used. On the other hand, you can say it's data lake type stuff. But that's not necessarily true because we can get benefits out of both of those. So if, you know, the thing to bear in mind as we're moving forward here is understand kind of what you want to do with all of this information. And it's what does your business need from all of that information. Now we're going to talk about how to deal with some of the challenges we've talked about and address those. But as we go through those, we're still not, what we're saying here is don't be, John and Kelly said we should have a data warehouse. John and Kelly said we should have a data lake. We're not saying that. What we're saying is if you're going to have something that's data warehouse-ish or data lake-ish, you're going to have to deal with some challenges and head some problems off at the past that they're already popping up on the newer things and things we've known about for years on the quote-unquote older type things. But the fact is that you're not going to fit, not anybody fits into one particular thing into the one or the other. Anything to add to this one, Kelly, before we move on? This is kind of a big one here. Yeah. I think that this is probably the meat of this presentation and we'll talk and provide some additional examples going forward is that technology is always evolving and data warehouses evolve for a reason. Data lakes evolve for a reason. And there's going to be another technology that evolves for another reason, right? And the idea is that each organization is within a different level of maturity with different business requirements. And the idea is to think about what are those business requirements first and how do you then architect an approach that meets those business requirements, understanding your existing architecture, the existing tools available to you, your organizational capabilities, the amount of money you want to spend, et cetera, et cetera. And every organization is going to be in a different level and will give some further examples of how organizations have been able to leapfrog some capabilities. Other organizations are still dealing in some significant legacy environments where they need to move a bit more like a battleship. And that's just reality, but that's the way that they can make progress. And before we move on to the next one here, one thing I'm noticing in some of the questions coming by is we are having an interesting discussion, and it is maybe worth having an additional in the future here. The perception that a data warehouse is a certain thing or a data lake is a certain thing. And again, just to remind some folks, we've showed some common definitions, but there's going to be academic differences between what we've shown you and what some other people might say. But for the sake of our discussion here, we've gone and picked the contemporary things. If you have a... A good example would be if you have a data warehouse that is the classical monolithic historical time-variant structure and you're using it, and that's wonderful, that's fine, that's awesome, but that definition has morphed over the years. And it might not... If Gartner is a good source of some new thinking about really what a data warehouse represents, if you're a Gartner client or have access to that type of stuff, that might be something for some of you asking some of these questions. So I wanted to throw that in there before we move on to some of the specific optimization things. Ready to go? I think, Kelly, it's your turn here. Sure. Again, just to reiterate what we learned from those Data Lake use cases, that it is important to start with some sort of purpose versus a build-it-and-they-will-come approach. And then once that's done, the approach for optimizing the Data Lake ensures that you maintain the vision for which you started in the first place so that it doesn't become a swamp and recognizing that there are very specific things that you need to do to maintain the rapid and scalable ingestion process. So although the tools are different, the same priorities apply in the sense of performance, data protection and understanding, and an integrated approach to the rest of your environment. Now, it's also critically important to consider the capability of your users and address any skill gaps that might occur. So I've heard that there are 40 job openings for every data scientist on the market. So you may disagree with that number. So whether that number is correct or whether the correct number is 20, that's still a big gap. And so the reality is if you think that you will be able to hire all new people, not just to build, but also to use the Data Lake, then your cost ratio just skyrocketed because the competition out there is going to be fierce for the one in 20 slots. So it's a good idea to plan on upskilling some of your existing superstars, helping them through the transition from some of the older technologies to some of the more innovative technologies, and then of course putting some golden handcuffs on them so they don't leave to take advantage of the disequilibrium in the market. But I think the reality is that that skill and the ability to manage it yourself versus hiring external parties to do that for you is very important. Now in some instances you may need to provide something like a warehouse-esque environment as part of the Data Lake to accommodate those users who are more comfortable with a structured format. And I saw a question come in about is anybody using Hive to do that? And absolutely, in fact, we've got one client who is using Hive to have a, you know, I'm using air quotes right now, the warehouse within their Data Lake. And this is a way that they are maintaining a unified architecture and a unified plan while accommodating these multiple user types. And so, you know, it is, as John said, it's not only, you know, it's not this or that, but how do you accommodate the different requirements of your user communities? And what I think is compelling about this client is that they are, they have a wonderful opportunity because they don't have a legacy infrastructure. They're a spin-off from an organization, and they have both the opportunity to create a new infrastructure based on their current business requirements. And they're lucky enough to have the funding to do that. So they are leveraging Data Lake first and then looking at warehouse, you know, again, with the air quotes, warehouse second to accommodate people that need a more sort of structured data approach. And absolutely, Hive's a great way to do that. Back to you, John. Okie doke. So on a data warehouse, if you're going to be using the data warehouse-ish things, there's some lessons we've learned a long time and they're worth reinforcing here with a lot of, because I, you know, I love these questions coming in are great because it's showing such a tremendous range of people applying technology and, you know, literally into questions from seeing two or three generations of technology in the questions, which is amazing. And it just shows why this is such an important series for us to keep hammering away here. You know, we're talking about metadata because the data warehouse type things, it's schema going on in. So your metadata's got to be good. It's got to be easy to find things. A lot of data warehouses have language because even with all of the upfront work you still can't find it. Same thing with structure. There has to be enough consistency that you can navigate through the thing, find. There has to be trust through data quality. This is a timeless discussion on data warehouse. There's a part of me that, and this is kind of tongue-in-cheek, but if shops with good data warehouse, some smart people and some good enterprise architecture would just stop messing around with cleaning stuff up and fix data quality, put a little bit of oversight on structure. I'd have to... Part of my job would be different on a weekly basis. I might even have to find some other type of work or, you know, do a lot more MDM or data governance type work because that would solve so many problems that create departmental BI. We have had clients that when we have assessed their cost of using data to do things that only 10% of their total investment is their data warehouse department or their Information Center of Excellence or whatever you want to call that, the ICE, the BIC, the ACE, we've seen lots of different acronyms for those areas, but we're talking an order of magnitude more spending outside of the official architecture. And that's a lot of money, even though it is getting people the data they need to do their job. It's a lot of money, and we could really, really manage that better. And of course, governance is there, too. And this slide's kind of a recap or reinforcement that if you're going to be data warehouse-ish with some of your things, you really still need to consider these basic fundamental lessons learned over the years. Now, let's kind of move this together. Kelly brought up a really cool example that we have, and we've seen several of this, where for those of you that have been doing data warehouse for a while, I will use this kind of analogy. You probably have a staging area or something prior to an operational data store or maybe even a staging area before it goes into the classical monolithic data warehouse. So imagine that staging area has to be where stuff goes in a relatively unfiltered format or slightly filtered format. Well, replace that with a lake. Replace that with a head-doop-based structure. It's kind of the same thing. You're putting a lot of stuff in there without a whole lot of concern about what it is, but we're going to square it away and figure it out and clean it up. So that way, with this lack of agility and the performance issues, you start to address those. You start to bridge over to the easier to put stuff in and get stuff out and let you experiment with that. But it hasn't gone into the data warehouse yet where once it's in there, it's got to be structured, it's got to be easy to find, and it's got to fit into this relational protocol. You also also can have, of course, the unstructured data with the head-doop-type structure or in the data lake. The second or the leapfrog analogy that Kelly mentioned earlier is just that remodeling thing. If you start to look at what's it going to take to bring our second-generation warehouse up to a fourth-generation-type structure, there is a strong case to be made to just turn the whole thing into a lake. And then take the lake and in Bill Inman's book, Data Lake Architecture, he talks about data ponds. You can go build ponds, you can build sandboxes, you can build a warehouse, and you can go on that. So you could evolve from the warehouse to the lake. If, again, you address some of those issues we talked about earlier, you still need to know what things mean. Schema on READ is cool for a data scientist. It's not cool for anyone else. If you want a sustainable product that people were going to dip into and create a pond or create a martyr or whatever, there still has to be some reasonable sense to that and some governance and some oversight around those types of structures. I think I've got that one nailed down. Kelly, is there anything else you want after that one? I'm sorry, I'm getting really passionate about this. For those of you that don't know, my background, I started in warehouse in 88. And that was before, and then some guy tackled me literally in a hallway in 1990 and his name was Bill Inman and he said, what are you doing? And I said, I don't know. We're just doing something for the client. He says, well, I invented the data warehouse. You're doing a data warehouse. And we're like, wow, that's really cool. He's been a buddy of mine ever since. But I've been in this a long, long time and I've seen every shape and size and application of this type of stuff. And the one thing before I leave this slide because Kelly knows me well enough, I'm on the soapbox now. I'll do 10 more seconds here. You have to have an open mind. You have to be very flexible with this stuff, especially going forward because the demands like those use cases she showed us are incredible and they are generating gobs of money for these organizations. And you can do it for yours, but you're going to have to really have an open mind and be flexible on some of this stuff. Okay, I'm sorry. There we go. Next slide. Actually, not next slide yet. Not next slide. Oh, yeah, let's go back. Okay, sure. Yeah, so one of the things that is interesting is that just because you try and leapfrog and go directly to the data lake doesn't mean that the challenges go away, right? In the sense that the same client who was able to create their infrastructure brand new are using their data lake as their primary data repository within their organization from which to do analytics and that sort of thing, realize that they have naming convention problems and that the customer names are not standardized and end up with a big mess that makes it very hard to accurately analyze that data. And so, you know, they have a master data issue, right? Well, they don't want to go out and invest millions of dollars and implement a master data management solution. And so they created a name canonicalization model that they're going to use to help standardize and aggregate names. And so essentially they've custom built a mastering process within the lake. So, you know, regardless of your opinion on that, I think it's highly creative, I think it solves the business process and in their viewpoint it's a much more economical way than taking the traditional master data management approach. And I do think that that sort of creativity is going to continue to occur and then ultimately there might be a more economical way to, you know, quote unquote master the data in the lake rather than doing custom builds and things like that. So anyway, I think it's very interesting and it just shows the multiple ways that you can creatively use the concept of the lake as a data repository and at the same time realize that some of the old challenges don't just go away. Yeah, and it gets a perfect segue to, and you've heard this already, I'm not going to, we're not going to spend about 30 seconds on this part of the story here. Blend, be creative. It's not an either or debate. There are some real powerful advantages to either structure. The data warehouse is not dead. Some people are saying the lake is going to kill the warehouse. I don't believe that for a minute, but I think we have another tool in the toolbox. So we have some time for questions. I want to just talk about the process to get there. And Kelly and I are going to kind of, we'll just walk through this together here. So when in doubt, all right, take a look at all of your business requirements. Upper left-hand corner of this panel here is a deliberately fuzzy representation of a lot of requirements. And then some type of analysis to that. And I mean real engineering kind of analysis to that. Just don't say, well, and I see this a lot. So Kelly can back me up on this one. We'll see some people saying, well, you know that really is something that a data warehouse should do. Boom, boom, we do the data warehouse. Or, you know, that's analytical. It's got to be in the day late. And bang, it goes there. Well, actually, no. There are ways, we don't have time today to do it, but there are techniques and quantitative techniques, qualitative techniques, algorithms, all kinds of stuff to figure out what you need. What you find out is the lower right-hand corner. The lower right-hand corner, which you find has a bunch of requirements, they're going to blump into a predictive bucket or models or plain old BI or maybe operational alerts or some trending reporting, ad hoc reporting, or just plain old row and column, hey, what happened yesterday type stuff? And you're going to find that there's this whole spectrum in your organization, and there is not going to be one necessary answer. And your result becomes engineered. And what we have here is kind of a hybrid architecture on the left here. It's overly simple. I know some of you are going to send a note now and say, well, that's not really, no, this is simple. The data's got to land somewhere. None of us have a clear picture of our data landscape, hardly any of us do. And most of us have more sources than we really want to have data sources. So it has to land somewhere. Then it has to get ingested because ingestion is a formal process for a lake or a Hadoop type environment. And we can put it in the lake. And here it sits kind of that quasi-staging area we talked about. We can spin off a sandbox really easy from that. We can even do analytics directly against the raw data if we want because it's schema on read. So that's okay. But we could also take that and create a data warehouse and fulfill and supply our BI needs as well. So this hybrid architecture is very, very possible. So just to review this, document your business needs. Yet you get your metadata, the stuff you want the business to do, the stuff you want to measure, good old-fashioned data strategy, data architecture stuff. Get those patterns. There's patterns that will tell you warehouse-ish or lake-ish or in-between-ish. Those patterns will make groupings. Now you know who your audiences are. And do your best fit. And Gartner uses a term of a logical data warehouse. And it is kind of a good way to look at things now. It's not just a monolithic one database anymore. You have to come up with a bunch of places that people can navigate to do the job that needs to be done by your organization. And so now we will move into the Q&A session. We want to let's see here. I guess we should just wind down through these. Let me just go back here a little bit. Boy, there's a lot of them. Where can we start here? Let's see. These are all timestamped. Good heavens. This is a big one. Thank you for all these people for hanging in here. Well, let's just start about 40 minutes ago. Do you see companies using Lake to do traditional warehouse, like Type 2, slowly changing dimensions? Now for one thing, I could tell you that there's people that will argue that saying Type 2, slowly changing in data warehouse in the same sentence is you've committed a heinous, really just war sin of the Kimball Inman thing from years ago. But that said, you know, Kelly, that was the example you expressed, right? Exactly. Yeah, exactly. There's people doing that now. The other, the next one was a comment that I already addressed earlier. Let's see here. Can Hadoop be used in a data warehouse effectively? Well, okay. That's a semantics thing. If you're going to call data warehouse something like a data warehouse, you're going to call it data warehouse. Okay, that's a semantics thing. If you're going to call data warehouse something with structure and to maintain the time variance and all of that kind of stuff, probably not. It's probably not. And here's why. There's something for, and this is to those of us listening. We've got about 400 people still on this thing listening right now. The old question of how do you know your data warehouse is built right? I can ask the same question today. Ask the exact same query 10 years from now, and I get the exact same answer because we've managed the time variance way. Hadoop is not set up very well to do that. Okay. So I would say we're probably leaning away from that. Can the ODS be a data lake? Well, operational data store, the keyword there is operational. If your latencies are incredibly low and you're streaming in a lot of stuff, Hadoop's a little sluggish for real-time update. You can keep adding stuff to it, but if something changes, you're really in a bit of a pickle there. This is where the engineering comes in. So if I have a transaction that has changed a value, and I want to update that transaction in place for an operational use, that's going to be really tough in Hadoop, but if I want to just add something to the end of a columnar thing, maybe I can get away with it. Again, this is where the engineering comes in. There's not a pat answer to that. Let's see here. If the data assets and the data lurker are stored in a near exact or even exact format as a source, wouldn't that make retrieval from the lake be challenging? Yes. Kelly, that's why we have really smart people plowing through that stuff, right? Yeah, that's right. That's why we have been in the room and watched these people working and we work with them all the time now. And yes, it is that challenge that spurred the evolution of warehouses and marts. And that's, again, that is speaking. These questions are speaking to our point, which is one size is not going to fit at all. Sorry to be redundant, but I did that for emphasis here. It can be really hard to get something out of the data lake unless you've done some understanding of when you loaded it in as to where things are and put some type of... There's a new product out, Atlas is one. Some vendors out there, and we're not naming vendors on this thing, but there are vendors now that help you manage your data lake and actually put a layer of metadata on top so you can actually do a relational navigation of that stuff. But in its raw form, very hard to navigate. Do you recommend the data lake as a play area? Possible. Then the data warehouse figure out the need? Absolutely. How do you govern a data lake so it doesn't become swamp? That's the governance requirement. If you don't keep that lake pristine with a lot of rules, a lot of governance, a clarity of what you're putting in there, it will become that junk drawer. Right, Kelly? That's the metaphor we've heard more than swamp. It's a junk drawer. Yeah, absolutely. And I think that there's governance around the lake. There's general governance around the data to start with. So if you have a good understanding of the data from a source perspective, then it's very easy to extend that understanding as that data gets loaded into the lake and that sort of thing. The tools are becoming much more sophisticated to be able to share that data understanding, the concept of metadata around the data lake, now that Atlas is becoming more mature, is coming to fruition. And so the tools are going to continue to mature and give us that visibility. And of course, the security and the privacy is also continuing to mature. And there's third-party tools that you would also use to manage the security and privacy. And of course, considering things like how you can obfuscate the data and things like that. But again, it's not just governance around the data lake. It's governance around the data and ensuring that that same governance extends to the data lake is, I think, the important way to consider it. So next question then. How much data governance, you segue to the next question just beautifully. How much data governance is needed in the lake if data governance is applied? Does it make data injection slower? Data governance isn't necessarily an automated thing. You apply some rules of engagement with the data. And by the way, this question, and not reflecting on who asked it, but we also hear similar questions from data scientists who absolutely have no idea what data governance is. They just understand the algorithmic part of their work and have no idea that there is actually some process out there to prevent things from happening. There's more applying the rules of regulations before the data even gets there. So there is no automated thing that would slow you down. If you just kind of get things positioned the right way before they're even loaded, that's not going to slow you down at all. The next one, a really nice comment, which is really important for everybody listening here. Starting out to build a hive environment, we are seeing limitations. For example, we can add an append, which I talked about, but are recommended to not update. Absolutely. That's like changing the tire on an airbus while it's rolling down the runway. It's really hard to do. We are also seeing very slow query response times, especially compared to data warehouse solutions. Absolutely. Because these are columnar type structures. They are designed for sequential read, not for direct reading, not for selects, not for joins. You know, we get really wonky when we have a forward outer join in a data warehouse. A two-way simple join in a massive Hadoop-type structure could be a real killer for response time. It can be that type of thing and translating that type of what we would call a two-way join into a search-through, an HDFS could be a nightmare performance-wise. Then again, this is where the engineering comes in. If you need to do something low-latency and near-time or real-time reporting or BI or operational type things, you're probably not going to get a lot of help from a Hadoop-type structure. Let's see here. We're at the top of our time. We have a lot of questions here. Here's what we're going to do. These are going to be sent to Kelly and I after the talk. And because this is a rich set of stuff, we're going to answer these tonight and get those back out to everybody as quick as Shannon can get those out at this point because we are out of time and thank you so much, everybody, for a tremendous amount of interest here today. You can just see what a cool hot topic this is and how dynamic our industry is. It's a wonderful thing. Shannon, I'll turn it back over to you for a wrap-up here. John and Kelly, thank you for another fabulous webinar and thanks to our attendees for being so engaged in everything we do. As John said, I just love all the questions coming in regarding this topic. And as John mentioned, I will get a follow-up email to all of the registrants by end of day Monday with links to the slides, the recording on the session and some additional answers for you for the additional questions that we've got outstanding. And I hope everyone has a great day. Again, thank you for participating. We hope to see you next month. And John and Kelly, thank you again as always. Yes, many thank you. Thanks so much. All you listeners will see you next month. It'll be even more fun. Have a good day.