 Hello and welcome, my name is Shannon Kemp and I am the Chief Digital Manager of Data Diversity. We'd like to thank you for joining the latest installment of the Data Diversity Webinar series, Data Insights and Analytics brought to you in partnership with First Time Francisco Partners. Today, Kelly O'Neill and John Loudly will discuss data lake architecture. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DI analytics. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and any additional information requested throughout the webinar. Now, let me introduce to you our speakers for today. Kelly O'Neill. Kelly is the founder and CEO of First Time Francisco Partners, having worked with the software and systems providers key to the formulation of enterprise information management. Kelly has played important roles in many of the groundbreaking initiatives that confirm the value of EIM to the enterprise. Recognizing an unmet need for clear guidance and advice on the intricacies of implementing EIM solutions, she first founded San Francisco Partners in early 2007. John Loudly is a business technology thought leader and recognized authority in all aspects of enterprise information management with 30 years' experience in planning, project management, improving IT and organizations, and successful implementation of information systems. John frequently writes and speaks on a variety of technology and EIM topics. His information management experience is balanced between strategic technology planning, project management, and practical application of technology to business problems. And with that, I will turn it over to John and Kelly to get today's webinar started. Hello and welcome. Good afternoon. Thank you very much, Shannon. Hello. Good afternoon. Good morning. And hello, Kelly. And let's get started. We have a lot of stuff to go over here. We're going to set the stage a bit with a little bit about Data Lake. We can't talk about architecture unless we talk about what that architecture has to do and benefits and risks. We have kind of a reference architecture of our own we want to cover. There's the lab and factory aspect which, I mean, by those words alone, you kind of get in the sense of what we're going to talk about. And a little bit of a dive into some of the tech that goes with this, not super due for deep or not here to survey vendors or anything. But it's good to know that there's some things out there and some categories. And of course, an architecture needs to stay together. We'll take away, well, we'll have a take away, so I'm sorry, I read ahead on myself. We'll also have the governance components and in times for questions and answers. And let's just get started. We have our usual polling question because we'd like to know where you are and what you're thinking. That actually helps us real time during our event here to help direct our content. So do you have a data lake? Yes, no or unsure. Please, with that poll, answer that question. And we'll, and then check the box. And then if you do have one, let's talk about the usage. Operational or regular use or it's informal like a lab or a sandbox or if you're not sure, just check that box as well. And we'll go from there and we'll take a look at those here in just a moment in the meantime moving ahead. The definition of the data lake, we've got a couple definitions here, plus kind of our own storage of instances of various assets. That's one thing, think of it as single stream recycling. Everything goes into one bucket, right? So it's going to be by nature unrefined and raw, but it allows you to explore the Gartner group definition is key on the exploration and the system of record compromises. Our own addition to these definitions is it can support either an exploratory use of analytics or an operational use as well. And Kelly, welcome and bringing you in here. Anything to add on that one? Nothing to add on that one. I think that the poll has finished. Do we want to see what people said for their answer? Sure. Yeah, let's take a time here. We got a lot of unsures. Well, most of the people that did answer do not have a data lake. And therefore, that did not follow into the last question. And then a usual bunch of people just too shy to respond, which is okay. So we're dealing with an audience that is, I would say, Kelly, we're going to make an assumption here, maybe light on data lake and looking to hear more about it. So we'll use that in our context going forward here. That's great. All righty then, well, let's just talk about the benefits. Kelly and I are going to talk to this stuff together. And this is really both of our insights versus tag team or taking turns on slides that can get just tiresome. Productionizing advanced analytics. We talk about advanced analytics, but the more and more deeply embedded they become, the more we want them to be regularly accessible, right? Then cost effective, who out there, I can't see anyone raise their hands, but you have a data warehouse, you have operational data stores, you have data marks, departmental spreadsheets, this stuff can get expensive. So we have a keen business benefit of cost of ownership going down. Of course, the value from all the different kinds of data that you can put in here. And this is going to reduce not only just your usage of data, but also the whole ownership across all the data life cycle. You're able to take a look at raw data in all of its various forms. And shapes and sizes within the data lake. That lack of having to structure it rigorously has a lot of advantages. Yeah, and I think when we think about kind of the concepts of the benefit of the data lake, it's really thinking about, truly, what are we expecting to achieve from that business perspective? So, yes, there's a benefit from aggregating data, aggregating structured, unstructured, semi-structured sort of data. But really, what is the true purpose? So what are we trying to get out of it? And how is it making us more competitive, more productive, reducing our costs, increasing our revenue, et cetera? And so when we think about this third bullet point around deriving value from unlimited data types, there's a lot of different types of value that we can get. So deriving value might be improving your customer experience initiative, because you can now link the experience that they're having across multiple channels and do some analysis around the way that they interact with you on multiple channels by aggregating text-based data out of your call center with other sorts of more structured data that might be, for example, in a self-affirmation application or something like that. So it's really looking at what are those end business benefits that you're trying to achieve to push your business forward that would not be done in any other sort of way, that wouldn't be done in your data warehouse, that wouldn't be done in your operational system. So creating a unique value from the actual cost and effort associated with implementing a weight. Yeah, and we have to talk this about this when we talk about architecture, right, Kelly, because that architecture has to form itself to those business benefits. Also, the architecture needs to avoid the risks, and these are fairly typical loss of trust. I think we've seen several examples of that in our practice where the data lake exists and then goes, so what? I still don't believe your numbers, right? The loss of relevance, right? And then, of course, you have the momentum. Everyone is very excited. Here we go. If we do something new, we're going to get to our data, and it just kind of grinds to a halt. There can be risks of the wrong answer coming out, even though you think it's the right one, and then there's the excessive cost of a lot of folks who are investing a lot of money in these types of technologies and not getting their return for it. We're beginning to see, Kelly, I think you can echo here. We're starting to see more and more client examples and also in the trades of, you know, this isn't working. We're not having the success as we look for, right? Yeah, I would agree with that. I think as a result, though, people are also being a bit more thoughtful about the lake and recognizing that it can become a swamp. So they think about what is our purpose and what is the cost that we're willing to incur as aligned with that goal and that purpose. So it doesn't end up being long-term excessive costs or increased risk as associated with it because we understand the purpose, the unique value of the lake, the differentiated value of the lake versus other systems we already have internally, other than just being something that our competitors are doing and therefore something that we also want to do. Yeah, the whole Me Too thing. Yeah, the Me Too thing. Speaking of Me Too, thank you for that segue. Reference architecture, there's a lot of them out there, but we have kind of a take of a more modern reality on this and that modern reality has shifted even in the last few years on the data lake and there's some reasons for that there on your slide and then just the technologies themselves have advanced very, very rapidly. And whereas the lake would consist of just this place in Hadoop where stuff lands, it's not really that simple anymore. So our first direct comment on the data lake architecture is there's actually three areas of a lake that you strongly need to consider. We'll talk about each one, the landing zone, the standardization zone, and the sandbox, the analytics type sandbox. Of course, we've got some things surrounding that which we're going to get into as well here. Well, landing zone, kind of closest to the original thing where we talk where everything just kind of lands there, hence the term landing. See, aren't we creative? And then the raw data is now, raw data is important, right? If you're a data scientist, you're gonna wanna go after the raw data, you don't want anything, any additional context put on that to disturb your analytics. So it's available at that point. Kelly, we see that still pretty popular element of the data lake, right? Yeah, and I think the point we're trying to make here is that originally, a lot of data lakes were formed just as the landing zone. And so some organizations have made it more sophisticated beyond that. But some organizations are still using it just as that, you know, literally landing zone where the raw data is stored holistically, versus moving into some more of the sophisticated components of the architecture. Yeah, and then next, and we've seen this, we have this thing we referred to earlier in the year as the leapfrog effect for architectures. And that is folks use the, or using the landing zone as a place to draw into other technologies to get to the data, maybe even populate a data warehouse or something like that. But what was happening is you're trying to do all this work on the data to get it ready to use for somebody else and even put a schema on it or something while someone else is rummaging about in the landing zone. So really, there's a standardization zone. And we're not talking separate physical instances necessarily here, but you do need a place where some pedigree can be applied to the data. And it can be prepared for consumption or even the sandbox. Now, that doesn't mean you can't go from a landing zone to a sandbox or something like that, but this is a construct that allows for the administration of this type thing. And this is where Kelly, we've got some folks that they populate the data warehouse from here, right? Exactly, that's right. Yeah, yeah, so and we see this more now than we did, I don't know, even just two or three years ago as these technologies were coming about. And then we have the analytics area where the data scientist works to do the new models. Now, to be clear here, this is where the data scientist works. If someone is going to do a sophisticated query with four or five dimensions in a year over year, that's not the kind of analytics we're talking about here. That would be something out of the standardization zone and put somewhere else where they could do that. This is the pure data science type thing. A characteristic here might be that the data stores in this are not even persistent. They come and go at the whim of the data scientist. So there isn't anything really even production about this. It can be very, very ad hoc. It can be very, very temporary with it. This is where some ideas are set forth. Anything to add to our modern reality, Kelly? We'll move into some of the other pieces that wrap around it here. I think you've got it covered. So I think some of the, rather than thinking about this in isolation, what you're going into next, I think is super important to ensure that the context, the context is achieved. Yeah, so this entire environment is subject to data management. And Kelly and I are going to kind of tag team on this. Data management as in knowing where it's coming from, having an awareness of some level of discipline around the lake. Now, the reason being, we don't want it to be a swamp, right? And kind of within that area, of course, we have a governance aspect, which would be your rules of engagement around the lake and the understanding of quality or the understanding of usefulness and then an operational standpoint. Because we have seen some really large lakes. And one thing we kicked around on a conversation on our team, just in general, was that a few weeks ago, Kelly, that someone had to recover a lake and they realized it was going to take an inordinate amount of time. And so there was a bit of an operational aspect to this. I know some of you are thinking, well, wait a second, the lake, we just throw stuff in and that's what it's for. And that's the initial view we had. But just to reiterate here, we're trying to tell you, these architectures have evolved and it's something that requires a little bit more than just totally informal management. You have your consumers, the scientists and the various consumers. And yes, the scientists are perfectly fine, probably, with the schema on read. But none of the other consumers of this are going to be happy with that. And we're going to have some type of schema on load or some awareness of origin, some awareness of who can touch what, when, where, why, and how. So we need to wrap some of these components around the data lake. Anything to add, Kelly? We'll start to dive into some more here. Nope, you've got to cover. Boy, thank you. Awesome. Just to review, this came from one of our architecture discussions earlier in the year. When you're talking architecture with someone so those of you who are thinking, based on our poll here, not everyone out there has the lake, right? Or you just don't want to tell us about it, which is fine. There's the form of the architecture, which is can people understand it? So even if it is one big lump, it still has to be understandable. And then there's the fit for purpose. So as it evolves, there's a progression. Your data lake won't stay the same. You could leave it like it is and strictly for data scientists. And some organizations are early on. And it's just some Hadoop sitting in the corner with some people sitting in the corner and they're just banging on it. Awesome. That's great. If it's successful, you're going to progress away from it. You can't help it. And so you have to kind of be ready for that. So let's break down between the differences of the lab and the factory, because understanding how they work will help you with your architecture as well. And we're going to kind of kick through this one really, really quick so we can get into some of the more deeper architectural aspects here. And the real key is, is it a lab or a factory? Or are you thinking of maybe both? Because your architectural governance and the organizational aspects are going to change. And you have to, and if you can, clearly identify if there is an evolution or that architectural progression we talked about. Or you might want to keep them separate. There is that thing where off to the side we're going to load data. And then we're going to have another more discipline where we use the standardization and push it farther down for something. And that might even be in parallel with this. But normally it's going to go from the raw to some type of standardization. And then on that line. But either way, Kelly, defining the purpose initially is really, really helpful for sustainability of these things. Yeah. And I think the other thing is that this is a delineation that's important to consider because it does drive, like the second bullet point says, architectural governance and organizational impacts. And so if something is just within the lab environment, it's treated one way. And if it goes into a factory or production environment, it's treated another way. It's very tempting to, I'm using air quotes right now, productionalize your lab. And it's also very dangerous to do so because of the difference in a way that it is architected and governed. And so this consideration is something that, of course, would happen when you're setting it up. But it would also be something that we would encourage you to consider on an ongoing basis and to validate that, in fact, your lab isn't getting productionalized and that people aren't using the output of the findings from the lab in a way that is production versus the process of productionalizing that finding or that process that did prove to demonstrate business value. So that application is always there and should be reevaluated and reconsidered because it's very easy. And I've seen organizations just kind of create that blend and not really consciously realize they're doing it and turn around and have a governance issue or some sort of regulatory issue based on the changes, on the differences. Well, I think a classic example here is the lab starts to get used on a regular basis because it's really cool. And the data scientists feel as though they're having being distracted by continuous periodic regulatory, yeah, and then that's not our job, which actually is an appropriate thing. Just so you can help, a quick sense of perspective, I'm just going to go through these real quick. When you're in that lab, you've got very limited elements as well. Yeah, you're going to load it. You're going to land it. You're going to maybe extract it. You're going to have a primitive HDFS in a dupe, might be colander, might even be graph database. And you're going to get to the data somehow and you're going to have some data scientists or data analysts around it. But then you're going to also move forward. And when you start to productionize or get more serious, becomes a factory, you're adding more architectural elements to it. You might not add all of these here on our slide, but you're going to add architectural elements to that thing and you can just tell at a glance here, and I have to read all of these, that there's a lot more to that. Well, that's what happens with this productionization and going with the operation or the factory. The characteristics here, obviously the lab, because by its name, it's experimental. It's flexible. We haven't seen, we've had several fun efforts where the documentation has been remarkably non-existent, but still it's been a useful product. And it's run by the users. I think that's the key aspect here. We've actually had a client, Kelly, you remember them, where they had two labs. They had one in one analytics area and one in a business department that had popped up kind of in parallel and then the two had to start to work together. But both were incredibly informal. So it's very, very output-driven, right? Yeah, go ahead, Kelly, I'm sorry. Yeah, I was gonna say, and when it is a lab, that's absolutely appropriate, right? Where you're seeing great work done in one area, you want to replicate it closer to the demand, closer to the users and replicate it in another area and see how that can assist from a departmental perspective in another group. So it's absolutely appropriate that that happens and in fact, encouraged as the different environments become more accessible, become more understood. But it's important to understand that the output may be different than the output from another organization because of the differences in understanding governance, structure, architecture, et cetera, that is set up differently. So, anyway. Yeah, now the functional thing is, notice I do say something here or we say something. The result should be evaluated for relevance. We have seen that the labs produce stuff that go, and they go, this is what it is. And due to a lack of context or experience, lack of awareness of the business model, the conclusion that's been produced has been incredibly irrelevant, but presented as being incredibly relevant. And that's a very, very, that is a formality with the lab that in light of all the other informalities definitely has to be considered. Of course, when you start to talk being more formal, now this, the factory, right, comes into place. And that means that the requirements out of this environment of the lake are directed. There are things that have to happen in certain ways. So regular output and that might be associated with the business service. So that does not mean you're not running an analytical model. We're not talking necessarily about a production report here. What we are talking about is if you are running an analytical model, that is that the results are highly reliable and the organization through some type of application of machine learning, AI or something is going to respond to a model-based recommendation that is productionizing the lake, all right? Which means you need all the rights and privileges thereof of something being a regular part of your business operation. That's which is a defined architecture. So you know the limits and what it can or cannot do. The rules of engagement for using it, who can use it and who cannot use it and what can be believed and what do you not use that for. And then functionally, are we using our old friend data quality comes up to bear here. Lineage and metadata as a tool. Big part of your architecture, lineage and met folks, we're still seeing you out there throwing up the lake and then saying we'll get to the metadata and the lineage stuff later and it's not helping. Dig your feet in and insist on it. Those of you that are in the middle of this. Scheduled access and loading versus the ad hoc, let's just go get it and throw it in there over the weekend. Well now you've got to time it. Publishing means quality control, maybe even approval. And running the models on a scheduled basis means there's administrative and maintenance. What if that model runs for two days, right? Ties it up. You've got to build that into the fact that there's now a growing number of users on the lake. I think I got most of that there Kelly, but did I miss anything? No, you didn't miss anything. Sorry to make sure I'm off mute. You didn't miss anything, but you did say something quick that I wanted to just touch on. They based on a lot more sophistication, the users of the output of a lake could be other systems, could be event driven. So it's not just thinking about a human consuming that output. It is other driving other activities, potentially an operational system. So that should be considered as you're structuring from an organizational functional perspective. So I just wanted to make sure that that important statement wasn't missed. Thank you. And that's why there's two of us on these events, right? So some aspects of the environment, some more things. We need to build our tools and products to get the data we're supposed to be at the right time. Still be able to experiment and not pollute the lake. We need to bring things in rapidly. I have never seen a data technology in the now unfortunately becoming decades of me in this business where the latencies have not been driven down. The early data lakes was, oh, there's some stuff loaded in there and maybe six months we'll think about loading it in again. To now we're talking about, and we're gonna go over just a list of some of those real time type technologies. And then of course, everyone wants to get to it and use it and things like that. There's also, while Kelly's talking, I do tend to look ahead at the questions and there's one question I can address right now. Notice we have arrows between the analytic sandbox and the standardization zone. What that means is you can learn something from the data scientists that's relevant to the other consumers. Therefore, you take that awareness that what you've learned and it could be machine learning or it could be just some relevant context you figured out and you can impute that back into the standardization. So that's what that means there. So we have kind of this base environment that, and what we're really seeing now is that the latency's driven down, down, down and streaming becoming more and more likely now. Not everybody Kelly we talked to needs to stream and they've asked if they should, but some of you just don't need to do it, right? Yeah, I mean, again, it's really what are you trying to accomplish and what's the impact on the cost associated with doing something that it may or may not be necessary for your business purpose? You know, I think that we all want to, there's an aspect of keeping up with the Joneses. We all want to make sure that we're not missing something, but at the same time over-investing in something where we're not sure if there is going to be business value yet we need to just make sure that we're allocating costs to anticipated benefits and then revisiting of course, if we're starting to see a change in that ratio. Yeah, so here's, I mean, not a pretty slide, not many pictures on it, but there's some real architectural aspects to consider. Many of you have an operational data store or your data warehouse has been driven to an operational status over the years. And you're thinking to supplement your analytics with the lake, well, the lakes can be real time and you can effectively replace an operational data store with this landing area and a standardization area then to a layer of operational consumers. So one thing to consider with your architecture and are you a lab or are you moving to productionized is is there any chance that you are going to replace the operational data stores functionality that is operational reporting, low latency type of feedback into the business and for review, low latency means things happen quick, right? The time between I know something and I need to know about knowing something as well, that's driving down the latencies. Also, and as somebody who is always dubious about data analytics technologies and operational work, but we're still seeing this and Kelly can chime in here. She's been more involved with some of those efforts than I have actually. There's full CRUD operations going on now in some lakes which I find, I'm not so crazy about that or not but the technology's moving to support that. Kelly, anything to add to that one or any of the speed and operational aspects here? Well, I think that what seeing some of the CRUD operations happening in the lakes is just another form of testing and experimentation and pushing the capabilities of the technology or determining whether it makes sense to perform any of those CRUD operations in the lake for the purposes of that lake. And so we're seeing all kinds of unique processes that traditionally have been done in other sorts of systems being attempted in a data lake environment. And that's the experimentation aspect of it which is all positive and all good. And something that we're constantly learning from and the technology as it evolves enables us to understand more about those operations such that we can track, govern and share the fact that those operations have been done. So when we consume that data, then we have that understanding just like in any other system where we wanna have that understanding of what's done to the data, not just the data itself. Yeah, and I think a clarification, a lot of instances, and Kelly, please correct me if my interpretation of what I see or you've got a slightly different experience. I mean, for those of you that think, well, why don't these two talk to each other? I've been, I took two months off this summer and the business changes. So Kelly's been working on some stuff I was working on before I went away for a few months. Most of the create and update are the results of interpreting the data and then creating some type of interpreted value or conclusion from the analytics or some machine learning and then making that available within the lake to other people. That's where that standardization area comes in. If you can do that stuff in that area without messing with all this raw data, which you don't wanna do, right? It's not that the model of the customer address changed and so I need to put my new address in there. It's more, more creative and enhancing of the data, right? Yeah, well, and you're creating, maybe you're creating a derivation that is then used or assumed as part of another derivation or as part of another activity. We've seen, so you're right, it's not writing the raw data, but it can, there is data that is created or aggregated that is then reused within that lake. Yeah, yeah, so it's enhancing and from a creative standpoint. Kelly, I already touched on the thing, match your needs with the late disease. Now, there's a lot of vendors that are playing in this very fast area now. The three you hear the most about any thumbs up, thumbs down on anybody or anything like that. Hortonworks, Detunity and Splice and they are all vendors that with the ingesting, the processing and the consumption have various types of management. The point I'll make here is we just recently interacted with an organization that was really struggling with the management of things, of their data lake and their data science areas and they've been so immersed in the core Apache type technology, which we see down below like Kafka, Storm, Spark, things like that, that they didn't see that there's this third market now growing, another vendor that's not quite in the real time area, but is there with the quality and the management like podium. There are a lot of folks and I swear every time I Google some of this stuff once a week there's another name on the list, right? So there's a lot going on in this area. The thing to recognize here, this is changing really, really fast and a lot of the problems you're experiencing are being dealt with as you speak by someone who has heard about this as a problem and is trying and endeavoring to create a software product to help you manage that. The other technologies within the realm as I just said were Kafka, Storm for the real time and Spark for fast batching, which looks real time, but it's fast batching, but those are all things that you'll hear about in this data lake real quick area as we've come more and more of a factory and we need to manage a more rigorous data supply chain. The key here is you are gonna have some type of raw area. You probably will need some type of standardization area. You probably will definitely need a place for the data scientists to go. They may go to the raw area. They certainly are always entitled to do that, but a lot of times you want to give them something even more, something that they can manipulate off to the side in a pure lab type setting. But there are technologies to help you deal with all of that out there, okay? Let's see, I've lost my button to go forward here. There we go. Governance then in the data lake. Some things you've heard about in the data lake governance and some things you may not have heard about in the governance and Kelly and I are gonna just talk about each one of these here for a little bit. The one area of data, we'll just go through the six here, data acquisition, are you allowed to get it? Where do you get it from? Get it from where you're supposed to. If you're going to make this a factory, then there needs to be a rule on adding a new source to the lake. We recently talked to someone who had sourcing a data lake with COBOL source code because it had an incredible richness of hard-coded values in it, which I think was just pure genius, but that's a rather unusual data source and someone should know about it. The catalog, well, what's it mean, right? We all know what a data catalog is for in this business and then the types of decisions that are allowed to be made at certain points along the way, the oversight of the actual analytics or the models. Go ahead, Kelly, yeah, go ahead. Sorry, John, can I just back up a second? Do we want to comment just briefly on the difference between a data catalog versus other sorts of terms used around metadata types of things to help clarify? Because as we heard from the poll, a lot of this audience hasn't implemented a data lake yet, so differentiating a data catalog from things like a business glossary and other sort of metadata, that might be a good thing. Yeah, well, and chime in here. In the context of the lake, the catalog is, well, it's kind of like the, well, I'm dating myself here, but it's kind of like the Sears catalog or anything like that. So here's something that I really like. What can it do to help me? Where can I get it? What might be the procedure to procure it? So it's that level of metadata. So it goes probably a step beyond just the definitional or the semantic there. It's going to be more of also the location and it might even be the avenue that I portray the lineage of that particular aspect of the data lake. Anything to add to that one, Kelly? Yeah, I think the only other thing is that many times it's also associated with data sets so that you can kind of see how the actual sets, what is within a data set. And then it's not as specific or as delineated as more of a data dictionary or a data or other aspect of metadata. And we're also hearing the term data catalog as uniquely applied to data lakes as opposed to applying the concept of a data catalog to a data warehouse or what have you. I haven't invited you to understand whether that's a technical limitation. I would venture to guess that the vendors would say that there is no technical limitation that the concept of the catalog is most relevant in the lake where the volumes of data are significantly higher. Yeah, and something to point you to that landscape of because there can potentially be so much out there and just understanding, well, that's another aspect of what's in the catalog is what's there, right? And again, it comes back to the thing. Absolutely, what can I get? I mean, the old Sears catalog, when you got it, you went out to Grandma's house and looked at it, it was immense. And just understanding where things were and then understanding the color coding and things like that of the edges of the pages. And that's kind of that metaphor of where can I just go to look at this stuff and that's a powerful metaphor. And I think that's why the word catalog is there. The usage and the model productionalization, that's hard to say, productionalization, pre-self-explanatory, you know, who's using it? What are they using it for? And we love a model and it's working great. We're gonna run it every day or every weekend. Well, then that means some controls need to be put on it and we need to know what those are and the governance folks need to make sure that that happens, right? Okay, moving on here. So what's gonna happen with governance and all of that? Well, the catalog's gonna make it easy to find. We're gonna have our lineage out there so we can understand not only us but the occasional regulator where things came from. We will understand that it's described right so if I want to reuse it for another purpose, I know that the context is appropriate for the other purpose. When we do make decisions around data, they are logged and communicated and that's kind of a new thing. You know, we looked at it, we decided to turn left and here's why we decided to turn left from what we've done. That brings collaborative tools and workflow to this environment. So not explicitly called out in our pictures but if you've got something that maybe is internet of things and really cranking through some stuff and really directing production or consumer responses or something, you're also going to want to track why you did the things that you did and then of course, making sure we can all understand and identify things. So I'll let Kelly chime in on some final thoughts on that and then we'll move on to our last few slides here. Yeah, and governance can have a very sour viewpoint when it comes to the lake because of the sense of control and limitation and things like that. So when we think about governance from a data lake perspective, it's really about data understanding and data optimization and then of course protection. Privacy and protection has been a lot of press right now so I'm sure most folks on the line are doing something about it. But if we think about how governance can ensure data understanding that is a huge value to the folks that are using the data that's created and consumed from the lake. Because if you don't have a clear understanding of what it is that you're looking at and why it's important and how it's driving things like that, then you're not getting the value out of the lake in the first place. So that's really the focus of governance from a data lake perspective. And I'm sorry, I've been talking to my mute button. So have a look at it. I was like, oh no, do I have to take over? Well, actually on this next one, we've got a lot of voiceover on this slide. So I think we'll both be talking on this one. First of all, you do want to evolve your governance components as well. And I think it's a significant part of the architecture. I'm quite sure someone else out there saying I want more tools and technologies and stuff like that. And by all means, they're out there and as our series progresses, we'll get more and more into those. But these evolution of these governance components are what's going to make your architecture stick and work. Because if you don't have it, you're going to start with one form of lake and someone somewhere is going to alter that without you knowing it. And once they alter it, then no one else will be able to use it. Or they're going to add something to it and not tell somebody. And it'll be really juicy stuff that everyone could benefit from and no one will know about it. So it's really important to do that. So, you know, governance is required, but when we move to the factory, okay, governance's role is going to change. It's going to change from, you know, it's appropriate use and we know things to, this is legitimate, it's in compliance, it's verifiable. It's still doing good things for the business. So what do you think you would need to go from point A to point B? Things like, right, Kelly, you need a roadmap to go from point A to point B. There's a pretty simple concept there. You're going to change your data policies at some point along the way for that. Let's see, what else do we need here? We're going to have to maybe standardize some things. Well, factory's a little ad hoc, things can come and go. A lot of folks just cloud-based, so we'll add a little, we'll move a little, we'll try something. We go into production now, we need to maybe lock that down and get a little bit more formal with the infrastructure, things like that. Organizationally, I like to do, you know, the people process technology. Kelly, we've noticed that defining roles and responsibilities around productionized lakes come in really handy. So who's supposed to do what with it as time goes on, right? Centers of excellence. And what is the change associated with going? Yeah, sorry, go ahead. Yeah, yeah, our old friend change management, right? You're going from an undisciplined environment to a disciplined environment. The people that were really happy with the undisciplined environment might not be really happy with the disciplined environment. Centers of excellence, supporting people. You might need a group of people to just do verification. I mean, when you're moving back and forth between a lab or a sandbox environment and a standardization environment, who's there to say that this is right and it map right and understand the abstractions of the data that went on? Yes, you'll have tools to do that, but you might need people to help explain and train that. Another aspect of operationalizing anything are data controls, our old friend data controls. If you operationalize something, do SLAs come into it. Folks that have heard me speak before, I'm not a huge fan of the way some SLAs or service level agreements are used in our profession, but in other ways, seeing how they are incredibly valuable in the operational world. Kelly, the last one, I think we touched on this earlier in the talk, business continuity, right? You're going to use this operationally it's in the cloud, but does that mean what if your network goes down and you can't get to the cloud, then what? Okay, yeah, things like that. Standardization layer, publish and subscribe or just go get it. All of those things that you hear in a data warehouse world of BI, those are all, they all become relevant in this world as well. Like it or not, right? That's parts of your architectures now. Anything else to add to those architecture elements, Kelly? No, I think we've gone through all of them. I think we got them, all right. Let's move to our takeaways. Then we have time for some questions. Well, no one's great, we'll have a few minutes for questions here. We started with the business benefits, right? Now we hear the traditional one, oh, we're going to have access to the data, yay, but don't hang your hat on that. You're asking to put an architecture in that a lot of people aren't going to understand and find a hard time to deal with and navigate. Even people that are used to the warehouse or whatever, there's some new things to learn here. So please, please, please have a way to justify that so that they feel that their time invested is worth it. Maybe some additional risks there. The quicker you can get the stuff and the more stuff there is, the quicker it can go off the rails, I think, as well. So don't be casual, not causal. You don't want to call, well, you could be a causal approach, but the word we meant to put in there was casual. What we need, our official intelligence, is in spout checker, is where we need our official intelligence. So a casual approach to, how many people did this go through? And it's still snuck up. All right. Understanding that the architecture is, it's going to become a standard. This kind of three area approach we've shown you is becoming pretty accepted in industry. And it's not just the raw lake as a data lake, but it's got some elegance to it. It's got some additional form to it. Now, unlike the data warehouse, we don't want to get into what is really a data warehouse. What isn't? Let's try to learn from that lesson of 20, 30 years ago and get away from the religious wars. It's supposed to look like what you needed to look like to get what you need done accomplished. But you're going to have various shadings within the lake with a standardization area and the raw area and the sandbox area and things like that. On the technology part being open mind, like I said, once a month the players are going to change right now with the state of things right now. And of course, we spent some time on this as data governance as a critical success factor. No matter how you view the lake, whether it's the lab or the productionalized factory, governance needs to be applied in some way, shape or form. I'll let Kelly add her last few thoughts here and then we'll start to take the questions and answers here. Yeah, I think one of the things that I would venture to guess that a lot of people that answered the question that they haven't implemented a data lake yet is because technology is changing very fast. They're not sure exactly where to start and it's unclear what the business benefit is going to be. We have a client who is a financial services institution, pretty cutting edge as far as other sorts of technologies that they've implemented have a very sophisticated way that they handle their customer experience. But they're, again, I'm using air quotes because nobody can see me. Their data lake is in Microsoft SQL right now. So you could argue that that's not really a data lake. Well, it was suiting their purposes for the time and as technology continues to change and evolve, sometimes the first movers don't have the advantage. So I do think there's a great opportunity for folks who haven't implemented a data lake yet to learn from, whether it's bad press or whether it is just highly publicized failures, but things are evolving and changing and improving. It's a great opportunity to really assess what do you need, why do you need it and to budget for the fact that technologies are changing and to give yourself an unused or unallocated portion that you know is going to be used at some point in the future because something else comes up. So I think having that flexibility and adaptability and agility is very important to make sure that you are getting the data that you can out of your lake. And so, sorry, the last thing I'm going to add that this lake is really just one component of your overall data environment and some of the capabilities that we walked through may actually be done in other parts of your environment. So considering that you want to optimize your existing data environment and leverage what's working and then implement a lake to take advantage of some of the additional volumes, data types, et cetera, as opposed to thinking that it is a rip and replace. Yeah. And now I'm done. Oh, well, no, you're not because we have questions. And here's one, I'll let you field this one because we've seen this architecturally already. Can data flow directly from the landing zone to the analytic sandbox? In fact, I think a lot of lakes do that, but they haven't implemented a standardization zone because they don't know what standards to apply in the first place. So, sure. Exactly. Yep, yep. Now remember, these areas we presented are kind of general areas. It doesn't mean that this is a dogma, all right? This is what works in many, many things that we see, but until you know what your standardization is, you might just be, you might just have the landing zone and be doing just, and that could be your sandbox initially, but as you, again, there's this progression of architecture that you cannot avoid. We learned that for the last 30 years that we just can't avoid. They don't stay simple and they start to need to morph to additional purposes. But if you do think you need to do that, you pull something into the landing zone. There's no need to standardize it. There's no need to do it, but the data scientists could really use it. Sure, ship it on over. That doesn't matter. So, here's another view. Kelly, I'll let you take the first crack at this one here and then we'll be out of time. What is your view on centralized data governance when involved with one or more data lakes? Oh, boy. The centralization question is really, is really such a company specific. I know. So, you know, there's always aspects of governance that are optimized through a centralization approach because there are efficiencies of scale and there are consistencies associated with that. Now, there might be slight differences in those multiple data lakes that you have, which is why you ended up with multiple to begin with. Okay. And so, understanding what that delineation is where you are comfortable between the centralized versus decentralized, that's a long answer, but a blended model is generally acceptable and most effective. Yeah, I agree with that. We have an awful lot of questions. This has been a huge event. The questions are still kind of actually banging in here. So, we're gonna have to, Shannon, back to you to wrap this up. We'll have to do what we've done a few times before, which was address all the other questions in writing in some way and then ship them out to everybody. So, Shannon, back to you and thank you, everybody, for your time today. John and Kelly, thank you so much for this great presentation. As always, and thanks to our attendees for being so engaged in everything we do. I, you guys know I just love that. I'll put that in. The question's coming in. And as John mentions, I'll get those questions over to John and Kelly and include it in the follow-up email, which will go out by end of day Monday with links to the slides, the recording of the session as well. And I'll leave it open for a minute if you wanna keep adding some questions in there for me to get over to John and Kelly. Again, thanks to everyone so much and I hope everyone has a great day. Thank you all. Bye-bye. Thank you all.