 My name is Shannon Kemp, and I am the Chief Digital Manager of DataVersity. We would like to thank you for joining the current installment of the monthly DataVersity webinar series, Real World Data Governance, with Bob Siner. Today, Bob will be joined by guest speaker, Evan Terry, to discuss how to govern data lakes sponsored today by Alexio. Just a couple of points to get us started, due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag RWDG. And if you'd like to engage more with Bob and continue the conversations after the webinar, you can go to community.dataversity.net. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me turn it over to Medan for a word from our sponsor. Medan, hello, and welcome. Hey, guys. Thank you for having me. My name is Medan Kamara. I am a Solutions Engineer here at Aluxio. And today I want to talk to you guys about fundamentally a different approach to how Aluxio treats data federation. So, fundamentally, when we talk about data federation today, we see four big trends driving the need for new architecture. We see more and more commonplace become this idea of separating both compute and storage. And this becomes very prevalent and relevant in these hybrid and multi-cloud environments. In this case, we would have essentially compute and storage separated where storage could be sitting on-prem and compute could be bursting out to a cloud environment or vice versa where we have storage sitting in a cloud data lake and object store and folks are running compute and analytics on-prem. We see the rise of the object store as somebody that becomes increasingly common and as a cheaper cost alternative to fundamentally traditional storage silos. And finally, we see the need for data scientists and data engineers to be able to have self-service data access across the enterprise and across these data stores regardless of where the data may be sitting. When we talk about the data ecosystem, we talk about the data ecosystem beta and the transitions come into with the data 1.0 ecosystem. So traditionally, we started off in silo kind of environments where we had both compute and storage both co-located on the same environment. But as we started maturing and started exploring different compute frameworks and other cheaper cost storage, we start this plethora of both compute and storage come into play. When we look at this kind of separation and this kind of challenge of being able to maintain these one-to-one connections and these data points between compute and storage, we realize there needs to be a framework that allows us to data orchestrate fundamentally access between both applications and storage systems. And while at the same time, we need to be able to do this with native APIs that the applications are accustomed to and then native storage drivers that are able to connect to a storage device if you're choosing. Olexio's approach to big data foundation is fundamentally different than the traditional data lake approach. Data lakes traditionally give you a single system where relevant data could be accessed from. However, this becomes resource-intensive to build and requires carcely hosting of permanent data copies and ultimately creates latency between both where the storage and data is stored and where the data is actually being computed, which could be remote. Olexio's approach is the new approach is that Olexio is actually the first storage of virtualization technology that federates data access across both file systems and other storage. So, Olexio acts as a virtual data lake in which applications and users could access files through our global namespace. We provide performance on-demand fast local access to be importantly the most frequently cached data, so the most hot data that you require. And we can optimize storage costs by only transparently reading and writing data directly from the storage system and thus not creating another permanent copy of the data. The three innovations we have for a data registration layer today that we see is the data locality with intelligent multi-curing, data accessibility being able to provide these applications that familiar APIs and do the API translation to a native storage or the face, being able to scale out data elastically along with things that compute and allow you to abstract data silos and actually embrace data silos at scale. The most common use cases we see today are being able to accelerate big data frameworks on the public cloud, running big data workloads in hybrid cloud environments, and even enabling big data on object stores across single or multi-cloud environments. Finally, Olexio is an open source. We do have a vibrant open source community and it is open source technology that you could go out and play with today on eluxio.org. Check out our GitHub and join our conversation on this live channel. Back to you, Shannon. Thank you so much. It's a great presentation. If you have questions from Adon, he will be joining us in the Q&A section at the end of the presentation today. So feel free to submit questions in the Q&A bottom right hand corner of your screen during the presentation. Now let me introduce our speaker for today, Bob Siner. Bob is the president and principal of KIK Consulting Education and Services and the publisher of the data administration newsletter teedown.com. Bob has been a recipient of the DAMA Professional Award for significant and demonstrable contributions to the data management industry. Bob specializes in non-invasive data governance, data stewardship, and metadata management solutions. And with that, I will go to the floor to Bob to introduce his guest speaker for today and to introduce today's webinar. Hello and welcome. Hi, Shannon. Hi, everybody. Thank you, Olexio, for sponsoring the webinar today. And this is a really good subject for right now. There is so much that is being written and being spoken about data lakes. And the relationship between data governance and data lakes is pretty obvious to a lot of people that you hear the term data swamp when people talk about data lakes that go ungoverned. So I've got a special guest with me today, Evan Terry, that I'll introduce to you in a minute. But I wanted to thank you for taking the time out of your busy schedule for joining us today. Before I get started, I just want to run through a couple of quick announcements. And we typically do that at the beginning of the webinars. As you know, this webinar series takes place on the third Thursday of every month. And next month, we've got a really interesting subject of data governance versus information governance. Why should we call it one thing versus the other? Is there really a difference in the meaning between data governance and information governance? I always talk about non-invasive data governance, so there's a bunch of resources that are listed on here. If you want to learn more about non-invasive data governance, the first one is the book that I wrote several years ago on non-invasive data governance, the path of least resistance and greatest success. I'll be speaking at a couple of data-versity events coming up soon at the Data Architecture Summit in October and at the new DG Vision event that's taking place in December. As Shannon mentioned, I'm the publisher of the Data Administration Newsletter. If you're not familiar with it, please go out to it. Lots of free information about data management, data governance, data lakes, everything that's related to data. And the name of my consulting company is KIK Consulting and Educational Services. So if you want to learn more about non-invasive data governance, please go there. And so first thing I want to do is I want to introduce our guest for today. And Evan is the Chief Analytics Officer at Velocity Mortgage Capital. He has been in this business for a long time, consulting and working in projects associated to IT, data management. In fact, he has even co-authored a book on data modeling called Beginning Relational Data Modeling. And I'm looking forward to having you in the conversation today. Evan, so thank you for being here. Well, thanks for having me. All right, that's really good. So the way that these webinars are typically set up is there's five key topics that we're going to talk about. So I'm going to spend a couple minutes, a very short period of time, talking about each of the subjects and then turn it over to Evan and then we'll go back and forth and discuss these topics. I think that they are some of the most pertinent and the most important for people to think about when they're talking about their data lakes and how they're going to put governance in place around the data lakes. The first thing we're going to talk about is, you know, what is a data lake? What is data governance? What's the relationship between these two? We'll talk about the data lake in terms of becoming a data swamp. That's what I hear. That's the term that I'm hearing most often from people that are referring to data lakes where the data in the lakes are ungoverned. They're kind of becoming a swamp. They're not as easy to use. They're not providing the return on investment. So we'll talk about some steps that we can take about preventing our data lake from becoming a data swamp. We'll talk about the metadata. The metadata seems to be one of the core factors as to whether or not your data lake is a data swamp. Do people understand the data that's there? They know where it came from, who to reach out to associated with that data. The metadata is a key piece of this all as well. We'll talk about leveraging governed data to provide trustworthy analytics. And that's an area where Evan has a lot of information to share with you. So looking forward to that subject. And then the last thing we'll talk about will be measuring the value of governed data or of the data in a governed data lake. So let's get started by just jumping into the very first subject, which is the relationship between data lakes and data governance. If you've attended my webinars in the past, I have a definition of data governance that I use all the time. And it is the execution and enforcement of authority over the management of data and data related assets. Sometimes I break the word management out into definition, production and usage of data. And you'll see as we get later into the conversation today that there's really metadata that's associated with each of those actions that people take with data. I asked Evan to provide his definition of what data governance is. And he pretty simply put it. He said that data governance is really the management and the organization of data. And as we know, there's a lot of different definitions of data governance. I don't know what your definition is. It could be one of these. It could be one of the ones that are listed here. But what I suggest is when we put together a definition of data governance, we're really putting some teeth behind it so people understand what it means to govern data. And so organizations have used the orchestration, the harmonization, the formalization. All of these things are good things for people to share in their definitions. But again, pick the definition that makes the most sense to you. Trying to turn that off, I'm sorry. But now the question becomes what's a data lake? And then we're gonna talk about the relationship between the data lake and data governance. So there's a lot of definitions of what a data lake is as well. And so I just grabbed a couple that I could find that seemed to make sense to me. And a data lake is basically a repository or a place for people to put raw data or objects or blobs of data and putting them out there for people to be able to react to them, to use them as part of their daily function to make decisions to get the information that they need. And the question really becomes when does a data lake become a data swamp? And simply put, I think that a data swamp is a data lake that has data in it that's ungoverned that there's no metadata for. And with that, I'm gonna turn it over to Evan and ask you, what do you think turns a data lake into a data swamp and what is the relationship between these two disciplines that people are talking so much about? Well, thanks Bob. So I think that the biggest challenge or I think maybe one of the most interesting aspects of the data lake provides is it gets past the traditional data acquisition and management challenges that you would normally face within a data warehouse. So the historical data warehouses would be highly structured on the output but getting data into them usually at least is challenging. It takes time. It's not necessarily responsive to the needs of the business. And so your data lake has that ability to take in raw data in its native form, store it dynamically within an environment. But what you're giving up when you're doing that sometimes is you're giving up a little bit of the structure. The question is how much of structure and management are you giving up as the pendulum swings more in the direction of speed of acquisition? So in this case, if you're talking about how do data lakes become swamps, it's really when the power of the data lake, the ability to store that data in its native form, the ability to add to it relatively easily, starts to conflict with the ability to actually get good information out of it. And that's typically aligned with a lack of governance, a lack of management around the data that's being added to it. So from my perspective, when you're talking about a data swamp, I think the best example of a data swamp would be probably everybody who's on this call is working in an organization that uses shared network drives to store files. And that is in some ways a great, it's very dynamic, lots of people can write to it, lots of people can read from it. But the information that's contained within that file share is often not particularly useful to people beyond those who have actually added the files directly to the shared drive in the first place. So when we're talking about data lakes becoming swamps, there's really some kind of set of rules, some kind of governance, some kind of standards of behavior that we're going to require of people who interact with the environment to ensure that it's maximally useful for the most number of people. Okay. So one thing that I heard you say that I thought was really interesting is that the data in the data lake actually only becomes valuable to the person who put it there because they're the ones that have that understanding of the data. Can you maybe elaborate a little bit more on that? What are the types of metadata that are being asked from, because we want the value obviously to go beyond just the person that's putting the data there, what is the metadata that's going to help people to understand the data in the data lake better? So yeah, there's typically the real, the basic information, the basic metadata that would be useful to someone who's trying to consume a data set that's available to them, which is going to be, first of all, do I know what's available to me? Can I find it? If I found a data set, I need to understand what actually, what it is, what is in that set? How is the data selected? What does it actually mean? Some kind of understanding of the, almost a data dictionary or some basic information about what am I actually looking at? All of the other sorts of things you would normally see within at some level within a structured environment where you'd say, well, how often am I getting the information and what kind of access do people have to this? So really it's about decoding it because I look at data that's dropped into an environment and if it's not at least minimally documented along those lines, what it is, when was it put there, what does it mean, then it's essentially almost a form of encryption. You've got a situation where you got a data set out there but nobody necessarily understands it. So those would be the critical pieces to make that information broadly usable across a wider group of people. You need to understand what can I do with it? And those kinds of things are gonna be what drives that usefulness. That's really funny that you say that, that the data that's out there is almost like it's encrypted because people don't understand it, right? And I guess that's the point of encryption. But we want to make certain to be as the other, the end result is that they will use the names of the columns to decipher what the fields are. And that's not really what you wanted to do. The names of the tables, the names of the columns, it goes way beyond that, it's the description, it's where they came from, all the things that you mentioned. People talk about things in terms of data dictionaries and data catalogs all the time. And so just kind of to wrap up this subject, would you suggest that if you're putting a data lake in place that the data dictionary would become a critical component of the data lake? I would agree with that. And of course the question around data dictionaries is always one of how in depth do you go into documenting it, how much work do you spend with metadata tools to try and capture it. There's probably a balancing act to be played there between the level of detail that you need to capture and the appropriate usefulness. But yes, as a general statement, I would say yes, the data dictionary in terms of identifying what you have and what it means is critical to the usefulness of that information. And you know what, you mentioned something else, you said minimally acceptable documentation. And I think that's something to take away from this as well, define what that minimal amount is and make certain that if we're gonna govern the data in the data lake, we wanna provide at least this information because otherwise it's almost like the data's encrypted like you had said. So great perspective on that subject. Let's spend a little bit of time talking about preventing the data lake from becoming a data swamp. And I'm just gonna spend a couple of minutes here talking about it, but I get that question all the time from organizations that I'm working with and the truth is that as I mentioned before, the data in a data swamp basically is ungoverned data. So the first thing that organizations should consider is at least putting in the level of governance that's necessary for the data that's going into the data lake and that level of governance may be, oh, well, we need to collect the data dictionary as Evan talked about and the types of things that we spoke about, but there has to be some level of governance for the lake and that if you again, let it become a free for all, it becomes very swampy and again, almost like it's encrypted to people. So the first thing that I would suggest that people can do is at least take their governance and focus on the lake and see what levels of governance are already there before we go trying to redefine it, but if you have structure in how data is getting entered into the lake, then live off of that or expand on that to collect the information that you need. The second item was implement metadata management for the lake, exactly what we're talking about here and data governance and metadata are very closely related. I tend to talk sometime about a non-invasive metadata governance because metadata governance is extremely important to make certain that the value of the metadata is effective for people and as I mentioned before, there's basically three actions that I see that people take with data and almost anything and I've challenged people to ask me what about this action and they tend to all fall under data definition, production and usage. So if we're gonna identify what metadata we're gonna collect in our data dictionary or our catalog or whatever we call the information that we have in our data lake, we can break it down into what do we need to know about how the data was defined? What do we need to know about where it was produced and how people can use that information? It's just a very simple way of breaking down the metadata and just kind of alluding to the question that we just talked about, what's the appropriate level? I think you brought it up in what you said is we gotta figure out what is the appropriate level of metadata as well as data governance for your data lake and with that I'd like to hear what your thoughts are about preventing data lakes from becoming swamps. Well, so I do think that the key to keeping your data lake from becoming a data swamp is some form of organization and like you just said Bob, it's a question of the appropriateness of the level of control that you're going to have and I think that ultimately you're going to want to look at, you're gonna wanna look at ownership of the lake, you're gonna have somebody who's hopefully within your organization in charge of at least managing an environment, you're gonna have to understand who that is, you're gonna have to have established some standards. So we just talked about about, how does data get into your lake? What are the minimally acceptable amount of documentation to be able to effectively use the information that's there and you're gonna wanna look at who can and can't access particular pieces of data. All of these things are going to be tied ultimately to the context of what your data lake is trying to do and I think that's one of the trickiest parts with using governance as a method to prevent disorganization or the creation of a data swamp is you've gotta be looking at your data lake and saying how am I using it and you may be using it for multiple purposes and it may be able to be divided into different categories of purpose and then you're gonna wanna be thinking about, well, if all I'm doing here is, data science exploration activity, I might have different sets of standards, I might have different rules governing the use and creation of that data within my data lake than I would if I'm using it for a more operationally driven, almost like an operational data store type queryable environment that's gonna actually drive real business decisions in the moment. So all of the things that we would normally think of when in a data warehouse I think are relevant from who's in charge of stewardship or curation of the data, how you're going to manage the evolution of the environment, the metadata capture about like you talked about and I do think that the zones, the concept of dividing up your data lake into zones of use and then sort of planning, so a little bit like having a city plan where you've got blocks divided up into different zones and then you can talk about the appropriate standards that would go into place if you're going to use that property for a particular type of purpose. I think you're looking at similar kinds of constructs here in terms of putting the right governance around your data lake to prevent it from becoming a data swamp. And you know what, I really love that analogy of it's like a city plan where you've got blocks that have been zoned to do certain purposes. And I know that I will be digging in and trying to understand from the organizations I'm working with what is the purpose of their data lake. And you mentioned a couple, but maybe can you elaborate for the folks in the webinar as to what might be some of those purposes? You mentioned scientific. What are some of the other purposes that you see organizations using data lakes for versus data warehousing or just databases? Sure, so the first one obviously that I've seen probably most often in this kind of an environment is the pure play analytics environment where you have a group that is typically involved in doing exploratory, it's almost a form of research and development with data. They're doing some kind of exploratory analysis. They're using say internally generated data sets, but they may also be going to the outside to be pulling in publicly available data sets or ones that can be purchased or they may be contracting with a third party. And then they're trying to put all of that together into a pure analytics environment, possibly with some more intense statistical modeling to try and drive towards a broader strategic objective. That's probably the one that I've seen the most often, but then right behind that would be a way to marry together the structured data coming out of operational data sets and operational systems with a variety of the unstructured data that you're going to have, whether you're talking about Twitter feeds or system logs or any of those kinds of things where you're trying to draw together in a more searchable, almost real-time environment, how can I tie together the different structures of data, the different data sets that present themselves in very, very different structures and pull something meaningful out of that where to put that into a data warehouse would be prohibitively time-consuming or expensive. Those are probably the two biggest most, in terms of end uses, the most common end uses that I see, but then if you're talking about dividing your data lake up into zones, you can also think of it in terms of, you might have zones within your data lake that are purely transient and used for loading in data sets on a periodic basis, you're going to have your raw storage clearly that's going to be the data in its natural form, and then you may even have environments that are slightly more cleansed or slightly more structured, depending on the purpose of your groups. And I would think that those purposes would largely fall under the two broad categories I talked about first. Have you seen organizations that have broken their data swamps down by, or I'm sorry, their data lakes, that was clearly a slip, breaking their data lakes down into subject areas, or would they more likely follow the transient, the raw storage, the partially cleansed or completely cleansed environment? Do they ever follow subject matters? Yeah, I think there is, my experience has been yes, subject matter orientation does happen. I think that's probably, again, if you're looking at these things sort of on different dimensions, you've got the purpose dimension in one way, and then you've sort of got the, maybe the functional area of the company on another, and I would see that subject area is a way that I've seen this broken down before. Okay, all right, so I think that information, I know I'm taking notes here, there's a lot of information that you're providing that's very helpful. Let's kind of change gears just for a couple of minutes here and talk about governing the metadata. And I just provided a class through data diversity on noninvasive metadata governance because there's a line that I tend to be using a lot, which is that if we're gonna provide metadata to people, the metadata is not going to govern itself. There needs to be people within the organization that have the responsibility for doing the same things, for defining the metadata, for producing the metadata, because there's a lots of different types of metadata that are out there. So we need to, as we mentioned before, that minimal set of standards for documentation, let's define what that is. Somebody has to have the responsibility for that. Somebody has to be responsible for producing it. If it's not coming out of an existing tool, somebody physically is going to produce a definition and there should be process around that. And certainly the metadata that seems to be most relevant is about usage. So I'm guessing that in most situations, there's data in the data lake that needs to be protected. So that's one of the first things that people think. There are no brainers that say, the data that has certain information in it is classified as being sensitive, classified as being private, whatever your categorization is, people will not only need to know how they can and can't and how they can and can't share the data, but what are the rules that are associated with that? The second bullet point is kind of interesting because it can be read two different ways. It could be, you could be asking the question, well, where is the metadata that's associated with the lake? Or even just lop off the word where and say, is there a metadata associated with our data lake? That's one of the first steps that we can take as practitioners to gauge how much work is going to need to be done in order to provide that minimal amount of documentation we talked about. And the question that has to be asked is, who are the people in the organization that have the responsibility for the metadata? And maybe Evan, you could share a couple, any examples of things that you've done to, well, if we understand that there needs to be this minimal set of documentation and somebody needs to define what that is, who are typically the people that have the responsibility? Is it the people that are entering the data in the data lake? Who does that? Yeah, that's a really great question because I was taking notes as you were speaking and I thought, really, again, I'm an analogy guy, so I like to bring up, you know, analogs into me. You know, you're talking here about who are the librarians, basically, right? Who are the ones who are going to keep the actual collection of data properly inventoried, if you will, right? And to make sure that that information is current and up to date. And I think the challenge that you have is most organizations don't have, most organizations are functionally oriented and therefore they don't really have the appropriate organization who's charged with maintaining this kind of information. They may not even really have anyone who's in charge with maintaining the quality of the actual data stored in the data lake, let alone the metadata, the quality of the metadata associated with the data in the data lake. I think where I've seen this work most effectively is, you know, the role that we've been using for years, right, of data stewards, of people who have a particular area of responsibility for the quality of the data are also those who tend to have the best information about what it means and how it's going to be, and how it might be used. So I think, you know, a combination of a good metadata management tool and an appropriate set of people, typically not within an IT organization who are charged with managing the usefulness of the metadata itself is where I've seen it be the most effective, but I'll admit it's something that I see companies struggling with an awful lot. And so you talk about librarians and that's interesting because one of the organizations I'm working with is calling them librarians. Is it typical to have one librarian for the whole data lake or are there individual librarians that are associated with each, let's say, source of data or each zone of data within the data lake? I don't think it's, I mean, it depends on the size of your data lake, I guess, but it seems to me that a data lake, if it's being used in the way that most are intended to be used, you're gonna run, it's going to get too large too quickly, I think, for a single individual to be able to practically manage. I think what you're talking about is someone, I mean, again, if we're taking people who have business knowledge, potentially as these stewards, it's probably functionally oriented because that's typically how the information about the business information is stored within an organization. So I would think you're probably talking about multiple functionally oriented librarians, if you will, or stewards who have information about and the responsibility for managing the metadata. Okay, and just to share with folks that are in the webinar, the organization that I'm working with that's calling them librarians, they're really per business segments of the organization. And so they're kind of taking what Evan said and saying, yes, we need to have these librarians and they're actually calling them at least that to start, but somebody that plays that role, I think that's what's most important. Can you kind of elaborate just real quickly on what you mean by Goldilocks mentality? Yeah, so the idea here is that you've got, if you're talking about managing the, whether you're talking about cataloging the data in the data lake or managing the metadata related to that data that's stored in the data lake, it's really just a question of, you can over manage the metadata, you can over catalog the data and you can under catalog or under manage. And the balancing act here I think is in finding, a not too hot, not too cold, just right balance between some structure and some standards and some expectations, but not stifling the value of the data lake in the first place, which is this dynamically evolving collection of data sets. And I'm not too cold, not too warm, just right. I understand that. And you know what, working to determine what that balancing point is, is gonna be different for every organization. So it's not something that you can say, well, this is the exact balance that you need. Would you agree with that? It's really specific to the organization? I think it is specific to the organization. I think it's also potentially specific to the lake in question if you've got multiple data lakes or if you've got multiple maybe zones or subject areas within your data lake, there might be some slight differentiation there as well. Okay, and one of the things I hear you keep saying is that we really need to understand, well, what is the purpose of the data lake? And you said, what are we trying to do with the data lake? And it makes sense to break it down into different usable areas at different levels of metadata perhaps. But what I have found to be very successful is to come up with a, kind of getting back to the data governance side of things is having a purpose statement for data governance. Being able to answer in an elevator when somebody asks you, well, what are you doing data governance for? Being able to have that, that they call it elevator speeches, I guess a long time ago. I don't know if they still call that or not, but have a quick definition purpose of your data governance program. And I'm assuming that you're thinking the same thing around your data lakes or your zones of your data lakes. One of those purpose statements that I found to be very effective was this organization wanted to use strategic data with confidence. And there's only five words there, but there's a heck of a lot of meaning behind that. When people want to know what that means, well, we need to define what do we mean by strategic data? What does it mean to provide confidence to the data that's in the lake, that's in the warehouse, that's in the application? What are we doing data governance for? And I heard you stating that, that's really important is that we let people know what we're doing with the data lake rather than just making it a sandbox that anybody can jump in and do anything that they want with. And so we want to make certain that the water basically, the water in the data lake is clean or it may be unhealthy. People may be making incorrect decisions based on that information. And so we want to let people know as if we're going to put out a boil water alert and let people know that the data in the data lake has been boiled, that it is at least defined the way to the level that it needs to be defined. And kind of think about, I had this conversation with somebody who lives in Florida about they moved from the freshwater side to the saltwater side, or I guess vice versa to determine, because they were concerned about the species that lived in each of those different types of water. Well, I guess depending on how well the data in the data lake is understood, that's going to really help you to determine who the people are that are going to be the ones that are going to be the species that are going to live in your data lake or use the data in the data lake. See, you said you'd like to now and you used to now have been jumping all over the there. So what I'd like to do is learn from you and I've got a couple slides of yours here to talk about leveraging the governed data because ultimately, as you said, the pure play analytics piece was one of the most widely used functions of a data lake. Maybe you can help us to understand what it means to leverage governed data to provide those trustworthy analytics. Sure, so from, you know, if you're talking about actually using the data that's in the data lake to do something useful, that we've been talking a lot so far about data governance and maintaining sort of effective standards and ensuring to use your water analogy that, you know, the water is clean, but it's clean enough for what? Right, is it clean enough? Right. Is it, again, to go down that road? Is it drinking water or is it, you know, does it need to be, you know, consumable by humans, that kind of thing, right? So you're talking about sort of establishing those standards and I think when you're getting to leveraging what you have for, you know, for the governed data, you're saying, okay, now that I've got, presumably I have some kind of a data catalog or a data dictionary. I understand what I have. I've established certain standards to be able to make sure that my data is usable and appropriate for the purpose that I've defined for my, either my subject area or my zone or my lake in general. The trick there is once I've gotten that basic foundational underpinning and I know that the data is appropriate, it's well-governed, I understand what it is and where to find it. Now you can actually start to use it, again, to your point with confidence. You can start to say, well, now I have the confidence to be able to use that information to then, you know, extend my analysis or ask that question that I couldn't ask before or, you know, find that piece of information that's gonna help me resolve a particular problem. So that's, you know, one of the points that I have on this slide is about trustworthiness really is a question of, you know, what's your need for, what's the context? What's the need for accuracy? You know, you know you're gonna have a little bit less controlled data in a data lake than you would in a warehouse, but if you've got that underlying governance, again, lightweight enough to be able to not be intrusive, but enough that you can, you know, rely on it, now you've really sort of set the stage for an appropriate use of the data. Okay, so then you also provided, prior to this information? Yeah, sure, so the question about governance, because it's that foundation of usability and appropriateness, it really does apply to a warehouse as much as it does to a lake, right? So you're talking about, again, the things that we've been talking about throughout this webinar, you're talking about ownership, who understands it and can answer questions about it or can be responsible for it. You're looking for finding the right things you need, because oftentimes you'll have, I've seen this in so many organizations where you have the same piece of information defined multiple ways, depending on which department is interested in the information. You've got the level of quality appropriately set and managed securities, hopefully, you know, already taken care of within the context of that governance process. And this is another piece of it, just like you would in a warehouse, you wanna have some ability to monitor the data that exists within your lake to ensure that you're meeting appropriate standards and that you haven't got a situation where the quality of the data is degrading, but no one's paying attention and no one actually recognizes it. So, and my comment at the bottom about the tragedy of the commons is just that, you know, if you don't have that basic control, I don't know if those of you who are familiar with the tragedy of the commons, that it came from the overuse of public space and grazing land in England during the, I think it was the 1800s, this concept of, if no one's in charge and everyone has access to this common resource, then what you wind up with is a lot of people doing things that, you have people whose behavior doesn't align, you've got people who may be trying to achieve different results and the resource, the shared resource degrades as a result. And you know what, that's how I like, that was what I was going to jump on was the tragedy of the commons. And you know what, it almost is diametrically opposed to the idea of the zones, right? Of setting up zones. So, we have the tragedy of everybody trying to do everything out of one place. Just one last quick question on this subject is it sounds like you're kind of leaning this way that potentially the governing of the data in a data warehouse would be, the lessons that we've learned from that would be directly applicable to the governance of the data in the data lakes. Is that your thought? Yes, that's exactly right. I think, you know, the question is simply one of how far down the road you go, right? If you're talking about a data warehouse tends to be fairly strictly controlled and data lakes, one of the benefits is you don't want that much control, but the lessons are still the same. It's just maybe a question of degree more than a question of appropriateness. I like that. There are lessons that we can learn from what people are doing. What we have done with metadata to support business intelligence and data warehouse efforts. So there's, again, the metadata, dating back to the whole Inman-Kimball debate, metadata has been a critical component of each of those ways of doing data warehouse. And we're saying that with the lake, you know, we might do a different level of metadata, but metadata is still one of those things that can prevent us from turning our lakes into swamps. So the last subject I wanted to speak with you about before we turn it back to Shannon for Q and A is that of metrics. How do we measure the value of a governed data lake? So for those of you that have seen the framework that I use for non-invasive data governance, there's really six core components of data governance. And one of those is metrics. We talk about data and people and process. You know, all of those things are very important to the implementation of an effective governance program. And so metrics are certainly a big piece out of it. And so when I think about the data and the data lake, here's some things that just kind of jump out at me. And then I'm gonna turn it back to you, Evan, to talk about, you know, what do you see as being ways to measure the data lake? But we wanna measure people's confidence in the data. And frankly, we just need to ask them what that is. What level of confidence do you have? How much time do you spend making certain that you understand the data? You know, those are something, that's something that can be measurable. Understanding of the data. How many people are you using the data? How the data is being used? The decisions that are being made from that data. You know, basically, you know, even measuring, as you had mentioned earlier, Evan, you know, the knowledge of what knowledge do people have of what data resides in the data lake? So those are all things that I think can be measured. And oftentimes, you know, I just wanna throw these out there before turning it over to you, Evan, just some of the things that I consider to be things that we should be considering before we're putting metrics in place, which is what are we gonna compare our metrics to? So we need to do a benchmark of what our current measurement is now so that we can measure some sense of improvement. You know, they need to mean something to someone. And I think that's also a point that you had made regarding the data in the data lake. You know, select the metrics that are associated with the data lake itself rather than just governance in general. I typically suggest that if we're gonna look for return on investment from data governance, it's really where data governance is being applied. And in this sense here, it's being applied to the data lake. So my last suggestion was, you know, go jump in the lake, jump in the data lake once you've got that data governed. So what do you, Evan, what do you see as being the ways of being able to measure the value in a governed data lake? But that's actually, I think, one of the biggest challenges. And it's been a challenge on an ongoing basis, I think, as long as you talk about the value of centralized data, how do you propose to, how do you make the case that a data collection, either in a warehouse or a lake, is of value absent a purpose for that value? So I think that's why the context, really, I keep coming back to that point, but I think context is the key to this. And so what I think you need to be looking at from a value and a measurement, a value measurement perspective, is that it's very difficult, as I put here, to evaluate what's the value of R&D? Well, it can't be tied to anything specifically. It's really hard to say that anything that you're doing today has any specific value, because it may not. The same question I've been in environments before where you talk about data sets in general, you say, well, if I do this analysis, I'll avoid a disaster, right? Well, that's a little bit like saying, every time I get in the car, what's the value of me putting on my safety belt? Well, it has no value unless there's a disaster. If there's a disaster, it has great value. The problem there is it becomes very difficult to make any kind of investment decisions based on that. So I think what you really need to go back to is, you need a business purpose. You need a business purpose that you're tying the data lake to, maybe multiple business purposes. You need to be able to tie, I think, your value to that purpose. And then you can start to measure how well that data lake helps you solve that problem or achieve that goal. And then the governance component of it is, how much value does the governance provide in helping, or how much value does it provide in helping you achieve that goal? Because in some cases, presumably, the governance is going to be key to your ability to be, as Bob was saying, be confident in the data and understand what it you have. So that has some value to bring along with it as well. And it might be the linchpin to the entire process. Because then, like I said in the final bullet here, the value really is measured in combination with that final use. If you're using your data lake for AI or machine learning, or if you're using it to be really nimble and responsive and be able to do some analysis really quickly and be able to maybe reduce your time to market for a particular product or service, I mean, these are the things that you're going to be trying to tie them to. They're hard to do ahead of time unless you really have a key pain point or a key objective that makes it clear that the data lake in a governed fashion is going to help you achieve. You know what? It sounds to me almost like insurance, right? You don't need insurance until you need it. You don't need a seat belt until you need it. You don't need the data in your data warehouses or data lake until you need it. So those three steps of define the purpose, the value, match the value up to the purpose and then measure how well it solves the problem is a great kind of ending note to end the bulk part of the webinar here. So thank you very much, Evan. I think you've provided some great insight. What I'd like to do now is turn it back over to Shannon to see if we have any questions for anybody today. Absolutely, we've got a lot of great questions coming in. And if you have questions, feel free to submit them in the bottom right-hand corner of your screen. And thanks to all the speakers today. It's been great presentations. And just to answer the most commonly asked questions, just a reminder, I will send a follow-up email to all registrants by end of day Monday with links to the slides and links to the recording of today's presentations. First question that came in here is for you, Madan. Just a quick question about Alexio. Is Alexio a database as a service? Yes, great question, Shannon. So Alexio is fundamentally actually data orchestration or data access as a service. So we would be deployed coupled with things like your compute application. And when and as your data is being requested or accessed from different data silos, Alexio would be able to intelligently and efficiently be able to transport that data to your application. Perfect, so into Data Lake, what is the difference between a Data Lake and an operational data store? Is that addressed to anybody in person? Madan, do you have an answer to that? There, Bob. For me, well, I'd be interested if Madan has an answer to that question, too, but, you know, the... Well, let me... Let you do it first, or you go ahead. Yes, so a Data Lake and an operational data store. So for us at Alexio, essentially we embrace data solids in the sense that regardless of if data is living in a Data Lake or operational data store, you know, from the application perspective and for end users, they still need self-service, performant access. And one of the challenges with maintaining access between Data Lake and Data Stores and multiple app teams is that someone at the end of the day needs to manage each one-to-one connection. And ultimately, you need some sort of virtualization layer, I think, in this new modern architecture that lets you access data regardless of where it is residing at the lake or operational store. And you know what? I always view the operational data stores as being kind of a staging area. And I know that more and more of these days, organizations are trying to make use directly of data from the data store. But to me, it always was a staging area. With a Data Lake, it's where we're putting the data to be consumed, and I think that's a big differentiation between them. Evan, do you have anything to say on that? I was gonna chime in there and say, yeah, I think the operational data store is in some ways less a structure than it is a purpose. And you could certainly use a Data Lake in that way. And the sense that OESs are usually used to look at data in relatively native form with relatively little external cleansing of rules associated with them. So I would say that they're not dissimilar. It's really more a question of how they're structured. I'm going back to the purpose of it. Oftentimes the ODS becomes that structure that people put data in as the staging, so I agree. Are zones in the Data Lake like a staging area for a warehouse before the data moves to the actual warehouse? I think that's a good question for you, Evan, because you started with the idea of the zones. Yeah, sorry, Janet, the question was, are the zones, can you repeat the question one more time? Yeah, are the zones in the Data Lake like a staging area for a warehouse before the data moves into the actual warehouse? I think they can be. So the way I look at zones is that you could have a zone that actually is a staging area that you use to flow the data through in a similar way that you would do it for a data warehouse. I think you can also look at zones as being functionally oriented where you could, you may have in subsequent zones, you do your data acquisition first, you might have a subsequent zone where the data is slightly more processed, it might be cleansed slightly, it might be aggregated potentially in different ways with other data sets to support perhaps an analytical environment where you've got an analyst who's doing specific kinds of analysis or trying to be, but who needs a certain level of quality of the inbound data and needs to understand that it's gone through the appropriate quality gates, I guess, before they can do their work. So you certainly can think of them as a staging area, but I think they're more just sort of conceptual, logical separations by purpose, more so than they are strictly staging into data warehouse. And you know what, again, we're just re-emphasizing one of the things that you said earlier, which is the purpose and knowing what you're doing with your data lake, and if that becomes the purpose to be this staging area, to be the area that we're gonna do some of the cleanup, the aggregation, to me that's the definition of the purpose, then it's fine for the ODS to be similar to the data lake. Of course, I'll say it the other way around. Medan, anything you wanna add to that? No, I agree. I think a lot that could be interchangeable and certainly in today's environment, we're getting multiple different data streams that come in, and a lot of times there does need to be a way to have that some sort of staging area, whether we're addressing data through Kafka or different rental APIs, before they eventually go to the single source of truth, in this case, which would be a data lake. So yeah, I agree with whatever. I love it, and there's so many great questions coming and we're not gonna have time to give them all, however, just keep them coming as I will get them over to Bob, who will write up the answers to be included in the follow-up email going out to everybody on Monday. So continuing on here, what would be the use cases around having both a data lake and a data warehouse, and do you see any challenges with having your data lake as a source for the data in your data warehouse? We've kind of talked a little bit about that, but maybe can expand on that a bit more. Yeah, I'd like to, Evan, you should just repeat that real quickly, but Evan, I think that would be a great one for you to touch first. Yeah, so what would be the use case around having both a data lake and a data warehouse, and do you see any challenges with having your data lake as a source? So I think in the first question, there's nothing inherently wrong with having both a data lake and a data warehouse. I think it's a question of the repeatability and the structure, the warehouse is gonna lend itself to a more presumably a slightly more limited set of uses and a more structured set of uses, and that might be perfectly appropriate for some component of your analytics environment or your reporting environment. And actually, as a staging area for a data warehouse, that is actually a good use of a data lake, especially if you're trying to manipulate data that comes in in a variety of different structures into something that's more uniform and more usable within the construct of some of the reporting tools throughout there. So I think it's a, there's nothing wrong with doing an old fashioned data warehouse as opposed to a data lake, and there's nothing wrong with using a data lake as that staging environment for the data warehouse if you're receiving the data in a way that makes that advantageous. You know what, and I find that a lot of the organizations that are jumping into the data lakes, jumping into the lakes, they've been around for a while, so they have data warehouses. So it's very common to see that there is a data lake environment and a data warehouse environment. In fact, they could be two completely different parts of the organization that might be focusing on them, or the purpose might be defined differently for them, but it's very common for them there to be both the data warehouse and the data lake. Any other comments there, Medan, did you? Yeah, yeah, absolutely. Just to, yeah, so in most of the parts that we see today, there absolutely are both data lakes and data warehouses, and typically the end users that we see on top of both of these vary slightly differently. So data warehouses, because by nature, the data is only structured in that format, we see business professional reporting, a lot of that's done on traditional data warehouses. Today with data lakes, however, we mostly see much more exploratory data analysis. So think about your data scientists and data engineers that really kind of trying to query this vast unstructured amount of the data. And I think now with tools in both the cloud, mark 20% with AWS and what they offer, coring that kind of vast unstructured data becomes much easier for doing that data lake today didn't used to be in the past. Perfect, let's see if we can just slip in your elevator pitch on the next question for y'all. We've got a couple of minutes left. How frequently can or should a data lake be refreshed? Well, I am going to guess what Evan's going to say, and probably what Medan's going to say as well. It all depends on the purpose. It depends on the purpose of the data and how often do you need it? Do you need real-time data? Well, that's obviously going to cost a certain level of governance need to take place to provide it real-time. What do you think, Evan? I mean, what is the right answer to that question? Well, I think that's exactly right. You've got different data sets that are going to have different purposes, and those purposes are going to require different data refresh cycles. So you could have something near real-time that requires on your constant feed. You could have something on a much slower cadence than that. I think there's not really one answer to that question, even for an individual data lake, I don't think. I agree that certain operations like refreshing could tend to be expensive. So ultimately, you don't want to refresh and take up additional resources, unless it's absolutely necessary for your end users. Perfect. Well, that is bringing us right to the top of the hour. Thank you so much to all of our speakers for today. It's been a great education, great content, and thanks to Luxio for sponsoring today's webinar, help making all these happen. Just a reminder, again, I will send a follow-up email to all registrants by end of day Monday for this presentation with links to the slides, the recording, and additional information requested that throughout will get the answers to the remaining questions as well. Thank you all. Thanks to all our attendees for being so engaged in everything we do. We just love it. Hope to see you all next month in Bob's webinar. Thanks all. Thanks, everybody. Thank you. Yep, thank you.