 Hello and welcome, my name is Shannon Kemp and I'm the Chief Digital Manager for Data Diversity. We want to thank you for joining the latest monthly webinar series, Data Architecture Strategies with Donna Burbank. Today, Donna will discuss Data Lake Architecture, Modern Strategies and Approaches. You can see WebEx has undergone a significant UI update, so feel free to look around. You will find the most of the needed icon buttons in the bottom middle of your screen. Just a couple of points to get us started, due to the large number of people that attend these sessions, you will be muted during the webinar. And we very much encourage you to chat with us and with each other throughout the webinar to do so, click the familiar chat icon again in the bottom middle of your screen to activate that feature. For questions, we will be collecting them via the Q&A section, or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DAstrategies. And as always, we will send a follow-up email within two business days containing links to the recording of the session and additional information requested throughout the webinar. Now, let me introduce to you the speaker of the series, Donna Burbank. She is a recognized industry expert in information management with over 20 years of experience helping organizations enrich their business opportunities, data and information. She currently is the Managing Director of Global Data Strategy Limited, where she assists organizations around the globe in driving value from their data. She has worked with dozens of Fortune 500 companies worldwide in the Americas, Europe, Asia and Africa, and speaks regularly at industry conferences. In fact, hey, Donna. Hello. Hi. Hello and welcome. We will get started. So as Shannon mentioned, today's topic is on DataLink and the architecture and approaches around that are always a very hot topic in the industry. And we'll mention this again, that all of these webinars, as Shannon mentioned, is a series and they are all available on demand. So if any of the past topics are of interest to you, they're all available on the DataBusy website, and this will be available on the website after the session. We also hope you can join us in the upcoming ones for this year, master data and some other key topics like data modeling. And we are just putting together the roundup for next year. So on that note, if anyone has any topics next year that they are dying to hear about or are interested in, please use the chat or Q&A or chat. Let me know because we'll be putting together before the end of the month. So if there's any of the typical topics that you haven't been hearing and you'd love to hear, we are open to that. So without further ado, a little bit on DataLakes. And we've all heard now, you know, DataLake, that Swamp, data sewer, whatever people of word they use. But I don't want to be negative because I think DataLakes do have a lot of opportunity and this is why they have become so popular. And like anything, I guess I'm old enough where I get a bit jid and you hear that about any technology, in certain technology here, data warehouses don't work because what, there's work involved in people with people and there's been data warehouses that have failed and there's also been very successful data warehouses, MDM, 10, etc. And any technology will ever mention on this webinar, there have been good ones and there have been bad ones. So I do think there's a lot of opportunity around DataLakes, which is why they're so popular, but like anything without the proper architecture, without metadata, without coordination between teams, it can quickly devolve into a swamp, particularly because DataLakes are by the very nature, cross-functional and have a lot of different disparate data sources and, you know, very quickly that can, that can sort of devolve into chaos. So here's just a few tips on how to hopefully avoid that based on practices we've seen in our practice. But let's start with the opportunity because if you know me, I'd love to be positive, which is why I'm in data because it's a lot of cool stuff you can do nowadays. So DataLakes is almost the epitome of that because data, you know, we can talk about a data warehouse and throughout this we will make the comparison between DataLakes and DataWareHouse. So it's not likely, say, if we just take one subject area and there's many different opportunities you can use for DataLakes, but consumer or customer, it has to be one of the more popular. It's not like we've not been getting a view of customer and a lot of related information around customer, but with DataLakes, the opportunity is even broader because data now is so much broader. So think of social media as a typical customer living on their iPhone, right? You know, social media interactions, are they on Twitter? Are they on LinkedIn? What are they saying about your brand, voice of customer? Have they called your support line? Can we do voice to text? Can we see scanned support logs? Are there photos, videos, you know, video chat or things they might have uploaded to the internet that we want to see? Can we look at actual purchasing patterns? Do we know where they've been from their phone or other systems? Do we know that they've been on our website and they've been clicking on certain things? Are they wearing an internet of things? We can see how they're using it. Is our product internet of things enabled? We can see user and patents, et cetera, et cetera, et cetera. What all of those have in common is that very few of those are stored in a typical relational database. Do some of those you could. Most of them are sort of very high volume and sort of happening in real time or happening, you know, disparately across time periods that may be high volume at one part of a web clip. You don't have to do that continually, but when you're on it, there's a lot of activity. So that's sort of, you know, we'll use the term big data and there's another overused term and we'll get a whole lot of artists on that. But that is the challenge of the data lake. How do you, you know, you don't write select star from IOT. You know, you can, I guess. That's really not the idea of what don't go video. You know, select star from video. This is a whole other area that isn't well suited to a relational database. So thus this idea of the data lake and store all of these disparate information pieces together to do to the next generation of analysis, which links to big data. And I guess if you've, unless you've lived in a cave, you've probably heard the idea of the three V's and the four V's, the five V's, many V's you'll come up with when it comes to big data. Another V's valid, which is why you keep hearing them so often. But they are, as I mentioned, high volume. And again, you could really rise and say, well, we've had high volumes of information for a long time, but, you know, these can be up to termites for a minute per day. You know, there's a lot of information, this high volume rapid and we may have had this volume before, but we're unable to really analyze it like we are. A friend of mine, he's a weather analyst and his father was a weather analyst. People can stick together. There's a lot of cool things done in weather. And literally, the father was saying, well, they used to do some of this analysis and there was a building in Wyoming with server after, I guess it's still there, a server after server after server. He said, even what was in the building in the 70s and 60s when he was doing this, doesn't compare to what most people have in their laptop. And he said, I'm so jealous of you. I'm too old to be doing this stuff now, but if I had the capacity that you guys have today in the weather, he's a good example of high volume information to manage. So he said, yeah, I'd love to have the capabilities you can, which is why some of this did size, why these data lakes have come none because we can store more. High velocities, I mentioned, they're delayed every second. You know, think of IoT. You've got a machine, you know, kind of putting out, you know, think of weather, just putting out weather feeds every second, every millisecond, whatever. And variety is the picture I mentioned before. It's machine data. It's media files. It's long files. Things that may be a relational databases. There's areas in a lake that you can put relational in a hive or certain table-like structures, but that's really not the value of a data lake. If you're just doing relational databases, I wouldn't recommend using a lake. It's what relational databases are good for. It's really getting these other ideas. And in fact, that idea of you're getting insight, right? Is it sentiment analysis across social media channels? Is it web browsing analytics? Is it looking through sensor data and machine logs to get patterns? Is it customer support, call log analysis? All of these things, again, are varied and cross-functional. And that's really the idea of it. To get value from these massive amounts of systems requires different tools. So different analytical tools, like statistical analysis, or, you know, R and Python and things like that. And sitting on a platform that is more of a data lake than a relational database. But the business need, again, databases don't go away. Again, it's just expanding the capabilities we already have. So you might have your typical business user, right? Because you just say what customers are saying about our product. That also doesn't mean that things are less complicated. In many ways, they're more so. So if you kind of live in the traditional database and data warehouse world, you know how complex just that is. Even if we had everything on one of these systems, all in SQL Server, all in a similar format, even just getting that, tell me everything about customer. Well, they might be named different ways in different databases. They've done formats, there's different integrations. You know, ETL is around for a reason, that getting all of that together can be complicated. But now, compound that, I'm in big data in the data lake world. There is your elusive data scientist, right? Trying to input those raw data sources and really parse that out and analyze that and make it relevant. So I think, and again, I'm around for a while. I'm starting to feel like I'm around for a while. I think in the early days, there's a lot of over-promising of data lake. Yeah, just sort of dump stuff in there and magic comes out the other side and maybe folks don't feel that way, but nothing's magic, right? It's just work one way or the other. You just sort of parse it more as schema on read, rather than schema on write, which is more your traditional data warehouse where I design that system. And then I build it a big data, that's sort of the idea. You're discovering things as you parse through the data and analyze it. So again, okay, we had four Vs, right? We had the volume velocity and variety, which are kind of more your typical ones and then value, which is how, what's the value from that? And the rest of these are similar of, we know this data is right, so you can put a lot of stuff out in that system. How do I know it's true? How do I know it's the right, it's the cleansed data? So we've probably all seen some of these, you see some of these figures and some of us might be living it today, right? So this data scientist is the sexiest job of the 21st century. Why don't they say metadata manager is the sexiest job of the 21st century, right? Because a lot of people we want to get through is the insights. And so you may have an analytic degree and you want to do all the great insights, but you spend 15 to 20 times cleaning and formatting the data. And some of us nerds like me find that fun, but that has this limit. No one sort of went and got their degree to clean up named data and make it all consistent. That's sort of what it means to the end, right? So a lot of people are frustrated we're spending all this time to cleanse the data, trying to connect that poor data quality, the no lower left, because then 80% of your time during the day, right? So you want to spend the time doing the analysis. Up in the upper right, this idea that without metadata, without understanding the data definitions, that can often be one of the biggest success of a data lake, from writing advisors did a survey with that. So, you know, but at the upside and the lower right, this idea that most people want to digitize their business, you know, digital transformation is huge. And when we think of digital transformation, a lot of that has to do with quick stream analysis, voice of customer, all the things we were talking about, but a lot of the biggest barrier is trying to find that right data. Or once you find that right data and get it in the right place, I have it in the link, then the data is inconsistent, right? So, again, these problems don't go away with the data lake. They're just compounded in a way. But BigHit is a growing trend. I've mentioned this in previous webinars and it's available on the Data Diversity website, as well as ours and Global Data Strategy. And we had done a architecture survey last year and we'll be doing another one this year. So stay tuned. We'd love your input. That over 70% of organizations are either using a big data solution today or are using it in the future. And when you look at the use cases, it's a lot of what we just mentioned. The data science discovery, sort of the sandbox exploration and doing analytics on discovery on these new data sources that we couldn't before. And some of this idea, I guess machine learning and AI would be similar to that and things like Internet of Things data are some of the big drivers for data. But again, some of the concerns some of it's governance and we didn't put that on the list but the top right-end concern of big data was how we govern it, right? That was just what we were talking about before. Security is also a big concern. You can't necessarily, a lot of these solutions are out in the cloud. You don't just willy-nilly dump PII or personal information out there. I think we all know that but sometimes trying to govern or manage that can be a challenge. And again, they're not easy. So the complexity of those solutions and the skills required, you may not have in-house, you may have to train, you may have to outsource. So again, there's a lot of opportunity around big data and data lakes but they don't come without their risks and concerns as well. So it's the idea of balancing that. There's immense opportunity but with anything we need to balance it with the idea of reducing risk and increasing opportunity. And there's many ways but two ways I'll talk about in this session is the idea of the architectural ways you can manage risk and then the governance ways. We've got the whole webinar on how architecture and government are interlinked in two sides of the same coin. But with architecture, the scalability, how do we scale cost? A lot of the reasons people go to a data lake is a city of cost or cloud. When you can see how fast when we get the information, is there real time? One of the concerns of a warehouse is the idea that you may have to do refreshes. Maybe you get it weekly, daily, monthly but you probably want that more real time in the data lake than this idea of how we store diverse data sources. That's with some of the basic architecture. Then we get the governance. How do we secure that? How we handle privacy and compliance? I think most important, well, not most but one of the most important consideration is this idea of collaboration between new roles. Often you have data scientists working with more traditional data architects or BI teams working with citizen data science and not only did a different way of working, different way of governing but also the tools are very different. Yet, if you don't have those teams work together you lose most of the value. The idea is say we want that voice of customer. Your core customer data may be in master data your master data system and or your warehouse really need to link that together. Otherwise, you really lose the value. So how do we get all of that working together? And you've seen the slide if you've seen me talk before and people say they like it. So I keep these things right so but it does some core things. One, what's the strategy? Why are we looking at some of this information? And I know sometimes we do it just for fun, right? Sometimes I don't know. Maybe I'm doing and just doing some exploration but generally these at least work better with the business strategy and I'll tease one company we work with that we're name nameless. It was sort of a very core commodity. I don't want to give away what it was but think of it keeps the lights on the company. It wasn't a consumer product but one eager data scientist wanted to do some analysis on Twitter sentiment. We did a lot of work on that and I was just so curious how many people tweeted about this brand? So go to last year and said their bill was late. You know, it really was not the right strategy for that business. They were more maybe the internet of things might have been better for that particular company. It was more of a manufacturing but it was more of a commodity type product. It wasn't something that voice of customer was very concerned with. I think we're today in the UK and as a retail customer product that's hugely important to them. We like to customer and Twitter and we did the similar analysis and there's thousands of people per day tweeting about this product because it's something they use in their daily life. So look at the business strategy before we go too far down into the technical strategy. And that may be obvious but all of us get excited about technology and then we sort of lose sight of that. The other part is what if we go sort of bottom up what data sources are we looking at? You know, big data and structured data, semi-structure. Maybe it really is only relational databases. We don't need the data link for that, right? Or maybe it is more internet of things we do. How do you coordinate that? And then what are the ways we want to manage that through data quality, architecture, et cetera? And then governance is that layer that really gets the people and process and culture around that. And often the whole sort of data lake world is very different culture in terms of how we analyze and look at the information and we'll talk about that later in the call. So we'll cut to all of this when it comes to the data lake. But really when you think of the, and it's worth thinking, this idea of big data, really not only technology shift but it's really a paradigm shift and a lot of people think it's really a paradigm shift and allow me to wax philosophical a bit. You know, I tend to do that if you've heard me speak before, but really it's a different way of looking at the world. So if you think of, on the left, it's more than a traditional data warehouse where sort of a top-down hierarchical I'm going to design and I'm going to build these companies building several right now where we're going to look at what information do you want? What's the data model for that? How do we define customer? What's the definition of customer? How do we store the data type of name and make that consistent? Super important stuff that is a very top-down by just my entire approach or even more importantly, master data. The whole point of master data management is to get that consistent view. So yes, by definition, I'm locking that down a bit. I put some things in quotes there, manageable writing of information. I know there's some huge data warehouses up there. It's not that we're talking small, same thing, stable rate of change and change tends to happen a lot. So it's all relevant, right? But that would be sort of your world of data warehouse, business intelligence, design and build, you know. If we go into the big data world, it really is a different idea. It's more about discovery, right? So I would say that's almost more a democratic view. When you look at the data, we analyze it, we have discovery. It's more collaborative interactive, iterative agile, you might want to use. And larger volumes of information with an exponential rate of growth. You know, just think of click steam analysis of customers and hopefully a company that's growing that's massive amount of information. Internet of things for your product. And that's where you're able to play statistical analysis. And that's where you're able to read, right? So I'm discovering things. Once I discover something that's a value I make, I'm going to start to restore that, that are variables. But I don't know until I start looking at it. I have some ideas that links back to that understanding the business strategy before you start going through things. If you don't have a customer facing product, you may not want to do it as a customer. It may be something else. But it really is looking at things and it kind of mixes, you know, why we're doing this. And so, again, this is, you know, if you think of the traditional way of the world, I always, if you remember from school, you know, linear aspect, in 1735, this whole idea of the hierarchy for organizing biological systems, that kingdom phylum, class order, family, genus species. I remember having to memorize that one. I was about a few years old or whatever it was. It's very structured. Think of the periodic table of elements. The whole duty of that is taking a very complex system, having a chart, having rules. In my mind, this is the data model, right? In a way, how do we organize, how do we have structure? This is metadata around a very complex world. And maybe that was naive. It was, we could really, you know, this is a different world back then in 1735. If we could just organize everything, we can understand the entire planet in the world of classification. Probably not true. There's so much complex that we don't understand, but it went a long way to really help us understand nature and biological species. So it's a good thing. It didn't go away. But if you look at a lot of the research now, there's a lot of things of emergence or chaos theory, or how much can we really classify? So this idea of emergence is the idea that there are complex systems and patterns. But you could think of a snowflake, right? Again, religion aside, no one thing or person designed a snowflake. No two snowflakes are the same. But there's this idea of snowflake-ness, right? So all the snowflakes on the right, you should have left. Sorry. You know that there's a snowflake. There's patterns that come out. There is just chaos, right? So sometimes used in building, we call it city planning. You know, we've all been at sort of a university or a campus or something, and there's these nice square pathways, and then what where people really walk is the kind of direct way across the lawn, right? So when you think of city planning, rather than maybe just make sort of straight lines, they sort of looked at traffic patterns and what would make the most sense. So again, out of these chaotic traffic patterns, what would be the best way? I sort of liken that to social media. You know, I'm trying to get a sense of my customer sentiment. And sometimes it makes sense. I have someone on Facebook saying, I love my new Levi's jeans. It could be, they're saying, is Levi coming to my party? That's a name, right? There's a sale. One of my partners is having a 20% discount if I'm Levi. Probably there's good things in there, but it's out of chaos. You sort of create these patterns, and that's almost the definition of the statistical analysis. I'm looking at chaos. I'm trying to create order. That's a schema on reads in a way that I'm from this chaos, so I'm finding patterns in the data. So that's, in a lot of ways, that's the difference between a warehouse and a lake, right? So on the left is that more the kingdom filing class order, customer product order number. In a way, you're adding this layer of organization on a complex world, which is your company. This could be student courses. Yeah, it's not just customer and product. That's kind of a popular one. But that's a good thing. It doesn't go away. You need that. If I'm doing financial reporting, then I'm reporting to the street or reporting to the board. Yes, I would like customer and product and order number hit invoice lockdown because I don't want my willy-nilly. I'm not going to randomly discover patterns. Hopefully in my financial figures, a little bit of art and science when it comes to some of that. But yeah, it should be fairly locked down. The lake is different. That's more your idea of emergence. I'm going to look at patterns in the data. Just give me the raw data in its native format. And from my statistical analysis, I'm going to find patterns. And I don't know what that is perhaps until I see it. So that's more of that emergent technology. But they're both good. They both have their place, right? And I think the true value comes combining these together. So that's the golden nugget there in the middle. I have a data warehouse and I have a data link. And I can find new insights. So one customer I worked with, they were a financial institution and they were trying to do some analysis on their high net worth customers. So they did some amazing things with web analysis and scraping and finding information of purchasing patterns and when folks have been sued. And really interesting information or creepy information, depending how you want to look at it. But when it came down to sort of finding that John Smith, high net worth customer had these three lawsuits and had been doing this on the Internet in their own data warehouse, they had 16 John Smiths or in their own master data, you could say. And they weren't able to easily link those new insights because their data warehouse and their master data wasn't clean. So unless you can link it with some of your customer data, some of these new discoveries aren't of value. But when you can, that's really where you get some really interesting insights. So the beauty is really combining these two systems together. This is again from that survey we did in lessons in data architecture about adoption. And that's always your buzzwords, I guess. If you look at the Gartner, one of the trough of disillusionment, they're sort of past the peak of expectations of people sort of know that Gartner, they have the height of hype and then people get disillusioned because they didn't master hype and then you sort of get to the state of normalcy. So I was just curious, we used to hear so much, data lakes are going to change the world, and they are. Is it a commodity now? Where are people using it? So this was the survey. The majority of people are using a data lake in conjunction with the warehouse where they're using them. Some have it just as its own separate standalone solution, which has its place as well. But surprising a lot of people still weren't. But that's fine. I mean, there isn't always a use case for it. Not every company has unstructured data. Maybe you just need it for typical data warehousing. So I don't know if that's bad or good, but it was interesting. Although I was heartened to see, because I think early in the days of warehouse, there was a lot of data lake, there's a lot of missing information where people would say, we don't do data warehousing anymore, it's all in a data lake. And that was the vendors, maybe that was the maturity of the market. I'm honestly hearing less of that, and I think the data here shows that caveat is a lot of the survey respondents were from data varsity and have to pat ourselves in the back. But I think by definition, the diversity of people who have understood architecture and governance and structures, and that's why we're here, right? We've talked about this stuff. So maybe had we done this more industry-wide, I think there may be more missing information there. But they're really not the same things. So don't replace one with the other. Just use them for what they're meant to be used. So we have a poll, and I'd be curious to people on this call to the different, we had a lot of questions on this one, but just a simple yes or no. We're going to put together a poll. Are you currently implementing a data lake? Yes or no? So I'm going to pass it over to Shannon who's going to open up the poll. So you can see on the right there a little yes or no question. So you hit the number, you hit submit, and there'll be a brief pause where we'll put together the answers and then we'll get to see where everybody else says. We'll give you a little bit of time because you might have been multi-tasking and you're going to stop multi-tasking and you're going to answer, and you're not going to be shy because these are not linked back to individual people so we don't know what you said as an individual. We just take a summary. So I hope you will answer and my answer. In fact, I do my own answer. So Shannon will let us know when the poll is live and we will see the response. I'm kind of curious. I would think it's fairly high because I think people are self-selected and you're interested in it but I am curious. So the poll has ended. So I think shortly we should be seeing the results. Yeah, I am working on opening the poll. Drum roll, please. Yes, it is. Calculating the results. So here we go. There we go. Okay, so a little bit more. We're not. Then we're, but it's about half and half. So 45 out of the group, 45 said they are are and 50 said they are not and then 48 of you are shy or multi-tasking or just putting stuff into an answer. But I find that interesting but I think that's fine. If you're joining this webinar because you are looking to enter the poll at one and you want to know more. So just kind of curious. So it's been about half and half. Some of you are. Some of you are not. If anyone wants to share their experiences with implementing or questions about implementing in the chat, often we have quite a lively discussion as things are going. I will admit I'm a horrible multi-tasker so I will probably not talk in any chat because I'm talking, but I feel free. We'll continue at the end. So as I mentioned, integrating the data lake and traditional data sources have a lot of value. Here's a very high level genetic of that. But if you think of the left, you're getting yourself a multi-sourced A-stream and that's going to be your data analysis and data lake in that light blue. That's often what we call the sandbox environment. I don't know what's out there. I'm just going to dump stuff out in the sandbox. We'll do some analysis and we'll understand. That might also be exploration. Maybe those are two sides of the same coin depending what you call it. There's also some lightly modeled data. So it could be that I've done some cleansing and some analysis or some structures on that. Maybe I put some things into a high structure or something. But it's still sort of an exploratory analysis. On your right, it isn't just warehouse, but I call that more enterprise systems of reference record. I think you mastered it. You referenced data. Are there marks, warehouses, operational data, even. And the lines are trying to show that on your right is more, hopefully, I have a data model for that. I've done some, you know, schema design. But you'll see those arrows between those two systems and it truly is bi-directional. So it's idea of say, let's just take a matter of data. Do you think of those percentages earlier on the call with the amount of time people spend cleansing data before they can analyze? I am sure that in any data scientists on the call feel great as chime in if you disagree. But if there was a list of common country codes that were cleansed, common customer list, then it's correct. That can be fed into the data lake as a source. People would love to use it, right? So a lot of these enterprise systems of record will feed into the, can feed the data lake as a source. So here's a list of customers. Here's a list of, you know, sales regions or whatever you're looking at can definitely be a source into the lake. You may want to link that with social media analysis or click through rates or whatever. But it can also go the other way. So sometimes the idea of doing this analysis is a field you may want to start tracking, right? It could be a, I didn't think having somebody's social media Twitter handle was important to warehouse, but after we did some analysis, it certainly is. So let's add a field. Or it could be, you know, a variable, a weather variable we're doing for sales data that people shop more when it's raining or they shop less because it's raining. They don't want to drive, right? It could be things that you discovered were valuable and sent up to the system of record. Slowly off topic. This all could be done for some local datasets that aren't in Australia Lake, but it could be a sequel to them. I'm doing some local analysis from my region. Hey folks, this is so important. Let's put it in the warehouse. So again, it should be bi-directional. And that, again, that's the idea of discovery. You're discovering something worthwhile. You may want to store more permanently. So kind of published, this idea of collaboration and governance. So what is the governance around publishing something? Discovery. What kind of vetting needs to be done? What kind of communication is done? Do people using data lake even know where these reference datasets are? Are they stored in a place that people can use? Are the people talking together or am I talking or communicating via Wiki or whatever method to share these insights? Because the whole idea of these insights is that you're sharing it with other people. And we'll show some ideas around that. Above it is the reporting and analytics. Could you do standard BI reports? You could be doing self-service BI and exploration or advanced analytics, which is probably more typical of something that's on a data lake type environment. And of course, don't forget the security and privacy at the bottom, because that's just banned. No matter where it is, if it's personal information, you need to track it in a certain way or if it's SIPA-regulated information, you really shouldn't be drawing that out and just randomly doing exploration on things you shouldn't be. So that spans all of it. It was kind of ties in with governance. It was at least this idea of a data ecosystem. I think it's important to remember, kind of my mantra there, the more data shared either across or beyond the organization, the more formal the governance needs to be. So when we're talking about something like mastery or reference data, yes, you really do need to have that more traditional, you know, kingdom filing, or a couple of AS-type structure, you know, customer product hierarchy, you know, that is because that is so important and it's the core of the organization and it's shared across everything else, you do need the structure there. I would think kind of that lower green, that might be your warehouse. I think that's maybe a little slightly, but not much less loose than the master data. It could be, you know, some just data marks there as well. But that also is kind of structured by definition. The stuff in the light blue, that's getting a little more exploratory. It might be operational reporting or I have my own local reporting data set that might be relational or it could be some analytical model data or other structure around it, but it's, again, more loose or more local. I think when you get the dark blue, that idea of exploratory data, that's definitely your day like zone. The whole idea is raw, lightly prep data, ad hoc analysis should be a very light touch government. Yes, don't let people put personal information out there. Don't go willy-nilly. I should know that you're rolling up a sandbox platform. You don't just go off and build your own randomly, but the exploration itself should be allowed. That is the whole purpose. So you don't want to under govern your reference master data. You don't want to over govern your data lake. And I've seen conflicted issues around that or, you know, because in some cases, unfortunately in many, the tool you're out there, it's easy to roll some of these things up and people will just find a way to go around you. So don't over govern. And we'll talk about that later, that I have seen too much of that. No, I don't want to go around the rule, so I'm going to go build my own lake and there's 17 lakes and, you know, that's not a swamp. What is that? It's just a very leaky lake district. You know, you actually want that because that's not helping it. And the idea is also that that not only interaction between systems, technology systems, again, this can be a lake, it can be a local mar, it can be a MDM system, it can be a data warehouse, all of those, but they need to interact together as an ecosystem. Just like a lake is part of a larger ecosystem with streams and forests and mountains. Right, just as your data lake is part of a grander ecosystem. But it's also human beings and roles between that as well. And that, I've seen, can be conflict as well. So if you have your typical day warehouse roles, I think we're all familiar with those, your day warehouse developer, maybe your BI reporting analyst, ETL, you may have new roles with this data lake. It could be your data science, it could be this idea of a citizen data scientist. You know, your data lake, platform who's managing this platform, kind of like your DBA, but for the data lake. And then this sort of these cross cutting roles could be a data steward. You know, if I'm the data steward for customer information, I shouldn't care whether it's on a warehouse or a lake. I still care about that as part of my stewardship, right? Or I'm patient data, you know, it doesn't matter where it is. I'm still looking at that as if it's my entire ecosystem. Data architect. And this is, I think, something important to remember that a data architect, if you're the architect, actually there's a data diversity article will be coming out of what we mean by data architect versus data modeler and that kind of thing. But there's different styles of data architect. There's the platform architect. So a data architect that's used to just doing data warehouse needs to explain her skills into things like Hadoop and data lakes and clouds. And because it isn't just one thing. Similarly, a data architect who might be looking at things like data models, random prize architecture diagrams. Don't just model what's in the warehouse, right? Or what's in your relational systems. Is it important to start looking or thinking of things that maybe it isn't. Maybe it truly is just exploratory. But I think, unfortunately, I think data architects often are too much on that left. You've got your traditional data warehouse and there's a need to have at least those conversations for what might be in the lake. And are you honing your skills to things like cloud and lake architectures and things like Hadoop and other systems. Otherwise, you have silos again. So I think those are data governance. Are you only governing stuff in the warehouse and you're not even noticing the lakes or you're over governing the lakes. Do you understand what a lake is and how citizen data scientists really may need the extra flexibility. They're not trying to be unruly. But it really is true that they need some flexibility and control. So I think all of those roles at the bottom really need to live in both worlds. And some of those human beings might be involved in both worlds. You might be using a warehouse and be a citizen data scientist or you might be doing citizen data science at the warehouse. But again, there's a simplification. I think all of these people need to work together just as those technical departments need to work together. And then it kind of leads to the management of one way to govern. There's this idea through metadata and you know I love metadata. So there's different ways of managing metadata as well. So if you think of, and you may have heard me say this before as well, this idea of encyclopedia versus Wikipedia is almost similar again to this idea of a warehouse versus a lake. Whereas encyclopedia, it's sort of a vetted truth, right? The people in academia who define these, they sit in the room and say, this is the version of what a skunk is and whatever we have in our encyclopedia and you publish it out. And it changes. It's not like, well, maybe a skunk is sort of the same. You want to do things about skunks and you put it in encyclopedia at each update. It's slow. It's sort of a standard body of knowledge. Wikipedia, right? That's sort of eventual consistency. I would get people on Wikipedia. It just seemed like this wouldn't be chaos. But you know what? I use Wikipedia all the time. And in some ways you can say Wikipedia is more accurate because it's updated more often. You have a wider voices looking at it. It's changed. It's dynamic. But really it is sort of a different thing. And I almost think of the data lake like that. It's really more for data exploration, cell service analysis, insights. But you can maybe get even better information by opening up your brain, opening up your ecosystem and looking at different sources and defining those patterns. So each one is good for what it's good for. But I've also seen companies kind of use the wrong things. If we go back to that pyramid, make sure you're choosing a metadata or collaboration pool. You're thinking of the environments. And this is handling the tool. I don't want to get a tool to ask me. I won't use names. But there's types and some do all well. But in general in life, it's hard to be great at everything. So they tend to fall into categories. One is metadata repository, which is more of a stricter governance. I have a glossary that maybe there's a feedback mechanism. But typically there's a group, your data governance lead that manages that, publishes it out. It may have a data dictionary, have your approved sources. Super important. And it's a really good recent university webinars on how you do that. And the way it is, I'm doing a warehouse. I want my source to track mapping, the attribute level, and I want an audit trail. I want to do PII or personal information. I want an audit to see where what databases have my PII. I want to classify secret data versus classify data and be very strict. And there's a reason for that, because it's my warehouse. And I want to have that very, rather formal metadata repository. And those tools are excellent. I sort of grew up on those tools. But this idea of the data catalog, which is kind of emerging in Gartner, finally has its own magic quadrant or report on data catalogs separate from metadata repositories. This is the more Wikipedia or it's more of crowdsourced and open. And these tend to work well with data lakes. I mean, they're not exclusive. That's where that piece in the middle, especially as that light blue tends to be a mix where it can be structured. But there's still some exploration. There's probably some overlap between the tools. But some of these data catalogs can do, they can do things like, okay, the metadata for a repository may be what's the data type for field X. The metadata for some of the data may be what element of it you use. Great idea. What were your results? What does this field mean? Some of that is overlap. But maybe the glossary, it's crowdsourced. You know, I thought this was age, but really it was, you know, date of birth certificate, which is different. I don't know why I made that up. But it's more of a crowdsourced approach. The lineage is probably more high level sort of by definition. And I surely do that direct source to target mapping as you are with a warehouse. But a lot of them have sort of different things. Was this useful? It could be that there's different ways to calculate total sales. But 90% of people are using this one. Let's go this way. This is a really helpful algorithm. This is the code I used. Almost more like a Github, right? Than maybe a traditional metadata repository. Maybe it's more loosely tagging information. This kind of has to do with customer information. Or this might be helpful rather than your strict data classification. It might be more of a traditional repository. Again, there's overlap in some repository tools that are getting more through the collaboration. So collaboration tools are getting more to adding some of the structure. But just give that some thought before you choose one. I've worked with several customers in the past couple of years. They're frustrated about their total. They kind of, they either got a catalog or what they needed was more of a metadata repository. Sometimes you do want to correct their rules. You must use this field for that. They're just not designed for that. You can't lock down Wikipedia as much as you can in encyclopedia. Or people went or had historically a metadata repository and they wanted to add more collaboration. And again, maybe the tool you choose has some overlap. Just give that some thought because I have seen frustration with several of my customers. I have a tool in there, you know, it's apples and oranges. That's all what you're trying to do. I'll be a little bit on this idea of a data catalog. Again, that Wikipedia of harnessing tribal knowledge. So it was this helpful. What are the queries or algorithms that people are using? Was this helpful? Because I think of collaboration. And often, you know, you have a repository. They're starting to add this as well. You know, you have a definition. Then someone says, hey, actually, you know, you know, three years ago I used something different. And this is the same thing. You're having chat around that. It can be really, really helpful. I find things you didn't know just by getting them. You can't talk to everybody. So I have a web-based way for folks to do that. It's helpful. Again, avoid silo. So I don't know if they know how to do it works, but it is in my mind, just give these data lily pads, right? So what I've seen in too often is that because data life can be sort of easy to spin up, right? I can do an Azure instance on my own. It maybe happens without connection to the wider data strategy or the wider data governance. And I've seen this, and partly maybe this is, maybe if I did the survey in the wider organization, there may be more confusion of what a data lake is versus the data warehouse. And I've seen maybe a sales team saying, I just want this information. I don't know. I'm just going to outsource and create my own lake. And then, I don't know, marketing does their own. They've got the marketing lake. And then you have these lily pads across and none of them connect. You lose all the value of the lake. Because you're not engraving, right? Marketing wasn't what sales was doing. Or R&D at their own. Or Joe just wanted to play around and create his own. And so that, he loses the approach. And I think what I've also seen teams, you know, it may be, often it's the business doing this because they're quote, frustrated with IT. IT took too long because there's too much governance, which gets back to that pyramid. Don't overgovern. You may run the risk of governing so much people just go around you and you don't want to do that as well. So what's the life touch? It could just be, and I've had some organization I've worked with just say, just let me know you've built one and basically tell me what's in it because we're doing the same thing. The other piece of that is cost, right? You don't want to pay for a bunch of different instances where you could get an enterprise license for something and just have different, you know, the areas that people can work in. So there's a lot of reasons not to do this, but unfortunately I see this happen all the time that there's always a data lake with leaf ads where people just go off and explore. Hey, maybe that's okay. Maybe it truly is a, you know, exploration. I'm trying to help down, but again, that could just be a cost issue. You're losing the scalability that you could have. But again, you may find something that's valuable and should really share. So again, if you can coordinate, please do. So some things to risk and avoid, or think about as you're going to a data lake. You know, one is just the platform. Do you want to have an on-premise thing? You're in control of that. There's benefits to that, especially when you're thinking of security and privacy. You may not want to go. Is it a concern for some people in the cloud? I mean, the cloud, however, has a benefit. A lot of these lakes are in the cloud, especially when it truly is a sandbox. You've got the scalability. It's easy to spin one up. A lot of these platforms are easy. And that's one of the, because if you think back to the circuit, a lot of the hesitation, what's on the skill set? I don't have a Hadoop administrator where I can do this, but if I can, you know, have one of the cloud providers help me with platform providers, help me with that piece, I can even more easily spin it up. So give some thought to your provider. A lot of these also have integrated tools with real-time data streaming and there's a lot of cool things that kind of come with these platforms. So give that one some thought. And again, please do that together. So you don't have a Hadoop scene. Again, not only did people spin up different lakes, but all on a different platform. And again, you're losing some of that cost that's coming from the company together. Skills, give that some thought. Do you want to outsource the lake altogether? Maybe you want to hide. I don't really know if this is a thing for me. Do I want to outsource to a third party? They can do smush brison. If that is good, maybe I want to take it in-house. If I do want to do it in-house, even to start, what training do I need? If this is you on the call looking to do this, what training do you need to really take a different path or different level to do sort of daylight things? Security, who has access to this lake? This one comes up all the time. I mean, give this thought before you build the lake. So I've seen really successful, really expensive analytic lakes with what they have to have, some third party or sales, marketing, paying to have a lake. You know what? Because there were security concerns. I mean, it was when they brought up security law and that kind of stopped allowing analysis. So do your due diligence first. Ask security what might be the issue. How is PII managed? So don't go so far that you can't use the lake because it's PII, which again, having gone built the lake and not having told people legal or security is going to be a little more nervous. You said, hey, I'm building this lake. What are some of that? What are your data governance manager? What are some of the concerns? And can you just do some simple things? Are you obfuscating information? Do you really need to store anything personal on it? Or can you get just summarized information on patterns without any PII? So please do think of that. And again, I always forget the names at the end, but I've been at some very large corporations and there was one, we were sort of talking with the lake and this well-meaning developer to raise a hand. He said, so I shouldn't be putting personal information name and address an email out on the lake and the class. We'll talk to you after the meeting. And it was embarrassing for everybody because he didn't know. He had never been told he wasn't supposed to do that and you can't assume people should know until they're told not to. So again, don't be dumb. I think that wasn't the right technical thing to say. But again, sometimes it's a very easy thing and one of my clients I work with, they're doing a lot of this. They just had a very simple check list. Is there a PII or no? If no, go ahead. So again, that idea of light-test governance, they didn't want to over-govern. They didn't want to under-govern either. And I think that's what keeps taking place. Cost, again, cloud is often the right model for scale usage. One of my retail customers had a lake for some of their quick stream analysis for web and things like that. And they use cloud because they were a retail product around December and hopefully they had massive scale of volume and the rest of the year they didn't. So again, a lot of this was raw data. They just wanted, and they used the cloud for that. So that was a great model for them. Again, the cloud is a very different model. That scalability is great, but sometimes on-prem can be cheaper depending on your usage. And again, a lot of us grew up in a relational world and I had a customer just a couple weeks ago told this story. I had nothing to do with it. And I think this was obviously quite a bit of money that people were spinning up Sandboxes and were thinking, hey, it's just kind of like creating an access database. Or just doing my own little SQL Server instance and it's free. And it's not because you pay for subscription pricing. And they kind of got Sandboxes. So people were creating all these Sandboxes and they never set them off and they're paying for them and they weren't. So again, that's a very simple governance checklist. It doesn't have to be rocket science. You don't have to go in and say, what data are you using? Or is it going to form me to a standard? But is it going to turn it on and shut it off? Do we know about that you're doing it? So this is just a basic checklist. Governance, is there common semantic meaning? Is there common data that people could be using to export to the lake? Do we know what a customer is? Are you doing an operating model of how teams work together? How things in the lake may be published? Again, who is even spinning up a Sandbox? Why? Do we have all the leaf pads kind of going off in their own silos? And then what is that life cycle? It could be sometimes you're using a lake just for some storage. I'm not even using it. I'm not analyzing it. It's just a cheap place to put stuff. That's okay too. But you just want to know that usage. When can it be deleted? Do you have that life cycle? It might just store. Again, cloud isn't free. My last experience, you don't want to just be putting out there forever. And most cloud providers have Apple storage options where that life cycle becomes important. And then the life cycle of that I mentioned before. How do we move from exploration to enterprise? If that makes sense. So again, data lakes can provide significant opportunity. And they're cool things. They're here for a reason. Data is getting more complex. And it's a great way to manage some of that complexity. Just go away. This is another tool in your toolbox to have these two different data sources that can work well together. And just like the data platforms and architecture should work together, so should the people. So make sure you have the collaboration, the governance around that, and kind of the life cycle management and operating model of ways of working together on this. This will be on demand. For those of you who ask, I think it's next week, or in several days, I'm generally sending out the link as well as the slides. Next month is on master data management. And that was kind of that top of the pyramid where it is highly governed. So if you're interested in that, please join us. Just quickly, that white paper I mentioned is out on our website. It's also on the diversity website. If you are more interested in some of those trends and the details behind that. So without further ado, I do want to open it up to questions or thoughts or ideas. I don't know if that's been overdue. Donna, thank you for another fantastic presentation. There's lots of questions coming in. And if you have a question, feel free to submit it in the bottom right-hand corner in the Q&A section. And to answer the most commonly asked question, I will be sending a follow-up email by end of day Monday with links to the slides and the recording of this presentation as well as anything else requested throughout. So diving in here, Donna, do you have any suggestions for training for data architect or for the data lake? Well, the diversity is always a good source for that. Sometimes the vendors themselves have some really great training. They have a vested interest in people getting to know. So I wouldn't discount that. I think YouTube is a great source. I'm always surprised. It seems like a place to look for pictures of kittens, but it actually has some really great information there as well. So those are three options that would be helpful. Thank you. Perfect. In addition to metadata repository and data catalog, would you recommend any other methods of documenting, representing the semantics of data, for example, the data models and ontologies, to help correlate governance across traditional systems and lakes? Yes. Thank you. I'm going to give Dwight a virtual hug across. He's my people. Yes. And I am a fan of data models, semantics, so all of the above. I think, too, you know, one of the misconceptions, and I should probably add that as a slide. So thank you for that topic. If we do this again, something like a conceptual data model I use all the time, or maybe slightly logical. That's a perfect way of kind of showing across a lake and a data warehouse. What do we even mean by customer? How does that relate? How is that different from a consumer? How is that different from X, Y, Z, a prospect? All right. So some of those very high-level definitions can be super important. And I think, yeah, those, if nothing else, also some high-level enterprise architect are aware, then does this customer data link, too? Very helpful. Ontologies can be interesting, too, because that's more of a kind of a different way of looking at it, but it still has your semantic definitions. So yes, I'm sorry for that oversight, but I am a huge fan, especially in the data modeling world. I think that's a great way to kind of integrate those systems at a high level and get the conversations going. And one more point on that. We're talking about the people. Often the people invited to a, I often do kind of workshops with data, so it's a high-level conceptual data model. And I think we remember to invite the data warehousing and the architecture and business people, but are you inviting the data scientists and some of these citizen data scientists that maybe you hadn't thought of and may have some really interesting ideas based on the data they've been kind of doing discovery on. So, great question. Great comment. No, we don't, of course, get into recommending one vendor over another, but in general, what are some of the vendors that offer a data lake platform? Yeah, I hate vendor questions, and they always ask me, so Microsoft Azure has some good, Google has its own, Amazon has some, so AWS type platforms. Those are some of the big ones. You can do Hadoop in Cloudera and a lot of those pure play vendors. And one of the nice things about the cloud vendors is they kind of wrap it up in a package and then there's, again, West poetic on the definition of a lake. It may have kind of a related tool that aren't quite lakes, but maybe it's a real time, they'd streaming service and they wrap a lot of those other things around it. So my recommendation, especially if you're new to this, would be to start with some of the cloud providers because they make it a little easier and they kind of offer some tangential things all in one night package and relational databases as well. So might be a way to start. There was a question to show slide 30 again, and I think we've got time for a couple more questions. And they made in the new platform these little things really small, so it's not as easy for me to find slide 30. But unless somebody on the university side has a bigger screen, I'm getting there. 24. All right. One of the other questions is, I muddled my way through. Let me know if you can use me. 30. 30. All right. Go ahead. I'm making a fool of myself. Yeah. Help. Oh, it's the second to last slide. It's just the next slide up. Yeah, there you go. This. Okay. Sorry. Recently, an analytics leader told me, analytics do not need data governance. And I laughed. However, how do I educate them? Thank you for laughing. Well, some of the, some of the ways to educate and I won't go back to the slide because I can't see in these little buttons. The ones that has sort of a lot of the industry statistics can help you in this presentation at some of the percentage of data scientists that are frustrated, the percentage of data quality efforts. So there's so much published around that that those can help you because we didn't, you know, you didn't say it, partner, Senator, ready an advisor, Senator, somebody else kind of gave some statistics around it. I had a customer say that to me. And, you know, some of us just obviously it was, he was doing some scariest of things, some data scientists on medical record data. And he said, no, someone told me that when you're doing this, that you don't need to do any quality. I'm like, well, if all your gender codes are wrong and you're trying to say, you know, that's a more prevalent in men and women. And then you're labeled wrong as men and women, your whole research is done. Yeah. I mean, and so sometimes that guts, the other example, the kind of showing example of that, you know, the analysis is only as good as the data itself. And then there's, you know, I've been doing some work with the university and there's the, the governance of the analytical models. And depending on the person who said it, that might be over their head. But how did you, what model did you, and actually it was so refreshing working with the university because I'm, and yeah, I don't want to offend anybody, but often I'm working with a retail company where you're working with a sales guy that says, I don't know, I just want the numbers. And with the university, it was, you know, they would argue what methodology you use to get that number, because they're used to scientific research, but they have a point, you know, what model you use, what modeling technique. And there are governance methodologies around the analytical models themselves. So hopefully that's enough, quiver in the toolkit to, in addition to laughing, to help, all right, that argument. I love it. All right, Donna. Well, thank you again. I'm afraid that's all we have time for. Thanks for another great presentation. Thanks to all of our attendees for all the great questions. We love how engaged everybody is. Just a reminder again, I will send a follow-up email by end of day Monday to all registrants with links to the slides and the recording. And I will include a link to the research paper as well, so you'll have that available to download. And additional resources from Donna, as she does so much. Thank you. So I hope to see you all next month in September, and I hope everybody has a great day. Thank you. Thank you.