 And here we go. Hello and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of Data Diversity. We'd like to thank you for joining the latest installment of the Monthly Data Diversity Webinar Series, Advanced Analytics with William McKnight, sponsored today by Looker. Today, William will be discussing when and how data lakes fit into a modern data architecture. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them throughout the presentation in the Q&A section, or if you'd like to share questions or highlights, you can also share them via Twitter using hashtag ADVAnalytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom middle of your screen for that feature. And if you'd like to continue the conversation after the webinar, you can follow William and each other at community.dativersity.net. And as always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me turn it over to Joel for a brief word from our sponsor, Looker. Joel, hello and welcome. Oh, you're still muted there. Yeah, there you are. Yeah, well, just having a whole lot of fun. Hello, everybody. Sorry about that. I promise not to take too much of your time with crazy sponsors stuff. But I thought it might take a few seconds and just talk about why Looker is interested in a conversation about data lakes. And it goes almost without saying that where data is valuable is where it's used, right? Data that sits in a data lake isn't really valuable until it's analyzed. It isn't really valuable until it's used. Depending on who you ask, only about 20% of data in data lakes is ever analyzed. And about 50% of business intelligence against that data is still run in tools like Excel. So this idea of fueling a company and being more sophisticated about the way you use and consume data is core to what Looker does. It's really core to what William's going to talk about going forward. But it's very, very important to our businesses, right? As data people, we're very lucky because this is fueling our business's success going forward. And it's a place of ongoing investment from our executive teams. I will say this. VI and business intelligence and how we look at data has really changed. It is very valuable, obviously. But the change is from traditional reports and dashboards type analytics. Where an analyst generates a report, it generates a dashboard for their team. That is still valuable. But remember that historic number that we're still only using about 20% of our data. And I think William will talk a little bit about this. This process of pulling out aggregating data from a data lake into a data warehouse or into a data store, operational data store, or data mart. That is a narrowing of that pipeline of data. What's core to Looker is opening up that pipeline. So you can access all the data, put it in the hands of people in really new and interesting ways. So I would like you to think about that a little bit as William is talking because getting that data out and getting the full set of that data out into the hands of people who will find it valuable is extremely important. Now, underlying your data architecture really is this concept of a data lake or a data warehouse where you're aggregating data from many sources. Lots of operational data stores, a lot of transactional databases, a lot of row-based data stores, aggregating together into something that is more suitable, more centralized for analytics. And that's what Looker plays. Looker actually helps you compress this idea of data lakes and data warehouses. By modeling the data right in the data lake, so you don't have to go through and owner some time consuming ETL process transforming the data as you pull it into a data warehouse. The point of a data warehouse, of course, is to shrink the data set and organize the data set for performance. And I know William is going to talk about how database performance has changed in the cloud era, but at this point, you can put everything in a data lake and really access it right away and still get the incredibly high performance that you need in order to run tons of queries with your users and also your analytics with your analysts simultaneously. We have a bunch of customers that are using what we consider to be four big ways of consuming data around BI and analytics, which is that sort of 20% use case. And the other 80 are around integrating it right into workflows like Slack or integrating right into applications or custom applications that you don't even necessarily know you're using data for, which are very, very valuable. You've all probably had an application or maybe you're a Netflix subscriber where you're getting data served to you, but it doesn't feel like data. And it's that native data experience that's just seamless and frictionless that we're all aiming for in order to get the value into the hands of the people who need data to make decisions. So Looker is, well, William and I were talking about this briefly beforehand. Looker is a relatively small company yesterday, but today we announced the close of the deal of being acquired by Google. So some of these numbers may not be 100% accurate after our acquisition by Google yesterday, but we have about 900 actually closer to 1,000 employees now, 500 or 5,000 developers globally. And a really interesting number, we have about 1.2 million licensed users around the globe. So we're big, we're growing, and we have a lot of momentum in making sure that data gets to the people who need it in order to make better business decisions. I would request one thing from you after the webinar if you get a chance. Today, Gartner announced their 2020 Magic Quadrant for Analytics and Business Intelligence platforms. Looker's got some great momentum there. You can see us there on the diagram. If you go to Looker.com right there on the homepage, click right on the diagram that you'll see there. It'll take you to the report. You can download it and read it. But between now and then, we have an actual analyst with us right now in the form of William. And he knows a great deal about this market as well, and he's going to share a lot about Daylake. So I just want to make sure that he's on board, and I'll hand it over to William here. Thanks, William. Hey, Joel. And actually, I'm going to step in here for a second. And thank you for this great introduction, and thanks, Looker, for sponsoring. If you have questions for Joel, feel free to submit them in the Q&A section in the bottom right-hand corner of your screen as he will be joining us in the Q&A at the end of the webinar today. And now let me introduce to you our series speaker, William McKnight. William is the president of McKnight Consulting Group. McKnight Consulting Group focuses on delivering business value and solving business problems, utilizing proven streamlined approaches and information management. His teams have won several best practice competitions for the implementations. He has been helping companies adopt big data solutions. And with that, I will give the floor to William to get his presentation started. Hello, and welcome. Hello. Thank you, Shannon, and I trust my presentation is projecting correctly now. Looks great. All right, good. Thank you, Joel, for that part at the beginning. And I'm so glad that you mentioned the big news, because I wanted to say something, but I didn't think it was my place to say something. So I'm glad you did. And congratulations, All of Looker. So as Joel mentioned, today we're talking about data lakes. I've been thinking a lot about data lakes. I've been thinking about them vis-a-vis the data warehouse. And Joel got into some of that. I'm going to share with you maybe some new thoughts around data warehousing itself in the course of the next 40 minutes or so. So Joel also mentioned the terrible rate of data usage that we have out there. But 20% of the data being used in the current data lake, we got to do better. I am definitely coming from the camp of I've found data lakes to be tremendously useful. Nothing that's worthwhile is easy. And so I'm going to throw data lakes into that category. But it is worthwhile. It is on the maturity spectrum of different artifacts that you need to have as an enterprise if you want to be really successful as an enterprise, not just with data, but successful as an enterprise going forward. It is a tremendous vehicle for exposing mass amounts of data that really does no other vehicle that makes great sense. Now we have to get our term straight because out there in the field, people use the term data lake to throw it around. It used to mean a Hadoop cluster for whatever purpose. And now it's sort of come to mean, according to some people, a cloud, just use of cloud storage. Is that truly a data lake? I think, to me, a data lake is really something that it's like a data warehouse in the sense of it's built once for multiple purposes, OK? So it's a lot of data. And maybe I'm talking about an enterprise data lake here. But I'm going to really use the term to describe that shared platform on cloud storage, not on a relational database, although it could be argued that you could use a relational database for a data lake. As a matter of fact, I'm going to muddy the waters a bit as I get into this and start calling data warehouses and data lakes a little bit interchangeably here as we go forward. Because we have to solve this problem. We can't be needing to replicate data all over the place. As a matter of fact, that's one of the reasons why we have a data warehouse. It's a consolidated platform. So let's not make the data lake another platform that's going to have to get a bunch of the data and so on. So I'll show you a way as we go forward here to kind of break that up a little bit. So data lake, again, before we jump into it, very important for an enterprise. It's not a swamp. It's not just vendor or consultant hype to guilt you into spending more money on something and doing something. It can be a tremendous asset. I want to show you that throughout the course of this presentation. So first here, find my mouse. All right, all right. A little bit about me there. Shannon gave me the introduction. So I'll go quickly. Got some books, lots of talks. McKnight Consulting Group, we do strategy. We do training. We do implementation work in the areas that I love to talk to you all about. As a matter of fact, this is one of my favorite times of the month, this hour that I get to spend with you. I don't see you, of course, but I see your names and it just brings me great joy to be sharing information from my practice. Truth be told, I'm talking about this stuff kind of just like this every day to smaller audiences, usually, vendors, in clients, and so on. And I do love talking about it. And I think it's very dynamic. And if you feel like maybe when's the train going to stop on all this stuff, I don't see it happening. I think it's just getting more complicated here before it's going to get less complicated for you. So buckle up and find your niche and find what needs to happen in your shop. And one thing that needs to happen is you have to make smart platforming decisions. And there are three basic prongs to this decision today. One is the data store type. You need a relational database. And by that, I mean the rows and columns adhering to the relational theory and all that good stuff. There's so many relational databases. And frankly, they haven't been sitting around while Hadoop came aboard. They are not sitting around now while cloud storage is starting to take over a lot of things. They have definitely advanced. And every release now has actually important advancements, whether we're talking about BigQuery, Redshift, Snowflake, and so on. Big advancements. Advancements I can't even think of. But yet here they come and there they are. And a lot of it's around performance, which is great. We know that's very important. But a lot of it's a great functionality too. And some of its functionality I'm going to be sharing with you here today that you need to consider, especially in your data warehouses. OK. Second decision is the data store placement. Now some progressive shops will tell me day one, yeah, this is going to the cloud. Don't even think about it. And other shops will really lean hard on on-prem. I don't think anybody's stuck there. I think it's fashionable to say we want to move to the cloud. If you do want to move to the cloud, let's take some action then towards moving harder to the cloud for you by actually considering the next big thing that you do to go into the cloud. It doesn't have to be a public cloud, although I think they give tremendous advantages to you. But the data store placement used to not be a thing, right? But now it's very much a thing. Where are we going to put this database? And many of you have multi-cloud environments. So not just the cloud, it's which cloud many times? And finally, the workload architecture. There's still a thing between operational and analytical. Still differences. One is much more transactional data in kind of thing and analytical environments are much more how are we going to get the data out? This still needs to be multiple. It still needs to be both of these. And then we have a seriously emerging category kind of right in the middle called, well, called different things. I'll call it operatical, okay? So you see what I'm doing there? I'm combining operational and analytical. And we find this a lot in IoT where we need operational analytics to actually get the job done. And those are becoming kind of hard to classify. But whatever the classification is around operational and analytical, you need to know it before you step into your platform decisions. So you make a smart decision. Now, as we go forward, whether the idea of the data warehouse, just so that I am clear with you, I am not in any way, shape, or form taking away any kind of value from the data warehouse as I talked about the data lake. I think the data warehouse is absolutely essential. We need a place for that consolidated intake distribution and access layer. We need a place to apply data governance rules so that multiple parties within the enterprise can benefit from it. It's only efficient and it's only required today as sort of a level of efficiency and economy that an enterprise needs to have. And truly a data warehouse is a data warehouse when it serves multiple purposes within the organization. If you say, well, we're building a supply chain data warehouse, well, maybe that could be expanded a little bit more to the enterprise. But realistically, I've stepped into many shops and they have multiple data warehouses. They have them for multiple purposes. I've done a full presentation on data warehouses and a big report on data warehouses in terms of where they are recently. And so I know that they're flavored. That's the word I kind of like to use. And that's OK. They've kind of come to that. But try to forge an enterprise orientation as best as you can. Data warehouses are clearly on relational databases. That's where I'm drawing the line. I'm going to expand on that, though, as we go along here and bring data lakes into the picture, maybe even combine them. So a little bit more data warehousing. I've been dragging this information around for a long time. I can't wait to expire it. But people still seem to need to know this information. I still put it out there. The old definition by Bill Linn of data warehousing, it's still true today. It's still relevant. People still need this. People still need this. As a matter of fact, if you don't have a lake or a warehouse, you're in a distinct minority there. But if you don't have one or the other, start working on your data warehouse. So it's a shared platform. Build wants to use many times. I like to allow access at the data warehouse level, not require marks. Marks will happen. And I'm not going to get into that too much more than that. OK, so you come to points in the road where, forks in the road, if you will, where you can go different directions with your platform. As a matter of fact, whether you're trying to platform something today or not, you should have as an enterprise a direction for where you want your architecture to go for data in the next five to 10 years. Now, it will change as new things emerge, but you should have that picture. And I help draw that picture for many of our clients. And you'll get it one way or the other. But you should have like a three-year picture and a five-year picture. And this helps you to inform how you platform things as you are looking at them today. So there are many reasons why you actually might want to not just platform something new, but many things that are, quote unquote, new are really variations on what you have going on already. So when might you want to switch? OK, to take advantage of cloud database. BigQuery, what have you, you might want to do to, you might see the advantages there over what you have on-prem, see the advantages in system administration, patching and updates and so on and so forth, and say, yeah, we want to go that direction. So that might be a reason. You might want to get into a columnar data orientation because that's a really hard thing to do if you haven't started out that way. So sometimes that requires more reworks than just reorienting an existing data warehouse. You might want to get into a data architecture that you want or maybe just a real data architecture. And so that might necessitate some change. You might want to get into using cloud storage like many data lakes have, very economical and quite functional for what its purpose is. And I'll get into that. Or you might have projects require consolidated data. So you might want to be bringing data together from multiple disparate data mards and warehouses and whatnot that you may have around the organization today into something so that the new application can go one place and get all of its dog food. It doesn't have to virtualize everything. A lot of judgment. I'm leaving a lot of judgment in there for you, by the way, in this. That is absolutely required for success. So the key is to get into right fitting platforms. And here's where I'll distinguish a little bit amongst the right fitting platform. The data warehouse, I've talked enough about that. Make sure it's a real data warehouse. You've modeled it. It's not a lock and load of an operational model. You have data quality there. You know you can change data to make it better. To bring it to a higher level of quality standard. You have great tools there, like a looker, not ancient tools, or Excel, the bane of my existence, Excel. Conformed dimensions, yes, you're sharing dimensions across the schema. And you have data governance on top. OK, you might have data mards. You might have a data lake, as I will spend some time here defining. It might just be a big data cluster. Maybe it's a Hadoop cluster. Maybe it's a cloud storage cluster for an individual purpose. OK, but that's not a data lake. But it still might be the right platform for that particular application. There are reasons why applications do not join in to the data warehouse today. And there are reasons why applications will not join in to the data mark today. Now these reasons should be bigger than the drop of a hat. OK, too many of enterprises out there will hear one little word of doubt and say, OK, well, I guess you can go off and do your own database. That's not a winning strategy for the future. We have to get more efficient with this stuff now. OK, other data mards, an operational hub. Data is shared out operationally. No, the data warehouse is not sharing data out operationally. An operational hub would be or an operational data lake. So I'm not going to talk too much about the operational data lake, but I do know that they exist. And they are coming on and do a functionality in the enterprise. OK, so let's juxtapose this a little bit here. I have a couple access here, data cultivation on the left. Use each understanding by the builders. That's you and I along the bottom, the x-axis. So let me start with that one. How much do we need to know as builders about how the data is going to be used? Now, I suggest that, at least in my experience, as I'm building quote unquote data warehouses, that I need to know a lot. I need to lock it in to some usage. I don't want to lock it in too hard, but I got to lock it in to get some early wins. So I have to understand how that data is going to be used. So yes, we do ask for top 10 queries and what's your business strategy and all this sort of thing. OK, that's how we build data warehouses. Now, the cultivation on the y-axis, that is how we find do we need to make the data. It's kind of another way of saying the same thing, but how much do we have to kind of lock in the platform to how the use is going to be? The data warehouse, I would say, we put a lot of summary data in there to do it well. We do data quality updates to the data. We re-scheme of the thing. We put derivations in there and calculations and we do a lot of things to it to make it really usable by the users. This is for your pedestrian and user. I don't mean that in a majority sense, but this is like where 80% of usage is going to come from, from this diagram. It's going to come from the warehouse or maybe an offshoot of the warehouse. Data lakes on the opposite side of the spectrum. You and I as builders of these things can be successful by not really fully understanding what those data scientists are doing to the data lake. I don't fully understand everything like I do for a data warehouse, but I know enough to handle that handful of users that are going to hit the data lake and that's going to be a higher caliber, more of the data science type of user. They're going to be interacting with it. This is their day job, in other words. Not like the data warehouse where they're kings of the supply chain and product management and marketing and so on. This is their work. The data scientists are okay with a slower or a lower level of data cultivation. Then I can't forget the data mart. You have them all over the place really. I could have probably dropped that on here eight times and been okay about that. So you got your mart still. Now let's look at the post operational ecosystem. There's your data lake. Looks pretty big, doesn't it? Compared to the warehouse, yes. It can get pretty big. It can get much larger than the warehouse. Depends on your business, healthcare, telecom. You got a supply chain manufacturing, et cetera. Yeah, your data lake, even retail. The data lake could be a lot bigger than the data warehouse in terms of size. But who cares? Size doesn't really matter so much anymore. What's important is the architecture and the arrows and how things are flowing and that you don't have hyper redundancy going on in this architecture. So you might have things alongside the data warehouse for workloads that don't belong in the warehouse. And I have a long dye tribe on this I'm gonna try to stay away from but there are reasons why you may not wanna put a new application in the data warehouse. And then of course the data warehouse feeds mart, source data feeds mart, et cetera, et cetera. You got your own post operational ecosystem. No two are the same. And it's probably much more complex than what I'm showing you here. Everybody says there's a little spaghetti thing going on with just about everybody. So take heart, you're not alone if that's you but we have to go forward. We have to be working to improve it, clean it up, make sure it makes sense. And as I said before, get into that architecture that you want. So what if the warehouse and the lake were combined and we could have one structure that suits all across the board here. Now I'm not taking away data marts but one structure that hits that I think that magic 80% that I talked about before. Well, that's possible today. As a matter of fact, that's how warehouses are being built today, the ones that are being built to modern standards. And the ones that we're recommending frankly, and some would call it the data lake now because I'm putting a cloud storage behind it. And some would call it, well that's still the data warehouse with cloud storage. Please just clarify your terms as you go forward and talk to your vendors, talk to your consultants, talk to each other, talk at conferences and so on so that we're all on the same page here. I know that's a tall order in this industry, but anyway. So I'll explain a little bit what I mean by what if we could have one thing to do at all. Yeah, there's your data mart. I put that right there, but I could put it again still all over the place and it would make sense. So let's deploy a data lake. All right. It's a data scientist work bench. It's also a data warehouse staging area. So all data will go, I say all. Nothing's all, but a lot of data. Most of data will try to get it through the data lake at the staging area for the data warehouse, but also all of it's gonna be left behind for the data scientist to do all the things that the data scientist will do to that data. I'm gonna get to that toward the end of the presentation. And just like any database out there, there's data coming in, there's data going out, okay? Just like every database and there's ways to do the in and the out, better or worse depending upon the situation. I will say a lot of data lake focus when you're architecting it has to be on the input, the intake, whereas in the data warehouse, we're doing a lot of that in batch intake performance. It's not really front and center, but it certainly is with the data lake if you're trying to get all that big data in there, okay? So a lot of times we turn to streaming methods for getting the data in and a huge proponent of that. It's replacing some ETL out there, but mostly it's for modern applications, IoT, et cetera. It is the way to go. As a matter of fact, it's the only way to go. Okay, so other patterns of the lake here, we have the data refinery. Many companies started, gosh, as much as five to 10 years ago doing data warehouse ETL on the data lake, just simply, simply on the face of it to try to save money on their ETL, which is laudable, of course. You wanna save money any place you can. Sometimes it works, sometimes it didn't. It can work. It can be a very legitimate pattern for using the data lake. It can be used for archive storage. That's at the end, at the end of the cycle, okay? Older data warehouse data, if you will, holder data warehouse data that you don't need. It can be the data science lab, which I suggest is probably its highest purpose in the organization. So if you don't get to all the others, be sure you provide that data science lab so that you can get into machine learning and artificial intelligence as an organization, step up to that modern level of maturity with data. So the example components here, you see different ways to ingest, process the data, analyze the data. This does not have to be really complicated, okay? I've seen diagrams that just get credit crazy with all the things that you need in a data lake today. Don't be overwhelmed and definitely do it in an agile fashion. Data, remember, it's data in, data out. Okay, data in there is gonna be used. Has to be managed well, of course. You have to have some of these things that you see here. Means to process it, means to analyze it. A way to store the data. Be that on cloud storage or object storage or manage to do different terms for that or straight to do, which mostly was on premises, but data lakes can be either, but really cloud storage is gonna be primarily in the cloud except for these one-offs that mega corporations can engineer. All right, data lake setup. Managed deployments in the Hadoop family of products in the Hadoop family of products. Doesn't mean that the data is organized on storage as Hadoop, as HDFS. There is a little bit of, I'll say architecture on top of HDFS that is not present in cloud storage. What I thought, what many thought was that that little bit was necessary. And that's why Hadoop was gonna take off. And it did for a while, but we kind of learned that in terms of price performance, that cloud storage is more where it's at for a lot of the data that we have in our shop today. So it fits the need. You can also have external tables in a high bed of store that point at cloud storage. And cloud storage, the big three here, Amazon S3, Google cloud storage, Azure data lake storage, and it's latest gen two. To run SQL against the data, hide QL and Spark SQL require entries in the meta storage. That's just a net. You can have data, in other words, almost SQLized, if that's a word, in your cloud storage. And that's pretty important for the functionality that I'm talking about here. Object storage instances and clusters have local storage on the physical drives, mounted to the instance themselves as HDFS and Hive. Object storage technologies access their cloud vendors respective cloud storage. That's pretty important to know. Amazon EMR, that's Amazon's object storage, right? It accesses Amazon's cloud storage S3, Dataproc, Google cloud storage, HDI, Azure data lake storage center, et cetera. Local storage is used by the object storage platform for housekeeping. So the data warehouse in the future, I've been building up for this, but okay. We're gonna pair a lake, which I've defined as cloud storage with an analytical engine that charges only by what you use. It's a nice balance for enterprise spend. And by the way, why am I bringing up spend in an architecture discussion? Why can't I lead that for the accountants and so on? You cannot do that anymore. It used to be that I could get by all of us, right? We could get by with our great architecture skills. Not so anymore. We have to bring price into the picture in order to be successful at this job, at this role, at your role today as a builder of such things. Because there's so many options today. You have to put your price on data. The data is becoming so high-important. There are platforms that are appropriate, platforms that are not, and some of it does have to do with price. So you have to bring price in the picture. So if you have a ton of data that can sit in cold storage and only needs to be accessed or analyzed occasionally, store it in Amazon F3, Azure Cloud Storage, or Google Cloud Storage, store it there, and use a database that can create external tables that point at the storage. It works in every one of the aforementioned frameworks, Amazon, Amazon, I said Amazon data, Microsoft and Google, okay? So analysts can query directly against it or draw down a subset for some deeper or intensive analysis. And your fees are gonna be much less than leaving it all in a data warehouse. Data warehouses are great, but if we're still down around that 20% that Joel talked about that of data that's actually being used, it doesn't all have to sit there. I do wanna grow that by the way. I do wanna grow that number in the organization. 20% is not nearly enough of data that's being used. Now, I agree that data has hotspots and that we shouldn't over engineer around this, but more data should be used, more of us should be knowledge workers in an organization. That's part of our job, I think, is to get data enabled in the organization. And the data lake certainly helps with that. So here's some notes on this idea of the data warehouse of the future or the data lake of the future, if you will, okay? You get more achievable separation of compute and storage, which gives you much more flexibility and also saves you money because you can choose profiles that make sense for you. Compute resources can be taken down, scaled up or out or interchanged without data movement. Storage can be centralized, I'm not gonna read the whole thing, but there you see some notes, get the slides. So you have the notes and you have this sample cluster configuration. This is the BigQuery combined with Google Cloud. I would say this is a starter set and the numbers may not be completely up to date. That's not the point. The point is to show you that, yes, even though it's supposed to be easy, there's still a lot of decisions to be made here as you step into one of these data lakes and here's some of them, all right? So, and there's different profiles and no two profiles are alike across the vendors. Makes that a little bit challenging if you're trying to compare vendors, but anyway, you've got compute costs there that you have to consider and there's some goodies in, for example, Google Cloud's platform premium that you may want to avail yourself of. I'll bring that up to about $1.10 per node per hour. Of course, this may have changed since I did the slide and some of the details are inherent in here. So tips, if possible, configure remote data to be stored in parquet format. This is just something that I've come to as opposed to comma separator or other text format. I like parquet format for analytical usage. This allows me more flexibility in my design and allowing me to have super long rows and not worry about it. As new data sources are added to Cloud Storage, use a code distribution system like GitHub. If you haven't stepped up to GitHub or a code distribution system now may be at the time because you need to be highly efficient, you need to be sharing information across the enterprise, not continually reinventing the wheel and the organization of it. The whole, what's the word I'm looking for, the whole path to production, if you will, needs to be streamlined for you to take advantage of data science and that's really where it's at today. So this is a part of that. Use data partitioning to improve performance but don't forget new partitions have to be cleared to the high meta store. Some of this gets detailed, encryption there, oh, co-locate compute storage in the same region. I should mention that. You have the option not to but we found some performance tips if you do. And don't forget that data in the data lake, and I could give a whole talk on this, but data in the lake still requires data governance. Okay, still requires data quality standard. This data is still gonna be relied upon and I find with data quality across the board really, it's the hard part of getting started with data quality. The hard part is breaking through and getting some business input and making some changes to that first set of data that you know is wrong and it quickly gets wheels under it and spins out from there into a place where you have great data quality across the board and it can be trusted. And believe me, that trust is necessary even in the data lake. Now let me wrap up my part here and don't forget we're gonna have Q and A coming up. So if you have any questions for Joel or myself, go ahead, put them in the Q and A panel and we will get there in just a few minutes. The data science lab role of the data lake. I said, this was a very important role, right? Okay, artificial intelligence and machine learning is looming on the horizon in every piece of software that you're gonna be buying and really a lot of the things that you're gonna be putting out. It should be, it should be. One thing I like to say and I give a whole talk on AI, which I won't do right now, but wherever you're thinking of BI, think of AI. That's a simple way to start changing your mindset. Consider data integration. Data integration is gonna, it's about to be turned on its head in terms of all the work that's seemingly required to do it. Data integration is taking advantage now of AI. It's much faster to do data integration today if you take advantage of this. You can predict with high accuracy the steps ahead or it can, I should say, the software can. It can really help you out with some predictive analysis of where you're going with your data integration and fixing bugs in your work streams and so on. Machine learning is being built into databases so the data will be analyzed as it is loaded. As it is loaded. I'm just trying to get your attention here to artificial intelligence and machine learning. It's not just for the software that you buy. It's for the goods and services that you put out as well. So it needs to be injected in there as well. So machine learning being built into databases so the data will be analyzed as it is loaded. I had a chance just this morning to be thinking about my data maturity model and how we access data and how level one is departmental spreadsheet. Okay, ugly but still happened. Two reporting, there's a lot more to it but then we get into dashboarding and self-service dashboarding and machine learning algorithms. But eventually I think we have to say that a high level of maturity of data usage now has to be this third building. Machine learning being built into databases or data stores so the data is analyzed as it's loaded and the data immediately as it's loaded into this great platform gets sprung into action wherever it needs to go. Does it need to go to call center reps? Does it need to make calls? Does it need to fix the supply chain? Does it need to change product specification? Whatever it needs to do, it does. And this is the enterprise of the future and this is what leading companies now are thinking about as they build their enterprises. So finally, the final bullet here the split of the necessary AI and ML between the edge of corporate users and the software itself still being determined. How much logic do we have to put out there on the edge in an IoT environment? How much logic has to go out there? And that takes up space and we want that to be small, et cetera, et cetera. That's all still being worked out. We need great architects to be doing that today. You need training data, not just data but you need to be allocating some proportion of that data to training. And you must have enough data to analyze and build models and then data left over to actually make some business gains with it. Your data determines the depth of AI that you can achieve. So we want to be storing all of our data. Artificial intelligence algorithms can get quite deep today if we let them. So a lot of people ask me, if we do our AI strategies for companies and so on, so where do we store all this data? Where do we store all this data? And the answer is not just simply a data lake. The data lake is a big part of it, though. The answer is really a great data architecture which has the warehouse, the lake, it has MARTS, it has master data management, other things in it fit for purpose for that enterprise but the data lake surely is one of them. And this is some of the data that we think about the data lake for. It's AI data. It's data that will support AI algorithms, call center recordings and chat logs, streaming sensor data, you can read the rest, but and not all of this will apply to you and much, much, much more than this will apply to you in your concept. You just have to think about your concept in these terms and think about it in these terms of the future, of which the data lake is going to be a big part. And this concludes my lecture time here. I'm gonna turn it back over to Shannon now for any Q and A that you may have for Joel Thank you. Perfect, William. Thank you so much for another great presentation. Really appreciate it. And if you have questions, feel free to submit in the bottom right hand corner in the Q and A section and just to answer the most commonly asked questions, just a reminder, I will send a follow up email for this webinar by end of day Monday with links to the slides, to the recording of this session and anything else requested throughout. So diving in here for both of you, you know, some time ago I made a comment on a data lake article saying that a data lake is a data warehouse and another comment or comment or criticize me stating that a data lake as a data warehouse is creating a data swamp, but I disagree and like to know your thoughts on that. You want to start, William? Okay, okay. Well, it could have been that the, that whole interchange was at an earlier point in time. And I'd be curious as to what the person who equates that combination to a swamp may think today because, and I think I pointed this out quite extensively in the talk, but, you know, really we have to define our terms a lot better because is a lake now, you know, it's that cloud storage at the backend of a great data warehouse. So the data warehouse can reach into the data lake storage. Do we want to call the whole thing the data lake? I don't know. I don't really care. Let's just be consistent about it. So if I'm going to call, I'm going to call the whole thing the data lake. I'm going to say the data lake now is encompassing the warehouse and the warehouse is encompassing the lake. So we'll get this all straight. But the important thing is the architecture and how it works together. So I definitely don't think that you're creating a swamp by thinking this way. Yeah, I think that's a really good point, William. And now that Amazon Athena can reach into S3 and read parquet files and Google cloud storage can reach into our, BigQuery can reach into cloud storage and read parquet files, right? What we're seeing is the collapsing of that tiering. Now there's actually a question here about data virtualization as well, which is a very similar topic, which is why do I need to move my data around? And so just to talk about data virtualization a little bit, right? Data has weight and has gravity. It is hard to move it around. But I think, William, your point about, if it's cheap to put it in S3 and parquet format, drop it there. It just makes sense. And you can still access it effectively through the same SQL query mechanism that you're going to attach to Athena. So it just makes your life super-duper easy to be using these new technologies. And so it's not a swamp. What it is is it's almost a single interface into the whole set of your data, both aggregated and not aggregated structured and semi-structured all through one learnability to query in one place. Yeah, I mean, if we want to use that to segue into the apparently data virtualization question, I didn't say much about it, but I'm a huge proponent of data virtualization. And as you go forward, if you accept what I say here today, that you're going to have a lot of platforms for a lot of different purposes. And I didn't really elaborate on the fact that the warehouse isn't fit for everything and the lake isn't fit for everything. Okay, you're going to have many things, many artifacts in your analytical environment and your operational environment. But you can't just put all data everywhere. And that's just not realistic. And there's going to be times when you got it wrong. You didn't put all the data together that now needs to be accessed together. And let's face it, there's going to be some time lag before that happens. So virtualization to the rescue, in cases like that. And sometimes you can lean on virtualization for quite a bit. But I don't want to ever cover up great design. That data architecture design is still required. Virtualization is not the catch-all for that. And what I'm trying to leave out there for you once again is great judgment in terms of knowing what the bounds of virtualization are going to be in your organization when you're not going to cross that line. And that has to do with query performance, has to do with data being accessed all together in a single query. And cost could come into the factor as well, priorities also. So all of that kind of thrown on the table together says that you need some data virtualization, but you don't want to count on it for everyone. And just so everybody has the question specifically on data virtualization is how does, how data virtualization work on these environments, particularly in IoT? Yeah, my only guidance on data virtualization is that latency is real. And it does extract a performance toll. But if it's the best way for you to get access to that data, it is a great tool. Absolutely, I agree. It's been a lot of great questions coming in the chat too. And if you have questions, feel free to submit them in the bottom right-hand corner of the screen in the Q&A. We got quite a bit of time here for some questions. Is there a set of steps to prepare data so it can be used for AI and machine learning? Well, I'll start. It's not as, you don't apply the same kind of data quality standards to data that's just going to be used for AI. But for me, it's kind of hard to say this data is just going to be for AI. I mean, some data, you know, it's kind of fringy. You know, it's the high volume data. You know that your pedestrian end user is never going to get in there. But will your data scientists want to go in there kind of directly in a non-AI way and look at that data? It's hard to say. So to me, it's a value proposition in terms of adding as much quality as seems to make sense to the data wherever it's going to sit, regardless of the purpose. So to me, data quality is a value proposition. I take it much more seriously for the warehouse and for end users that at the first sign of a hiccup of a problem with data that they're going to come from my head over here in IT or wherever I happen to be. So data scientists are much more interactive about this. And so I still apply data quality, but you may not apply it to the same stand. Yeah, that goes out. I think that when you look at AI ML, you need to really be talking about two kinds of data, which is, what is the operationalized data? Like somebody has a question and they would like to run it through a pre-trained pre-built engine to see whether or not, say, sentiment analysis or something along those lines are very common. The other one is training data for your data scientist team. And that needs to be highly granular in the classic like Kendall definition granularity. And it also needs to be version controlled. So what you are pulling out needs to be consistent each time and you need to have some type of auditability or care so you can tell whether or not changes in the outputs of your models are the result of changes in the data or changes in the model itself. So that consistency is very important in the training data space. So I don't think that there's a prescribed set of steps, but I would say you really need to think about those two very different realms, the data scientist workflow and then the operationalized AI ML workflow. And when you're thinking about it and it's very germane to the data lake, if you're the data scientist you need highly granular, very raw data, but you need a consistent set of that data every time. Perfect. Now the hot topic of governance of course comes up and you mentioned governance is still important in a lake. How would you, how would a new governance program approach applying governance and quality standards to an existing data lake? So I like to apply data governance programs to so-called enterprise subject areas not necessarily to the platform. And that means I want to get the rules out that apply to customer, apply to product, apply to sites, apply chain, et cetera. And I want to apply that to the data in its most leverageable place. And the most leverageable places in an enterprise now are the data warehouse, master data management, of course, and the data lake. Because data lake, you know, I didn't even really get into this too much, but data lake can feed databases as well all over the organization. So I want to attend to data governance by subject area as opposed to by platform, but then the judgment comes into play to say that we may get more or less deep into applying the data quality rules and data governance standards to data that we're putting into a warehouse versus a lake. I hope that makes sense. I think it really does, William. And you know, at Looker, we're big because of who we are in our technology. We're big proponents of a unified semantic layer that is effectively the interface between access to the data and the data as it resides in that data lake data warehouse construct. But I think that as you do that, you need to make sure that you're reflecting those data governance control, security rules, at every step in the way, right? Are you masking social security numbers or pay card numbers in the database? Are you masking them as they flow through the analytic tools? Are you masking them as they come out in query results for end users, right? You need to make sure that takes place. You need to make sure that governance rules about who has access to what data based on the database and the function are reflected in that analytic tool as is the role-based access control system that you have for your whole company. So I think that one of the big challenges people have with governance and control of data and access to data and data quality is where they do it. And the challenge is that today with the best technology, it is done in multiple places and it really does depend on your use case, what you're trying to do with it and the rules you wanna put in place around it. Perfect, such a hot topic and so important to everything going on right now. Do you have any detailed architecture resources for building a data lake warehouse? Everything I keep reading is too generic and not specific enough. I think Joel probably just came into to some great resources today with Google who is one of the leaders in building these data lakes. I don't know if you're aware of any of those resources but I have found something out there and contributed some frankly myself. So yeah, they're out there but Joel, are you aware of any in particular? You know, I can't mention any really specifically off the top of my head, but what I would say is this, data lakes grow. So think of the use case, think of the first piece of value that you wanna get and start small in these new usage-based consumption models, you're not gonna kill yourself starting out and seeing how it works, trying it, practicing, doing some skunk works. You could cost you a few hundred bucks whereas previously like five, 10 years ago it would cost you many thousands of dollars to play around. So my guidance for you would be that some of that, I mean, if when you engage with a vendor I'm sure they'll give you lots and lots and lots of cheerful help that you will pay for. But I think that really the key here is that elastic consumption models and on-demand consumption models change the game in how much you have to do up front before you start deploying really. You can really start small today. Yeah, and refer to the deck here. I gave you some parameters of a starter set on Google Cloud for a data lake. And those are parameters you'll need to bring into that provisioning. And yeah, and I knew what I wanted to say. And I just want to echo what Joel said that we have to have the ability to spin up, spin down. We have to have the ability to play, to get creative, to do things without purpose. And that's really going to help to find the enterprise that the future is playing with data and going where the data leads us as opposed to coming top down with, you know, we're going to build an ERP, it's going to do this, that and the other thing. Well, let's be open to the possibilities. What is the data telling us? Great advice. And so moving on to the next question here. The scientists and engineers expend more than 80% of time and effort on data wrangling and need better data architectures to alleviate the problems. Any thoughts on this? Still, I've got lots, William, you want to start? I'll leave it to your lots here in just a second. I'll just say it's still true. If you have not, as an organization, attended to this matter specifically and with resource, not just, you know, it's just part of the job, you know, improve your job and nobody ever gets the time to do that. If you have not specifically focused on improving yourself, you are in that category. There's no doubt about it. You, things have moved so fast in the past few years that you undoubtedly have let things slip and slide and maybe you've found yourself into the accidental architecture and you have to get a mindset outside of that and get kind of back to my earlier answer about getting creative, getting open, getting free, getting agile and applying that sort of creative process or else you are going to be one of those that are continually spending 80% of your time wrangling data as opposed to doing anything with the data. Yeah, absolutely. I mean, so Core to our ideas at Looker is really to compress the amount of work you use prior to analytics by building in a DRY. Don't repeat yourself processes at every steps, right? If you are manually stretching your ETL every time and not using something like Airflow. If you are spending a lot of time wrangling and shaping instead of modeling, if you are going together a lot of SQL snippets in order to do your queries, right? There are tools that exist out there that can make your life easier and what I showed in my slides at the very beginning is that no amount of before it gets into the hands of users is data valuable. It only is valuable when it hits the hands of users and that's where we need to be focusing our time, getting fresh, like granular data into the hands of people who are making those decisions so that they can analyze it and get better outcomes. So it's imperative we compress the early parts of the data pipeline. There are tons of tools in there. I think William's conversation about this virtual compression of what's a warehouse, what's a lake, what's an operational data store, a data mart, or even what's an OLTP database, right? All these things are coming closer together. And if you have tools, and I can talk looker all day, but if you have tools that are able to look at that granular data in the data lake and query it as it is in its current schema with the same performance as if you had some incredibly expensive on-prem database from one of the big vendors, well, then that's the right way to go because it gives you the flexibility and agility that William's talking about and that ability to really just bring the data in, build a model once, query it as much as you'd like, and the data to whomever needs it, that's really powerful. I'll leave that, I'll start getting a little too lookery there if I go any further. I love it. The data lake deal with change over time, not only from loading, but also extracting meaningful data over time as business rules change. We just got a lesson in a minute here, but we'll see if we can slip in a quick answer. Well, it's typically a load-only environment, we're not changing data that exists there. That was part of the question was about that, right? How to deal with change. What was the second part, Shannon? But extracting meaningful data over time. Okay, and two different levels of business rules and so on. This is inevitable, and this should all be kind of part of a data quality program where you know the standard that data is at, depending upon, say, the date range that it exists in out there in the data lake, because you can't go back in time and say, oh, we got a new business rule here to improve data quality, so let's go apply it to all the data in the data lake. Well, you just have to know, you put that in metadata and you make that available to the end user so that they can make heads up decisions about the data because they know what level of standards that it was held to. And you've got to do this because if you don't do this, you are going to wait forever to get the perfect data quality standard and to make the data perfect and you're never gonna get there. So that's unreasonable to think about. So what's reasonable to think about is agile data quality, doing something today, doing something next week and so on and knowing that that might lead data out there at different levels for you, but you have to just communicate that and pre-train your users that this is how it's gonna be and why it's gonna be that way. I think too often we under communicate what it is that we're doing and we kind of cross our fingers and hope the users won't notice and we'll just go with it, but that can really mess them up. So I say communicate it in metadata, communicate it otherwise, but you can do it. I agree, yeah, I would also add to that changes will happen. So if you've got a dagger in ETL process that can alert you, that would save you a ton of time. And if you have something like a sophisticated semantic layer, what you can do is mask those changes in the data while letting people continue to run the queries that they're used to. So those are two ways we see. Let's now go quick to censor at the very end. Well, thank you both for these great presentations in Q&A, but I'm afraid that is all the time we have for today and thanks to our attendees for being so engaged in everything we do and all the great questions. Just a reminder, I will send a follow-up email to all registrants by end of day Monday with links to the slides, links to the recording of this session and thanks again to Looker for sponsoring. Thank you all. Thanks, everybody. Thanks, William. Thanks, Joel. Have a good day. Thank you.