 Hello, and welcome. My name is Shannon Kemp and I'm the Chief Digital Manager of DataVersity, but I would like to thank you for joining the latest installment of the Monthly DataVersity Webinar Series Advanced Analytics with William McKnight, sponsored today by Matillion. Today William will be discussing using data platforms that are fit for purpose. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, you will be collecting them by the Q&A section, or if you like to tweet, we encourage you to share your questions by Twitter using hashtag ADVAnalytics. And if you'd like to chat with us or with each other, we certainly encourage you to do so. To open the Q&A panel or the chat panel, you will find those icons in the bottom middle of your screen for those features. And just to note that the chat defaults are sent to just the panelists, but you may absolutely change that at any time to chat with everybody. As always, we will send a follow-up email within two business days, containing links to the slides, the recording of the session, and any additional information requested throughout the webinar. Now, let me turn it over to Paul for Matillion for a brief work from our sponsor. Paul, hello and welcome. Well, thank you, Shannon. And hello, good morning, good afternoon to everybody. Really excited to be with you today to talk about building data platforms and building analytics platforms that are fit for purpose. I wanted to start out the talk today by giving you a vendor perspective from Matillion. Matillion, if you're not familiar with us already, we are a platform agnostic cloud data integration platform. You give us a really interesting perspective on the industry. We get to work with a lot of the leading cloud databases and cloud analytics providers around the industry. And so I'd love to get into that a little bit with you and show you a perspective of how things have been evolving over the last few years. We're going to wind up in one of the newer paradigms, architecture paradigms, that has come about today called the Lakehouse. And we're going to talk about why that is and why you might want to use a Lakehouse by way of kind of taking a trip down memory lane through some of the paradigms that we've evolved through in the past. Go ahead and get your chat ready. I'm going to ask you a couple of poll the audience questions here in a minute. So I'll give you a second to kind of pull that up and get that going in. In the meantime, I'll introduce myself. So as Shannon mentioned, I'm Paul Lacey. I'm a senior director of product marketing here at Matillion. I've been in the industry for quite a while and worked at a number of companies that specialize in big data and analytics infrastructure technologies, working with everything from Hadoop to OCR data extraction and now at Matillion in the data integration space. Prior to that, I was an engineer. I like to call myself a recovering engineer. I was a hardware engineer by training and studied electrical engineering in school and built a lot of analog and digital systems in the early parts of my career before transitioning into firmware design and then full software and then on into product marketing. So really excited to share some of that experience with you today. OK, so if we've got the chat open, my first question, let's start with a quick overview of the timeline of data warehousing and we can kind of see how we get. So first question, who out there wants to guess when the database was first invented? What year was the database invented? Pause for dramatic effect. We have 1971, 1968, the 60s, the 70s, 1930, 1950. Great, great answers. Yeah, 1950, 1960s, 1968. Yep, 1942. Perfect. 1950, 1955. All great guesses. Yeah, absolutely. 1800s to 25 BCE. I love it. I love it. 1838. Yeah, exactly. You know, it really comes down to what we define as a database. Well, so good, good, good question for sure. You know, when we think about the invention of the database management system, there was an article that was published in the early 1960s. I think it was 1962. If you look on Wikipedia, that credits the very first usage of the term, the database management system. But it was absolutely some time around that time. It was sort of a time of a lot of innovation deep thinking where people were coming out with new ideas and new concepts. That was a lot of fun. Let's try one more. How about the data warehouse? How many how many folks out there? What when do you think the data warehouse was first invented? 1980, 1971, 1980, 1987, 1981, 1980, 1980, 1970, or yeah, 74, 78, 86, 93, late 80s. All right, you guys are on it. Absolutely. So just scooting on along here. Yeah, the the the invention of the term data warehousing is largely credited to an IBM systems journal article. And in 1988 that came out in that year in between there, of course, we had the invention of the relational database system, which some folks mentioned was in the 1970s and then the invention of the sequel language. Somebody says last year, that's funny. So we could we could do this all day. But for the sake of expediency, let's just go ahead and jump to the end here. So we see a lot of of innovation over the space in the last, you know, call it 40, 60 years or so. You know, just a lot of motion, a lot of new things. We see the publication of Inman's, the data warehouse toolkit or building the data warehouse in Kimball, the data warehouse toolkit in 1996. Hadoop coming around the scene in 2005, and then we start to move towards the cloud and we see things kind of accelerate from there. Which it's it's pretty fascinating to take a step back and understand just how far we've come. And then if we do a zoom in and an explode of the last decade or so of the cloud technology, it things get even more interesting as we start to see things really heat up with technologies like Google BigQuery opening up when in 2011 and Kafka not too far behind it. Technologies like Redshift coming in, coming on to market in early 2013. Snowflake not too far behind 2017 in the kind of open sourcing of Spark Apache Spark around that time as well. The data bricks being founded a year prior to that. So just a tremendous amount of innovation that is happening in the space today, which is really great news for operators and for designers in the space, especially from an architecture perspective, as as we start to see improvements with each new technology that comes onto the scene. And that is what we're seeing if we look at the major architectural paradigms over the last 15, 20 years is that the effort to reward has dropped dramatically as innovation has increased through each one of these architectural paradigms. And so let's walk through each one of them really quickly and we'll talk about what they are and then we'll come around to to answer the question of why Lakehouse, why is the Lakehouse, you know, come about and what is the problem that that's trying to to address. So starting out, many of you are are probably familiar with what we call the original big data stack. And this is what a lot of people built back in the early 2000s, 2010s as well, where you have a fat ETL pipe that pulls or extracts information from its relative sources, passes it through. You know, if you if you want, you do streaming analytics on the front end, because you need to you need to have access to that very low latency. So directly out of that ETL pipeline, you'll do your streaming analytics. And then that data was stored and stashed in the data lake, right? Because the time we didn't have a lot of resiliency in our data pipelines. And so we wanted to make sure that as schema drift happened, we didn't lose data, data pipelines didn't break and and cause holes in our production data set. So we had stashed it in a data lake. We do additional processing on it from that schema on read environment via another ETL job, which would then load it into a enterprise data warehouse, which was probably sitting on prem somewhere. You know, and that's kind of where where things started. A little bit of lightweight data mining was was getting going on the data lake. It kind of foreshadowing the the data science revolution yet to come. But, you know, this was kind of the, you know, original place that most people started with when they started processing, working with big data. What happened next around circa 2013 ish or 2015, depending on what you're looking at from a timeline perspective, the the tools evolved and the pipeline tool started to become better at being able to flatten certain data. So they were able to extract nested data and flatten it during the process, which led to the ability to stage data directly in a data warehouse in a in the staging table somewhere and then run data transformations on that warehouse, either via push down or or via just extracting that data from the warehouse processing and then putting it back into another table and in the downstream warehouse. But but the paradigm was all about how we could move faster by putting data in its production into production systems sooner. And then we kind of split that off where we had the data lake, which was storing a lot of different types of data in service of more the growing data science needs of our organizations. And that was accessible via data prep to some of the data science workloads as well. But those two were very separate parts of the stack. In many cases, still are, which is what we'll talk about in a second. So a lot of people are still operating here today, by the way, this is this is not necessarily old news and it gets the job done for a lot of folks. But what we see at Matillion is a lot of people that move to the cloud want to take advantage of the cloud and the kind of elastic scalability, the modern compute that's available in some of these newer cloud platforms that exist, you know, where you don't want to be doing compute at all in your pipeline or in your ETL product, you want to be designing logic and pushing into those production systems that take advantage of their scalability. And so that's where we get to what we call hybrid storage, where you see now we're able to take in and extract data without even flattening it, thanks to advancements and many data warehousing technologies that can store semi-structured data natively. We can now take all that information. We can stick that directly into the database in the staging table. We can run back and forth with with push down commands and process that into varying degrees of data cleansing and data preparation routines and get that ready to go into some of our traditional PI and analytics applications around the business. We also are seeing an increasing connection between the data lake and the data warehouse, either via a cloud ETL tool or via some sort of virtualization strategy, which a lot of companies are pushing today. And that that enables, you know, the kind of bringing data science closer to the production data and the data warehouse as well. And that's one of the dynamics that we see happening today in the industry. And a lot of people are operating under this paradigm. This is almost best in class. If you don't consider the evolution we're about to talk about next, which is the lake house and one of the things that's interesting that we've noticed in traditional analytics over the last few years is as companies start to move towards the cloud and as they start to embrace more and more capabilities that exist there for processing unstructured data, semi-structured data, multimedia data, etc. That that kind of motion is becoming more and more standard in enterprise analytics. And what that means is that people are needing to bring more and more data science into their environments. Hence the need for virtualization, hence the need for tighter coupling between the data lake and the data warehouse. And you see the data science kind of now sits between these two stacks. But unfortunately, when it comes to productionizing these these workflows, if you're on a traditional stack, things are very separated and siloed. And so if we'd look down the list here and you'll recognize a lot of these technologies, each each one of these teams has their own answer for each each layer of the stack or each part of the journey. And those two don't really match up together very well. So it makes it very difficult to take, say a data science experiment and turn it into a production data pipeline and vice versa to get production data into data science environments without duplicating it, which is where we come to the the last and final concept today, which is the lake house. So what is a lake house? At the end of the day, we're trying to combine data science and data engineering workflows into a single unified environment with a change in architecture that comes down to how you think about your storage layer. So a lake house really combines the best of both worlds. It takes all of the performance, scalability of fast queries of a data warehouse, a columnar based data warehouse, highly structured environment with with lots of indexes and lookup tables and and things like that. And it combines that or it merges that with the schema flexibility on the in the data format, flexibility of a data lake and the data in a schema on read type environment. And if you if you can do that well, which which some folks in the industry, particularly data breaks are running after with a high degree of success, you can get to an environment where you have your your data, you have a acid compliant layer and you've got a performance SQL query engine on top of that that can deliver all of your streaming analytics, BI, data science and machine learning needs from a single place, which is quite interesting. So we wind up in this situation where we have a paradigm that you can extract your data, all data, all data sources go into one environment. You can push your transformations into that environment and then you can run your all of your enterprise analytics, machine learning and AI applications off of a unified data set. So I can Shannon. Oh, yeah. So, you know, we're a little over time. So I just want to just want to let you know where. Thank you, Shannon. Thank you. Yes. So I'll go ahead and I'll go ahead and hand it over to to William there. And thank you all for your attention. If you want to have if you want to learn any any more about the lake house, we have a lot of information for you at Matillion.com. You can learn how a performance data integration environment can help you stand up and build a lake house and learn a lot more about the business benefits of the lake house there as well. I'm happy to have some more conversation about this when we get to the Q&A section. And thank you very much for your attention. Now over to you, William. I love it. Thank you so much for kicking us off. And thanks to Matillion for sponsoring. I hate to cut you short because it was such it's such great content. And when we need to swap it again. But this is awesome. And if you have questions, there are questions coming in already. And if you have more questions for Paul, feel free to submit the questions in the Q&A section of your screen as he'll be joining us in the Q&A portion at the end of the webinar. And then let me introduce you to our speaker for the series, William McKnight. William has advised many of the world's best known organizations. His strategies form the Information Management Plan for leading companies in numerous industries. He's a prolific author and popular keynote speaker and trainer. He has performed dozens of benchmarks on leading databases, data lakes, streaming and data integration products. And with that, I will give the floor to William to get his presentation started for low and welcome. Thank you, Shannon. And thank you, Paul. That was a really informative talk. I love that walk through history that so many of us were a part of. And the history, it just marches on in this space. And so obviously under the Lakehouse umbrella, there are components. And I want to focus in on that and talk about how we select those components because when it all comes together as Paul showed us, it's a beautiful thing. But it's not a beautiful thing if those components don't work well together and if they are lacking in a certain area that you were expecting. So that's what I want to go through. Now, we've had a little back and forth on the swap displays today with Zoom. So I hope that I'm sharing the right thing and come on in, Shannon, if I am not. And also, I apologize up front. I have a little histamine going today. Hopefully that doesn't come through too much in my voice or otherwise. But I'm going to start today with the number one thing that people ask us about. And I think is pretty much a number one complaint and a number one expectation of the platforms. Now, obviously, I'm focusing on analytical platforms here today. But I do think a lot of what we're talking about has everything to do with an operational platform as well, or most of it does. So performance is very important. Performance is a critical point of interest. And to measure, we use similar price specifications across data warehouse competitors. And the reason I say price and I am speaking of our benchmarks, but really how I think that any enterprise should be looking at it as well is because the componentry is quite different when you go from one cloud, for example, to another AWS to Azure, to GCP, to IBM cloud, to Oracle cloud. Things are different. And we are definitely marching hard towards obfuscation by the vendors in terms of what it is you're actually getting. So t-shirt sizes, yeah, that's sort of the way. X or small to large. And you can kind of try to figure out what's under the covers. You can sometimes, but sometimes you can. And it will change over the course of time. So that you just have to be aware of. Usually when people say they care about performance, it's the ultimate metric of price performance. Obviously, you could throw a lot more resources at something, make it hum. Most of these things you really can. There's no top end as long as you're willing to pony up. But that's not really what we do in enterprises, right? We spend wisely and we care about such things as budget for these platforms. So price performance is the ultimate metric. The reality is of creating fair tests can be overwhelming to men shops. And it is a task usually underestimated. And I'm going to give you a checklist at the very end of this so that you can do this for yourself. You can do your benchmarks. You can do a fair test of the platforms under consideration. Excuse me, William. I think we might have paused the share. The slides are not advancing. OK, how are we doing now? We are on the on the presenter screen. So I think we need to flip over. Well, see, this is what has been happening. Oh, thanks, Bob. There we go. OK, let's try this. All right, I don't know why it stopped sharing on me. But anyway, not important. The perils of performance alone, this is a problem that I find that a lot of shops get into is that they will look at performance. They'll get overwhelmed by looking at performance. It just takes up way more cycles than they thought. And that ends up being the only thing that they look at. Well, while I obviously admit it's very important, there are other things that we need to be looking at. A modern workload is less frequently a set number of queries, but more an interactive variable number of queries. And so if we just look at performance of queries, that does not tell us how many queries are going to need to run because of the other things that are lacking in this platform. How much more hard work are we going to have to do because of the sequel is lacking or the administration is lacking or something like that? So it definitely is important to look at some other things, which I'll share with you here. There can be some hidden downsides to some data warehouse platforms that have features that appear beneficial and desirable. So do not let the vendor sway you into saying, hey, look at this recent release and all these very, very important things that may or may not be important to you. Be sure that you keep your needs front and center as you make this determination. Now, today we have many platforms. We have pretty much the AWS stack and the Azure stack. And I'm going to call something a heterogeneous stack or a snowflake stack, which obviously there's different parts in all of that. You could definitely mix and match your way around here and sometimes that's how it grows up over time. It's all mixed and matched and so on. But I will echo what Paul said earlier and that is that there's a lot more components to the modern workload, the modern enterprise handling platform than there used to be. Now, I'm resuming share once again. I don't know why I keep stopping that, but okay. Now we're back on the presenter mode. I don't know what's going on today. Okay, I'll swap again and maybe this will just be the new normal for today till we figure this out. But okay, enterprise and a late platform. So we've got data engineering, we've got data analytics, data science, data catalog, workload management. I think this pretty much corresponds to a lot of things that Paul was saying in his talk as well. And you can see some of the players that I slotted in there. This is not extensive by any stretch of the imagination but we're gonna go through some of these and I'm gonna call out what I think is really important. And it paused on me again. Okay, play. All right, let me guess. I'm sharing the wrong thing. Okay, no feedback. I am. I'm sorry, I'm playing with the mute button there. Yeah, you are sharing this. I don't know why. Okay, all right, well, all right. Anytime I see that I am paused, I will resume and swap. I'm watching. My neck is getting a good workout here too. That's all right, you can neglect your neck and you don't wanna do that. So data is fit for purpose, FFP when it is in a leverageable platform, in an appropriate platform, et cetera, this is my criteria for whether your data is fit for purpose. I didn't say the data has to be perfect. I just said it had to be fit for purpose and that's as good as really I can expect any enterprise to do, to be fit for purpose in terms of what they do. Now, that first word has taken on some new meaning there. Leverageable in a leverageable platform. Now, a data lake is ultimately leverageable because it's built for many purposes and there's a lot of data in there from across the enterprise. Same thing with the data warehouse. That in terms of how those two come together, that's pretty similar. However, these days, a lot of databases are multimodal and that's a term you're gonna be hearing more about and these are databases that can handle not just relational, for example, but maybe graph as well, maybe JSON as well, maybe caching data as well, et cetera. Any two of them make something multimodal. Now, does that create a category? I don't know if it creates a category but there are certain things you wanna look at if you're considering multimodal and I can get into that in another presentation but we wanna look at all of the so-called models that that platform is supposedly good for and make sure that they're all fitting the great norms of the respective platforms. But anyway, respective models. Leverageable can also mean a great use of data virtualization. So you can have your data kind of scattered about maybe not in the great good old Inman Kimball structure, one of the earlier slides that Paul showed earlier, but it can still work if you have great data virtualization on the top, if you have great even data integration that's hard at work, making sure that data's moved about in rapid fashion to all the places that it is needed. So there's different ways to accomplish a leverageable platform than just say building data lakes and building data warehouses that have consolidated data. You can achieve leverageability by having these other things in place. And as long as you have them working to performance standards, well, I'd say that's leverageable. Okay, we're doing the thing again. All right. All right, let's look at, make sure I didn't skip anything. Okay, product setup. And I'm gonna guess I might have to swap this for you. You're on the run. Yeah, I don't know why, but okay. Let's look at product setup. Hopefully everybody's hanging in there with us today. Okay, one of the first things that I wanna look at is cost predictability and transparency. And cost predictability and transparency are a couple different things and it's not just having to do with absolute cost. The cost profiles are straightforward if you accept the defaults and you don't get into negotiation, you don't get your enterprise discount. But they're not the same one to the other. You will have to spell it out. Take a workload, run it through, not just run it through for performance, but also run it through for cost. Look at your bill. Look at what the bill would be if projected to the enterprise workload that it's eventually going to have to support. Because initial entry costs and inadequately scoped environments could artificially lower expectations of the true cost. And this is something that I get a lot of calls on. Blah, blah, blah environment is costing me a lot more than I thought. Well, sometimes that's because you didn't know going in that you would need all the components that I just listed for you and Paul did as well. You might have thought of the old school, you need data integration, you need a database, you need some BI on top of it, but yeah, what about the other things? Today, you really do need that data catalog, right? Okay, you need data engineering. You need all of those things. I'm going to go through some of these and help you with your cost consciousness. Yes. We're still on product setup. And somebody recommended stopping to share and put your size and presentation up first and then should stop and that should create the other things. I have stopped here. I don't know why this month is different from all the rest, but. Okay, let me share again. I'm sorry, everybody. Yeah, it's weird. Okay. Presentation. That's being finicky today. Here we go, here we go. Hang in there. Coming. Coming. Okay. All right. Yeah. All right. Everybody cross your fingers. Let's move along here. Cost consciousness and licensing structure. I do have a lot to cover. Be on the lookout for cost optimizations like not paying when the system is idle, compression to save storage costs and moving or isolating workloads to avoid contention. I am focused a lot here today on the cost aspect of it because that's where we find a lot of enterprises get surprises and that's never good. That can just kind of sour the whole system, sour the whole implementation. That's not what we want. Now I'll show you here on this slide again back to my big four stacks. And by the way, there are plenty more. I'm just using these as an example. Some examples of what you might select in terms of the implementation. So we like all the ones that you see here. They are, I guess I'll say they're kind of typical for data warehouse and analytical workloads. And there you can see the respective costs. And these are, believe it or not, kind of as close as they really get out there when you're looking at the platforms and comparing them. And so again, we try to put the costs equal as equal as can be. And then we use that to put in place those things. And of course, price performance is going to be the thing that's going to be the ultimate equalizer at the end of the evaluation. So we just try to be kind of equal at this point but we know we can't be but the price performance at the end is going to be the ultimate equalizer. Just remember that as you take a look at these platforms. We're also looking for ease of administration. So over time, every shop is looking to do more with less, right? Okay, so the things that we used to have to do in the hard way, we hope that that stuff gets automated and so on. Frankly, a lot of the data integration is going kind of hard that way. So you've got to kind of roll with this and find the things that we still are needed for, all right? And there's plenty of it. But some of the things around administration are things that we believe can be automated. Overall cost, time as well as storage and compute resources are affected by the simplicity of configurability and overall use. Yes, when you go to the cloud and you select your setup and blah, blah, blah, all the things it takes to get to hello world, that is important. That is an indicator of how easy or difficult it's going to be going forward to get things done. Of course, all you're doing there is the very basics, but it is nonetheless an indicator. And a great database engineer, which obviously you have that or you are that in the shop should be able to get any of these up and running in short order. And if not, it may or may not reflect on the engineer may probably also reflect on the vendor and the documentation and all this. Don't let them make you kind of feel dumb because you didn't set parameter XYZ to one, two, three when it was never documented to do it that way. Optimizer, the warehouse should be designed for complex decision support and machine learning activity in a multi-user mixed workload, highly concurrent environment. All optimizers should still be under high development out there in all databases. I'm always checking on this in my briefings in my analysis and so on. What have you done lately for the optimizer because no optimizers are, shall I say, optimized. They still need a lot of work. Some are much better than others. And you want to check on some of the things that you can learn about the optimizer. Does it have conditional parallelism? For example, check on dynamic and controllable prioritization of resources for queries, which obviously bleeds a little bit over into workload management. But optimizer category, yeah, that's pretty important. Now, those were categories and now I'm getting into some more hard stuff, more of the real quantifiable stuff when you're selecting a platform like dedicated compute. The dedicated compute category represents the heart of the analytics stack, the data warehouse itself. It's not the only thing, but it's a big part of it. And this is a lot, a lot of times people will talk about the warehouse. This is what they mean. Now, all the vendors, and again, I'm limited here in space, but there are others, IBM obviously, Oracle, on and on. There are other stacks, but these stacks have pricing kind of like what you see here. They're all doing enterprise agreement for you, by the way. But believe it or not, not every enterprise is at the level where they need that or where that's going to ultimately makes sense for every vendor. So they all, even though it's pay-as-you-go stuff, you know, they all like the bigger commitments. Obviously that helps their cash flow. So you got to keep that in mind. You can obviously proof of concept of stuff all day long on the pay-by-the-hour kind of rate, but eventually you might want to kick it over or explore that option. Yeah, and there's some different terms in here too, boy. There are some different terms as you go from one to the other. Google has this notion of slots, you know, and some of the others conform to what's more or less called a node, like we have learned to love the term over the course of the years, but they don't all even acknowledge that term. Now there's dedicated storage. Dedicated storage represents the storage of the enterprise data kind of obviously there, but in former days it was tightly coupled. Now it's not in any of these and is priced separately. And as you can see, you will pay on order of two cents to four cents per gigabyte per month in each of these. And this doesn't usually tend to be a high part of the overall cost, but nonetheless it is worth your look. Data integration. Now here I'm showing you some of the, I don't know, hard, if you will, data integrations. The crude, if you will, data integrations. Of course, we don't have that for Snowflake. We need a product like a Matillion, but for the others, they have their own kind of load. It's not really data integration. I probably shouldn't have called it that. It's just sort of the load tools. Azure Data Factory, AWS Glue, and Google Data Flow. Talents and example, Matillion, you know, on and on we know our favorites in data integration. So we need one of those for some stacks. And I would say that we really need them for a lot of stacks, even in Azure, AWS, and Google. That's for sure because the tools I'm showing you here don't do enough for an enterprise workload. Then you're going to have to access the data. And once again, I'm showing you the crude query engines that are provided by the various vendors, not the robust engines like the lookers, like the Tableaus, like the things that you probably really are going to need quite a bit of. But you can, with these tools, Azure Synapse Serverless, Amazon Redshift Spectrum, Google BigQuery, and Snowflake, you can access data in kind of a crude way. Not really fit for, I would say, the typical end user, but it is good enough to get started perhaps. I do know there's more opportunity there. I'll put it that way. Data Lake, okay. So I'm a big proponent of the Data Lake. We can talk extensively about why that's needed in a modern enterprise. Paul touched on it as well. One of the things that Paul said was that it is a foundation of the lake house. And hopefully you're okay with that term. I've grown to be okay with the term. The lake and the warehouse working together. So the lake category represents the use of a Data Lake that is separate from the data. This is common in many data-driven organizations as a way to store and analyze massive data sets of colder data that don't necessarily belong in the data warehouse. Now, in my architectures that I typically do, we have the lake and the lake also serves as staging to the data warehouse. So this kind of changes the statement a little bit here in that all the data in the warehouse did actually go through the lake and is still there, by the way, in the lake. So the lake is, I don't wanna say everything because nothing is everything, but it's a lot of the data in the organization of all stripes, all data types, all data sources, et cetera, as much as you can get. It depends on your roadmap in terms of where you are in that journey. I don't think anybody's done. I don't think anybody has a lake that they're completely happy with and they're done and I think mostly what I see is if you have some success or whenever you get to the point where you have some success with the data lake, you want more, you wanna start putting more data in there and hopefully you didn't architect it in the first place to be a lake for a single source of data, a single subject area of data, et cetera, something like that and hopefully you can get to its real high purpose in an organization which is to be kind of that data warehouse of the all data. Now, we've done quite a bit of research on these various platforms. We've priced them out. Yeah, they probably for a modern enterprise workload, it'll run into the millions per year and that's across all these categories. Now, I show you at least two categories on here at 0%, but I left them in here because in the other platforms, not AWS, but maybe an Azure, Snowflake, et cetera, yeah, maybe on those you have to pay separately for them. That's not an opinion on or not a statement on the overall costs of AWS versus Azure, et cetera. It's an opinion on how it gets broken out and I'm letting you know that it gets broken out differently by platform. This is the AWS breakout as an example. This is over the course of time. So if you have a true machine learning stack in place from AWS over the course of time, once you build up to the point where you have a solid stack in place and that takes most enterprises a good six months to get there, at least six months, maybe a year to get there, then you're gonna be running as something that breaks out kind of like this. You'll see the dedicated compute in AWS is gonna be the big nut there. Of course, data integration not insignificant in the overall mix and the data lake itself. Cost less has more data, but obviously different characteristics. So be sure that you are accounting for all of these costs and all of these goodies, right? All these capabilities that you have in a modern enterprise. You're going to need all of them, okay? Maybe not day one, but you're going to need machine learning. You're going to need identity management and so on. So let's talk about some of the product utilization characteristics. Now concurrency is a funny thing we throw the term around quite a bit, but what I find is that most databases are pretty good up to five and 10 concurrency levels. If you're really hammering the database with five to 10, we're talking linear scaling at that point, but at that point, about 10, things start to really diverge. Maybe it's eight, okay? Things start to really diverge and the linear scaling, it drops off for most of them. Now it drops off for all of them, but obviously some to more degree than others. So what this means is that when you're analyzing a platform, you have to understand the workload characteristics of where you're going, at least in general, and know that concurrency may or may not be an important factor, and it probably is. If you're building a lake or a warehouse, concurrency is important. You're going to have multiple users for it over the course of time, more so now today, obviously in the warehouse, but the lake will start to approach the level of usage that the warehouse is in today over the course of the next, I'd say, five years. And then all the things that we do to support the warehouse, to make it great for our user communities, we're gonna have to be doing that for the data lake, because we can no longer sit back and say, well, I don't really know what they're doing in the data lake, I'm just providing them the data. Well, that's kind of how we started with the data warehouse, and that didn't last for long. And even though we have a high, more cultivated, I'd say, more scientific audience for our data lakes, eventually their needs will be to really, really focus on the analysis and less on the data wrangling and all that good stuff that we do behind the scenes or we're supposed to be doing behind the scenes. So resource elasticity, up and down, by the way. A warehouse needs to be able to scale up and down and take advantage of the elastic compute and storage capabilities in the cloud. This is one of the promises of the cloud. So it's not elastic, it's not good enough. If this means that at certain points you have to call the vendor, renegotiate the license, experience delays of days to weeks, and meanwhile the system has hit the wall and that's it. There should not be these huge stepping stones in elastic scalability, right? It should be small stepping stones and stepping stones that happen automatically as needed. Now, I totally get the vendor idea that they have to be paid for this elasticity if you're using more resources, I get that. But please be sure that you negotiate that in upfront. What if it needs to scale? And are you good with how that product will scale? Are you good that at the drop of a hat, in terms of maybe your concurrency is three or four, at the drop of a hat it's gonna kick it up from eight to 16 nodes. And you're gonna be paying for that because the database is really hedging against any concurrency problems. Are you good with that? Be sure that you understand that aspect of it because that's stuff you don't necessarily see in the first month, but you start to see it when you start to feel like you're, you've really invested in the platform and you're really committed to the platform and then it's like, oh no, we have these issues. So yeah, that does have to do a little bit with concurrency there, has to do with data as well and complexity of the queries. Now, machine learning, today data or house query languages need to be extended to include machine learning. We want that built in or we wanna know the story of how we are going to do machine learning algorithms against the data in this platform. It's not good enough that we'll figure it out when we get there. You should know the algorithms that are built in, how many are there, how many hundreds are there, hopefully, of algorithms are built in because over the course of time, I am pretty sure that we're gonna see more machine learning access to data than we ever will SQL access. Of course, that's gonna take a little time, but that is the way of things and where I think that we should really be focused when we talk about machine learning, excuse me. Data storage format alternatives. Now, you have choices here. We've been speaking about Paul and I have been speaking about the data lake as if it's on cloud storage. And that's what I believe about a data lake is the best thing not everybody does. Some people are doing the lake concept on a relational databases, which I would say looks more like a data warehouse to me, but we'll get our term straight over the course of time. You do have choices in this. And wouldn't it be great if the data storage could accommodate all data types? Wouldn't it be great if you could actually do things like index data in the data lake so that we don't have a lot of sequential processing so that we have some random access capabilities so that we know where data is and we can quickly zoom in on where a term is found in petabytes of data. Let's say William, where's that in petabytes? If you wanna do that search, pretty much you can forget about it today, but I think things are on the horizon that are gonna make that much more palatable. So also when you think about lakes and warehouses and you think, well, I've heard this from clients recently, well, I'm not ready to make the jump. I'm not ready to make the jump to the data lake. I can just patch on the warehouse here and so on and we can go forward and get through the day. Of course, yes, you can do that for a lot of things, but if you look at the vendor community, as I try to do, I see that a lot of the briefings I'm taking, a lot of the new releases have a lot to do with cloud storage and the way that data is stored on there. And so if you wanna be taking advantage of this, another reason to dip into cloud storage. Now, make sure that you are able to take advantage of the many types of data available, such as Apache Orc, Apache Parquet, JSON, very common, and Apache Avro, et cetera. Modern data warehouse need to be able to analyze that data without moving or altering it. So wouldn't it be great if your data integration vendor could flatten all that stuff in JSON out there? Snowflake recently made an announcement about this as a feature, I think that's great, where it can flatten nested data. You wanna be sure that your data integration tool can take advantage of things like that as well. So in Hadoop, and that's still out there, that was all the rage for a little while anyway in terms of data lakes, so therefore gained a little bit of a foothold and we still deal with it, right? So in Hadoop, it's much more crude than a relational database ever was in terms of a relational database day one. And I learned today, I think I learned today from Paul that that was 1986, I guess 1987, because I thought it was that year I started working at IBM, but I thought I was on the ground floor of it then, but it was 86, I guess. But anyway, whenever it was, it came out of the gate with organized data pages and data blocks. So within those data pages and blocks, you had your records, you had your record headers, you had your ID maps, you had a lot of things that worked with indexes to help you get to random access of data. And I won't go through all that today, but that's not present in Hadoop. It's not present in cloud storage either. So in Hadoop, you have the header, you have records, you have sync bytes every once in a while, but within the record, you have key lengths and keys and values, depending upon the data type, you might have multiple of these in there and so on. But essentially, these are schema less file formats and the scheme is presented upon the data access. Now, Parquet is, and the reason I'm bringing that out is because that's huge for me in terms of when I'm architecting a data lake. I've just grown to believe in Parquet as the optimal way to organize a data lake, whether it's Hadoop or straight up cloud storage. So there you have a header, you have your blocks, you have your footer, you have column chunks. So it's kind of like a column or database if you're familiar with that in that certain columns, if you will, of the record are going to be stored continuously. And that will occur for up to as many different quote unquote columns that there are in the record. And that makes for better access of data when you're not looking at complete records. And let's face it, a lot of the records that we put into the data lake are very flat. They're very de-normalized, so they can be quite large and so that IO bit can be very expensive, even though lakes are generally pretty fast. Okay, I cannot talk about putting data in fit for purpose platforms without mentioning the graph database. It's often overlooked, I think at Parrel. I think if you have a workload that is about, quote unquote, the word network or the word relationship or something like that or anytime you have a workload where you're trying to access data in relationships really quick, a graph database is going to be ideal for that. Even at the expense, sometimes the expense of copying the data from a relational database or a data lake out into the graph database. So there we talk about the data being organized as triples and so on. We'll get into that in great detail but there's a lot of great algorithms in a graph database that has to do with relationships, that has to do with what are the most important nodes in that network and that helps you with your business completely. In addition, this is all in addition to the data display which obviously is quite nice. So there you have a lot of the platforms or 90 plus percent of the platforms that I would recommend to an enterprise. And now that we're parsing out the lake in the warehouse I know the lake makes them all as one but we are the builders. We're behind the scenes. We have to, we have to parse this out. We have to put data in places. So, and I talked about everything in the lake and so on but as we build out our data lakes, we don't today anyway have to cultivate that data a lot because we're talking about more volume than anything else. So in the data warehouse, we've grown to cultivate the data quite a bit. I like to use the term spoon feet because that's a paradigm that we often use with our users. That's just where things have come to and it will come to that again with the data lake. So we don't understand very much about it. We don't cultivate the data very much for the data lake as compared to a data warehouse but you know what, that's okay. That's okay today because those scientists, the users of the lake, they know what they want. They can step in and if they don't, they are a little bit out of luck today because that's not what the data lake is but we have to grow our capabilities, our understanding of what they're doing with the data. So our data science as builders, we have to grow that so that we are more supportive of the true needs of the data lake and in that way we can really create the lake house concept. So this just follows on to what I just said. The balance of analytics today, it's a lot on the data warehouse, it's a little bit on the lake. The analytical applications are kind of one-offs on the lake. It's not that big shared resource as it needs to be. I think that's gonna change but I think everything's gonna grow in the actionable future. I like to talk about the actionable future because the future future down the road, well for one, we don't really know what that's gonna look like, we can guess but we have to make decisions based upon 2021, 2022. What's available? And we cannot, I can never, I've never been in a situation where we can wait to see what the market does and that's good enough. We have to make decisions based upon what the market provides when the business needs it and over the course of time we're gonna have more usage of both, we're gonna have more data in both. We're definitely gonna have them working together, that's the yellow line, the pipe between the two so you wanna be sure you develop that. I would not recommend building a lake that doesn't work with your warehouse or vice versa if that's you but I do believe that a lot of analytical applications, a lot more are going to start hitting the lake and it's going to be a beautiful, harmonious thing as long as we build that lake house properly. Now, I've given you a lot of information here today. I hope good information that helps you go to market, that helps you take a harder look at what you've got to make sure it's right and I did allude to the fact that I think you should be benchmarking the stuff before you make selections. So there's a lot that goes into that. I ought to know, I've been doing this quite a bit. So what are you benchmarking? Query performance, load performance, query performance with concurrency, ease of use. You know, when you go through a benchmark, when you run a workload, let's say a TPC workload or maybe you have one that represents what you're going to do. Great, by the way, great on that. If you had that, you're running it through. These are some of the things that you can get out of that benchmark. You obviously have to load the data. You obviously are running queries and you may or may not have concurrency but you probably should. And the ease of use is something that's less quantifiable but more of a feel thing I would say but nonetheless, you can quantify that and bring that into the evaluation. Well, who is the competition? Is it fair? A lot of these questions have to do with are you making it fair? It's not as easy as it seems. Query schemas and data, the scale, how far, you know, how big are you going? The cost, how much can you spend on this? And how are you making fair comparisons between various platforms? And I gave you my rule of thumb about trying to find platforms that are equal cost and still doing the price performance at the end of it. Number of runs. Yes, Shannon, am I? I know we had some tech issues and stuff today but we're just coming up at five minutes left in the, in the time, we'll mention we get some Q and A. Definitely, definitely. I just have a few more bullets here. Just what you see, this is my last slide. So number of runs, you know, can you populate the cash first or not? Any other software that gets involved here and I don't know where I mentioned it but any kind of tuning, whether you're allowing for tuning or not. And if you do some sort of tuning to one platform, you probably need to do it to the other to make it fair. Overall mentioned or measure price performance. And that brings me to the end of what I wanted to say about using data platforms that are fit for purpose. And Shannon, we must have questions. We do. Thank you so much. I'm sorry for the technical challenges we all, thanks for hanging in there. Just to answer the most commonly asked questions, just a reminder, I will send a link to the slides in the recording, so you'll be able to see all the slides as well by end of day Monday for this webinar. So diving in here, the question, there's a question that came up here, Paul, when you were talking about how does Material Solutions stand for a company with an established and traditional multi-dimensional BI solution? Can we reuse some of the current BI components? Yeah, short answer is yes, absolutely. I mean, a lot of folks are looking at hybrid deployments right now, especially as you're starting to age or gracefully exit some long-term on-premise contracts with some of your legacy technology. The good news with the cloud is that, it makes it possible for you to go at your speed. You can move little bits of workflow up as you need. And what's also nice about the cloud is that it supports a lot of legacy technologies on the BI visualization front as well. So absolutely, yes. The short answer is you can use Matillion and cloud ETL solutions with your legacy or traditional multi-dimensional BI solutions as well. And right, so when is it the right time to consider a cloud migration? What are the indicator signs? Well, I'll start on that one, I guess. I'm pretty much there in terms of most everything gets that as the first port, first decision. So, but I do acknowledge that there are some workloads with that, let me put it this way. I'm not gonna be able to overcome the challenges to migrating to the cloud for one reason or the other. And you know what, that's good, that's fine. If an organization wants to be sure that they keep high control over certain kinds of data, be it financial or their formulas or whatnot, their customer lists and so on, I totally get that. And frankly, some of these data centers are gonna be hard to rest data out of because they are so good inside of some organizations. So there's gonna be that dynamic in play and there's always business priorities other than replatforming, which the benefits kick in, but it takes some time to really overcome the migration. So I get that totally, but I put in place the total cost of doing something like this, the return on investment of doing something like this. I think it's there most of the time. And so pretty much the cloud is for now. Yeah, and I would just add too, I think William, you mentioned before elasticity being one of the benefits of the cloud and also the price performance or price variability of the cloud as well. I think those are two key benefits of newer cloud technologies that if you're finding that you're running into performance wall with your existing stack or you're worried about variability of load and how that might impact your budget, things like that, those are some of the common triggers that we see from some of our folks as well that are considering migrations. They kind of push people over the edge. I love it. Well, there's a lot of great questions here. Maybe I'll shoot them over to William and to Matillion as well, but I'm afraid that is all the time that we have for today. Again, thank you so much to everybody for your patience and your hanging out with us and for the assist. I love the assist. It's always good. We just have the best community ever. So thank you both for these great presentations. Really appreciated. It's just really good information. Again, I will send a follow-up email by end of day Monday with links to the slides and links to the recording as well. Thanks everybody. I hope you all have a great day. Thanks all. Thanks Peter. All right, thanks Paul. Thanks, William. Thank you. Bye-bye. Thank you both. Bye-bye.