 Hello and welcome, my name is Shannon Kemp and I'm the Chief Digital Manager at Data Diversity. We'd like to thank you for joining today's DM Radio Deep Dive. Are you, our data lakes, excuse me, modern data pipelines, improving speed, governance and analysis, sponsored today by 5TRAM. It is a deep dive in continuing conversation from a live DM Radio broadcast a few weeks ago, which in case you missed, you can listen to it on demand at dmradio.biz under podcasts. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right hand corner for that feature. And for questions, you'll be collecting them via the Q&A section in the bottom right hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag DM Radio. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Now, let me turn the webinar over to Eric Kavanaugh, the host of DM Radio to introduce today's webinar and speaker. Eric, hello and welcome. Hello and welcome. Thank you very much, Shannon. And thanks to all of you out there for joining us today for another DM Radio deep dive. Yes, indeed. My name is Eric Kavanaugh. I'll be presenting today with Taylor Brown, who's done some really fascinating stuff over at 5TRAM. And he's going to tell us about kind of how we got here and why modern data pipelines are so important for the future of business, quite frankly, not just for our industry, but for business at large. So let's get some context from recent events. Just yesterday, General Electric gets booted from the Dow. Wow. How did that happen? Anyone remember about 10 years ago or less, in fact, when GE was going to be the right-hand man to government innovation? There was a lot of hope, a lot of prosperity in the mix. People really had good ideas about where things were going. And now they just got booted from the Dow. Well, lots of reasons for that, obviously, but I can tell you one thing for sure is that inertia is a killer inside large organizations. Inertia really prevents agility, and there are all kinds of reasons why that happens. But I guarantee there are people inside GE who saw this coming and have been working feverishly to right the ship to get it going in a better direction. And guess what? It just wasn't enough. So why is this happening? Well, in our industry, we play a pretty significant role. We can really help large organizations right the wrongs, get back on track and find new directions. That's all through the classic use case of data management, of getting data into analytical environments of operationalizing, those environments of tying analysis to operations and businesses. All of these things are really important. But again, how did we get here? Well, let's think about a couple of things. Constraints drive design. This is true across any architectural landscape, whether you're talking about information systems or traditional architecture. I know that from years ago in my reporting history, I did a story on some highway improvements up in Illinois, and one of the guys from the Illinois Department of Transportation explained to me that at least back then, and I'm pretty sure it's still the same, you can only plan out new highways according to existing traffic patterns. In other words, you're not allowed to predict how traffic might change once a new highway is put into place and use that as your basis for justifying a new highway project. You have to look at existing traffic patterns and base it on that. Well, that may not be effective enough in the world of information management. As we see the raw amounts of data and all the different data types coalescing around us, we have to realize we need a new way of getting this information into the systems that drive analysis and that drive the business. And there are lots of other things that are happening too. Moore's Law, we've talked about that for years. Well, Moore's Law in its original form at least is fading. We can't get any more power out of these little processors. That's why massive parallelism is better. So let's think about this too, application retirement, right? This is something we've talked about for years and it's really never happened. I have a famous wisecrack from a good buddy of mine, Gilbert Ben Cutson, who joked a few years ago that elephants go to a special place to die, but there is no software graveyard. He said it all just goes to the cloud. Well, that can work if you port the data and you port the applications and all the workloads to the cloud. That's what we're talking about at the DataWorks Summit right now in San Jose. I'm dialing in from the hotel room. That's all taking place right now and really in the last 18 months, large organizations have finally gotten serious about moving to the cloud. It is the future. At least it's the near term future. There may be interesting ways that the extent data centers will be renovated and remodeled and refurbished and reused and so forth. But right now people are moving to the cloud. There's a lot of gravity up there. And that's in part because there are so many applications now in the cloud. If you want to avoid the fate of GE, if you want your enterprise to rise and shine and do better and better each day and each week and each month and year, you need to be using these new data sources and they are everywhere and they are huge. That's why modern data pipelines are so important. Let's talk about real world architecture buildings, right? Architecture matters. The Keystone, you can see all throughout these arches in the Coliseum. The Keystone holds that whole archway together. Old world architecture, it may not have been as good as the stuff we have these days, but the Coliseum is like 2,000 years old and it's still standing. So if you come up with a good architecture for your ideas, for your business, for your information landscape, you're going to have some pretty good success. Well, let's think about what's happening these days. These are the modern skyscrapers. The Sears Tower, I actually watched that go up because I grew up in Chicago and I remember seeing it at various stages of its development. It was an amazing thing to witness. I was only, I think, seven years old when the Sears Tower went up. And of course, that's Freedom Tower in the middle. And then the new, the tallest building in the world, Franti Khalif Tower. And it was going to be called the Dubai Tower. I'll talk about that in a second. But the point of this slide right here is to mention that you cannot build skyscrapers with bricks and mortar. It's just not going to happen. They're going to collapse under their own weight. We needed a revolution in engineering in order to be able to build skyscrapers, standing on the shoulders of giants, so to speak. And look at how amazing they are today. The Sears Tower was amazing back then 40-odd years ago. But the Freedom Tower and then this new, the tallest building in the world is just amazing and magnificent. And I would argue we're in the same kind of scenario right now with information landscapes and information design, where you cannot build web-scale applications using the old tools. You cannot build it using old data warehousing technology, legacy ETL, for example. All that stuff needs to change if you're going to graduate to the next level of big business driven by data. But there are some new exciting problems that are going to arise. And I love this story. It's further proof in my half joke that we're living in the matrix. This was the Dubai torch which caught on fire, right? And I don't know if you heard the story, but something about the material that was used on the outside of the building was just not engineered very wisely. And it caught fire and it was a huge problem. So there are lots of things to worry about in this new age of information design, this new age of let's call it skyscrapers for data management systems. And some of the old issues still matter as well. Don't forget about the costs. Don't forget about the basics. So modern solutions have cost structures too. This is a story about that particular tower and how the guys who were building it ran out of money when they were about two thirds of the way through. So they had to borrow a whole bunch of money from their neighbor and that's why they changed the name of the tower. So the old rules matter, the old things like financing, like thinking through your project and making sure you have some buffers involved. You got to think through all that stuff, get your early stage success stories and then move on from there. And communicate. You really need to share plans with all of your stakeholders. Good communication is really highly underrated, I think. And it's really important to get multiple members from your team on board in the project. Find out what they want. What's gonna make them happy? How are you gonna make them successful? These are all really important considerations to take into mind when you're designing these new skyscrapers of information systems. And remember that process is always the key. Process is what everything boils down to in business and what we're seeing right now is a reinvention of business processes. We're seeing processes collapse, not just streamlined but taken down from weeks or days into minutes or seconds. And that's just amazing stuff. And it's gonna shake out throughout the entire organization step by step, bit by bit. And as you think about how to leverage these new technologies, as you think about how to make use of these skyscrapers of information systems, just remember that you have to start somewhere and you must align this to process. So the best thing you can do is look at wherever you have the most pain points in your organization. Is it sales? Is it marketing? Is it operations? Is it supply chain? Wherever the case may be, you can leverage datasets internal and external to make better decisions going forward. And with that, I'm gonna hand it off to Taylor Brown who's gonna take us through a bit of a history of this industry and explain how we got here and then talk a bit about what you can do next. So with that, Taylor, take it away. Awesome, thanks so much, Eric. And that was a great intro to exactly what I'm gonna go to. So appreciate it. Welcome everyone. Hope everyone's having a good day so far. Yeah, so I think what I wanted to go into today is just talking about a brief history of data warehousing, ETL and BI and kind of data governance around that. And so I thought it'd be interesting to really go through the last 18 years of what has changed in each of these different individual pieces of the BI stack. And so a few things I wanna just get out in front is I'm just kind of looking at the history of trends and that dates I'm throwing out here are somewhat rough. So don't necessarily hold me exactly to them but it's kind of like roughly the time that I felt these dates were applicable. So let's start with the 2000s. Just to get you guys warmed up with what the 2000s were, that was the 2000 was when the pink razor came out and also the initial iPad. So that's kind of set to see it in a little bit. And here's what the data stack looks like. I mean, it was a somewhat complicated data stack. There was mostly all custom systems, the service systems that the companies were leveraging or using were enterprise, on-premise, custom applications. And companies were loading into the data warehouse using classic ETL and then they were loading to data marts and cubes, typically OLAP cubes and then they were clearing directly in their reporting tools from there. So that was like the general, what the stack looked like. From here, just looking at the 2000s, what was the warehouse at this point? And it was really, there was a cube. So cubes are often, OLAP cubes are often set up for faster dashboards in reporting and used to overcome performance limitations in ROAST or databases, especially in data models that contain hundreds or thousands of columns. So OLAP cubes boosted the speed of analytical queries in ROAST or databases by limiting the amount of irrelevant data that needed to be scanned. So by defining schemas and data structures, the dashboard could easily access the data that needed and answer the queries. So that's kind of setting the stage and I think at that point, typically the infrastructure is pretty slow, fairly expensive to put that in context, the cost for one kid to buy was around $7.70. So keep note of that, because we'll continue to see that drop over time. So in the 2000s, the data pipeline, what did the data pipeline look like? This is really just the extract, transforming load. So think Informatica or custom code, heavily, you know, with heavily customized pipelines, you know, you're mapping the types, the columns, the tables, you're transforming data prior to loading it into the OLAP cubes or into the warehouse, and all the activations were also being performed within the pipeline. So it was a somewhat complex system, but at that point, it really needed to be complex because all the sources that they were pulling from and loading to were also extremely complex. And what did the data governance layer look like? I mean, these systems were very hardened systems. And so there's a lot of central S-Panning that went into effect. And so there was actually really good data governance. You know, there was pretty good control and oversight into, you know, whatever it was looking at and, you know, how the companies were defining their metrics and data security. And what did the BI tools look like? I mean, you know, at this point, it's really these heavy monolithic BI tools that were kind of focused around reporting. And so you have to cognize this is the high periods of micro-strategy in the world. These are really focused at that point. We're really only able to ask the question of like, what happened? What happened in the past? And they were extremely accurate and very detailed, but they were very inflexible. And so, you know, they were hardened systems and they would take months to change or update. They would take years to build. There was a tremendous amount of planning at that point. And, you know, this is kind of in pursuit of the waterfall development. You'd sit down and plan everything out to a tee. You'd build it across all of your teams and you'd get to, you know, these very detailed reports. But those reports are really, you know, mostly aimed at the C-suite and it was a really a top-down approach to BI. There's also a pretty high cost of ownership at this point. You know, all these tools are fairly expensive. So just looking at the total tools, like if you just break down what are the number of total number of tools that you're looking at here? You know, there's somewhere between five plus tools that you have minimum in terms of your entire data stack. So just keep that in reference of the number of tools, the more tools, the more complexity, the more, you know, obviously these things need to interact with each other. So, you know, that's something to point out here. Also in terms of the teams that were, that, you know, needed to be involved in these BI setups, there's, you know, you had your executive and management team at the top. You had your project management team. You also had, you know, engineering, IT, and the mail list, and then at the very end, you had the business users. And so, you know, I think these heavily complicated team structures that relied on a tremendous amount of communication really relied on, you know, making sure that communication was great. And that was where a lot of these systems became more complex because they relied on so much of this communication. So for instance, you know, you have an end user who needed a specific metric or a specific, they were looking at a specific, you know, metric, and they would ask the analysts before an analyst to come up with exactly how they wanted to aggregate this. They would understand, okay, this is how I want it to be in the queue. They would pass it off to project management. Project management would then pass that to IT. They would pass that to engineering. So, you know, there was just a lot of, you know, kind of games of, you know, basically a game of telephone that was going on there. And, you know, oftentimes engineering would potentially aggregate something differently than the analyst wanted to. So by the time it got back to the analyst, the analyst would say, you know, this is incorrect, you know, so it was just like kind of constant back and forth. So, as I'm kind of pointing out here, there's really just a huge game of telephone, you know, telephone going on here, and there's a high cost of communication, and there's a high cost of planning. So what's continuing into 2010s? And, you know, the ways of, like, scalability were supported, and the pre-computed structures, such as star schemas, the difficulty of changing the structure of the database and the need for experts, or experts to configure and administer the data warehouse and eventually go out to a fair amount of frustration. So just like all this orchestration, all this planning, and I think eventually people were like, all right, I'm trying to just really set up like this. And so, you know, what are the challenges of, you know, the OLOP? It's really the need to pre-compute and maintain secondary data structures, i.e., the key, the indexes impact the data availability, and then adding new data or corrections meant re-computing summaries and aggregates. Also, you can't see new data until the rebuild is complete, so that takes time. And then adding dimensions allowed end users the insight they demanded, but there was a big aim of the compromise. And so the trade-off, you know, the most needed dimensions against capacity limitation, the tightening of the batch windows, and so it was like this constant type of take, like, do we allow more dimensions here, or, you know, we try to tighten it down a little bit for capacity concerns. And then the ever-increasing data volumes exacerbated the limits of the cube, and the gross storage that really could only scale up by leveraging and increasing esoteric and expensive storage into hardware. So it's like, all right, let's just throw more money at the hardware here, and, you know, obviously that becomes expensive as well. So those are like the really big challenges. And so, you know, 2006 is, you know, kind of the, what happens in the warehouse is the column store MPP, excuse me there, it should be MPP on time, DB comes out. And so the column store is designed for analytical queries. So massive parallel processing MPP, like, you know, divides up the queries between nodes in cluster, and each one of these, you know, does a portion of the work. So this is like, you know, this is, this was a huge kind of, I think, revolution here. And so it was that you were able to kind of remove the cube from the equation. And then you have Teradata, HP Vertica, the TISA, Oracle. And so this was a huge kind of introduction into the space here. So what did the stack then look like? It, you know, still somewhat looked like the 2000s. There wasn't necessarily a cube then. It was, there were still data marts, and there were still, you still had your business rules, and there were still, you know, a number of different constituents that were involved here. So you had still your executive team, your project management team, your engineering, your IT, and all those business users. So five plus tools or more, and still like six plus teams. So, you know, pretty, you know, still pretty heavy-handed there and a lot of communication that's having to go on. So what happens, you know, next 2008, along with development and MPP technology, the rigidity of the monolithic BI stacks were starting to become a little too constricting for business. So this is where the self-serve BI comes out. And, you know, really we go from a point of where you're just asking what happened in the past to kind of asking, you know, why did this happen? And really looking at the present moment, like what's happening right now. And so you're looking at like Tableau and Quik and that you're really allowing your users to drill down, explore, you know, be there still somewhat in data silos. And like, you know, one of the challenges that those data silos kind of start to bring up is there's multiple versions of the truth here. But, you know, this is definitely something that I think was driven by, you know, the business team saying I want more, you know, free access to this. I don't want these long cycles of having to go back and ask the engineering IT teams to, you know, continue to change and update things. So, from a data governance standpoint, you know, there's more data, there's more consumers, there's more complex data, there's now multiple versions of the truth. So, and the decentralized guide tools for data governance kind of end up being a little bit more like herding cats. There's, you know, everyone's accessing it from different places. People are, you know, passing CSVs and things around all over the place. And so there's all these data governance tools that start to crop up to say, like, how do we control this? How do we make sure we have, you know, good quality data and data security across our organization? So that's kind of, you know, the result of this self-serve BI. So 2011, or actually, yeah. So 2011, actually one thing I forgot to mention here is just also, you know, I think the thing that companies start to recognize is that instead of trying to, you know, only have technology for doing their data governance, there's a lot of policy-centric approaches that start to crop up. And so businesses, you know, focus on creating policies for their data models and data quality standards and data security and life cycle management. And that I think is a critical evolution in data governance. So it's kind of a big one to point out here. So 2011, speeding along here. The challenges with on-prem MPP console whereas it starts to crop up. So the arrival of the data really puts the MPP data warehouse to the test. And it's not really due to volume, it's because, you know, like, really MPP data warehouse can scale very high, but the bigger problem was really the variety of big data and the vast array of new types of analytics, some of which were not easy to do inside of the data warehouse. So this is kind of the new big challenge that I got to pay to shoot. You know, now what do we do? So introduction to Duke. Boom. So Duke, you know, was originally developed around 2006. It started to take hold 2009, 2010, 2011 and so I said the dates are a little bit vague here, but, you know, for all intents and purposes, let's say 2011, and Duke really kind of comes the rescue and says, all right, great, this is a fantastic potential answer to our problems because it can scale to big data, it can handle all different forms of data and allow lots of different types of analytics. So it's like, ooh, we finally, I think maybe found something that's going to work for, you know, for our challenges and it can potentially work for a long time, which is really exciting. So when you look at the Duke stack, it's like you have your operational source systems and logs, things like that. You then load it into Duke and then you're transforming, either you can do that in Duke or you can do that, you know, using, you can still the same certain tools again, traumatic things like that. And then you've got Hive or Spark, maybe see the top bit. And those, they're then pumping over to your data warehouse and also your analytics tools. So that's kind of what the stack starts to look like. You still have six plus tools, so it's not that this stack has gotten me simpler. It's just kind of shifted the way I think companies are doing it. And then there's still the same number of constituents involved. You have the executive team, you have a project management, engineering, IT, analysts and business users. So that hasn't really shifted all that much yet. And so fast forward to 2013, things keep shugging along here. And unfortunately, we start to run into some issues with Duke. I think that Duke turns out to be a great initial, initially it seems to be great, but because most it's easy to dump data into Duke and do data lake, but then it's hard to manage to extract value from that data. So it's complicated low level setup and just tremendous amount of maintenance and cost. And then it requires experienced development teams to work with Duke and really get the most out of it. So ultimately companies end up sending a lot of their data from Duke into a SQL database for analytics. And that's how you go back here. It's like that's where the high, the spark, these sub-to-tools aren't going into your data warehouse, your analytics tools are running off of that. So I think that ends up being kind of a bit of a dead end to some degree. So it's like it doesn't really solve the big data problem and it also doesn't really solve the data warehousing problem for analytics. So it ends up again being a bit of a dead end. So what comes next? Well, right at this time in 2013, MPP column store moves to the cloud with Redshift. So AWS forks part cell and launches it in their cloud. And they're really just a first mover. Redshift is really still a system that was designed for on-premise, but by moving to the cloud, it makes it a little bit more accessible, definitely makes it cheaper. And so this is something that really, I think takes off for the entire industry. They say, okay, great, let's move to the cloud here or let's start to move to the cloud here and start moving some workloads here. So it's fast, affordable, it's enterprise data warehouse, it's on AWS, fantastic. It's definitely less expensive than on-premise column stores and it really allows the mid-market and smaller SMB companies to start leveraging that as well, which is, these were really kind of not possible before because of the price constraints. It's fairly easy to resize clusters. And if you also look, the cost of a gigabyte of data is now at five cents. In the last 13 years, the cost has dropped about $7.65 per one gigabyte of data, which is pretty great, and that really is a pretty big shift there. So the other big thing that happens around this time is the cloud-aid and self-service BI. So we had kind of self-service BI with Tableau, but now that was Tableau and the 2008 quick set of tools. But I think what companies were really looking for is a little bit more of a cloud-native tool. So the decentralized nature of the desktop self-service BI first of all are its organizational challenges. So cloud-native BI was born. And it's really driven by massive infrastructure changes coming from the cloud platforms like AWS and tools like Looker want to be collaborative, centralized, and focused on data governance. And I think that another big thing that's happening here is the data is no longer like a competitive advantage, but it's a standard. And it's really, it's a disadvantage if you're not fully utilizing your data. So I think this is kind of like a shift in the paradigm where before you ever had to kind of be using data, if you really were using it well, you had a huge competitive advantage. Now it's like, if you're not using it, you're really just kind of behind the curve. And so I think the key thing for this is cloud-native self-service BI was centralizing it, making it collaborative. And I think for the first time, we're no longer looking at the past or the present with the data. We're trying to predict the future as best we can. And that's difficult, but it's starting to allow that and part of that's just through accessibility, right? You have more people using it. You have more people looking at it and more people saying, I use this to help myself and help us plan and help us put goals together and things like that. The other big thing with cloud-native self-service BI is it's likely to have a single version of the truth. So you don't have as many of the challenges that you had before where you have a bunch of siloed self-service BI. You have it all kind of focused around one particular point here. And full data accessibility and another big thing is to put fast query right here directly against data warehouse. So this is like the big revolution of the BI side of things. So it's great. So it brings us to 2015. And of course, there's all the challenges. So if you look, you know, it kind of, you know, we, you know, it feels like 2013 was really like, all right, great, we're really close to data development. We're like, we're almost there. We always have the perfect stack together. And cloud is really starting to work out for us. We now have the MPP in the cloud. But, you know, I think people still have to see some of the challenges with, you know, Redshift in particular. And I think, you know, this is just a survey that a company called Intermix did. And, you know, just of, you know, companies that were leveraging Redshift. And the problem were like, hey, our queries are too slow, our passwords are slow, our Redshift is a black box. Like, you know, we're growing and afraid our current Redshift stack won't scale. Like, you know, just some of the issues with Redshift. And I think the challenge here, you know, Redshift by all means is a fantastic warehouse. And it will scale very well. But I think the market had a slightly different appetite. You know, I think the market was interested more in, you know, a little bit more of a managed system that you can plug and play and not have to, you know, go in and do the DBA type work. And I think the other issues with Redshift are just because it's not really unnative. So it was designed for on-premise. And so it doesn't separate to paper store. It's not really easily scalable up and down. I mean, you can resize, but that's a bit of a pain. You know, single queries, single queues for your queries is also a huge issue. So, you know, it downgrades quarterly when you have like a lot of producers and consumers with data. You know, it's really difficult to manage the cluster and you have to do these really complicated work-life management rules and you have to run lots of back-eams, all that kind of stuff. So it's not exactly as flexible for the needs of a lot of the data teams. You know, it's better than what was before, but I think there's still an appetite for increase, you know, a better solution here. So, in terms of the 2015s, you know, this is when the cloud made and calling store MPP data warehouses come up. And, you know, I think data teams want new levels of flexibility to adopt to the needs of workloads and to support self-service and simplified administration through automation. So, you have Self-Late, you have Google BigQuery, you have Azure SQL Data Warehouse come onto the market and you have this separation of compute storage which is really vital to the success of the MPP warehouse. And, you know, I think there's also zero infrastructure management, so they really thought through, like, okay, what do people want here? What do our customers really want? And I think they really nailed that one on the head. You know, the other big things are these tools support structured and unstructured data, whereas the older tools, like, you know, like Redshift didn't really handle, say, JSON all that well or even unstructured data all that well. And the last thing is it's instantly scalable and compute, you know, with compute. So you can go up and go down very quickly. So we just take a very, you know, a little slightly deeper dive into what this directly means. You have separation of compute storage, like, so now you have a single, you know, S3 layer or, you know, Google Cloud Storage layer with essentially infinite storage that you're then running as your storage layer and then you have your queries running against a leader node here in your MPP. And so the data is copied onto the nodes in the cluster at, you know, at time of compute, which is pretty exciting. And then, you know, this really solves the queue issue. So you can have multiple different consumers and producers in your company that are all reading off the same data sets. You don't have to have the multiple data sets and silos happening here. You can all work off the same thing. So you have your, maybe your e-tail running through one of these. You have your, you know, finance team running through another and BI runs another and save the accident, you know, accidentally does some, you know, crazy queries that end up taking down their whole cluster. That's perfectly fine. Like, I mean, it sucks for everyone who's using BI, but your finance team won't go down and the ETL won't go down. So this is a, this is a huge advance in the space and just makes life a lot easier. There's a lot less you have to, you know, handle in terms of orchestration across teams and things like that, because they used to be a little bit independent while also still working out the same data set. And now that means elastic compute. So you can resize the cluster in seconds and just, you know, add more nodes. So let's go, let's go crazy and be like, hey, we know it's, it's Black Friday. We know we're gonna have a tremendous amount of traffic and we're gonna have a tremendous amount of purchase going through our system. Let's scale this thing up. We, let's, you know, let's get to what it needs, which is great. And then you can scale it back to Cyber Monday or whatever you need. So that's really the flexibility and the agility that the newer teams and the, you know, the newer, you know, just companies really need, not even your company, just like people need in general at this day and age in kind of the, you know, the way we're running our businesses. So the last thing, and I'll just point this out, is just kind of a huge perk here is data sharing. So data sharing is, you know, one of the perks of having all of the, your data running off of these, you know, ask through your data sources. So you can share across different companies, like Weep, I train to share at APCO, can share with Bob Swalming or whatever. And, you know, I think this is still fairly early and that there isn't a tremendous amount of, you know, of companies leveraging this quite yet. But I'm pretty sure that we should keep an eye on this because I think this is going to turn into a really great way of sharing data around. So far on, man, how does that affect ETL? So, you know, I think that the thing here is just to go over the recap of everything I've gone through in the last 20 slides or whatever. If you look, you know, warehouse has gone through and I've gone through a number of advantages. So in 2000, OLAP to, you know, six-con-premise com store, MPP 2011 Hadoop to the, this should say 2013 cloud com store, to the 15 native com store, you know, BI, 2000 monolithic rigid BI, then self-service, then centralized cloud native cell service. You know, to be honest, ETL hadn't really, it has not really changed a lot of much. So, you know, I think it gets part of the question and then when I was looking at this, it's like, why has ETL not changed that much? And I think there's a couple of reasons for this. But it's really, I guess, time for a change. So, you know, there are huge challenges with ETLs from the 2000s. You know, ETL was optimized for as slow on kind of still a lot of beta warehouses. I mean, this was initially how it was designed for it. And so there are massive storage constraints at that point, you know, they're optimized for pulling on from this enterprise applications. You know, I think there's just so much change in the last 15 years. And the fact that ETL hasn't, you know, I think really indicates that there's a need for it to change at this point. And so one of the challenges that, you know, inherently using an ETL strategy, you know, bring up is there's still an extensive amount of setup involved. And there's ongoing maintenance. And there's extensive planning. And I do hope that, you know, oftentimes if you're trying to do this, that like you have to plan so far in the future you're kind of like trying to really plan the future. And that's just really difficult to do, especially in today's business environment where things are changing so quickly that companies really need to be able to switch directions very fast. So 2015 shift in the kind of company structures and I joke here, yeah, like so hot right now. I love that. No, we do. Okay, yeah. So moving to cloud, IT teams are shrinking and analysts are going to, you know, the front and the self-service BI. And they really want, you know, simple infrastructure. They want fully managed services. They want kind of a more holistic control over the stack. You know, they're tired of having to deal with going to all these routines to get what they want. And today it seems really want, you know, they want infrastructure that was, you know, simple to use with, you know, SQL based, can be managed by themselves without a ton of overhead from themselves and from, you know, other teams that, you know, in their company. And so they wanted tools to help them succeed that, you know, we're just set up for, you know, for their success, which is, you know, overall complexity and good jobs. So this is kind of a big push for this. And it's time, it's been 15 years. So it's like, let's make some changes here. The other big things that, you know, are happening at this point are the rise of cloud applications. And they have this tiny picture here on the left. But it shows something like 5,000 marketing apps and marketing tools. CRM and easily thousands have cropped up in the last 10 years. And so this is a tremendous amount of the new tools that companies are leveraging that are highly specific for those specific needs. Also the drop-in in day storage now is, you know, one gigabyte is two cents. So, you know, it's a whole, whole lot cheaper than it was 15 years ago. And this also plays into, you know, how it works or how the ETL, you know, paradigm should shift. The last thing is just the add-out work flows are really what companies want. They want to look, quickly iterate and move, you know, as fast as possible. So we dive into what does this shift look like? And as we see it, you know, this is classic ETLs. You see how they allow it to be like a track, you transform, you load, visualize, this is a simplified number, so nice. And on the right, you go with more of a modern data stack or you can take an ELP stack where you strap the data, you load it in a more holistic way, then you transform it and build that model like semantic data governance layer within the warehouse. And then you're visualizing by shipping only the results, you know, running the queries against the data warehouse and then shipping the results to the browser. So super fast, but the workforce, the data warehouse, just do all the, you know, the heavy lifting there. And so the kind of benefits of this, these stack are as all cloud-native self-service analytics. So you have full scheme of replication coming off the data, kind of building a data lake within your data warehouse. So you're not needing to add an extra Duke stack or those other things, you're just like, let's put it all in there. We'd store it in the S3 later for a lot of these so we can scale up infinitely. You're then, you know, putting these into, like, more of a standard schema here inside the warehouse. And so you're building like a base table here, set of base tables. And I'll speak to why that's important in it. And then you have the SQL-based modeling and transformations that are post-load. And then centralized and collaborative BI layer on top of that. And so one of the reasons that, you know, I think this is a really great move is that it's a modularized replication and it's kind of modularizing, it's kind of separating replication and load from transformation. Because I think that one of the largest complexities that happened in the ELT or ETL paradigm was that you were extracting data but then you were also, you know, transforming it at the same time. So if you had to go back and understand if you were, you know, where the issues come with your wire, the metrics aren't exactly working out right or why the brightest numbers aren't exactly right, you have to try and figure out and isolate, is it because of the replication code? Is it because it's not correctly coming from the source? Or is it because of the aggregations that were applied or the transformations that were applied? And so, you know, when you build these monolithic systems that are combining multiple steps, it's, you know, that complexity just compounds and it just makes everything harder to try and figure out. And so, this is some really separate to this too. It says, okay, great. Replication is something and the transformation is something and that really, you know, it allows you to have really a fantastic break there. And, you know, you have this semantic layer that you can run data governance on and you can have a single version of the truth. And it also just simplifies your stack. So, you're now down to like three teams, right? So, you don't have to have engineering necessarily, you don't have to have IT or project management. Some analysts that can essentially run most of this stack, they understand how the data warehouse works, data warehouses fully manage, so it's not like they really need IT to oversee that. The extraction processes using a tool like Pipan is like, you know, designed for analysts to set up and very fast for them to get working with without the help of engineering or IT. And then you have your business users and your executive team management that are at the bottom, but there's not necessarily a whole lot of interface that needs to happen there. There's not a tremendous amount of communication. The, you know, the analyst obviously still needs to work on making sure that they get the data governance part, right? And they'll work with the executive team, and they'll, you know, on that whole thing, but then it's really like, okay, let's get it out to the rest of the company, let's let them use it, and let's not worry about them, you know, mishandling or misusing it in some way because you have that really strong, centralized transformation data governance layer. And so kind of just to recap, yeah, I mean, there's been all the change that happened in the warehouse and BI, and really, you know, I argue that, you know, ETL is right for change, and you're seeing just the, you know, just the tip of this happening right now, and it hit a lot of companies moving over to doing this. So quick, you know, last few minutes, I just want to give a very fast plug to, what is it we do and why to, you know, bring all of this up? So, 5.5 takes this whole thing because we're a data replication tool. So it's like, we're zero configuration, zero maintenance data pipelines. You know, so 5.5 helped you achieve the accessibility with the zero configuration, zero maintenance data pipelines. We pull the data from the source for you, you know, a matter of, you know, you connect with an authentication, we pull the data and load it into the warehouse and continue to keep it up to date. So, you know, we'll pull from applications, databases, files, events, really wherever you have data, we'll load it into, you know, typically wherever you have business data, and we'll load it into your warehouse. So there's Snowflake, Google BigQuery, Amazon Redshift, Microsoft DeGeneres, and a plethora of other ones. And we support from, in terms of sources, I think this was even outdated, we're adding about, you know, two to four per month, but we have a long list of applications, some of the popular ones like Google AdWords, Google Analytics, Marketo, NetSuite, Salesforce, Zendesk, Zora, databases, MySQL, Oracle, Postgres, SQL Server, you know, and there's actually a long list about those as well. Files, you can grab from like Amazon S3, FTP, FTPS, SFTP, we do support emails, so you can like email reports to us and we'll automatically grab those and upload them. We're just, I think it's more of like a tool kit for your analyst team to help you to get the data of all the places that you need it into the warehouse without having to call on your, the rest of your engineering team. So how does it work? The way it works is you authenticate and where you do the rest. So say we connect system like Salesforce, we, you authenticate it for us, you give us access to, to pull data on your behalf, we pull all the historical data, we run, we normalize that data, we create the schemas and tables for you in the warehouse, then we load that data and then we continue to keep it up to date to say a five minute interval. In terms of the normalization behavior, this is, you know, getting the weeds, but I think a lot of people, this is important for them. We have kind of two different levels of normalization that we do. And for data that comes out fairly denormalized, so we're talking like APIs that are sending very denormalized data in JSON sets, we actually will spend a lot of time to normalize that data into a great schema for customers, so they can just, you know, start working directly on top of that. They don't even need to generally build a modeling layer in the warehouse, they can just drop their BI tool directly into that. And then for normalize schemas, kind of like databases and in places like Salesforce and NetSuite, because they're so customized, we just replicate those as is, so you get, you know, essentially a direct replica of what you have in your source into the warehouse. Then a big thing that we offer standard schemas for our forest is just our ERDs. And so these are great because you know exactly what you're gonna get, you can plan ahead of time, you don't have to just set it up and hope that you're gonna get something great. The other good thing is that we do very exhaustive pulls, so we pull everything by default. So it's not this kind of game of telephone again where you have your analysts and your analysts, it's like I need this one table and they tell the engineering team that you're gonna go add that one table and then the next day they need another table like this game back and forth. It's like, no, we just pull everything by default. And then you can turn off tables if you want to for some reason. We can turn off columns or things like that, you need to, from JPR purposes or whatever, but that sync all by default, I think is a really critical component of us. And then last few minutes, just one or two, how the OTA process works. So we only do an incremental update process. This is by default when you insert, we insert, we update, we delete, we mark it as deleted and archive that. We work on some other things in terms of expanding this. And then we also do auto max team immigration. So add a column, add a column, you move a column, we don't remove a column, but you give them a alert on that. When you change the data type, we'll change the data type in the warehouse as well. And when you add an object, you add an object, move an object, we don't remove it at this point. That handles the tremendous amount of the complexity, I think, so that you really don't have to worry about anything, these old ETL types of systems is something changing the source. That would break everything in your ETL, you'd have to hand your team, especially to go through and fix all that stuff, and that can slow down your entire pipeline or all of your analytics for days or weeks or months at times. So this allows you to stay extremely agile, continue to make all those changes that you need to make in the source, and not worry about your new downstream BI. So I'll leave you with one last thing here, it's just complexity compounds. And so I suggest you automate, standardize, and simplify as much as you, of your stack as you can. And I think that this is one great way of doing it. So that's all I've got. I think the last thing I said is like, what's coming next? Question marks, you know, I think I love feedback, questions, thoughts. My email is Taylor at 5Train.com, we'd love to hear from you. Otherwise, I think we should open up for questions there. Yep, Dana, no, that's a great presentation. I love the history and I love the fact that you kind of draw these lines to where the inflection points occurred. And I think the challenge that we face today is that we have multiple inflection points that are all coalescing. And that's gonna create some interesting opportunities, but some pretty significant challenges for how to move forward. And to that point, we do have a number of good questions here from the audience. One, let me throw this one over to you. One of the attendees asks, why normalize? Should the standard scheme is denormalized or denormalized? What do you think about that? I would say that kind of textbook data warehousing is to normalize them. Because when you denormalize them, oftentimes you're gonna end up with duplicate versions. So that's like single source of truth. You know, I think it's better to have a single source of truth so that you don't have multiple teams working off of, you know, multiple datasets that are all overlapping that have denormalized data. Okay, good. And there's a couple of questions around data governance. You know, I'm glad that you brought that into the mix because data governance obviously gets more important by the day. And I think what you've talked about, you're mitigating or limiting the amount of transformation that occurs, that's actually a fairly significant component of the data governance landscape. But still, the question is where do you, where do you handle data governance? And I think it's sort of an end-to-end process, right? You have to have bits and pieces of protocol woven in. But if you don't transform a lot of the data as you load it, that at least helps to kind of stabilize the whole picture. What do you think? Yeah, absolutely. I mean, I think there's still kind of two places that you can add data governance. I'm gonna go back up just so I can use something here. So I think if you look at it, there's essentially the point where you're extracting data out of the warehouse or out of the sources, and then there's the point where you're transforming it. I think that you have the opportunity in both of those places to do data governance. So what I mean by when you're extracting it, that's when you can decide what do we wanna make sure we don't have in the warehouse. Like this is like security or GDPR and things like that. You may wanna decide, hey, I don't wanna load any PII into my warehouse. Or you may decide like I wanna hash or I wanna obfuscate that data in some way. So that I'm not worried about the security issues downstream. I'm not worried about all of the consumers that we have in that warehouse. So that's point one. And 510 definitely can help with that. And then I'd say the second half in terms of thinking about data quality, centralization of your metrics, of the way you're defining things like revenue. I would say that should happen at that transformation layer. And that's where you are building these dimensional tables within the warehouse that you then appoint your BI to. So those are kind of the two locations I would say. It works out in the current setup. Okay, that makes sense. One of the attendees is asking about how you lean this and how real time can you get in this environment? I'm guessing you can get pretty real time. What do you think? Yeah, I mean, I think it really depends on which warehouse you're loading into because the column store warehouses, like Redshift, for instance, because it only has a single queue, they can get back up really quickly. Tools like Snowflake, for instance, you can get them down their latency down to like minutes or seconds. So I think that it's not quite there where it's like we're in the middle of seconds because it's still a column store data warehouse so it's not really designed for that, but it's getting pretty done close. And you can definitely get it to a pretty fast speed. Okay, that sounds good. Anya, you talked about efficiency of making changes. Are you formally using change data capture as part of this architecture? One of the attendees is asking. Yes, I mean, we do use change data capture when pulling off of the databases. So we're reading off of the logs from all the databases. So for like my SQL, it's off of the bin logs and Oracle, it's the log miner, I think. And so we are doing change data capture there. For the date, for pulling from sources that are APIs, I mean, we're just pulling for changes every so often, every few minutes. So multiple changes can happen between those polls which we can't see because that's kind of just inherent in the way that the API setup is, but we will definitely pull all those changes and capture those where we can. Okay, good. And we have a question about large organizations, especially government organizations that are not allowed to use the cloud. What can you do for those organizations like defense companies, for example? Sure. I mean, at this point, Flagstern doesn't have a fantastic story about that. I think we're very focused on cloud. At some point, perhaps we may move to offering something that's a little bit more on-prem, but to be honest, we've just seen so many companies that are really just focusing on going to cloud. So it's silly for us to create an on-premise offering where we're just loading to cloud. And specifically when they're coming a lot of times from a cloud service like Salesforce, and then to go from cloud to on-premise to cloud again. So I don't know if that's a great answer to that question, but I guess that we're not exactly super focused on it at this time. Yeah, I know that makes sense. You have to focus on where you excel. And I think, frankly, that's gonna be a hallmark of the information economy going forward because you can design software from anywhere and deploy it pretty much anywhere. Companies such as yours really have to focus on their key differentiator, right? You have to focus on what you're doing well and realize that there are segments of the market that you won't be able to capture right away. But that's a good point. We have one question about, and I'm curious to know your thoughts on this, because I've been tracking Apache drill for some time, and I've seen how they made some really good advances, and I think kind of had to go back to the drawing board as I recur. Are you familiar with Apache drill, and how do you think it fits into the broader picture of understanding large data sets? I'm not super familiar with it. I mean, I know I have a cursory understanding of it. I don't want to speak to it in depth. I definitely look at it deeper if that individual wants to just email me, you know. Okay. Yeah, I don't want to speak to it knowledge-wise. No, that makes sense. Okay, here's another good question about metadata. One of the attendees asked, on the extract load step, do you also create metadata about the specific load, such as date, time, source, et cetera? I'm sure that's all baked in, right? Yeah, so we have a pretty granular logging system that will log all the information that's happening even within 5Tran systems. So 5Tran, you can think of, I would say, as a glass box. So there's not a whole lot you can do inside of 5Tran systems. It's fully managed. We're sitting there 24 hours a day watching it. But we want to give you guys information as to what's happening. So we'll give you the time that we call the API. We'll give you the particular, even query sometimes that we're querying against that. We'll log how long it's taking for the response to happen, how long it's taking to process this data, how long it takes to load into the warehouse so that you have very great granular information on every step that's happening within 5Tran. And we actually will load that log data into whatever log system you're using. So if you're using like CloudWatch or if you're using some other logging system, we can actually send those logs to you so you can put those right into the log system you're already using. Okay, good. And here's a really good detailed question from an attendee who asks, if the source's schema doesn't track changes over time, will you add that capability to your schema or will only the latest values exist in the warehouse? So we automatically have our own system for tracking the schema. So a lot of times we aren't passing any sort of metadata about the schema, especially when we are creating a schema ourselves. A lot of times we'll get these denormalized data sets coming from APIs and we will transform them into a much better, more normalized schema in the warehouse. And in order to do that, we track the schemas very closely. So we'll track all the columns. We'll know exactly what we've seen before, what we haven't seen before, what happens if a column gets dropped. Like we know very closely what's happening with the schema. Okay, good. Well, folks, so we burned through a solid hour here. We have a few more detailed questions. We'll be sure to pass those on to Taylor. But yes, we do archive all these webinars for later viewing. We will have an email that goes out on Friday with some details about all that. And I guess any closing comments from you, Taylor, about where things are going, what are you working on next? I mean, I think that the warehousing space continues to evolve. I think the transformation layer in the warehouse is pretty interesting. I think for us, we're really focused on still just continuing to make the most reliable product and support the most number of connections because as I showed on that slide, there's 5,000 of them right now and we've got, what, of 80 or so? So we've got a long way to go. Yeah, that's my space, too. And it really is fun and exciting to watch this all pan out. I think you guys are playing a really valuable role in allowing us to move through what is a very dynamic and turbulent period of time in the information management space. And with that, I'm gonna hand it back to Shannon Camp to take us out. Thanks again. Thank you, Taylor. Thanks to five, Shannon. Thanks to all of you for your great questions today. We'll get them all to Taylor to get you some answers. Yeah, thanks so much, Eric. Thank you, Eric, and thank you, Taylor. And Eric's pretty much said it all. So thanks to all of our attendees for everything today. Thanks to you guys for this great presentation and just a reminder, as Eric mentioned, and I will be sending a follow-up email by end of day Friday with links to the slides, links to the recording of this session as well. And we hope to see you in more deep dive or DM radio webinars. Thanks, guys. Thank you. Take care. Thank you. Bye-bye.