 Hello, and welcome. My name is Shannon Kemp and I'm the Chief Digital Officer for Data Diversity. I'd like to thank you for joining today's Data Diversity webinar, What's in Your Data Warehouse? It is the latest installment in the monthly series called Data Ed Online with Dr. Peter Aiken. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we'll be collecting them by the Q&A section. And if you'd like to chat with us or with each other, we certainly encourage you to do so. You may have to note the Zoom chat defaults to send to just the panelists, but you may absolutely switch that to network with everyone. And to open the Q&A or the chat panel, you may find those icons in the bottom middle of your screen for those features. And to answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days containing links to the slides. And yes, we are recording and will likewise send a link of the recording of the session as well as any additional information requested throughout the webinar. And now I'm going to introduce to you our speaker for today, Dr. Peter Aiken. Peter is an acknowledged data management authority and associate professor at Virginia Commonwealth University, president of Dama International and associate director of the MIT International Society of Chief Data Officers. For more than 35 years, Peter has learned from working with hundreds of data management practices in 30 countries, including some of the world's most important. His 12 books are many firsts starting before Google, before data was big and before data science. Peter has founded several organizations that helped more than 200 organizations leverage data-specific savings, which have been measured at more than 1.5 billion US dollars. His latest endeavor is anything awesome. And with that, I'll get everything over to Peter to get today's webinar started. Hello and welcome. There we go. I'm mute. Yes, thanks so much. I think I'd have that down by this point, but welcome everybody. It's a beautiful sunny day here in the east coast. And our topic today is data warehousing. So of course, you know, I like to mess with the titles a little bit here. What is in your data warehouse is where we put it out. But actually, the real question is what is in your data warehousing operation. So your data warehouse is probably not a single standalone data warehouse. It's a combination of various activities. And so it's more useful to ask this question than actually what is in the data warehouse. And in fact, if we really want to look at it, what we really should be asking is what do your data warehousing operations consist of. Gosh, if you haven't figured it out at this point, a lot of this stuff moving to the cloud. So we'll certainly throw in some components of the cloud in there. So let's dive in with a what do your data warehousing operations consist. What we're going to talk about today is some definitional material get to started with some just sort of standard definitions that we talk about them. We'll talk about two aspects integration in there, which is really a focus of warehousing and preparation, which is another component of warehousing because it doesn't do any good to stick the stuff in the data warehouse if you don't have a good way of getting impact cows letting people explore it around there. And we'll talk about some best practices and in all cases, I contend that methods, all a crew up to something called plan do check act. We'll talk about that in specific detail them about an hour from now we'll come back with some takeaways and references and look forward to the Q&A section that Shannon hosts for us as well here. Let's dive in. So the general data warehousing input output diagram is that we have some inputs and a process and some outputs and some of those outputs consisted in most cases of warehouse data. Now, I'm not going to say warehousing is the right name or that we should do it but it's the name that we use on this now and what happens is we have a lot of different applications that are essentially hung off of this data warehouse. And that's a good thing because that means our data can be reused which if you recall from my other topics around this is really the main issue here most applications are designed just to feed data. In this case, we're actually designing for reuse. So our focus of operations are really focused on something called extract, excuse me extract transform and load, which turns out to be a major source of data structure and transformation information out there. And then we have metadata, if you will, major source of metadata, and the fact that it's in production also talks about the fact that this set of conversions is perhaps more reliable than many other situations that you'd have in your organization. So if you're familiar with our data management, data, Dama Dembock here this is what I'm showing on the screen here and you can see that this is one of the major pie wedges that we have around here, but notice it is not separate from data governance. There's a lot that you need to pull into this so if you're seeing this for the first time that Dama Dembock really just talks about what is involved in data management. And I think that there's a bit of critique that we can do for ourselves to say that when we created this we weren't thinking as clearly as we see how others are using it, and people will come back and read this and say, oh gosh, this means I have to do data warehouse and I have to do document content management pieces. No, actually it just says these are the things that comprise the 11 business practice areas and comprise our data management body of knowledge that we're looking at on all of these. So one of the things that's a component of our Dembock to is an input output process diagram here's a verbal one. Again, I won't read it for you here but you can look at this and this if you take nothing else away from this is a great articulation of it it shows the inputs the activities of primary deliverables, who are the suppliers what are the participants that are involved in the tools, consumers the metrics that you should have in here. At the top we have goals to support effective business analysis and decision making by knowledge workers, and to maintain an environment to support business intelligence activity I'll define those terms in just a quick second but let's start out with a sort of negative example. Again, data warehousing well it's a wonderful idea and people put things in there it also ends up with some companies not doing as well as they could so it's a technology that permits us to do things like query and reporting and development of capabilities around these areas. It gives us information that has not previously been integrated so we're finding a way to integrate all of these by virtue of the fact of how the warehouse is structured and represents for many organizations and new set of organizational capabilities that they haven't had before. Now we've also talked about this in the context of business intelligence it's also known as decision support that dates back to the year before I was born so before computers business intelligence is about supporting better business decision making and gosh if we had a place where we could go to always grab some data that we knew as of at least known quality and be able to make some decisions on it. That would be a help for many organizations. There are a set of technologies applications and practices which is what we'll talk about today that allow us to pull these things together at a scale that we haven't been able to do giving us the opportunity to look at historical patterns in the data to improve future performance around us. Many organizations call this the use of mathematics and business and again it's the idea of applying statistical analysis and what people call analytics in this case in order to do this. So looking at all of these things here it comes back to the question. Data warehouse should we build is the way most people approach the process to how can we give me how can data warehouse based integration address the challenges and that's really what you should be looking at the data warehouse concept is a sort of Swiss army knife. And the one thing that you don't want to do with your data warehouse is something that was always made fun of Indiana Jones movies. I don't know the story of Indiana Jones that's the Ark of the Covenant supposedly in that place it's being stuck in the data where in a warehouse somewhere. And unfortunately many organizations think of data warehousing in the same way that they don't use it daily so they don't tend to see the use out of it. So another component that we need to understand from data warehousing as well and that is the idea that better organized data increases in value but nevertheless 80% of organizational data is right. You know that sounds awful and it does sort of have some bad implications right as an acronym that stands for redundant obsolete or trivial and the only question I've gotten in 35 years of doing this kind of work is that what if ours is higher than that. I've never had any company that came to me and said, we have less than 80% of our data being composed of redundant obsolete or trivial data that's in there and I've had many companies go as high as 90%. There's a general agreement out there that much of the data in organizations is redundant obsolete or trivial but which data do you eliminate and when we come to the fact that most enterprise data is never really analyzed in the first place. It becomes a little bit more challenging around that. So let's take a very specific example that we ran into at one point in time. It was a healthcare company that was somewhere out in the Midwest of the United States and they had 1.8 million members that were part of this healthcare provider. In the data warehouse they also had 1.4 million providers. Now, I'm pretty sure they did not have an individual provider for each member or even close to it and that's the way these numbers really look so there's okay first problem right there. The problem was it of these 1.4 million providers 800,000 of them had no key which meant there was no way of getting access to the data. In the first place the data had simply been locked in a warehouse and was not retrievable by most conventional means. 29% of them did not have the social security number identifier that they were using in this case again illegally but nevertheless using it. So that's another third they couldn't get there actually only 2.2% of them had the required nine digits that were in there. And most importantly and this was the thing that sent the boss to the, you know, blew up on this this entire data warehouse had been built for one user. Now, it's not to say that you shouldn't build a data warehouse for one user but it's typically not your best use case. And it costs 30 million bucks for this guy and what happened the boss said, look, next time we do this, I can just grab a handful of MBAs. And I can lock them in a room, and we can accomplish all of this much more quickly than spending 30 million dollars to come up with a gigantic data warehouse that has exactly one user in this. Again, I'm telling you sort of a worst case scenario most data warehouses have multiple users and pretty good context for them but this does happen on a regular basis. And again, I end up being the person that gets called in routinely to help them straighten these things back out. The reason we have to do this is because of the way systems have evolved that's what we call paving the cow path. So this may be a way in which a startup company or something else came along and and is creating a series of pieces of payroll database and research and development and marketing and notice each of them has their own separate pile of data associated with it, which are the proverbial silos that you hear about so often. So we're looking at these things. What do we need to do of course, well, first of all, we want to tie it together it's kind of like Gulliver being tied down by the little putions. There's just lots and lots of these interfaces that go back and forth and keep us spending our time and money on transforming things that we really should be doing programmatically instead of individually. What we need to happen here is that we need to integrate and put in a new if you will collection of data that we can start to add pieces to and we can start to differentiate them between marketing data and specific and data that is shared across the organization. By the way, this is a great way to help determine your critical data elements in this so that you can re architect your environment kind of like fixing the plane while you're flying the plane but that's what most of us are faced with. And then we can re architect is something that does in fact make a little bit more sense around all of these topics. There's a couple of additional pieces that are really important that you don't forget they are not panacea they will not solve everything. But by gosh there's a lot of problems that link data can solve on this and I saw that same $30 million that had been spent for one government agency. I was using that example earlier. The entire data warehouse was recreated by a firm that was well versed in this technology for $300,000 so that is two orders of magnitude less than the original price that they were looking at and then had in fact spent on this. So they were able to replace a very large data warehouse with one user with a more generalized facility for much, much less in order to do that. It's not the topic of today's piece but it is definitely an opportunity for you to leverage what has happened in society so that there are more and more and more and more of these opportunities to link data back and forth with these. Another thing to keep in mind of all of this too, or that there are some technologies using XML and JSON and all sorts of other cool new bits and pieces that allow you to virtualize your data so that you don't need to actually have it captured in the data warehouse the way you'd like to have it, but you can do this on the fly creating virtual pieces so everything that you're seeing in here that is in the teal color is actually stuff that has been created on the fly with this whereas the orange represents the core database tables so I'm going to create these one, two, three, four, five, six other tables on the fly for the purpose of the analysis, but I'm not going to store them, which means we can save on some of our cloud bills that are out there. And it is important to address cloud because we've seen many many folks that are following good guidance that says you should definitely learn about the cloud and figure out where it fits in by the way that's just good general advice with technology. I'm not a question of is it the right thing for you or not but it's where can it fit in here. And so these kinds of technologies whether it's Amazon's or Microsoft's or Google's they're all very good they are all virtually identical. The real question is what data do you want your cloud to get access to. And that's actually kind of easy if you want access to retail stuff. I'm just going to give you some better access to some of those things if you want to get access to LinkedIn. That's the Microsoft version of it because they have much better access to that and Google of course does YouTube and search. So lots and lots of different pieces with the real differentiator among cloud technologies is going to be the type of data that you want to access, not the capabilities of the individual data we're having capabilities in there. So including you've all seen this it's stuff that we have to pay attention to its location independent community community which is great. It means that we can virtualize just as I was showing in the previous example, and that the details are abstracted from consumers so they don't have to really know about the particular pieces the nice thing about cloud which is a mixed bag on this but it's still in some instances is really good. So looking at insufficient capacity in this case you can see that sort of teal, excuse me that sort of pink salmon color there is is insufficient and the yellow represents wasted capacity and as you're growing. Of course it grows and this is the real selling point for clouds to say that look you only need to buy as much as you need to buy. On the other hand, there are so many variants of this and there's so many organizations that are trying to look at this whether it's a in on Chrome cloud or private cloud or mixed environment, or types of different options that you have here. The real question is, when you're putting data in the cloud data in the cloud should have three attributes that data outside the cloud does not data in the cloud should be cleaner data in the cloud should be smaller. In volume and more shareable by definition. Let's take a look at how that works because most instances of data warehousing end up with something very much like this story. I've got some data. I want to put it in the cloud. There we go we put it in. Now the problem with putting all of your data in the cloud is that there's no basis for the decisions being made we're simply saying all of it there's no inclusion of architecture and engineering concepts in the cloud and that's a real problematic and there's no idea that these concepts are missing in the first place. And remember back to 80% of our organizational data with rot. Let's not put rot in the cloud and pay Amazon or anybody else for that data instead. Let's look at the opportunity to do warehousing. In this case, through a series of transformation so we take the data, we bubble out if you will boil it down to its essence so that it is less in volume we're not putting redundant data out there in the cloud there's no point in doing that. Similarly, the data in the cloud should be cleaner than the data outside the cloud or in the warehouse again I'm using these terms interchangeably because I do believe they are at this point completely interchangeable and data in the cloud or data in the warehouse should be cleaner than data outside of it. Do you imagine that anybody would like to have data in the warehouse that was less quality that was a lower quality. No, I don't think so so very simple questions but good ones to ask people and then the last criteria of course I said was that the data is more shareable that's the actual characteristic, we can objectively test for it determine whether or not this data is more shareable or not. And as I said, this is true for warehousing, it's true for cloud. Yes, all of these things are necessary but this gives us an opportunity then to say with this less data, this more clean data and this more shareable data, we can do what we need to do with the branding and allow organizations to start. They will actually hear it an organization I don't want that date I want the data that's over here, which over here well stuff that's in that warehouse or that cloud or whatever it is yes absolutely want to do that. Finally, we don't want to clean it after the fact because that means we're really kind of doing this glove box work there's something on the other side of these large plastic shields that keeps us away from the things that are in there well it's kind of hard to do with data once has been put in the cloud, as opposed to where you have it currently on prompt so clean it before it goes in the cloud don't put in the cloud, and then clean it by the way cloud warehouse right same kind of thing there. I'm sorry I went back a little fast on this, we're going to now go to integration over here we're talking about real challenges that we have. Two basic warehousing purposes. The first one is integration which is just that I got data disparate types that I want to make sure that I can do compare and contrast with absolutely reasonable. Remember again that most data is never organized so the same as the inputs and outputs go into all of this they have the same type, all the way around and the downstream knowledge is incorporated upstream so that we do kind of understand what the sources and uses are. The other part though is a preparation concept just typically seen as the last mile if you will in prepare preparation for decision support or other types of activities that are there. These are generally closed ended activities as we're going to try and come up with an answer. The goodness knows we'd like to have good answers as we go through this, but it gives you the opportunity to find finalized problematic quality measures that we can put in place in order to do this so again two basic purposes and the point is to pick one. Right. Don't try to do both within your same warehouse design. It generally produces much more complexity than you need to have I have seen several companies that can do it. But again, it's really sort of a problem. Let's illustrate why that is the case in many instances. We look at six different applications and we can come up with a measure that says the worst possible upward theoretic complexity measure is n times n minus one divided by two, meaning there will be 15 interfaces. And as a result, if we have forced to connect everything to everything else now, Royal Bank of Canada told me I could use their numbers they had at one point in time 200 major applications connected by about 5000 batch interfaces. So if we look at that on the complexity scale, I'm just using these numbers here on the side pushed into my formula. Okay, so there we go with 200 of them that means we can have up to almost 20,000 interfaces and Royal Bank of Canada only had 5000. They were below average in terms of their complexity metric. Now, the main thing that I hope you're taking away from this is gosh, I would sure hate to have it either all this in one person's head, or we're still something that we had to manually manage, like programming code or something like this. Really, what we want to do is go after programmatic activity. This is where things like business will engines. And again, the aforementioned ETL perform super, super well given this a really good reference to this is Claudia impulse wonderful work along with Bill Inman and Ryan Stokes who built a sorry created a book called the corporate information factory, and it thinks about these things as a production function if you're going to do this kind of work. It makes absolutely good sense to check out their book here you can still I think it's online and Google books so you don't even have to buy it I hate to say that as an author. But that's one of the things that Google and Bill in and have to work out around all of that. Here's a warehouse applied to a specific challenge so here's a healthcare scenario, perhaps doesn't have epic or discerned in place just yet but maybe does. Nevertheless, you still have all sorts of different types of data coming in at the top that are all integrated into this platform so we can add some dashboards, a patient and physician portal and importantly a workflow manager that will allow us to better represent what happens in all of these things. So, there are three basic models for the warehouse as you might imagine they are focused on integration on preparation and everything else or everything and I'll get to that in just a second but the first one is of course Bill Inman's implementation of this. The second definition here is that a data warehouse is a subject oriented integrated time variant non volatile collection of summary and detailed historical data used to support strategic decision making processes in the organization. Bill is known as the father of data warehousing and he does a wonderful job if you're going to chance to hear him speak he's an excellent presenter, although he's really into text analytics right now but nevertheless this other stuff hasn't gone away so we can have some features of data operational systems, plot files CSB is whatever it is we need to bring in there's typically some staging operation the warehouse itself consists of data, summary data and metadata, all combined in one nice integrated environment that we then use to produce data products in this case, typically breaking it up into those subject areas so the warehouse is considered to be the combination of everything from staging over through the data march so that everybody can actually use the various analyses that we have in order to do this. By the way, in today's environment data march now become data products. No problem with that. It's just a little bit confusing in terms of how that works. So, when we look at third normal form, what we're really doing is making the data to be as flexible and adaptable by definition as it possibly can be and I'm going to go back a piece on this and rebuild this diagram just very quickly. Because we want this orange warehouse at the center to have the most flexible and adaptable because we don't understand and foresee all of the potential uses. However, when we move the data from that orange into each of the green cubes data march data products that we're talking about here. Each of these are transformed into a way that makes it more natural for the type of analysis that the user community is going to use around that. Again, there's not much true expertise in this normalization function so if you are unfamiliar with it, or you had a very poor instructor in that area I would definitely urge you to get some help in that area and make sure that the entire record is dependent on the key the whole key and nothing but the key, because these topics are taught very unevenly in many universities and as a result, many people get this simply wrong. So there's some pros and cons about the Inman style the third normal form. It's easily understood by business and end users that reduces data redundancy by storing the data in a highly normalized fashion. It forces the various business rules and referential integrity constraints that are around there, and the attributes can be indexed as well, which means we have the opportunity to do more flexible querying around that problems with it is it joins of large amounts can be very expensive. And quite frankly once you've built it it's not easily going to become. It's not easy to make it 10 times larger it doesn't easily scale so you sort of have to have the end very well in mind as you're taking a look at this. The relationship between engineering and architecture is very critical in this environment architecting is the idea I'm going to create this larger reporting environment. Again, could be cloud based or not cloud based but in today's environment you're likely to be cloud based, and that we're trying to do this from an architectural perspective to make sure that we can anticipate certain things there's a good architectural. The thing that that sorry saying this which is always designed the chair for the room. Right, so if you don't know what room you're going to put the chair and it doesn't do a lot of good to design a chair similarly with your warehousing in here you need to design it, given the larger business context organizations are having a tremendous amount of time, trying to get a hold of and attract the right type of people in order to do this. And so, while everybody's really psyched about AI and machine learning these days. There's still an awful lot of demand for people who know how to do this really really well. Everybody may be familiar with the old target example here target had this little thing called a guest ID where they connect a lot of information together, and found out that they were actually getting into people's business in a way that probably wasn't appropriate given that circumstances that they did want to tell you though a couple of specific things about this and that is that the idea that most organizations spend an awful lot of their time in fact too much time to the right of that yellow line there between their traditional data management practices and their warehouse data. The reason for that is because people understand the loop on the right hand side and when they try to go back and fix it, they go back and fix it in the data warehouse staging areas and in the etl that I described earlier. However, they would be more productive if they instead fixed it in the black box that is their typical data management practices that they have in order to do that. So I'm going to close this quick section here with just a little shout out to a wonderful project that we did for a group called feeding America earlier and feeding America is the overarching branch of the food banks that are in this country and they're really good at making sandwiches and getting food to people that are in need, wonderful, wonderful opportunity to do all of that but they also at one point said, I wonder what else this data that we're collecting about the food that we're producing could be used for and this was a terrific example of this. They were looking at a bus system in a major city the bus went down one street called the main street bus line or whatever you know they were going to call it and found out that by looking at their data they were able to say look, if you just take this bus line and instead of routing it up and down main street that's what it's called, we put a couple little hooks at the end there, we could eliminate food deserts for thousands and thousands of people. And they wouldn't have come out if they hadn't taken this data, added it into a nice integration data warehouse combined it with some additional pieces in there and used an Inman warehouse exactly as it has been described on here. The other part of this was preparation. And again the preparation is really all focused on this business intelligence so here's a taxonomy if you will have different types of business intelligence pieces out here. And you'll see I'm getting a little skeptical with this and I'll tell you why but the basics are absolutely understood by business users that they can drill into this relational data warehouse anywhere that they need to, which means the entire cube becomes accessible, and we can identify summaries or drive all the way down into transaction details, if something isn't looking the way we would think it would look another way of describing this is this emphasis on the cube here. I might say that I have different dimensions on the cube one dimension might be product one like geography one might be time on this and then I could look at the various components that go into this against very old slide but the principle has not changed. I love using micro strategy stuff we're using it, because the users can go in and look at this cube and grab this data from different perspectives, answering questions that in the past we would have had to ask it to prepare a program or a report for and then we might have used that report, only once in the process of doing that by discovering that Oh, I forgot to include time in there and have to send it back to it for doing something else. So here's an example this is something called set analysis, think of it in a sort of murder they wrote kind of story, you know we're trying to find out something. So we'd like to have a list of customers whose income is less than 100,000, or who are younger than 30 years these are the two characteristics of the suspect that we're looking And now I add a third category to it and say, oh yes, and they live in New York State. Well that takes us from 30,000 down to 6000. And then we've purchased whatever the item was within the last seven days well that takes our 6000 down 800 customers. And now we can add one more piece of this, the list of suppliers who supplied the information that those give me the service the product that this got out of here, they want the suppliers from that so now we have managed to go from a giant list of 30,000 customers to just 40 suppliers in there and we can find out more specifically what happens in that again that would be a one time analysis performed So let's say more general type of analysis looking at portfolios bank accounts are varying values and risks and you can queue by social status geographic location that value, and try to come up with a balance that says I'm going to do some risky loans but I'm going to balance them out with not risky loans and how to evaluate the portfolio as a whole the list least risky loan would be to the very wealthy individuals because they're unlikely to default on them but at the same time they're a very limited number of those types of individuals. Certainly can give lots of loans to poor customers with the risk increases of that. So combining these types of analysis where to lend what type of interest rate to charge are all very, very useful going out from this. The Kimball is implementation this or a star schema, or what you say is a copy of transaction data, specifically structured for query and analysis. So, our map then looks this way we have the same kind of inputs that can come into this staging types of activities but now we go straight to a data mark that is set up in what we call the proverbial star schema, and those star schemas can be easily reported. Again, everybody kind of knows this process, intuitively many have been taught at universities and things like that, but it does give us the opportunity to very easily let us hear the fact of table in the middle is sales. So the dimensions are dates store and product and we could drill down further on each of those various types of activities of course when you get to dozens of tables it becomes more complex. Nevertheless, it's still a very easy way of obtaining data very rapidly and data reporting capabilities rapidly. The pros of this or it's a very simple design, the queries are really fast people love laying around with them, and most major databases are optimized for these types of designs. But the questions have to be built into the design if there's a dimension that you don't build into that star schema structure, you will not be able to answer it no matter how good your prompt engineers are new term of art these days right. The charts are often centralized on just one particular fact. So once again, we're looking for where we come with these things. Again, the design is more well known you don't have to have quite as much expertise because the design is basically pretty straightforward and once people understand that design they can rebuild it for other dimensions pretty easily around all of that so just like everything else there's good and bad. Of course, the data growth is like crazy and the ability to analyze it is not growing nearly as much and this does lead us into a couple of problems. Everybody wants to do this data analysis better. So where we all in person I would ask you the question, do you think 50% of the time being spent on data preparation is good so let's again let's put some numbers on it pretend we're paying a data scientist $100,000 a year. All right, well that means 50% of their time is not spent doing data analysis. It's doing something else. Preparation. Okay, well guess what preparation can be done at a much lower price point than a data scientist salary. And so we shouldn't be having the same individual perform both of those activities. Then the question becomes what would a good ratio be and most people. Oh, I'd really rather have my people working 80% on data analysis and 20%. But everybody knows it's just exactly the opposite. You ask any data scientist and they're going to come back and tell you that they spend 80% of their time wasting their time doing data preparation as opposed to actually doing the analysis work around it. The prime example of this was in a Forbes article from a couple of years back during the height of the pandemic, and they valued both American and United Air, and you can see American Airlines was valued at $6 billion United was valued at $9 million in there. The actual value of the data that was locked up in their frequent flyer programs was valued a lot more than the actual airlines were worth if I had had extra billions lying around I of course do not. I would have bought American Airlines for $6 billion, and then given the airline to somebody else for another $6 billion and kept the data piece of it to try and make that the 10 or so billion dollars that we have in order to do this. You better believe everybody in these organizations is trying to do this and the reason they can't do this is because most of our knowledge workers do not understand data or data literacy or these concepts around warehousing. And so, many organizations will spend a lot of money building data warehouses and things but notice the numbers that are coming out here by the way I want to call attention to the data literacy project.org it's a great resource for these numbers and click into the venture have been collaborating in that area for a couple of years coming up with some very, very good statistics around this. For example, if I just put data in front of people. It's got higher quality data in front of people or more data in front of people, nevertheless 48% of them frequently defer to their gut and it's even worse when we go to management, two thirds of management actually goes with their gut. We're not going to look at this stuff in a data warehouse it's sad. In fact, when we go a little further into the data find out half of them incorporate data but 36% of them find an alternative method. I'm not sure what that is is that gambling or throwing a coin or consulting each thing or something along those lines, and 14% of them avoid the task entirely. So the level is objectively that we need people to be at in order to make best use of your data warehousing characteristics in here I won't go through them at this point but I will talk specifically about the citizen knowledge areas and the first one is a very bright design who was working on a project for a company got on the elevator one day and the boss looked over and said oh you work for me as a data scientist. How's that project you're working on and the data scientist said oh you know I've gotten 72% it's a real improvement over what we had before the boss exploded. I never something you want to have happen in a closed elevator but the boss absolutely blew the cork and what happened was there was a miscommunication of course. The boss was mad because they thought the data scientist wasn't giving 110% effort, which is of course what they required of everybody else in order to do this. I said there was just simply a mismatch here but when we did a little bit of post analysis investigation we also found out that the data scientist didn't know that there was $5 million and unlocked value and that data at a 68% confidence level, as opposed to a 72% so they missed two years of revenue around that because the data scientist was so far removed from the process. Similarly, anybody working with the data is going to be working with some aspects of stewardship. This data is provided on behalf of the organization and has some capabilities that should maintain some that it should. But our knowledge workers need to demonstrate value in what they do to show that they're in fact adding value that their current skill sets and more importantly their data are current to the point where they need it to be. I think that they have now fiduciary responsibilities and also share the same fate and share fate I use the analogy of a swimming pool for all swimming in the swimming pool. The same swimming pool and somebody unfortunately does something in it that makes the quality the water less, it really hurts everybody. We have to have some more work to get a greater focus on these things and I think this diagram really represents the essence of the package that we really need to address as a data community. And that is that we get real excited when we achieve 68% or 72% but we're not as good about talking about how everything else does. There's a whole concept over here where these organizational things happen, and we need to make sure that the organizational things happen in a way that while we celebrate it as data people. It actually means something to the business people and this is going to take us probably a ways into the overall process of getting people to understand what's really going on with these topics. If we can get people to take a look at this, then the easier it's going to be to have everybody understand that it's not just good that something happened with data but that something good happened with data that made something happen in the organization concept. When we look at this we also need to be driven by strategy and the strategy component here is the idea that somebody is going to be coming in here and saying these things are important to the point of you should do them first because we never have all the resources we'd like to have when we're trying to do these things. Unfortunately, the definition of strategy has gotten, if you will, bastardized by the management consulting industry and again I'm a management consultant so I have to be part of that piece to turn into something that's a grand plan or a master plan, or more importantly, a thing. And we don't want to have that thing be the piece that somebody has to go out and take a look at instead it's better to go back to the military definition so the reason I put this yellow arrow in the top of the diagram is because before about 1950. The word strategy was not really used in common vernacular was restricted to being a military term and the military term was a pattern in a stream of decisions, which means that strategy is closer to a process, then it is a thing. Let me give you a couple of quick three examples on this. Interesting company, they had a strategy that said every day low price, and they drove that into everybody's head. If you were on the planes going out to Walmart, if you're on the planes coming back from Walmart, if you were in the community there, you understood that Walmart's goal was every day low price and more importantly, if you were an associate working in Walmart IT, and you had a decision to make, you knew that the decision was correct if it supported the business strategy of every day low price. This is what we mean by a pattern in a stream of decisions. Similarly, you may or may not be familiar with Wayne Gretzky I was just up in Canada two weeks ago in Ottawa for the day my Canada days had a wonderful time up there they did a terrific nationwide event it was really really super. One of the things we're going to hope for is that we can get a Canadian day to day out there after all Canada already has a national beer day that would be October 4. A little bit aside on that one but Wayne Gretzky of course is one of the major stars and he got a little bit of cryptocurrency trouble to recently around that but he did have a strategy with respect to his playing hockey. And that is that he's skated not to where he thought the puck was, excuse me he didn't chase the puck he skated to where he thought the puck was going to be. It's a wonderful definition of strategy it's a really good articulation of it if you want to learn more about it. He put a bunch of it up on his Wikipedia page so that everybody can gain the benefit of conversations that he had with his father that were clearly formative in nature and most importantly actually produced results that he was able to come away with strategy example number three here then let's just take us as the good guys whoever us happens to be in the bad guys are over there on the right hand side of the diagram. I'm going to use a different strategy if I'm going to engage the bad guys over there on that terrain, then I am. Up here, and the bad guys are down there or just the opposite, the bad guys are up there, and we are down below, what you don't want of course is the people who are participating in this having to go consult 100 page manual or 100 PowerPoint slides to figure out what is their strategy. If our strategy is every day low price and this is the scenario, we know what we're trying to do this pattern in a stream of decision should also be guiding your warehousing activities, just the same way as it guides work group activities. There's a book on that that's a data strategy which just really says it's the highest level of guidance available, and it usually involves a balance of remediation and prevent proactive type measures around all of this again we don't have time in this topic to get into it. But when we look at that. Now we can start talking about best practices. This is a famous illusion you either see the old lady or the young lady in it or perhaps both. In order to do it which means we really need to change the way we approach our warehousing activities. Most companies start out and say, how shall we build this data warehouse. Sometimes they know enough to know that there are two different competing models that consume an awful lot of good results around that. Or we're still they say I've already got a warehouse. Now what should I put into it right well again that's putting the card before the horse. The question is, how can warehousing capabilities, solve this business challenge, and that's going to give you a much more generic answer and a design that's going to allow you to address, not just the immediate problem that's staring you right in the face, but in fact, a multitude of problems that you have. So how can warehousing capability solve a class of business challenges. There's lots of other examples. Again, are you ready for it there are foundational practices there are project deliverables. Would you get it right the first time the business environment, constantly evolving how do I build something, knowing that it's going to be evolving going forward well if evolution is important, then flexibility and adaptability and lower risk are your primary levels that immediately point you towards the third normal form the Inman kind of warehouse you haven't agreed upon enterprise vocabulary do you have your warehouse set up to be the auditable system of record, extracting transforming and loading doing data transformation. What sort of results do you need can you wait overnight or is this something that people have to have on their devices as they're looking on this. The performance of reloading the warehouse, I had one organization I worked for is a very large bank, where they put in place a warehouse that took 48 hours to reload with the dailies at the end of the day. Of course, if you do the math on that one, you realize that is simply not going to work under any set of circumstances around that. All of these things we get always the word analytics pops up and quite frankly I've never heard a definition of analytics that I like. In fact, when I find a term that I'm trying to understand one of the things I go out is to images.google.com and I say what is analytics and look at this connect any data explore and visualize and show the story great right. Here's one with four steps. And here's one with five and here's one with six and here's one with seven and here's one with eight. So I know I'm not alone in my confusion as to what the word analytics actually represents. I'll take it from a textbook here and I'll do a little bit of reductive analysis on this so this is a book that they like to run through our MBA programs and teach MBAs how to do business analytics and statistics on this and this is the actual text of it. I won't read it for you. In fact, I'm going to pick it apart, because when you look at this much verbiage and try to figure out what's actually there. It becomes a real problem. So let's just look and call it analytics. It doesn't make any difference business analytics or just analytics. And it's the use of data and information technology. Well, what is that. It's really hardware and software is it not. And it's the use of statistical analysis and quantitative methods. Golly quantitative methods. I'm sorry statistical analysis is a subset of quantitative methods. So let's just eliminate that word entirely there and mathematical and computer based models. Well, once again, models right now how is a model different whether it's in a mathematical form or a computer form. It's still a model to help managers gain improved insight about their business operations and make better fact based decisions right blah blah blah blah blah. Everybody okay to help managers make decisions better decisions on that. Let's go to the next clause in here is the process of transforming data into actions through analytic analysis insights in the context of organizational design making and problem solving. The process of transforming data into actions through analysis to solve problems, supported by various tools such as Microsoft Excel and SAS and many tab and using tools. And that 31 words is one third of the original definition of analytics and really even these quantitative models here if we teach it with this definition, the thing that we're missing is the business context, we've got to have a, because a why a motivator that says why are we in doing all of these various activities around here and the answer is to solve a business problem. Just because I build a big analog build warehouse. If I haven't solved the problem I haven't added any value in fact I hurt the organization because I've expended things without getting a result. This all comes to the really what word analytics means, which is data analysis. And there's no reason we should use anything else, but that term but it's the marketing term to your. Now, when we talk about this, let's talk specifically about analytics in the context of understanding the organization I've got the power of Babel in the upper right hand corner this diagram to remind us. The business and it tend to speak different languages. And sometimes these architectures are understood and documented, which means they can be useful but if they are not documented. It means they are going to have limited use in terms of understanding and architecture. Really means documented and articulated as a digital blueprint. And again, if you're going to digital this is one of those key pieces you'll need to have there. But what we're trying to get to is shared understanding by the business, sharing that with it and sharing that with the actual system. Names in order to do this because if I don't have that understanding shared within everybody in the organization, I'm going to have confusion. And that confusion is not going to result in good business outcomes instead it's going to result in poor business outcomes. The goal of this is to get a common vocabulary which means that your warehouse has to be accompanied by some sort of a trusted business glossary catalog, whatever it is that you want to call this in order to come up with it. In order to do this, because that catalog is what's going to get us to singing off of the same sheet of music to use a musical analogy. There's a more component to this I've already told you about in men and Kimball, being the two major adherents to this there's a third adherent to this which is actually the better of the two. And that is to say that both in men and Kimball have come together and agreed that this format structure here that I'm showing you as a data vault is really the best type of implementation, however it does incur greater overhead. Always the best place to start, assuming that your investment is going to be significant for your organization if you're starting a data warehousing programming you anticipate is going to be a large component of your business, or you have a mess of a data warehousing program, and it's already a component of your business. This is the solution to simplicity, the ability to drive us better towards something. The key with the vault is that allows you to go in and store the business rules and the lineage of the data in with it. So these anchors that they call them are really a hybrid of the two pieces. Again, hubs, links, and satellites, anchors, satellites is the word that you want to use. Again, data vault is going to be a very, very much more intensive exercise, but it gives you simple integration in the long run. So if you are facing lots of integration challenges, you need to have this and if lineage is also critical in this, these are going to be other ways of looking at it, but it's complicated to push all this stuff back to the back end. These people are going to have to be more practiced in these areas in order to use these. So here's a comparison of the three different types. Again, Kimball is the dimensional, Inman is the third normal form, vault is the data vault, and you can see there's sort of pros and cons and necessary pieces in there, but general guidance, if you're starting out, you need a simple reporting warehouse, a star schema will likely get you there and support you for years before you go into this. So use the time of the star schema that you can pre-build out there to plan your data vault and what additional complexities and business challenges you can solve within that. If you have already an existing third normal form, you may want to look at vault as a way of starting to simplify your existing environment around this. Lastly, don't start from a blank sheet of paper. Yes, your screen went blank because I hit the B button. That's what B stands for blank. There are lots and lots of good guidance out here for organizations to come up with pre-done patterns. My personal favorite is Len Silverstone's universal data modeling patterns. I literally was talking to doing one of these webinars, and we had a question during the Q&A part that's coming up in about four minutes for us. And somebody said, I wonder if there's a pre-built model for a pharmacy cash system and turned out Len was walking right by me at the time and I grabbed him and he was able to tell me that's in volume two of his book. If you buy a lens book, if you can actually find them out there, they come with CDs. I know those are funny for some people, but they have these models literally come with the book. Again, you can look at some of these different models as starting places. Again, I'm certainly not going to walk you through this model or this one, but these are good ways of modeling your ETL and other types of things. There's a wonderful group out there called the open standards group that's doing this meta model of various data warehouses, lots and lots of opportunities that you have in order to take a look at those. So when we look at all of this stuff put together, what we're really talking about is understanding that warehousing is a capability for organizations. That capability can show up on-prem or in the cloud. It doesn't make a difference except that it's more expensive to put your data in the cloud and not use it than it is to put it on-prem and not use it. The DIMBOK is a good guidance. The Dama DIMBOK allows you to go in and find out how to build some of these things. Again, if we're looking at our legacy environment and trying to get to digital, whatever digital happens to be, there are some things we need to pay attention to in order to do that. We look at these two basic patterns, integration in here, which has a talent for real engineering. It does not incorporate leading edge, but it is leading edge technology and it requires an adaptive stance rather than a prescriptive stance. And then our preparation for the star schemas. Again, emphasizing storytelling first and visualization second, but the analytics is ubiquitous and not well understood by everybody. So I don't even like the word, just use data analysis. I'll never win that battle, but you can at least try and get people to say, look, we're doing analysis, what are we trying to analyze? Because if you're just doing analytics, it doesn't matter what you do them on, right? But finally, there are some best practices. You can't use what you don't understand. So you have to understand what it is that you have. And in order to do this, iterate on the process. Figure a way of doing plan, do, check, act around all of these pieces. And then call your requirements. Now that's a very technical term calling the herd is getting rid of the weakest. Right. So we don't want to have an overly complex warehouse. We're going to have it set up so that we can start to use it in a way that will make a lot more sense. And I'll let you guys get ready for the Q&A. We're going to do just a couple of quick takeaways in order to do that. First of all, I'm including a list here of 16 major causes of data warehousing. That's off to T Dan, which is now part of data diversity on this, but Rob Siner for years and years has done a great job of collecting these together. So these are all really good reasons of why your data warehouse might not work perfectly. The first time in order to do this. And secondly, last piece here just spend a minute on, which is to say that the warehousing requirements tend to be what are called use case driven. So somebody will make a use case that will put down a scenario for this. And I've seen lots of organizations that have lots and lots of use cases. Again, the idea of a use case is how somebody would interact with the system that we're looking at the problem with most use cases is that they do not have integrated glossaries. So if somebody hands you a stack of use cases, hand it right back to them unless they give you an integrated data dictionary because a tank could be one thing to some people, and something completely different to somebody else. These use cases are unable to capture what we call a non functional requirements. And the plan then has to be detailed because the architectural requirements are very significant. And finally, the average data warehouses rebuilt seven times before it becomes considered to be useful in most organizations. So we've reached the top of the hour here and again time for Shannon to pop back in with us. Hopefully we will see some of you at the upcoming DG IQ event that we're going to be in DC around December. That'll be just before the next one that we do of these webinars, which is data management best practices on this you can see what's coming up here in the future and Shannon back over to you. Peter, thank you so much for another fantastic presentation. As always, and just to answer the most commonly asked questions. I will send a follow up email to all registrants within two business days containing links to the slides and links to the recording from this session. We have questions for Peter feel free to put them in the Q&A portion of your screen. So diving in here, Peter, is there a published updated version of the CIF architecture that incorporates data mesh data fabric and even data lakes. I believe they're working on one I don't think it's done just yet. Somebody may be able to correct me on that but I looked a while ago and saw that something was in progress. So yes, and let's talk about those concepts that are coming on board now right. So, again, whether you like Gartner or not Gartner declared data mesh dead but data fabrics has got a little bit of life into it although it's on the downward slope of the high cycle curve on there. And the reason for that is is quite interesting. Either these topics meshes or fabrics which are both very, very well thought out concepts require an organization to be able to optimize their existing data management practices. Now optimize is a level for integration concept. It means you have documentation, repeatability measurements that are around there and once you've taken these measurements, you now have an idea of what things you can improve. So the level four requirement of maturity. It's kind of like handing me a snorkel and saying go diving where I have no snorkeling experience. You know it's great equipment and certainly will keep me safe and I understand I have many friends that do snorkeling in order to do this but I am not mature enough in my practices of doing all of that in there. Great question. Thank you for that. Indeed. So, why data on the cloud, why should data on the cloud be smaller than on prem data. Great question. Let's go back to the slide on that and make sure I do that real slowly so let's think for just a minute. All of our data is coming to us in most organizations through a series of exercises that we call silos. Again the sales data is going to be put in a sales warehouse and the manufacturing data is going to be put in the manufacturing warehouse and all of that sort of thing. Do you think there's redundant data in there. Yes, there is. And again, our statistics show that 80% of corporate data is in fact redundant obsolete or trivial. If that's the case, by taking our data and simply dropping it into the warehouse, forklifting it if you will, and again doesn't matter whether your warehouse is on the cloud or not. You're just taking a bunch of stuff that most people aren't going to use in the first place and putting it now in two places in my legacy systems, and I have it in the new systems. So, if I'm able to do this properly I can look at this as an exercise in engineering and architecting my warehousing capabilities. In doing that I'm going to find a lot of like things, and I'm going to put those like things together in fact find out that I can eliminate large chunks of your data now you will never achieve 80% savings on the first one. But let's just say that you only achieve 10% savings on it. That's 10% off of your linearly expanding cloud bill, or again your warehouse can be 10% smaller or will grow at a lower rate in order to do that. 80% of organizational data is redundant, obsolete or trivial so if you're going to put things in a warehouse, or in the cloud, you should do some analysis on them. Remove that additional redundant obsolete or trivial data and put that data out there where it will be less cleaner and more shareable. Again, great question. Thank you for asking that. Peter, who is. Oh, do you know who is working on the updated CAF architecture. No, I would reach out directly to the organization though they probably it's it's funny that's the one question I get with them all the time when's the Denbach three going to be out. We'll get it out as soon as it's done but it ain't done yet so that's probably the answer they have but they may have a timeline for them. If I know isn't that long ago. So, how does data fabric relate to the data warehouse. When you're looking at capabilities what fabric does is go to an idea that was actually kind of, I forget which token Valley company pioneered it but the idea was, you can simply plug in to the wall and just like you get electricity you can get data in the same way and having a fabric around that allows the organization to identify its mission critical data in a very easy fashion and take that mission critical data and make it more accessible with generic types of things you'll see this in companies like data bricks and some of the other companies that are out there doing some really really good work around it. So they're doing this by the way in spite of Gartner saying that their technology is doomed. So, again, I'm not necessarily a great gardener fan they've done some really good things over the years. In fact, if you went to Gartner. Last year I had a friend come back and tell me this. This year was doom and gloom and everything's going to be software as a service and you're going to have to outsource everything there's no more on prem and your budget is going to get cut by 30% next year sort of that was their message last year this year. The message was life is really, really great because chat GPT is going to fix all of our problems right well we know neither of those two statements are true, but it does, nevertheless become the tail wagon the dog so the fact that Gartner doesn't like fabric or mesh. It's somewhat immaterial and that some companies are using it very very well, doing it in some specific limited environments in fact the US Army is looking at an implementation of mesh in a way that looks like it might be very very useful for them, and something that they're very easily able to take care of. Great question. Maybe we should do one of these own fashion fabric Shannon as we go further forward on these things. Yeah, very, very much so I think so maybe so we can take a look at that. So, what is the future of a data warehouse with new cloud concepts like the data and Delta Lake and data fabric and data mesh. The wonderful thing is that we've kind of got the late concept down pretty well. I still have having trouble with organizations that have multiple data lakes. And I'm not saying that multiple data lakes is a bad idea, but it's a great landing place for an awful lot of data so we can sort of put in a DMZ and do exploration before we decide how we're going to incorporate it in our production environment. And we've gotten quite good at that again I'll go to Bill Inman he's written a really good book on the data lake house that many many people are finding value around that, which is to take the ideas of the data lake which is to say that we're going to sort of generate the schema on the fly as we need it as opposed to trying to pre do the analysis. And that will work really well for a relatively limited subset. So, here's the key with lakes, and this type of concept. If you're not going to formally specify your metadata, that means the understanding of that metadata must be shared by the work group. If the work group is 100 people, you have a very big problem if the work group is 10 people, you have a manageable problem. The lake size is determined by the size of the informal sharing of the metadata among the work group. And that's a lesson that it's taken an awful lot of companies and awful lot of time and expense to learn and figure out around that, but it does absolutely add one more trick to our bag of tricks which is to say that we can put in place a landing place and do some analysis on this data in a way that in the old ways, we really required a lot more formal activity around, but just putting it in a lake house or where, where lake or whatever we're going to call the things right doesn't actually solve your production problem. And until you've got a billion users banging on your warehouse, you have no idea what types of scale you're going to need what types of indexing performance, what types of refresh you're going to need to have in order to do this. You're going to get a much better look at the data faster. And that's a real advantage in many cases for that limited work group. Again, that work group then has to decide how they're going to introduce that into your larger warehousing environment and I again I can tell you, I gave the example earlier on of the warehouse refresh taking 48 hours. You're never going to catch up quite simply. It's just an impossible situation in there and I can't tell you how many times I've been called into organizations where they're, you know, what we built this and we built it this way and it was what we're supposed to do but we can't get the thing to reload in time. It's like, no, and it's never gonna you're going to have to go to some sort of specialized processing, which again I'm calling these things specialized warehousing is a well established industry there are lots and lots of guidance around this again and diversity has got a huge library of articles to go back and look at this through the TDAN collection and also additional topics around this so well I say it's specialized knowledge you don't want somebody who's not knowledgeable doing it, but you also don't have to hire somebody in with, you know, super fantastic expertise in order to come up with this there are things that you could do yourself. In fact, here at VCU, we have quote a warehouse. There's actually a SAS state of warehouse that somebody built, and that individual moved onwards and now nobody really knows how to undo or do the warehouse, except that the ETL job still work and there are people still using it so they figure, we'll just leave it alone and let it continue to run. That's not a typical story for for warehousing although again, I can tell you the one earlier about 30 million bucks and one user. Right. That's got to be a pretty good use case. I'm going to be really calling that successful. Peter, since you brought up what the definition of analytics is. Can you tell me what a data scientist is cause if I failed to see the science. I see statistician and sometimes not even that just people know how to filter data and push data through program objects and a statistical library. I guess somebody asked me one of my favorite questions. I like to quote Eric Segal who is the person who actually is credited with creating data science as a category, and he says calling somebody a data scientist is like calling somebody a book librarian. That's a great response. Yes, I agree 100% on that. And really what we should be doing is looking at data science not as a generic class of things. Because when we look at data science at the moment, we're seeing three major problems. The first one is that they're not productive enough precisely because of that 8020 definition that I showed you in terms of how much time you spend creating your data warehouse and making all the bits and pieces that are there. That's just a huge, huge inappropriate use of money. I had a, excuse me, a business in the Germany call me the other day and say, but we've hired 20 data scientists and we can't figure out what to do with them, right? Well, yeah, that's clearly the wrong way to think about it. Your data scientists should, by the way, if your data scientists are only 20% productive, and I come in and reduce your data preparation time from 80% to 60%, I've doubled the productivity of those data scientists. You're still paying them to do things that they are not skilled in knowing how to do. But it's at least an improvement in there. So I said the first piece of it is they're not productive enough. The second one is that they don't have any interest in learning the business. It's not put into them. These programs are being taught out of statistical programs that have no idea what the business is. And all the data they get in their training programs is perfectly packaged data that says go in and find the optimal cluster analysis of this that and the other thing, which as classroom exercises are really great but don't work really well in the real world. They're going to teach elective courses, which is really a luxury for me. And it means people don't have to be there but I get students come again and they come out of these wonderful courses, and they immediately say, great, where do I go apply cluster analysis. And it's like, in my courses cluster analysis is not probably going to be the problem, unless you're doing your cluster analysis on the metadata, which they find a really interesting proposition. Our data scientists are not interested in learning the business they're not productive, and they really have a challenge with their domain knowledge. So what I've said over and over again that we need to do with the data science is not let anybody get a degree in quote data science. But if you're going to get a data science degree in dental hygiene, yes they do have them. That's a great job for data scientists, or you're going to get one in logistics where somebody really understands how things goods and services move around and how data is used in there so logistics data science. If you don't have that additional word in front of you, you're just a data scientist and I think unfortunately, the hype cycle and that is now starting to fizzle out and we're seeing that also go into the trial of disillusionment where people have invested. By the way, everybody on this webinar should remember big data. Remember how that was going to solve all of our problems as well. What's happened nobody talks about big data anymore, because it was never a thing in the first place. What we did create out of all that were a whole new class of tools that allowed us to incorporate parallelization so we can talk about big data technologies. And that's what some of these really wonderful companies have now started to put in place our big data technologies that are now becoming useful, but you don't see people playing with a dupe anymore do you. Right. So again, let's be careful as we dive into these things and figure out how they can be used properly rather than whether or not they are useful, because that's never the answer I don't think. Again, great question. Thank you for asking that. I didn't know that was your favorite question Peter. No, yes, exactly. I go back at least 15 years at this point so. Yeah, it's true. So Peter can data warehouse and data might be implemented by a graph database pros and cons versus existing methodologies. Fantastic topic. I wish I had space in here to incorporate some of that material. But the short answer is yes, and more to the point. It's not so much the graph technologies are going to do some of the things for you but the graph technologies allow you to do more with a basic vocabulary, then you've been able to do before so one of the first steps in graph technologies is to come up with that common vocabulary that we're going to use and build constructs around that. And the basics on it. Yeah, you're not going to do it in men warehouse and graph database you're not going to do a Kimbell thing in graph databases but what you can do is that preparation and query type of activity. So that in a graph database, it's going to point you in the right direction. Pretty soon that. So let's just say that we've got the entire problem spaces 100% graph will allow you to go in and figure out what that 20% is much more quickly but in all likelihood, every time I've seen a graph implementation and maybe just as the ones I've seen are not as mature as we'd like them to be, but that graph implementation runs you off the ending you eventually need to get back to real data at some point but now you've eliminated the problem space to the point where it's a fairly easy thing to do and I've seen a lot of organizations just go back to resorting to spreadsheets with not really what we want but nevertheless is still a good tool and it's a lot less expensive than a $30 million data warehouse with a single user on it. Again, great questions on that graph databases is one that we get with increasing frequency Shannon so maybe we'll look at that and we do the next round of these programs will be put together next year. Indeed. So, can you briefly mention the role of natural language process and recent AI advances, which will affect reporting like presentations and user interactions. How many of you out there have heard the term prompt engineer. Now the nice thing about being a prompt engineer is that it means our need for English literature and just humanities degrees is back on the rise again. Most of us in it and data tend to be pretty linear in our thinking or logical in our thinking. If you're going to actually get chat GPT to do something and I should point out, I had a student this semester who did a great job of taking some general metadata and asking chat GPT to create a data model out of it. He did a passable job now just like all AI it hallucinates at a rate of about 15%. So some of the stuff was absolute rubbish when we got a good 80% out of there that was a workable model to start with which meant we could leap frog ahead of sort of the traditional methods in there. Remember your job will not be taken over by an AI. It'll be taken over by somebody who knows how to use AI better than you. And we look at this new concept of prompt engineering. It's somebody who's coming in with a liberal arts background, who is more conversant with the English language than a traditional data or IT person, and can actually implement this in the context in which the business is trying to solve the problem which means you have knowledge of the customers, you have knowledge of the environment in which the customers are working in. And again, I can remember a query that somebody was trying to show me, and I finally figured out you mean they're going to be making a query while they're literally under fire, right in a military context. That doesn't make a whole lot of sense, whereas help, I'm pinned down and need, you know, the vector to get us out of here. That is something a prompt engineer will be much better at than a traditional IT person so look for these prompt engineering degrees and and jobs to start coming up in the future we will definitely start seeing more of them as this latest AI bubble burst and we go into our third AI winter around all of these things. Again, if you haven't seen that there's a whole talk on that activity as well. Again, great question that the prompt engineer is going to be a new job category, a really useful job category, and a set of skills that people are going to have to start to put together, if they wouldn't live in this new world and get data out of their warehouse using artificial intelligence types of activities. And remember, AI is simulating intelligence it's not real intelligence it's simulating it. And by simulating it we can still learn from it, but just as any good model, the model is only good. And remember George boxes statement. All models are wrong some models are useful. Right and that's a really good way to think about AI as well. So, Peter, and correctly formulated question you know. So, going back to your the previous question why data warehouse on the cloud should be smaller than on premise we can answer that already so it just is just restated so apologies. That's it for the questions we have any additional questions from the attendees will give you all moment to type in any additional questions. And while you're doing that Shannon I'll show the little bit of additional information that I'm incorporating in with this that I didn't go over most of these but the corporate information factor I consider such a useful concept in here that there's a fair amount of stuff that we've pulled out of it to give you all a little bit more guidance around the kinds of things that you want to pull together so just knowing you get the PDF it's this plus a little bit more in there. At data virtually we love to give everybody just a little bit more than what they asked for around that. Very true. We love our resources we love our data right. Lots of learning. So, oh, are you going to be speaking at EDW 2024. Certainly hope so. And that we are all looking forward to that VG IQ next step and then EDW after that so very much looking forward to seeing everybody it's been so much fun to get together in person I've seen people who I haven't seen literally in years. And it's it's practically a joyous occasion for us all so we're, we're super happy that's an option at this point and looking forward to going forward on that. Yeah, EDW. The recent one we had was so much fun. So much for this next time right Florida in March. Yeah. Good. Yeah. Yeah, it seems to be a theme to the theme. There's a comment in here this is I think the question I'm putting all the data in a warehouse is analogous to words that can be deleted from writings and the meaning is not diminished. Thank you for observing that. Thank you. Well, I think that's all the questions that we've got for today. Peter. Thank you so much as always. This is a great presentation. I really appreciate it. Thanks to all of our attendees for being so engaged in everything we do. Again, just a reminder, I was going to follow up email by end of day Thursday for this webinar with links to the slides and links to the recording. Thanks y'all. Thanks, Peter. Thanks everybody for joining. Have a great day.