 Hello and welcome, my name is Shannon Kemp and I'm the Executive Editor for Data Diversity. Thank you for joining the latest in the monthly webinar series, Lessons in Data Modeling with Donna Burbank. Today Donna will discuss data modeling for big data. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. And we very much encourage you to chat with us and with each other throughout the webinar. To do so, just click the chat icon in the upper right for that in the corner of the screen to activate that feature. For questions, you'll be collecting them by the Q&A section in the bottom right hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag lessons data modeling. As always, we will set a follow-up email within two business days containing links to the recording of the session and additional information requested throughout the webinar. Now let me introduce our speaker, Donna Burbank. She is a recognized industry expert in information management with over 20 years of experience in data management, metadata management and enterprise architecture. She is currently is managing director of Global Data Strategy, an international data management consulting company. Her background is multifaceted across consulting, product development, product management, brand strategy, marketing and business leadership. She has worked with dozens of Fortune 500 companies worldwide in the Americas, Europe, Asia and Africa and speaks regularly at industry conferences including ours. And with that, let me get turned it over to Donna to get us started. Hello and welcome. Hello Shannon, glad to be here. So, yeah, I will just kick it off. Shannon already introduced me very well, so I don't think I need any more introduction. I will mention though if folks are on Twitter, you can follow me at Donna Burbank and Shannon remind me what the hashtag is for this event. Lessons data modeling. Up and as a chat. Data modeling, yeah. So if people are Twitter folks, please do. If people might know me as Shannon mentioned, I've been in the industry for over 20 years either in consulting or several of the data modeling products I have been involved with. I wrote a couple books which are listed there on data modeling. So I have data modeling near and dear to my heart. And one of the things we'll talk about today, and I guess actually throughout the year, is one of the reasons I love data modeling is that it just applies to so many different areas of the business of technology and in this series we're going to try to touch on a lot of them. So if some of you joined last month, and if you didn't, it's on recording, we talked a bit just to kind of kick off this series of why a data model is important as your overall strategy. And I feel strongly that it is because it does touch so many parts of different types of technologies and projects. And then, again, I mentioned just a minute ago that so many things have changed in the industry and big data is one of them. So unless you've lived under a rock for the past few years, you've heard the words big and data put together. But what does that really mean? So I'm going to go through in this presentation and talk more about that. It's a hard one to put together. I'll be honest, because I know the audience for these webinars, and it was true in the last one, is so varied when we look at the different titles, especially when we're talking about big data. So some people are looking at it from an organizational point of view. They might be at the C level or the management level. Some people are very techy and are actually building the models themselves or implementing a Hadoop infrastructure and maybe want questions about that. So I tried to touch on all of those, and that's another great thing about a data model is that it can't touch on all levels of that organization. So I hope I give something to everybody and nobody is overly disappointed. So that's kind of some background on that. So without further ado, what we will cover today. And apologies or hopefully it's all right. I tend to get philosophical about some of these things. That's just how my odd little brain works. And I just, as I've been thinking about the industry, I've just seen such a change, not only in technology but in culture in general, which I think kind of drew to this new big data world. So I want to talk a little bit about that. And then how that maps and what it means to the larger information management landscape and how data modeling fits with that. Talk a bit, as I just mentioned, on the modeling and technology and then as well as the organization. And selfishly, what does that mean for us, the folks that are data architects or data modelers on the call? Do I still have a job in big data? What does that mean? Is it something new? Do I have to change skills? So we'll talk a little bit about that and then as always, summary and questions. So as Shannon mentioned, there is a chat for the questions at the end. So for those of you who are data modelers or aspiring to be one, you will know that data modeling is all about context and a lot of it in definition. So we're going to start off the call with that because when we talk about data modeling, there's often the question, what do we mean by a model? Especially when we talk about big data because there are different connotations to that. So what we're not talking about, and you're trying to leave now, but we're not talking about the type of analytical statistical modeling that you might be doing in R or Python or SAS. We will talk about the model in the middle, your typical entity relationship model or formerly known as entity relationship model in the world of big data and how that might help and augment the world in the first column, how you might, when you're doing your statistical modeling in SAS or Python or R, might need a model to help with that and some of the context and definitions. So we will cover that. This isn't A, how to script in R. So just because I know, especially when we're talking about modeling and the context of the big data, there are two connotations and then the third one, Mr. Fabio on the right. We're definitely not talking about that type of modeling, but you never know why someone might have joined this. I will say that my last name is Burbank, which is also a city outside of Hollywood, California, and my profile says modeling, so I get a lot of strange modeling agencies trying to follow me on Twitter. They clearly did not look at my picture because I'm sure they're disappointed when I start tweeting about data. So we will talk more about this as we go through the importance of these definitions. When I talk about a model, what do I mean? So that is what we mean on this call, is sort of the graphical data models that describe both the business and technology in an organization. Okay, so again, I apologize in advance for my slight Donor rant, but I hope it will make sense in your mind as it does in mine. So when I was first interested into this world of big data, you start to step back and go to see how the industry and technology in general has changed. So I applied it to something like the telephone. So think way back. When telephones first were invented, you actually had a switchboard operator, right? So there was a central source of one-by-one, make a phone call. I would schedule that phone call and have to go to the operator, and whether they listened in or not, which I sometimes did, they would pass you through the other operator. Big, huge technological innovation when people actually had their own telephone in their home and could dial. And yes, I'm old enough to remember the rotary phone, which are kind of fun to dial, but a lot slower than what we had. And then you had sort of the touchstone phone. And then huge technological jump with this idea of not only a phone could you pick up and carry around the house, but this idea of a cell phone that followed you around. And then as we progress, you'll see that, you know, as we know now, a cell phone isn't a phone who actually talks on the telephone anymore, right? It's a text device. It's a way to get to the Internet. It's really a lot of people's lives, and it really is a little mini computer. So just think of them when we think of telephones. So what has happened? If you look at the bottom, there's sort of some parallels of what happened when you think of the computer world. So think of around that same timeframe, you had the big mainframe, which is very much like that switchboard operator is big and centralized, and you did the scheduled jobs, right? You sort of got your time on that server. When you think of things like distributed computing, that's kind of the phone you can pick up and move around the house yourself, right? That democratization of computing was a huge switch. And then when the world turned on its head and you have things like the Internet, and again, all comes back to this ubiquitous cell phone, that again, not only is the change rapid, when I was looking at the technology, you look at the phone in 1970 and then 1980, and then you get to 2000s. I mean, I think I'd have to update this each week. I think next week, my slide will be a new iPhone 25 or something that's come out, right? And the world is just, so not only is it much more massive volume, it's more democratized, and it's more active. So no longer do you passively talk on the telephone. People use this as a way to interact. Think of when the first Internet, the web came out. It was something you read. Now, if someone can't tweet about it or comment on it or share it or edit it, it's not so interesting. So it really is a whole new world. Just think of the volume behind that. So to go a bit deeper on that, when you think of this technologicalized goal and culture shift in data management, it really is different. When you think of the world of the mainframes, and yes, I am old enough to have programmed on those too, although I didn't have the most modern high school. So actually did some way back then. Very much waterfall methodology, partly because you didn't have to plan ahead and schedule. Back to the phone analogy, I do remember when you'd have to plan ahead, I was going to call grandma on the weekend because it's cheaper. You just pick up the phone and call. It really is a different world. And again, thinking of the 1980s and 1990s, that democratization, the relational database, individual PCs, this idea of client-server computing. That's really what a lot of the traditional technologies where we traditional data modelers are used to, this kind of idea of data warehousing. That was sort of a glory time in a way. That was a very new concept and we had some excellent new innovations. For a while after, say, I'm honest, I think I got a little boring. We sort of had the dot-com, that sort of burst, and then all of that great stuff we built in the 1990s and beyond, we had to clear up, clean up. So a lot of it did seem to be, a lot of just do more with a little less. How do we integrate what we have? How do we clean it up? How do we have data quality? Which is valuable, not the most exciting. I will have to say that is all changed and that's why I am still in data management. So the innovations that have happened in big data in terms of, not just big data, but things like cloud computing, no SQL. The new technologies that are available in big data really are kind of changing the data world on its head. So how do we relate to that? As well as how are things done differently? So this idea of agile development, which is all about the whole democratization, if you think of the agile manifesto. It's all about smaller chunks of information, enabling people in a more democratic way to work together. Very different from that traditional waterfall methodology, which actually traditionally historically has fit better with that whole data modeling paradigm. I designed something and I built it and I tested it, which is fine. That works very well for many things. If I'm sending a space shuttle to the moon, I'd like to have that tested. But it doesn't work well for everything. So I think we are, as an industry, kind of battling with those two paradigms. We've got the kind of design and build and manage. And then this new world of rapidly changing real time, agile, world is changing on its head each day. And how do we apply data modeling to that? So there really is a bit of, I think, a paradigm shift in the industry. So if you think of that traditional data modeling paradigm, it's that idea of you design and then implement. It's a very top-down, hierarchical way of doing things. When you think of big data, it really is the other way around. It's more how do I discover information that's happening and then maybe analyze it? It's a very different way to work. All of these take with a grain of salt. I think traditional world, I say, in quotes, you know, a manageable rate of information and a stable rate of change. I certainly don't think we thought back in the days of the data warehouse. We didn't have a lot of data. It's just so much more now. And it really is a different way of looking at, in my first slide, of we're not talking about the statistical analysis type of modeling in this presentation as much. It's the data modeling that supports that. So it really is a different paradigm of from design and build to more discover and then build and analyze, which is a very big switch. Again, as I wax poetic in my brain, I really think of it as a lot of things in change in life. So think of, quote, the old traditional way of looking at the world. This idea of, if in school, you remember, Linnaeus in 1735 generated that taxonomy of biological systems, right? Kingdom phylum class, order, family genus species. I remember having to organize that and memorize that in biology class. Well, in its day, and still, we still use this. Think of the periodic table of elements. It was a way of looking at the world and say, if we could only understand the structure of the world and organize it, we've got it. So someone create, not to get philosophical or religious, but maybe this creator created the world in a certain way and all we have to do is unlock this and we have a piece of it by understanding. I just put it in this right box. It's a huge part of me that really likes that. It's just so nice. And yes, if you look at my office, everything's organized in file cabinets and they're labeled and they have sub files. So I think a lot of this, and before you judge, I think a lot of us in the industry are in the industry because we like that organization and that's a great thing. Think of other related industries, accounting. It's all about how do we organize, classify, and that's a lot of what we do in data modeling. But as I just learn more about the world and start thinking of this, there is a whole new way of looking at the world when you think of things like chaos theory. So I think this traditional way of saying, if we just organize everything into this little bucket, we'd understand it is that's being turned on its head, just like our technological world. So if you're familiar with the concept of emergence, which was new to me several years ago and I was researching this, and notably I took this definition from Wikipedia, which is much more this sort of not the encyclopedia Britannica, but this new way of crowdsourcing, wisdom of the crowds, which has its downsides, but I use it a lot often at work. So where is the balance between this crowdsourcing model and the more designed and built? But as I digress, the idea of emergence really is just that. So taking this idea of complex systems and finding patterns in that complexity. So one example might be a snowflake, right? So those two snowflakes are alike. They're immensely complex. Each snowflake is minutely different. Yet there is this concept of snowflakeness that comes through. I think we can all identify a snowflake. There are patterns within those. It's also used more practically in things like city planning. So think of it most simplistically. In the old days, you might have built a city. I'm going to start from scratch, and we're going to build streets and roads that are numbered A through Z and 1 through 10, and they're all in a nice grid. Or you can build roads more organically. I remember in my university, they had those nice straight pathways until everyone wanted to find the quickest way to do that and walked right across the lawn. So rather than beat them, join them. They found and actually paved that way that everyone was walking. If anyone has lived in Rome for a while, it's a very simple city to follow if you just kind of walked the way you would have walked. Driving a car is terrible because it sort of grew organically out of how people walked to the store. Then the streets make sense. So this is that idea of sort of emergence by organically generating patterns out of what seemingly is chaos. So when you think of things like on the ride, this idea of social media, a lot of people are doing things like sentiment analysis and social media. There are gems of information out there, but how do you find them in this maze of massive amounts of information? So again, say maybe you are a jeans company. You might be something like Levi's Jeans. You're trying to say, what are people saying about my product? Well, maybe somebody on Twitter is saying, hey, there's a sale on Levi's 20% off at Macy's. Well, I would want to know that. They're discounting my product. Pretty important. Or could be a typo. Is Levi coming to my party? That's a name. Or typo, LOL, leaving soon. That's a clear typo, right? Or I love my new Levi's jeans. Certainly want to know that. But how do you find those two that do make sense until the ones that don't? So again, this idea of taking this massive volume of information and finding relevant patterns out of it is a big part of big data and a part of the idea of modeling the idea of big data. So anyway, I find that interesting because this pattern, it isn't just something that's unique to data. It really is changing the way we look at the world and the way people interact with each other. It's changing the operating model of an organization as well. So how does this fit with big data with a larger information management landscape? And we showed a similar diagram in our last webinar last month on strategy. This is the framework we use in our practice that really, you know, there are different ways of organizing the enterprise. This is the one we tend to use. And we always start with the idea at the top of this idea of business strategy. So what am I trying to do with my business or my organization and why is that relevant? That should drive everything else. So it's this top-down alignment with business priorities. That should drive everything, especially when we're talking about big data. We'll talk more about that. But to really make the world make sense, it's a mix of how do I do this top-down from a business and then the bottom-up because there is an inventory of different data sources from your typical relational databases to unstructured data, the semi-structured document and content management, et cetera. That's the complexity I'm talking about. And some of you might rightly say, why do you have big data as a separate box? Because that does include things like unstructured and semi-structured. Yes, it does. But there are things like big data platforms that we'll talk about today, Hadoop and Cassandra, et cetera, that can manage those. I won't spend too much time on this other than to say big data doesn't live in a vacuum. So it's very nice to do a lot of, you can do some really neat things with big data and do some discovery. And a lot of the companies I work with have these interesting proofs of concepts to test things like social media analysis, for example. But unless that's tied into the larger enterprise strategy, it really isn't valuable. How does that relate to what you're trying to do? So what we spend a lot of time in our organization is what are the good use cases for big data? How does supply to your larger enterprise strategy? Is it part of your 360 view of customer, for example? And then how do you build things like governance around that? So the model I just described of kind of this managing chaos is very much what you're trying to do with data governance in the big data world. Doesn't mean it can't be done. You're actually doing it with several organizations. But it's a very different model than your traditional relational database. There is some idea of looseness, of I'm really trying to do creative discovery on these data sets. There isn't, in many cases, not a very rigorous data governance in certain areas of this big data world. And then how do you integrate that? How do you integrate, if you are doing some social media analysis, how do you integrate that with your master customer record, that golden record you've created? Data warehousing information. What relevant information mixed with that? It's a data quality relevant. We'll talk a lot about that in this presentation as well. And then, again, how does architecture and modeling apply to that? And then, as we kind of alluded to, how do you integrate that with the other sources? Because big data by itself is interesting, but it's really interesting when you start integrating with other areas of the organization, adding the metadata around it, really doing a full asset inventory and seeing how it's relevant. This is kind of a cool finding, but how does it help drive your business and have change? So, quick on definitions. Don't roll your eyes if we're going to the definition of what big data is. But again, if you haven't been living under a rock, you've heard big data, you've probably also heard different variations of the Vs, right? So everyone has their version. I'll give you mine. I think the common ones are often the three Vs. It's the volume that I've talked about already, the high volume of data, the velocity, the fact that it's generated so quickly and changes so quickly, as well as the variety, the fact that it's not just relational databases. It's in log files, machine data, et cetera. But I'm a big fan of proposing it. Unless you manage it and have that modeling and metadata around it, where's the value? So what's the insight you're getting from this? There's a reason you're doing this. There's a cool factor of, yeah, I'd like to play with a dupe, because who doesn't, right? With new technology, we want to play with it. But what's the use case around it? And how are we going to relate that to the other data in our organization to actually drive some value? Further on that point, I had my own, or I've heard this from others, it's not mine, but another V, veracity or truth, is the data quality accurate? Is it relevant? So if you look at some of the statistics, you'll see, often folks think, okay, data science and data lakes, we can gather the information together and magic happens. Well, there's no magic. So if you look on the upper right there, this idea of data lakes, there's all quotes from industry experts and organizations that have done some surveys. The quote there, the lack of metadata and data definitions is one of the main impediments of a data lake. So some of the data swamps that are created is because there isn't that metadata. You can't just dump data and have magically things happen. There's a lot of work to do that. So if you look on the upper left, that's kind of the work. So if you talk to these data scientists, people who are doing the data analysis, often the frustration is a huge part of the time that they're trying to implement cleaning and reforming the data to make it fit for perfect purpose. Now I'm in the lower left, right? That's not the favorite part of any data scientist job. You don't go into this job to say, I'd love to just kind of do some address name matching and clean up my data. No, you want to find the new discoveries, right? So how do we make that easier? Because in the lower right, a lot of the reasons people are going to big data and a lot of these technologies is this whole idea of the digital organization. But if you can't find the right data and it's not consistent and it's not right, we're not getting anywhere. So probably as I say, preaching to the choir with this audience, but it is something to consider. And maybe these are statistics you can help use in your organization to help drive that idea of, you know, it is not magic. There's a lot of work that needs to be done and some modeling behind it. You know, and this furthers that idea is that there are massive opportunities with big data at Linux and big data storage options to help store that massive volume of information. But if you don't have the metadata behind it, you know, where did this data come from? What's the purpose of that data? What are the units of measure, right? Or what are the definitions of key terms that it's not going to be valuable? So even when we go online, there's actually a metadata course we put together with Data Diversity that talks a little more about this. I think this slide was stolen from that in all honesty. We talked more about that. There's so much information. Think of the open data. Most governments now will post information on the web or post information on weather, on population, on finance. But there's always metadata behind it, right? Because when was this data calculated? Who organized it? What metric units is it in, et cetera? Otherwise, the data makes no sense. So on that note, yes, there are data modeling cartoons in the industry, in case you were wondering. And this is a cartoon from the book I put together with Steve Holberman and Chris Bradley. And for those of you who might have done this, you might laugh for others. You might just think it's weird, which maybe it is. But who has not gone through this, right? So, okay, we're almost done with our acceptance testing and everything looks great. We're going to launch the application. Just one question. What do we mean by customer? And that was always something when I was early in my career and I'd go to the Data Diversity conferences and things like that. And people would always bring up that example of trying to get a single view of customer. And early in my career, I was like, how hard is that? A customer is a customer, right? All right, don't laugh. But we all know there's so many. Is it an active customer? Is it a lapse customer? Is it a high value customer? Is it a customer versus a product? But that's all the stuff you do in data modeling. And that's important, whether it's big data or small data or any data, or just having a conversation with someone. Context is so important in getting the meanings of these core terms. I mean, I have worked for and with and many organizations that have made massive business errors on something as simple as getting the definition of customer wrong. I get ads all the time from a credit card company with whom I have a credit card offering me a good deal on this credit card. They don't know that I'm a customer. And that's different from a prospect. You know, this happens all the time. I think we're used to it. So again, these basic definitions are so critical to any business organization. You know, further on that, big data analytics is hot. There's a big reason you would have big data is to do the analytics on it. But you need the metadata behind it. So here's just one example. One business use case of big data is something like smart metering, right? So, you know, not the old thermostat in your house. You actually have a smart home where you can control it on your cell phone. You can get great analytics from it. And the company that's providing the information can help optimize that because they have the analytics. So maybe it's something as simple as, our analysis shows that energy usage with smart meters increases less than if you had a traditional thermostat because you can control it more. So, you know, sort of an interesting finding. But when you start looking at the metadata behind it that starts making sense. What was the source of this data? Was it taken monthly? Was it taken weekly? Are they averages? You know, were these, did you have the same metric readings? Were the meter readings taken from the meter or the billing? Did you calculate this by individual or by household? You know, all of these questions really are needed before you really understand what that analysis means. And that's the metadata behind it. And again, it doesn't matter if it's big data or small data, you still need that, those core definitions. The business case remains the same, whether it's the data warehouse or traditional relational data or big data. I mean, your business sponsor or you yourself, if you're a business person, you might just say something, quote simply, you know, tell me what our customers are saying about our product. Seems easy enough if you're sitting on the business side and you can have that by this afternoon. How hard could that possibly be? Well, we know how hard that possibly could be. So if you look at the lower left, you know, in the traditional data warehousing world, just the idea of which customer database do you want? We have about 25. Can we get an inventory of that? And by the way, something as simple as customer name is formatted 16 different ways across the 25 systems. We get to the world of big data and that complexity doesn't go away. It's just a different source. So you may have a data scientist saying, okay, I have to import the raw data from all of these sources and write a program to parse it and then analyze it. This takes a lot of time and you need the metadata behind that and then you need the models to combine those two. And I guess I mentioned before, that's where the value comes, is really combining data warehousing and big data. So think of something like customer experience optimization or how likely is my customer to churn or how are my customers using our products, especially if it's an online product or maybe even a cell phone. You can literally see how people are using your product and then do some really interesting things with big data. But if you don't know who your customer is, you're not going to get any value. I've done some of those big data analytics with customers in very similar products. We did some great social media analysis and we have a subset of high net worth individuals we'd like to target. But when we look at our own data, one of John Smith is a high net worth individual and one John Smith is bankrupt and we can't tell which John Smith is which. It makes a big difference when you're doing a marketing campaign. So again, big data is great. It does need to be in many cases integrated with a warehouse and more traditional data to really get the full value of that information. You can get insights. It's still interesting to see trends and things. But a lot of the value really comes when you can integrate those two more traditional and the more innovative sources. And one of my favorite case studies, and it's a little dated now, but I keep using it because I'm old-fashioned maybe. No, but because it is very helpful and it's out there on YouTube if you want to look is from TDWI Chicago back in 2013. And it was Facebook talking about the data warehouse. And I like that because if you look at the quote at the bottom, one of my pet peeves, and I won't do the full rant here, just the short version of the rant, is this idea that our society or people or however our brains are wired, we tend to have and or, right? It's one thing or the other. And Jim Collins, the author of Good to Great Business Book, said the genius of and and the tyranny of or. It's not big data or data warehouse. It's and, it's putting them together. And each has its use case and they're starting to merge now with some new technologies, but really the whole philosophical difference is very different. So think of Facebook as one of the founders of big data. They're excellent at big data. One of the great things big data was good for them is this idea of exploratory analysis, finding patterns in data. Sometimes the creepy stuff, I'm posting from Paris and I didn't put my location. They can infer it based on who I'm talking to and where I am. And a lot of the very interesting things they do is based on big data. What they couldn't do very well is something simple, like how many customers do I have. And that was where they needed a data warehouse. Not only on the semantic definition of a customer that I mentioned, but also just technically, if you're trying to do something very simple but how many users are logged in by region, that was actually a lot faster on a data warehouse. I think the quote he gave was a minute on the warehouse, an hour on the Hadoop at that time. I know things have evolved since then, but still the idea is that the certain technology is good for different things. So I thought it was actually very clever of him to sort of almost come out at TWI and say, yes, I am Facebook and I built a data warehouse. Because it isn't an or, it's an and. They can work together. What was interesting is one of the other examples he gave was they had an issue with defining what they meant by a user. And we can all relate to this if you are a Facebook user. You might log in on your laptop and log into Facebook and put listening to a great webinar by Donna Burbank. You should join. Or you might just sort of always be logged in on your cell phone and not really actively using it. Are you logged in? Or what if you're on Spotify and you post a song from Spotify? Are you technically a user of Facebook at that time? Or are you using all those different variations of user or what a data model was good for? Does that sound familiar? We're all on the way down with our assessment testing. What do we mean by customer? Again, that doesn't go away just because it's big data. So how does that relate to things like modeling and technology? So you may be familiar, and I've showed this before, this idea of, and other people have, it's not mine, of the idea of the levels of data modeling. And each does have its own purpose. So a lot of what I was talking about earlier with these definitions may live up at this conceptual data model level. That's where you define some of these core business terms and rules. What do I need by customer? How does that relate to a prospect? Is that different from an internal customer, if I'm talking about an internal group I'm selling to? All of those type of basic questions and relationships and business rules around data can be done at the conceptual level. It's also done at the logical level. Often I think you get a bit more detailed when you get to the logical level of the detailed rules. Can a customer have more than one account? Does that customer have to have an account to become a customer? All of those types of questions can be done at the logical level. And I'll talk more about this. That's often where you start to get into physical design, and that's sometimes with big data where things break down. And we'll talk about that further. And then when we get into physical, I'll talk further about that, but that's really where you are generating a physical database. And so what does that mean in the big data world? I think things make a lot more sense in the big data world up here. All the stuff I was talking about, this idea of communication of business rules. What do I mean by a customer? What is a payment? Who's an employee? All of that can be done with any technology. I think even I've done these for customers that never actually did a data implementation at all. They were trying to do some business process optimization. I might not even have called it a data model. I was just talking about some of their core aspects of the business and used a model to help define who is your customer and what is your product and how do you sell it? So again, this is valuable just to understand the business because we talked about what the importance of definition is. What do we mean by a model? What do we mean by a customer? What is a household? Is it people who live in the same building or do they have to be family members? If you're a data modeler, all of these should become sort of familiar. And my little joke, if anyone speaks Italian, API means API, which is a B. I lived in Italy for a while and I was cracking up because they would just call it an API and an API. A little picture of a B. So anyway, all of these things when you talk about terms, this happens all the time in data warehousing. What do you mean by total sales? We have different regions calculating total sales differently when you're trying to get an audit that those type of things are going to come up. So that does not change whether that every technology you're using is still valuable. If you remember back to that fifth V, the veracity of one of the biggest failures or issues in data lakes is that idea of lack of definitions and lack of metadata. And we'll talk a little bit more later on maybe why that is. So when you get into the logical level, I mentioned, okay, in the traditional relational world, this is generally your precursor to physical design. You're thinking about normalization and things like that. So does that make sense in the big data world? Not so much. When you're thinking of the physical design, put the key business rules and relationships and attributes due. So again, kind of customer have more than one account, doesn't need to have an account. What are the key attributes of a customer? Are they required? What are the data types of those? All of that type of information is still valuable to the business. So that still does need to be done. It can be very helpful exercise. Now physical data model, that's the optimization and design of a physical database for storage and performance as well as table structures and all of that. I think most of you in the call probably are very familiar with this. But what does this mean in the big data world? And I think that's kind of where our brains get kind of turned around because it is a different world. So in your traditional data modeling world, you kind of have this idea of schema on right or the idea of that design and build. I was talking about the kind of kingdom phylum class order family species, right? If I design it and then build it, everything works well. It works very well in many systems. So the idea is I have customer information. What is that about customer that I want to store in the warehouse? I need their name and their address and their gender and their employment options, et cetera. You create that structure and then you can generate your physical database structure. Most tools on the market who do data modeling can also, what they call reverse engineer or if you have a database structure, the beauty of having that metadata is that you can then read from that. So it is bi-directional. Yes, there was a schema on right. Some human being had to create that, but then once that is created, that can be read automatically by many systems. So ETL tools and BI tools and data models can all read that schema and that data structure very well. There's sort of inherent metadata in it. Well, when we're thinking of the world of big data and you've... Did I just do that backwards? Yes, schema on right and then schema on read. This idea of schema on read. So you've probably heard that if you're in the big data world. What the heck does that mean? It doesn't mean that you sort of automatically read the schema and it happens. There's human... There's still human design in that. So the idea is that idea of that discovery. So if you're familiar with the big data ecosystem, you have HDFS, which is in the Hadoop world. That's your file system. That's where you take all that great information and you dump it for lack of a better word, right? And there's many reasons to do that in the big data realm. It might just be cheap storage, right? Well, it can be a lot less expensive than other systems. It could be, and this is a valid use case. I don't know how I'm going to use it later and what format I'm going to use it. I just know it's important. We're getting all this great sensor data or all this sentiment analysis data. We might need this in a way we don't even know. So just dump it in this raw format and we can use it. Well, that analysis, that exploration of how I want to use it or tweaking it in different ways. That is your schema on read. This is still, as I say in the slide, not magic. The person that's still creating that. And if you're familiar with the Hive, that's basically kind of a SQL type layer on top of HDFS. It's a little more limited than your full SQL, but you can still generate table structures. But it's not as if you can take something like data modeling tool and just create things on it. It's not the idea of big data. It really is about that discovery and taking the massive volumes and making sense of it. So that is really that philosophical difference that we were talking about earlier, where this idea of design and build the schema on write and getting a bunch of stuff and figure out how they want to use it later and then design it as they want to use it. And that's super valuable. And I've worked with many organizations that are really doing both. Often this is a precursor to your data warehouse. Let's do all the munging. I'm not a big fan of that word, but it does seem to fit. Let's take the data, see the patterns in the data. Maybe there's an attribute I hadn't thought of. I just needed a gender and name and address, but I noticed age is a big differentiator in how they buy our product. Well, I should put that in the warehouse. So again, you might not, or what they say on social media or whatever, things you might not have seen by patterns in the data. So that is some of the beauty of some of this. But it does make data modeling a different thing. So this idea of data modeling in the big data ecosystem, you have your data sources, whether it's videos or files or streams or relational databases or semi-structured data. And either you create some sort of high of layer on it that can be then read through a data model, or some of these systems are semi-structured. You can get, even though XML is not exactly what we're doing, most of the data modeling tools can either create a hierarchy out of that or kind of map it to a more traditional ER model, depending on the tool. They kind of do that differently. But it isn't the idea of that design and build. It just is a very different thing. Related to this discussion, although they're very different in my mind, and another rant is that people sort of include these as the same thing, this idea of no-SQL. So a lot of rants have done this brand about that. One is that no-SQL isn't a thing. It's something that there's SQL, and then there's everything else, which is a huge, these are all types of no-SQL databases. And I guess, yes, collectively, that's part of the big data ecosystem because big data is a lot of things. But it really is separate, but since that is so related in many people's minds, it is different from the traditional ER world that I thought it was valuable to talk about. So key value databases are very cool things. They've been versions of this for a while, but they're really coming to their front. There's some examples there, Oracle NoSQL, et cetera. They can really support this extremely high volumes of records and state changes. So think of something like managing user sessions in a web application or online gaming or your shopping cart is actually a good one that a lot of people use it for. I'm collecting all this stuff real-time and kind of linking it to the profile. Generally, the structure is done in the application code. It's kind of a coding effort. It's not as if you can see an example that you've keys and your values. That's not really a huge database structure on top of that. So generally, data models don't make a lot. I mean, you could have a model for that, but it's really more in the application code. You don't have the tables and columns, kind of that we're normally used to. It starts the million miles an hour on this, but just kind of going through some of the differences here. Now, it's no SQL metadata, document database. Kind of a different thing. So there are good ways to say MongoDB is a good example of that. Now, there are some tools on the market that can both read and write these as well, because you can do some data modeling here. So one neat use case is that, so think of it, multimedia is a good example. This is social media posts, where you might have a bunch of documents in this example that talk about, say, things in China. Maybe I'm a museum, right? So I've booked on China and I have artifacts from China. So the attributes on each one of these is different, but there are some commonalities. So you can be flexible in how your data model supports this. That is one of the reasons they're so popular is because you can't have that flexibility. So some modeling can be done in these, and there are, as I mentioned, some of the tools on the market can both read and write them, which is a little different than, say, the key value pair database, which is normally done in code. Graph relationships are other interesting things, and they're used for a lot of neat use cases, like fraud detection or threat detection. If you're trying to see patterns in data, so if you look at that data there, you can kind of see groups. So maybe I had this criminal one, that person kept telling telephone calls around certain people, and I can kind of see that network. Or a more positive note, maybe social media with marketing. This person bought my product, and they posted it on social media to these other people who have similar interests, and can I see patterns between, or things like network optimization. I'm seeing a lot of network traffic here, Internet of Things patterns. So again, part of the beauty of these is that discovery. So it's not like you design a data model to see who's going to commit fraud, right? I mean, the model itself is that database. It's looking at that metadata and the patterns between it, which is very neat and really cool, but a very different way of looking at things. And there's some neat vendors. There's some MDM vendors that are kind of mixing worlds here where you might store the data in a more traditional database, but maybe some of the relationships or hierarchies between customers are done in graph. Again, as we kind of open up our minds to some of these new things, again, it's not that either or. It's an and. There's some neat ways to link these. It's just important to kind of know what fits where and where a data model might fit within all of these. So moving on, again, a lot of different things we're hitting on, hopefully some meet the needs of somebody on each call. So when you talk about the organizational considerations, and what does this mean for being a data architect or a data modeler in the world of big data? Do I still have a job? Right? So if you read the blogs, and go into diversity, there's a lot of discussions around this. What does it mean in the new world? And yes, this job is just different. So, and excuse me, I probably offend somebody in the call, but I will put myself in one form or the other have been this role in a different organization or a different phase in my life. And you tend to act a certain way in a certain role, but I think also certain personalities tend to gravitate through different roles. And it's important to sort of think of this. So I'll start with the upper right, executive who are often our sponsors, right? They want things done. I'm busy or they think they're busy. Everyone's busy, but business people think they're busier. And their big picture focus tend to be, you know, we think of a successful business person. They tend to be optimistic. They think of opportunities. I heard this big data thing, and I know it can change our world, and it can change our business. And you might get frustrated and say, that's a lot harder than that, but they may be right, right? There are some new things that we can open up our minds to, but when you're talking to them, they're often thinking, what's the business opportunity? That's why they care. They probably don't get as excited as I do about the technical considerations of a graph database. They want to know how can this help my business. So think of the upper right, almost the other extreme. You have a database administrator whose job is to make sure this database runs and runs effectively. And it's very analytical and very structured. And they're busy, too, but they're very busy on a project or a task. So you as the data modeler in the center might go to them and say, I need something, and they say, I've got a beeper around me and I'm busy and I've got to get this task done. Just let me code. And they're also very cautious, right? You want to do this new change and add some fields. Well, it's going to break 17 different things. So it's kind of their personality. And you in the middle of the data modeler, you kind of generally, I'm making a huge generalist, I'll say me when I was a data modeler or am. Excuse me. You're a mixture of both, and I think that's the beauty of a data modeler is that on one hand, I think of, use this analogy before, apologies if you've heard me say it, but it's a good one, so I'm using it again. I think of Janice, right? Who's the god of a goddess, or I don't know who's god or god. God, I think, of January, the first month of the year, and has two heads. One is looking one way into the new year and one looking the other way into the old year. I almost see that as a data modeler. In the one hand, you turn around, you have to speak to the business and use business language about customers and products. And then turn around to the DBAs and talk DBL and then how that data model you've created can be optimized on Teradata. And that's kind of the fun of the job, but you kind of often, that personality has different sides, right? So once this person is analytical and structured, but we also tend to be kind of big picture focused and very passionate, and, well, other data modelers, but not me, I've heard sometimes like to talk a lot. Crazy idea, but I've heard that. And we are very passionate, so you're probably wondering about that picture, right? And so sometimes, I have to say, we can be considered old school, right? Because we're the ones that get up there and say, oh, if your data model is not in third normal form, the world is going to end. And we get so passionate about that. We're kind of seen as that weirdo in the street corner holding up the sign that the world is going to end. You know, we can be evangelists. So that's the kind of picture, the old guy in the street holding up the sign, the world is going to end if the model isn't third normal form. And sometimes we're right. Sometimes it should be in third normal form or dimensional model for your warehouse, right? But I think some of us can get so caught up on that we miss the opportunity that the business executives are looking at. So we have to be careful of that. Or even if we don't feel that way, I think sometimes the organization, there are some blogs out there, is data modeling dead? Well, there is a positive title. You know, you don't get that in the lower left, and not that I'm bitter, but you get the data scientist, the sexiest job of the 21st century. You've probably all seen that, I think it was Harvard University quote. They probably have a different definition of sexy than I do, but a data is sexy, I guess. A little bit of a stretch. But it is a hot new role. Why is it a hot new role? Partly because they're talking to that business executive in the upper right. They've sold themselves as being, we can have these great new opportunities. I'd like to explore. I'm as soon as modern. And that is right. There's a lot of opportunities. So make sure if you're in that data modeling center, you're looking at those opportunities and talking and helping those people. And can you be the hipster carrying your Starbucks and get the skinny jeans on, right? I don't know if that's what's hip. Wear some skinny jeans. I don't know. I'm not a fan. But again, these are stereotypes, but they are important to know because they are kind of there in business. And then I'm probably going to offend people, but I have been a software vendor too. So I can, in the past, often it's a little bit of a stretch. It's magic. It's easy. No modeling is needed. I have actually spoken to the analyst of some of the vendors and always scolded them. Really, it's not this big red button. And I think that's an extreme, probably extreme of, you know, if you think of the extreme of the upper left of the DBA and everyone wants to change, that's a bit of an extreme too. But those are, if you were talking of extremes and opposing forces, you have a lot of those, and you're in the center as a data modeler, right? You have to serve the business. You have to serve the DBAs that do have to keep the systems running. You want to support the data scientists and the new opportunities. And then you're also struck with, why does it take so long? The vendor said, it's just you dump everything on Hadoop and magic happens. Well, it's not magic. Even scheme on read is a lot of work to get that. You have to do the analysis. So key to this when we think of our careers is really working and understanding these new roles. So a lot of these existing roles, what is a data scientist, right? There's a new one we can tell a whole conversation on that. A lot of it is, you know, I did econometrics way back. I guess that's a data scientist now, right? You are doing a statistical analysis, that first type of modeling we're talking about. But that needs the roles of a data architect. The data architect often understands the whole way the business works, how those existing systems, they have the metadata in their head. Things like privacy don't go away. How do we get the ETL, how do we provision the data on that source? So how does something like ETL and Hadoop administration work together? So there are some new roles. I think there's a lot of alignment with existing roles. It's not as if this new, when we think of the evolution that I showed in the first couple of slides, we didn't start from scratch with a cell phone. You know, there were phones before that. So there's a lot of overlap and evolution that we need to be a part of in our roles. If we do get that right, and again, why I am still in this industry and why I have the company that I do, it's really this idea of data-driven business, and that's why I think data is so cool right now. Why is it the sexiest job in the 21st century? Because people are finally wising up, too, I think a lot of companies have for a long time, but there's so much more opportunity now that data really is hot. I'm finally, and we always joke, you have a friend who's, I don't know, an airplane pilot or some, you know, seems to be sexy job and you're at the dinner party, oh, wow, you're a pilot? Yeah, well, I'm a data analyst. I actually have said people say, cool, you do big data? I mean, suddenly we're the cool kids at the party, so let's go with it, right? And I see, when you talk about data-driven business, what does that mean? I see, well, many ways, but I've had to divide into two. One is this idea of optimising your business. When you hear people saying, I'm becoming a data-driven company, love that. I've worked for several organizations in my consulting that have that on the wall. Imagine that before I even came in. You know, they came up with the idea that data was cool. And how do you even have better marketing? Marketing is a huge consumer of data. How do you understand your customer? How do you understand your competitors? You can build better products by understanding your customer. Things like customer support. Can you look at the support logs? Network outages really understand your customer in a better way. So through data, we can become a better company, do what we do better. What I think is even more exciting is this idea of business transformation. How can we become a data company? And I've worked with several organizations that do that. You might say, I'm a telecommunications company. That's sort of a commodity. How do we utilise the data through doing football analytics or patterns of our data? Think of the example I gave of smart meters. Companies just, again, have the picture on the wall before I came in saying we're now a data company. And how can I monetize data as an asset? The beauty of that is you need data to do that and you need metadata to do that and you need data models that really understand when you think, especially when you think of that conceptual model, how data relates to everything else. And I've worked with big data organizations that started with a conceptual model to understand what are the interrelationships with data and how it can try. It's been an exploratory phase. What data do we have and which data can we leverage? Is it our customer data we can monetize? Is it our product data? Some people came up with surprising. They didn't realize we had this weather data that we did just for our own analysis and we can sell that externally for some examples. But you don't know that until you model it. One example, we work with a consumer energy company all about big data smart meters really changing the way they do business, giving users control, all completely data driven and doing a lot of great things with Hadoop and the whole ecosystem. One of their very first steps was to identify what business data was critical by doing a data model, creating definitions and really identifying what that data means, where it's stored, how they can leverage. So one of the more innovative companies I've worked with that were really changing their business from their entire business model from, I'm going to build a certain way by usage to letting consumers drive through smart meters, having connected devices, really changing their entire business model. And one of the first steps was data model, data definitions, thinking of governance and things that are related to a data model. So again, it's a new world. Some of the ways of looking at it technically are different, but it's still relevant. It's not at all that it goes away, which is part of, again, as I mentioned, why it's sort of fun to be in data management. So in summary, we are in a period of disruptive change. I mean, my head explodes when I think of where we were. I mean, just the things you can do in your cell phone that you didn't. So rather than being frightened by it, which sometimes I am, how do you embrace it and really think of new ways that you can really expand your career by being a data person. And that's a bit of a social change, as well as a technological change. And do create a fit for purpose solution. It's not an either-or. There's relational databases. They're great for certain things. A whole asset has been around for a long time because it's good at great things. Big Data has new opportunities. So how do you integrate those? And with any age of change, the basics still apply. You still need governance. You still need an operating model. You still need your business requirements. And the last one, have fun. I really do believe this is a fun time to be in information management, which is why I'm still in it. And I hope you do too. So with that, just a little bit, and Shannon will send out the slides if you do have any questions after this or just want to contact me about anything. There's my contact information, a little bit of a sales pitch next month. On September, we'll be talking about UML for data modeling. And does that make sense? When does it make sense? How you can do data modeling. And I'll have two excellent guest speakers, Norman Doust and Michael Blaha, who are experts in data modeling of both written books on the subject. So I hope you can join us there. So with that, Shannon, we can open it up for questions. Donna, thank you for this great presentation. I just love it. And just to immediately answer one of the most popular questions, as you mentioned, I always do a follow-up email by energy Monday for this presentation with links to the slides, the recording, and anything else requested throughout. So let's dive into it. What is the fundamental difference in a technical sense between big data modeling and traditional data modeling? Well, as I hope I touched on that, I guess if I had to summarize the two, I think at the conceptual business level, they're very much the same. The conceptual model you should still be understanding the business, and that's where you're talking at the business. I think when you, Tom, trying to go back to my slide, when you think of the technical model, I would say that big difference is the idea of schema on write, which is more your traditional ER model. You're building the entities and relationships and attributes and scalability, and then you build it worse as big data, worse more as schema on read, where you're discovering patterns and discovering through the data what the schema should be and creating it then. So that's probably the quickest version of an answer for that. Sure. And traditional data modeling uses tools like ERWIN and ERStudio. Could you recommend some modeling tools for big data? Are they different? There are, I mean, the big three that you mentioned, so there's ERWIN and there's ERStudio and there's Power Designer. All of those tools in their own way can definitely read from a hive, and you can infer a model. So that example I'm actually showing now is one of the modeling tools that did, quote, reverse engineer from hive, and there's a lot of tools that can do that. When you get into the actual, the Hadoop ecosystem, I would love for someone to come and tell me I'm wrong, so I'm open to that. As I just said, the world changes so fast. Most of what I see is a lot of, kind of, a hive cooler, the sequel being written, and people are sort of doing that manually. There's some, a lot of these data exploration tools, like in Altera, because there's a lot of, this could be a whole other topic, but this data preparation, this user-defined data preparation, is being done kind of visually, but it's not the traditional model we're looking at. So that idea of kind of reverse engineering is more at the code level or some of these visualization tools that can kind of do that data preparation kind of in a visual way. That's what I've seen in the market. But again, if someone has better ideas, please drop those in because things aren't changing quickly. So I think you've covered this a little bit already, but it's been getting the modeling done on HDFS Hadoop system or still done for tables and databases. So this hive, so HDFS is just a file system. So this is the entire Hadoop ecosystem of the two framework, which has a lot of components. You've got data movement and you've got data storage and you've got MapReduce, which, you know, so there's a lot of those. So Hive and HBase, that is actually a layer on top of HDFS, where you can create these tables like things. So, yeah, so that's a layer on top where you're creating these tables. Thank you. And can you recommend further resources on big data modeling? Where else can people find more information? I have found a lot of the vendors themselves are very good at actually putting out a lot. The Apache, you know, the open source ecosystem, some of the data modeling vendors. If you are looking more at the NoSQL, Steve Hoberman, a colleague of mine, just wrote a great book, I guess it was last year, on data modeling for MongoDB, which is sort of interesting, because he comes from our kind of the relational background and goes on with that. But I think a lot of the open source, I mean, the beauty of Hadoop is that it is an open source system, and as a result, there's a lot of resources just open and free on that. So that's what I would recommend. Free is good. And Dataversity of course. Dataversity of course. We actually have a lot of content coming up on that. Can you tell us more about what makes a graph database different than a relational database? Wow. A lot of differences. So let me go show the picture. So a good use case for that would just kind of be social media or how things, it's almost, it's most basic, and I'm sure those people are going to cringe at my simplicity of it, but it's things relate to thing, right? So that's why I'm trying to say I'm friends with Shannon and Shannon's friends with all of you. You can see those patterns. So it really is linking, that those linkage. So it's great for data that has a lot of hierarchies in it through that, or it's not the idea, you don't have foreign keys, you don't have tables, you don't, it really, simplistically I would say, is thing relate to thing. And then you should kind of identify those patterns that you see. What have you found to be the best tool for use of data governance and lineage, and are they different for big data and data warehouses? I distracted myself and someone pointed out when I answered the previous question, Coursera does have some good courses as well on big data. I've taken a few myself, so that's another good one. And could you repeat the question, because I shouldn't look at them while you're reading them. You've got to be the best tools to use for data governance and lineage, and are they different for big data and data warehouses? I haven't seen, well there's different, for data governance, there's things like, I always say tool names, but there's some tools in the market that are great for process, as data governance has process, the people part, so there's some tools that definitely focus on that. Some tools, some of the data modeling tools like the three we mentioned have some good lineage. Meta integration is one that does the lineage between a lot of those, so a lot of the metadata repository is on the market. A lot of those though, because of what we discussed, they can often read from a hive, and they can read things that are structured. It's really hard, or you can also, a lot of the tools will get a lineage from the Hadoop ecosystem to see what files were loaded and that sort of thing, but it often breaks down when you get to the big data layer. But all of them are tools, I mentioned they're different types, so it could be your retail tool, it could be your metadata repository, some of the data modeling tools. It kind of depends on your use case. Most can at least get the lineage when there's something like a hive layer on top. When you're just looking at the raw data, you can probably see some of the files that can go through. I have seen kind of a convergence with your traditional metadata repositories. A lot of those haven't kind of got here yet. And again, unfortunately, some of those, you can get a great what went on to Hadoop. But when you try to link it to more traditional sources, generally there has to be one of these kind of sequel-y things. That was very technical, sequel-y things in the middle that you can really do that lineage for. A new industry term. I love it. I'm afraid that is all we have time for today, however. I'll get the any questions that we have remaining over to you, Donna. If you want to take a look at those. Just a reminder, I will send a follow-up email to everyone by end of day Monday with links to the slides, the recording of this session. And Donna, thank you so much for this great presentation. Clearly a hot topic. And thanks to our attendees for being so engaged in everything we do and for all the great questions. I just love it. And I hope everyone can join us next month when we're talking at UML. Great. 20 seconds, yeah. All right. I hope everyone has a great day. Thank you.