 Hello and welcome, my name is Shannon Kemp and I'm the Executive Editor for Data Diversity. We'd like to thank you for joining today's Data Diversity Webinar, a framework for implementing NoSQL Hadoop. The latest in Stompin in a monthly series called Data Ed Online with Dr. Peter Akin brought to you in partnership with Data Blueprint and sponsored today by HP Security Voltage. Now let me get the floor to Steven McLaughlin, the Webinar Organizer from Data Blueprint, to introduce today's speakers and webinar topics. Steven, hello and welcome. Hello, everyone, thank you for joining us. We appreciate you finding the time and your busy schedules to join us for today's Webinar, a framework for implementing NoSQL and Hadoop, or as we like to call it around here, Go Small Before You Go Big Data. As always, a big thank you goes out to Shannon and Data Diversity for hosting us and another big thank you to our sponsor, HP Security Voltage. We'll get started in just a few moments after I let you know about some general housekeeping items and introduce your presenters. We have a one-hour presentation today followed by 30 minutes of Q&A. We'll try to answer as many questions as time allows, but feel free to submit questions as they come up throughout the session. To answer the top two most commonly asked questions, yes, you will receive an email with links to download today's materials and the webinar recording so you can view afterwards. These materials will be sent out within the next two business days. You can also find us on Twitter, Facebook and LinkedIn. We set up the hashtag data ed on Twitter, so if you're logged in, feel free to use it in your tweets and submit your questions or comments that way. We'll keep an eye on the Twitter feed and we will include answers to these questions in our post session email. All right, now let me introduce you to our presenters today. Peter Akin is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He's written dozens of articles and eight books. The most recent is Monetizing Data Management. Peter has experienced with more than 500 data management practices in 20 countries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his and Data Blueprint's expertise. Peter has spent multi-year immersions with groups as diverse as the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, the Commonwealth of Virginia, and Walmart. He often appears at conferences and is constantly traveling. Speaking of which, Peter, where are you at today? So today I'm out in San Diego, California, at the Data Bursary Sponsored Data Governance and Information Quality Conference. We have 482 participants in this event and are really getting excited about the things that we're hoping that the weather holds off here because in nine years of holding this event, we've never had the whole lunch inside and we might actually have that happen today. But fingers crossed, we'll see what happens. Awesome. And Josh Bartels is also joining us today. He is a highly qualified data management consultant and leader with 10 years of experience across multiple industries in delivering tailored data management solutions that provide a focus on data's business value while enhancing the client's overall capability to manage data. Josh holds certifications as a data management professional, project manager, and data vault 2.0 practitioner. He holds two master's degrees for business administration and information systems from Virginia Commonwealth University and his most recent efforts focus on the creation and migration to new data platforms for clients in the financial and insurance industries. And Josh, the White House sends a picture there. You're not getting ready to jump over it, right? No, no. I live there for a while. Josh would never do something like that. So hi, everybody, and welcome. Again, the topic today, as Stephen said, is really much more about going small before you go big. And that's what we'd like to get across to you today. We've got a sort of a longish title framework for implementing NoSQL to do. Deinvestifying data 2.0, developing the right approach for implementing big data techniques in here. What we're going to talk about today is, first of all, the fact that we're using the wrong vocabulary to discuss this topic in a big data context. And we'll start off by giving you some more precise definitions. For the words framework, hopefully, you'll understand what we mean by the term non-Van Neumann architectures by the end of the presentation. Josh speaks specifically to Hadoop and the sequel in terms of that. Then we'll move into a sort of historical perspective on big data and show you that there really is not a lot that's new, but there is something new. And so it's not revolutionary. It's really more evolutionary. And of course, with all evolutionary approaches, the real key to this is crawling before you try to walk and then run because you're less likely when you stumble, as all of us do, to have a catastrophic failure. We'll finish up with a couple of examples looking specifically at social and operational context in this piece. Let's go ahead and get started then. The real first piece of this is, again, the idea that we are simply using the wrong vocabulary to describe these. All right, I have a question for you guys. Big data has a clear definition. Fact or myth? It's a myth. Really, the fact is that the term is used in so many different contexts. Meaning has become ambiguous, really, or not well understood. It's become a buzzword in the industry of somebody can say big data, but you don't really know if they're talking about Kadoop or NoSQL or other different types of platforms. And often, people disagree as to what the term actually means. And of course, the real key for that, as Josh was alluding to, is that we'd like people to be very precise about their terms. Because if you are imprecise about your term, it's very hard to describe business value to that term. If I say, for example, that cars are useful, that might be a very useful sort of thing, unless you happen to live in Cairo, Egypt, where cars may not be quite as useful because the traffic is unbearable and other things are more important. So how did we get to this concept of things being a little bit out of whack here? And the first piece of this, and Doug Laney is a good friend and did a good job by getting us started in this area. His original observation was that the volume of data is increasing, the speed of data moving through our systems is increasing, and the variety of data that we are obtaining is increasing. So these were the three Vs. I think he would agree that these have been taken out of proportion, but he was the first to observe that this was happening and say we have to pay attention to it as a Gartner analyst. Now, we've also seen some variability definitions in here. So a 2011 term from IRC, our own Central Intelligence Agency uses the term vitality to describe big data techniques. Courtney Lamers calls it virtual pieces. So this limits the discussion to only online assets. And Stuart Madnick up at MIT has said, well, we've got to put value and veracity in place here as well. First observation is that if all the definitions start with the letter V or any letter at the same time, you know the marketing people are winning this particular argument and it's very difficult for the rest of us to focus on this. So volume, velocity, variety, variability, vitality, virtuality, value, veracity, wow. One thing that's a problem with all this is, of course, is that there's very little objective information about it. So let's look at a couple more definitions. Gartner in 2012 calls it high volume, high velocity, high variety, carrying on the Doug's work in there originally. IBM says its data sets are beyond the size and the ability of typical software tools to use. Yeah, no problem. Wikipedia's definition currently says, so large and complex that it becomes difficult to process using on-hand data management tools. Well, okay. New York Times says it's shorthand for advancing trends in technology. It's pretty broad. Tom Davenport, a broad range of new and massive data types that have appeared over the past decade. Interesting data, very large types from the Oxford English Dictionary. I was quoted back in 2007 as saying it's about putting the high back in IT. Well, once again, problem here because if we have no objective definition of big data, then any measurements, any claims for success, any quantifications at all must be these skeptically and with suspicion. So rather than arguing about big data, let's look at something that we can in fact define objectively and that is by adding a simple term to it, big data techniques. And I would challenge all of you that are listening to this webinar to try this in your own conversations with vendors or anybody that comes along and says, hey, I want to talk to you about big data and say, let's just try and experiment. Let's talk about big data techniques and then they ask you what are big data techniques? And you can say, well, they are currently characterized by continuously, instantaneously available data sources, non-vanoiment processing. I'll define that later on in the presentation by capabilities approaching or past human comprehension and we can measure human comprehension and have built-in architecturally-enhancable identity and security capabilities in this. These are often considered to be trade-off focused data processing. So the question then can change from big data to where in our existing architecture can we most effectively apply big data techniques? And as I said before, if you have a conversation with anybody about this and say, let's not use the phrase big data, let's use the phrase big data techniques, watch how that conversation changes and in our estimation becomes much more productive because big data technologies by themselves are a one-legged stool. And being that I'm on a red eye coming back from San Diego on Wednesday night, I sure don't wanna spend that flight on a one-legged stool. It would be very uncomfortable. I probably won't spend it on a three-legged stool either, but we'd like to take our three-legged stool as being comprised of people, process and technologies. Governance, of course, is the major means of preventing over-reliance on these one-legged stools by themselves and allowing us to have a proper balance as we're approaching, but are relatively speaking new problems with some new techniques or technologies. The big data landscape when you look around at it is in fact full of various different types of vendor offerings that are terrific. They do a wonderful job. This is one way of characterizing them. Here's another way and we'll refer you to the 451 research.com website for more details on this, but they have done a marvelous job of mapping out where all of these things apply. We're not gonna get through this. We did include it in here for reference purposes. Again, there's even a key to the map here that you can look at so you can find a quadrant and look to see where a particular product is at a particular point in time. It's a wonderful resource for people to use, but as you see, we don't really have a good definition for big data, but we can objectively define big data technology. So please let's now to our next question, Stephen. Fact or myth? Everyone should invest in big data. Josh, what say you? So I say myth. The facts are that not every company is gonna benefit from big data and it depends on your size and your capabilities or your abilities, right? So the difference between a local mom and pop versus a statewide or national chain which might have some of the Vs that Peter previously discussed, volume, velocity, those types of things. The real key is, of course, that big data creates different value propositions for different parts of the community. Healthcare, people are looking at it and saying, yeah, there's quite a bit of value. We've seen this number here fluctuate between 300 billion and 600 billion dollars. European Union says that the government can save 250 billion euros, which is a half percent of their annual productivity growth. Not a bad number in there. Global personal location data, again, 100 billion to 700 billion depending on what's happening and if you've ever been lost and had Google Maps help you out in one form or another, that's certainly some value, they save some time, they're easy to quantify that type of thing. Josh and I spend a fair amount of time driving up and down the 95 corridor and Google Maps will pop up in the middle of it to say, hey guys, there's a shorter route and we'll always appreciate you with that information. There were ways, record, anything along that retail a little bit different. 60% increase in net margin but not talking in terms of the billions and billions of dollars in manufacturing. It's a fairly substantial piece. So there's a fair amount of variation in these areas that people can look to in order to take a figure out but again, it's not uniform across all of them. So let's look a little bit historically and see what's going on. First of all, the spending increases that people are suspecting are just enormous. Again, one of the projections here put to the 232 billion dollars by 2016. Well, hopefully we'll all be around to double check these guys on after all that's the funny part of it but we do caution all of our customers we work with to not fall victim to the shiny object syndrome. There's a lot of money being invested but it's in fact generating the expected returns and our Gartner Headcycle suggests that the results are going to be disappointing. Now, a question for the audience I don't know if anybody will get this one or not but here's an interesting quote in considering any new subject there's strictly a tendency to overrate what we find to be already interesting or remarkable and secondly, by a sort of natural reaction to undervalue the true state of the case. The author of that is Lady Augusta Ada Loveless who is credited with being the publisher of the first computing program back in the 1600s. So she was smart enough to get this back then we keep going through it. Gartner Headcycle, which many of you are familiar with and again Gartner does a great job on these is that something starts off with a technology trigger. When this technology trigger starts at the bottom left hand of your screen it then jumps to the peak of inflated expectations. That peak is very high and as Lady Loveless says in her comments it's probably a little bit higher over suggested and then of course it drops to the trial of disillusionment. Well again in all of these cases the trial is below where it should be as well and what happens here just like any natural oscillation climbs back up the slope of enlightenment and eventually winds up in the plateau of productivity. It's not as bad as it should be it's not as good as it should be but it does fit in and have a utility in this. These Gartner Headcycles are quite useful in terms of looking at things in relative context. So here's a Gartner Headcycle from July of 2012 and if you were invested in text analytics in July of 2012 you were in great shape because there was nowhere to go but up. On the other hand if you were in social network analysis the way to read this chart is that it's a dark blue circle and social network analysis according to this chart was five to 10 years away from peak hype which is the height of inflated expectations in this case. So it's headed for a crash and of course predictive analytics and web analytics off on the right hand side of your screen there were approaching a relatively steady state. So getting an idea how to read these. By the way they also Gartner said as we've been saying here focus on big data is not a substitute for the fundamentals of information and it's something that we of course agree 100% with them. So let's go to a specific focus on big data. There's a Gartner Headcycle focusing in on just big data as of July of 2012. Now again the data on the right hand side I'm circling the pink there and you can see that big data in green is light blue dots and it is two to five years away from the top of the peak of inflated expectations. So it's as of July 2012. In one year, so we're now flipping ahead just one year July of 2013, big data is approaching the peak of inflated expectations. However, according to Gartner here it is five to 10 years from the top of that peak. So in one year we've gone from two to five years from the peak to five to 10 years ahead of the peak and this indicates a increase in the velocity of the hype around this particular topic. So five to 10 years instead of two to five years that's just in one year. Here we go one more year 2014 and all of a sudden big data has passed the peak of inflated expectations and is now headed for the trial of disillusionment. So the Gartner reports here reading between the lines said that a lot of things happen too quickly in big data and that the two to five years away is now five to 10 years away from the trial of disillusionment. Big data might have been a great thing to have on your resume last year in 2013 but in 2014 it is not the thing to have on your resume. Now again, not that Gartner per se is looking at these things correctly or incorrectly here but when you're looking at five to 10 years away from these things, these are problems. Again we can look at prescriptive and predictive analytics in relative context in here. So Stephen, we'll get back to another myth or fact. Fact or myth, big data is innovative. Josh Bartels, what say you? I say it's a little bit of both. So the big data techniques can be innovative but big data as a storage platform or technology may have been around for years. Also innovation is usually tied to ROI and insights depending upon the size of business and the amount of data used and produced. Now the size of the company doesn't necessarily matter but the insider, the value to the company is what really needs to be measured here. Again it'll be all kinds of different contacts and things we hope you all are able to take away is to place yourself in this era and see what comes up. So some of you may be asking why you're looking at a picture of my barn being built. And the answer is my barn when I was building it had to pass a foundation inspection. In other words before further construction can proceed the bank which is loaning the money wants to make sure I have a good foundation upon which to build the barn. If I have a poor quality foundation for this barn there's not a chance it will stand and it will probably fall down and hurt my horses and other sort of bad things will happen in there. Unfortunately we have not developed an IT equivalent in most organizations. And so people are simply unaware of this absolute critical engineering concept that we need to have in building our large complex IT systems. And what we end up with are very good IT systems built on top of the poor foundation. Now in this context we need to understand what a framework is. If we go to the barn we would say the next thing we build on top of the foundation is a framework and then we would put on a roof because the roof then allows us to work in the rain. Whereas if we put the walls up the walls might get wet before we put the roof on. Simple things like that is a system of ideas that guide the analysis. It is a means of organizing the data about the specific project. It allows us to make integration priorities and a framework for making those priorities. And finally it gives us a means for assessing progress. Some of you are unfamiliar with the Zachlin framework. It is now in its third version. If you are not familiar with that let us know and we'll do a data versus webinar on that topic. It's one of the more important concepts that is missing from many, many IT projects. Now we're gonna go back into a little bit of history here and research history in the sense that what we're looking at right at the moment is that processors have gotten fast but they can only get too fast because they're governed by the laws of physics and the slowest part in any standard information processing system is the hard drive. The hard disks are too slow or the memory is too small and a cool device called a flash drive removes both of the bottlenecks. It's funny Apple and Yahoo have spent hundreds of millions of dollars on this and some people call them just purveyors of overpriced flash drives which is in some ways a correct characterization that the real key is that we can make the flash drive look like more memory or like a hard drive or in some cases very innovatively both. The bottom line of course on all this is that from a processing perspective the new data techniques do give us some new capabilities. Now I promised earlier I would define non-Van Neumann processing and efficiencies here so let's go ahead and jump into that. The computer science term and what you really end up with is the idea that Van Neumann, John Van Neumann is the father of computing architectures as we have practiced them up until relatively recently. And the idea was that you had program and data and it was moved to a processor and as that processor then processed the data and moved the program and the data back it were sometimes erased the program and put new programs in place. But the only way of increasing that process was to in fact shrink the wires that the program and the data had to travel between the processors and the other pieces. And while that's a good idea it does have a point of diminishing returns eventually. So people started to think about it and Michael Stonebreaker was one of the first ones if you don't know Michael, you see better Ingrish out of Berkeley and MIT and he made an observation that modern database processing was about 4% efficient. Now the big iron data processors did not like that particular piece but it still is nevertheless true. We are faced with a zero sum game we have to trade off and in order to get more useful work we have to decrease other things we have to trade the characteristics against each other. And if we can come up with some effective trade-offs then we're able to do some of the things and the topics that I'm popping up here on the right hand side of your screen are now the things that have come out of this Google MapReduce and it's on Dynamo, Netflix has the chaos monkey they're all sorts of different things that are out there. And this is the fundamental piece to remember is that big data techniques exploit non-vunnoiman processing. What that means is, I haven't found a better illustration of this I'd love somebody to suggest one, it's like a Pac-Man but in a large number of parallel instances. And if we can take that problem and have these Pac-Men parallelistically process this problem that we're facing here we are going to be able to get through it faster, better and cheaper. There's just two pieces that are not optional in all of this. First one is the ability to decompose your problem into multiple sections and once your problem is decomposed and solved in pieces you have to reassemble it. So certain types of problems lend themselves to non-vunnoiman or what we've been calling big data techniques. And one of the little litmus tests we use in data blueprints is that people say, oh wow, I look at one of those big data clusters and we show a picture here that, Josh, I think you took this picture yourself at one point and people look at this and go, oh, that's one of your big data clusters. It looks like a bunch of Dell computers that you put together. And yes, that's the interesting answer. It is a bunch of Dell computers that Josh assembled. With Hadoop, it turns out that you can have two computers that can control the overall processing which means we end up with the ability of setting three parallel tasks in line in this particular big data cluster. Three nodes of parallelism is not really the number we're looking for. It should probably number in the hundreds or thousands when we're looking at this. And that means, again, dividing the problem up and reassembling it are the real key important pieces. Now, when we look at this, one of the things we wanna see is what can we do in order to take a look at the overall way in which big data can help. And we now refer to something called the analytic insight cycle. In this cycle, we look at patterns, objects, and we find out that some hypotheses will occur in that process. Here is a little small animation. It's something you can get on the web called BOIDS, D-O-I-D-F. And it's the observation that you can control birds blocking by only three variables. Well, kind of an interesting observation. Law enforcement, for example, might have benefited from this in crowd control pieces. I'm thinking of the police, Egyptian police, and Tahrir Square protests. They had known this. It might have turned out very, very differently for the Egyptian government down there. Now, having that insight is good. If we're not able to get feedback though and discern what's going on in that insight, we have no ability to use it. More importantly, we have no ability to reuse it, to operationalize it, unless we can make that insight exploitable and put it into some sort of a knowledge base. And this cycle that I've just described here for you is what we call the analytical bottleneck. This is the calling, if you will, to the data science community where we'd like to have more people come up with insights into this process quickly. Now, when we add big data techniques to this, bringing them in on the left-hand side, and I'm just gonna abbreviate them again as volume, velocity, and variety, we can say that some things happened before where we could start to get some sense-making techniques, addressing the what is happening on this. We can get some actual and potential insights into this where we can look for things and try to get them to be confirmed as we look around this, so potential and actual insights. And finally, we can look at combinational insights that can come through. These conform very closely to Margaret Bowden's computational creativity taxonomy, exploratory, combinational, and transformational. And what this really leads us to are two very easy-to-understand use cases for big data techniques. Again, we call this the sandwich analogy. We have a landing zone where we bring in data. Many organizations are calling this a data lake. Where the data is essentially not well understood, but at the same time not ready to be processed. In the middle are our standard architectural processing capabilities here. And the other part of the bread is the other use case for it, which is our offloading, needing less structure for cold, transactional, and analytical data. And again, we're gonna get a dramatic scoop for coming up with this idea, but hopefully you can see that the landing zone and the archival zone are great places to put these in place. But that the middle part, again, just as Gartner has said, as we've said here several times in the webinar, are absolutely required in order to do this. So now I'm gonna turn it over to Josh and let him tell you a little bit about what do we mean by NoSQL and what sort of a role does it play within the organizations. Thanks, Peter. So, talking about NoSQL, it's commonly interpreted as not only SQL or not ordinary SQL. A lot of different people say different things, but really what we see it as is a broad class of database management technologies, right? That can span different sections for different types of motivations and different activities. But really what it's used for in general is the simplicity of design, horizontal scaling, what we mean there is scaling out. So the parallel processing that Peter was talking about versus scaling up. So I'm gonna put it on a new commodity machine versus buying bigger and better machines, right? And then finer control over the availability of data. So the structures usually differ from relational databases, making operations faster in NoSQL for certain things and better relational databases for others. There's even different capabilities and different activities that the NoSQL technologies are better at. There's a difference between a graph database versus a column store database which are both NoSQL technologies, but a graph is better at creating connected data where a column store is better at seeking information specific to search types of activities, right? So that's what NoSQL is and now we're gonna talk a little bit about Hadoop because there's a difference between NoSQL technologies and what the Hadoop platform is doing. The Hadoop platform is really the storage and processing system that runs on clusters of commodity servers. So this is basically the data lake or the data landing area is what Hadoop's used for. You might put NoSQL technologies on top of the data lake to process some of the information that's in the Hadoop storage platform, but it really can store any kind of data in its native format. It can provide a wide variety of analysis and transformations using the mapping it out to the clusters and then reducing it back into an individual answer. Store terabytes, petabytes, larger sets. It consistently scales out. It handles hardware and system failures automatically without losing data. So like the concept of RAID with hard drives, Hadoop has it built into the whole storage platform and there's two critical components of Hadoop. There's ACFS which is the file system which actually stores the files and is managed across the cluster of machines. And then there's MapReduce which is the programming language that allows for programmers to interact with the data as it sits in HDFS which will map out to the clusters where the data is stored. And then as the answer is being generated, that's what the reduced part of the program is doing is reducing the answer or reducing the set of data down to the answer that would ask. Why would you want to use NoSQL or Hadoop if you potentially have a large number of users? So reading internet data or e-commerce transaction logs or interactions with any type of website that you may want to track conversion rates or different types of things where you have a large number of data. That would be more of a Hadoop use where a rapid app development and deployment use could be for NoSQL. I want to generate something that is a web application quickly. Fastly I don't necessarily want to know or I'm not going to have a rigid structure for what my data is so I'm going to use a NoSQL type of technology so that I can scale out quickly. If you have a large number of critical rights, so you're capturing sensor data or other types of data where it's every second or every millisecond on specific tests or engine tests or things like that. Social networking data, right? So I want to look at 140 characters. So I'm looking at trying to analyze in tweets but they're small, they're consistent, they come in constantly, it's basically a data stream that never ends. So those are types of things where big data processing is going to be important. You want to solve a scaling problem, right? Our data is so large where we have a large amount of complex data and our relational systems can't handle the scale of that complex problem Hadoop is a specific answer that can work on that. Another thing is if you need to be able to scale out and financial constraints you can scale out Hadoop by adding in cheaper or more efficient hardware as you need it. So it's not like buy it now as you need it so it becomes more efficient for processing. So what are you gonna do? Yeah, so you can see that this is a little bit confusing to people because Josh is advocating using cheaper hardware for certain types of processes and that makes sense for certain circumstances but it makes a really bad idea for some other ones. As he said, you're headed on to this next slide in here since the real life uses work. So here are some use cases in the real world, right? And I think on the next slide we go to the graphic here but really we're looking at risk modeling or customer churn analysis, recommendations, targeting of ads, social sentiment on social media, threat analysis, trade surveillance. There are many uses. The question is are they applicable and do you have the amount of information that really constitutes the use of these technologies and are you gonna get the return on that? And I think that's the framework that we'll be discussing here in a little bit but those are major questions that come in. You're looking at a context here of improving operational efficiencies. That's gonna have certain characteristics to it which are gonna be different from growing revenues which are gonna be different from reducing risk. And so that's a great place to take a look is to just start in and say, all right, which side of the pie am I trying to look at on this? Again, we're giving credit here. This is an Informatica blog that will fold this information from the references in the upper right hand corner there. It really pretty articulately lays this out. So again, medical research may be in a growing revenues context but infrastructure management may be one where we're improving operational efficiencies and both of them, well, they may be good candidates for looking at these big data techniques. It's not necessarily that they're going to be doing the same kind of thing. Many people get scared, for example, when we say we're gonna use commodity hardware to do infrastructure management. It's a bit dangerous and the answer is, well, that's how you set it up. Again, in all of these cases. Now, big data has some challenges that are associated with it. And David Brooks of all people did a really terrific job of articulating them in a New York Times op-ed page a couple of years back. Big data really struggles with these social aspects of things. There's absolutely no way that big data techniques are going to give you the kind of warm and fuzzies that you see when you meet a long lost friend, somebody you haven't seen since high school. It's very good with quantity. It's not so good with the quality on this. So struggles with the social. It also doesn't get context nearly as well. Our brains are really, really good at taking in stories. And one of the things we always help our data management colleagues with is telling stories across this so that management can place the value in some sort of a context. So data analysis is not very good with narratives. It doesn't really understand this concept of emergent thinking or even explaining. In fact, what it really is good at is creating bigger haystacks. More data is always going to lead you to more statistically significant correlations, but the vast majority of them are going to be spurious and they're going to just see this. So again, correlation is not causation. And that falsity grows exponentially with the greater amounts of data that we collect. A real good way to think about all this is that if you've got a friend that take the political view of X or Y, depending on whatever it is, maybe that we're in a recession and we should increase government spending or we should start government spending. Maybe that took streams that somebody would have on this. Big data doesn't change anybody's mind. I've never had a single person who's been persuaded to change the political alliance based on big data techniques that come into this because it favors the means, if you will, over the masterpieces. Again, cat videos, if you will, over the Mona Lisa. People don't really understand this. And finally, big data really obscures values because these values are always there according to somebody's predisposition and agenda that they have in presenting this with you. David Brooks had it in the opinion page. We have it in our approach to this webinar. And again, of course, what we're all advocating is try to crawl, walk, and run your way to this era. But do treat it as something other than just another standard IT project. All right, we now return to factor myth. Big data is just another IT project. Josh, what say you? I say myth. Big data is not a typical IT project. It doesn't answer typical IT questions, right? What we're looking for in big data are usually exploratory activity, trends analysis, actionable activity, enables new capabilities and it can be disruptive for the business. It might sound simple, but that doesn't mean it's easy. And we have to be aware of the shiny object syndrome as we say, or SOS. So let's talk specifically about disruptive for a second, Josh. So I mean, this is one of the things you and I both observed in organizations where they try to bring in big data techniques into the IT department. And the IT department has been trying for years to try and address and answer a very specific question. And the difference here is in a disruptive sense, the best answer from a big data exercise might be another series of questions which leads to another build or rebuild of the existing piece. Yeah, so it's more of a iterative development or I ask one question, I'm gonna ask six more after I answer the one question because I might find different insights versus typical IT project planning is gonna be either following agile or waterfall which have distinct use cases and problems and requirements where big data is more of what if type of analysis. So it's just to another myth, I labeled this one wrong, gosh, I'm sorry guys, I didn't catch this last night, big data myth number four. All right, factor myth, lightning round, big data's new, Josh, 30 seconds on the clock. All right, myth. So it's been around since the 1990s. The concept's been previously used. We can even look at mainframe architectures that mirror some of the architectures that we have with big data today. And it's harder to leverage when you lack the appropriate techniques. I think that's kind of what we've been talking about today. What are the big data techniques that can. Got it. And as Stephen said, lightning round here, we're just gonna dive back in history of the 1600s where the Black Plague was going on and this was a book called The Bill of Mortalities and what they were doing was counting the number of people who were dying at various diseases that were there. It does represent an early database collection. It was a wonderful big collection of databases that allowed them to do things that we now consider to be, for example, geocoding. So if you knew where the death was occurring, you could not go there. Kind of a good idea. Similarly, if you understood when it was happening, you could say, are things getting better or worse? And they were able in real time to predict the peak of the plague and to say things were, in fact, getting better. You'd feel a lot better if somebody said things are getting better if you actually had some data for that as opposed to somebody simply saying it. And finally, they were able to identify the cause. What was happening? Why was it happening? And this was because, of course, there were too many rats around. The rats themselves weren't necessarily the bad things. It was the fleas on the rats. So they could either give rats flea bath or let's go ahead and get rid of the rats, right? And of course, what everybody wants to get to is predictive analytics, which is what will happen next. We don't anticipate the outcome for this particular fellow. It's going to be a very good one, no matter what happens next on this particular score. Now, just very, very briefly, data management as a formal discipline, though, has only been around for literally 101 years. The British discovered in 1914 that they were at war with much of Europe and that there were 14 million Germans living in the United Kingdom. And what they decided to do was keep a 3x5 index card with some critical information on each of those 14 million Germans and managing those 14 million cards was a non-trivial exercise. A little later than that, the CIA published in a journal called Studies and Intelligence in 1962, an article, and let's read it later on, but it predicted the use of computing in the intelligence community and forecast the development of predictive analytics and the accompanying privacy challenges. An interesting subsection of this topic here was that they actually had a bit where they were talking about covert actions in Afghanistan. So it's just sort of a fascinating look into the future of what was happening in all of these areas. Again, very, very challenging. People saying, oh, well, imagine people planning a covert action program in Afghanistan and people asking some questions that would only be answerable by these very large data collection programs in place here. I guess this is number seven instead of six, according to my bad numbering, Steven. That's right, this is another lightning round for a thousand bonus points. Big data provides all the answers. Josh, factor myth. Myth, big data, like we said earlier, is exploratory, right? It doesn't replace scientific theory and it doesn't replace human thinking, right? And the complex problem-solving activities that we go through when we assess problems and address ideas. It could definitely give us insight to that activity, but it doesn't substitute the hard work that's necessary in developing a solution, right? So we're gonna say you need the right approach and then the next slides you're gonna talk about the approach framework. So this picture that we're looking at here is a big data approach framework that we've come up with that data blueprint which starts in the upper left, but as you can see, iterates around and we're gonna talk about the different points here and why they're relevant as we go through and understanding whether or not it's going smaller, going big, right? So we're gonna start by identifying the business opportunity, right? And what we wanna do there is see how the data that exists within the business can be leveraged for exploring different things like external marketplace factors, analyzing opportunities and threats or internal efficiencies. Can we leverage some of our strengths or improve some of the weaknesses that we have within our organization? Once we've been able to identify that opportunity, what we wanna do is move on to applying the six V's for lack of a better term, right? Because we don't have a absolute definition of what big data is, but if we know what the business opportunity is and the associated data, we can look at some of these definitions that have been existing around and say, is there a larger volume? Is the velocity such that it's high or variety, variability, vitality, or is it virtually related information? Is it in the cloud? And there's tons of it in the cloud, right? By applying those six V's, then we can ask ourselves the question, is big data something that is going to be useful for us, right? The analysis will allow us to say, you know, if the six V's indicate something that says this is big data specific, let's go ahead and look at applying it to a big data environment. If not, why not use the current BI environment that exists within the organization? Are there budgetary restrictions? Do you have a financial constraint? Are there technology constraints? So this question area here is very important because it's taking the fact what constraints exist within the environment and will big data be applicable? The answer is no. We're gonna use our current business intelligence competencies and we can cycle through and iterate through the fact. The answer is yes. What we wanna do is look at foundational practices associated with big data. So this has a foundational piece to it that goes along with the technical practices that need to be thought about later on, right? And the foundational practices really look at things like big data strategy, the architecture, and the governance of the activities that need to be applied. These things allow us to tie to the business strategy. They allow us to understand how we're gonna apply the big data techniques that we talked about earlier and it also allows us to make sure that we kind of control the situation and governance so that we just don't start to go willy-nilly with it doesn't become a shiny object syndrome and just become a big data activity, right? Moving on to the technical practices, right? We're gonna look at the models and algorithms that need to be applied. Is there a specific big data platform that's gonna be beneficial for us? What kinds of visualizations can we use to show the business that we've gotten insights or we've created exploratory activities and we also need to source the data from wherever we're gonna get it and integrate it into the big data platform that we choose, right? And just a quick note there. One of the things we can observe time after time is that our data science community spend 80% of their productive hourly time doing that sourcing and integrating the data. Now that is a very unproductive use of these highly paid resources. And so after we go through the technical practices, right, it becomes this exploratory activity are there perfect results are not necessary? We're gonna reiterate and refine our activity. We're gonna do iterative processes to reach a decision point and use feedback for exploration. The insight that's gained from it needs to be actionable, understood, and you have to document what you learn, although to create that necessary feedback loop throughout the process, right? I don't know how the slide's building on Peter's side, but I think that, yeah. I think we're up to the final round of myth. Well, one more piece on this before we jump on to the myth in fact, and that is the components here for big data or not big data remain the same. The existence of a data strategy, the good governance around it, the solid architecture and an education around how these work is going to be much more important because these technologies do work on their own, absolutely. And we gotta have them in place in order to do that. You're right, I was off a section on that. So there's the direction and insight because back in allows you to reevaluate that business opportunity and come to a good decision on that particular piece. And yes, Steven, now we are at the door. One more round. You need big data for insights, fact or myth? Myth, we work with many clients where we can provide insight with standard analytical tools and big data is not a necessary activity, right? Big data is just defined by the technology stack that you use or the big data technique that is used. It can be used for predictive and prescriptive analytics, but we've also seen clients use relational platforms to do predictive analytics based upon historical demand planning data that exists in their SQL server database. It's all about the models that you wanna apply and whether or not you have constraints in which big data can help you remove the constraints so that you can look at the whole scope of the data that you have. So you need to understand how the data is structured, architect and stored, and from that you can gain insights from either your relational or your standard databases or big data depending upon what needs to be applied. So we've got just a few minutes left here and Josh just got a couple of examples we're gonna run through. We may or may not get through all of these, but let's give it a shot. All right. So social sentiment analysis is used, is one of the major areas that big data and Hadoop platforms are being used for. It allows for the landing of multiple sources of unstructured data from Twitter, Facebook, LinkedIn, other name of social network, right? We can integrate the data into the platform. It's often used with algorithms, looking for keywords that determine positive or negative feedback. One of the newer instances that has been around for the next last year or so is not only do they look for positive or negative feedback, but they've been using it for voting mechanisms, for TV or other, so other media platforms are using social media as a way to produce outcomes for live TV shows and things like that. So all of that has to be processed through big data to be able to read the results and provide feedback in real time, basically. All right. Next slide. We had a client recently that used the data and fed it into their operational activities. So they utilized real time pricing data for multiple sources to dynamically update the price for their books in the Amazon marketplace. So they were trying to adjust data for multiple sources, whether it be, the book was on eBay, it was on Amazon, it was on other selling marketplaces for books, and they would look at the real time changes for those and apply a predictive model to determine the best price point for their book in the marketplace. Now, cautioning here is if you let the computer run wild and update your operational activities, you can have an issue with a very large race to the bottom, which your book prices might become one cent to win the battle. So you have to think about how you apply that big data technology and what that return on investment may be and adjust the algorithms, models, and frameworks accordingly, otherwise you could run into situations which might produce problems. Examples are healthcare here, dealing with patient data, analyzing clinical data for diagnosis and genetic data. So another thing is looking at billing data, is another thing to creating cost savings for the healthcare systems. One of the clients that we work with is analyzing palliative care information and understanding whether or not palliative care information can be gathered on billing and information to see, there's two things, savings to the hospital, but also life care for the patient by determining treatments against different patients to understand whether or not we can better their life and create savings for the hospitals. Next example, retail. So loyalty programs and big data, right? So we've had some interaction with companies who are going out, getting pricing data from different retails and selling that data back to other clients so that they can price their product similarly. There's a big agency here in Richmond, Virginia that does that, that's all they do, they're called retail data. Another big thing too, every time you use one of your customer loyalty cards, that data that's being tracked about you and your preferences and your purchase history, and so the next time you come into the store, they're gonna analyze that data and give you coupons or even now when you check out, you get coupons based upon what you buy or what you previously purchased and all that is data that's being processed and returned back about you based upon all the data that they've been collected or about people that have created similar purchases such as you, there's even an instance that was out there where someone determined based upon their purchasing habits that a female was pregnant and it went to, went to Aunt Binoa's thing, went to her parents and they didn't know that she was pregnant, but the company knew that she was pregnant, much of the target, which is the target. So it produced a large, it created a large insight for the parents and Target had that insight, but I don't know that the female wanted that information shared. So it's just interesting that these loyalty programs can produce these types of insights through big data and no-SQL technologies, even when we're not even expecting it. And just a note on that too, Apple yesterday at their big announcements for their World Wide Developer Conference included the loyalty cards in their Apple Pay scheme, their plan for doing electronic transactions. And one of the things Apple made a big point in to the announcement was saying that Apple will never actually see what things you buy. So the material would have still gone to Target, but the idea is if you incorporated that into a competitor's wallet instead of an Apple wallet, the competitor would have that information in addition to Target having that information. And so Apple was trying to differentiate themselves from their ability to handle big data anonymously. So again, that may be appealing in some circumstances, maybe not appealing in other circumstances. So we're getting back to the top of the hour. Thank you, Josh, for all of that. Josh has assembled a series of references here that we've incorporated in and you'll guys will get as part of the outflow on this. And unlike most webinars, we've actually included a number of additional slides here for you all that we're not covering in the one-hour presentation, and just gives a little bit more background in the area here. And so there's about a good 20 pages or so that we're gonna add to the PDF here to do this. But we're at the top of the hour and it's time for your questions and answers. And let's go and see what sort of questions you guys have for the three of us. Yeah, that's right, thank you guys for sticking it out with us for the hour and for suffering my alter ego game show host persona. If you would like to submit a question to the Q&A, you can click on the Q&A window feature at the top of your screen and you should be able to submit your questions through that Q&A window. Again, we have about half an hour here, so we've got a couple of questions that have come in, so I'll just go ahead and hit you guys with the first one. And the first question we have is regarding the second myth, if I'm a small company and or I don't have petabytes of data, does that mean that using any of the big data technologies or tools can't be applicable to me? So this is Josh, or Peter and Jake. I'm sorry, no, go ahead, Josh. I'll just pull the myth out. I would say that the answer is no, big data technologies can be applied. We've seen small businesses use some more of the NoSQL type platforms where they're trying to do rapid application development and try and create an application quickly and fast and they're gonna use a NoSQL backend for that. It's usually the HDFS or the Hadoop platform that is gonna be more applicable to a larger company than a small company. But some of the NoSQL backend can definitely be used at smaller companies or mom and pop shops or people who are trying to do things with data that don't necessarily need to scale quickly but are trying to do something different that doesn't fit into a relational model. Yeah, I think Josh has given the age old answer of it depends, it sounds like to me. And I think in general, if you're dealing with smaller amounts of data, a traditional, well not a traditional, but a NoSQL technology might fit the bill whereas some of the big data, maybe Hadoop technologies might be a little bit overkill. On the other hand, if your small business is dealing in massive amounts of data then obviously it's gonna be a good fit. So, Peter, what were you gonna say? No, good answers, both of you guys. I was gonna add on to that that our challenge in IT for years and years has been how do we put something in front of the user so that they can react to it? Because almost always when you hand them something in response to the initial requirements, they say, what, close to what I wanted but could you make it pink and I'd like it to dance? You know, and you go, oh wow, I didn't get that from the first pieces. So, the faster you can shorten that cycle and that is something that people have, I think, ignored about these big data technologies is that it does lead to some faster rep and prototyping applications. Now, the goal here would not be in our mind to train all of your people in big data techniques but to get one group good with them and make them understand that they would be the ones to go in and do lots and lots of prototyping as a way of refining those requirements as Steve and then Josh have said. Very question, thanks. All right, and we'll move on to the next question which is Hadoop, as you've defined it, doesn't include technologies or products such as Impala, Spark, Drill, et cetera. Josh, I'm looking at you. Yeah, so Hadoop, the ecosystem of Hadoop, especially Hadoop 2.0 does include those activities. I think for the purpose of this presentation we are focusing on the HDFS and the MapReduce which is the core of Hadoop. Spark and Impala are ways to, Spark's a way to do stream or in-memory analytics and Impala is a way to query the data as it exists in the HDFS platform. So to me those are all hooks that go into the core of what Hadoop is. As they continue to expand what the Hadoop ecosystem is made up of we'll see more of these things being written by either the Apache Software Foundation or other organizations to improve upon. I think the improvement here is trying to get more of the relational user into the big data world. You know, that's one of the big things that Impala and Hie was written for so that people could write standard queries against these big data stores. So the ecosystem does define those additional things but I think for the purpose of this discussion we were just trying to balance what Hadoop is versus NoSQL and the ecosystem does definitely include other tools, processes, and techniques. Yeah, so thanks for pointing that out. On the other hand, it does require now a larger vocabulary and a larger corpus of knowledge for your organizations. If we go back to the previous question, if you're just gonna stand up a group that's gonna learn how to use big data techniques to do rapid prototyping as well as the other things that we've said the learning curve is deeper given that it's simply just more information to integrate into the process. Again, if you guys are interested in exploring those topics a little bit more, gosh, how many people do we have out at the NoSQL conference this week, guys? I think we're sending two people out to the NoSQL. Two people. Two or three, yep. So there are a couple of data blueprinters out there at the NoSQL conference in San Jose. But again, we're as anxious to learn about some of these extensions as everybody else is because the field is moving very, very rapidly. So again, great question and clarification there. Thank you for that. Yeah, these are great questions. So the next one is I heard that Hadoop is not so good for too many writes. For quick writes, a JSON doc store is usually recommended. What are your thoughts on that? So I think Hadoop in its traditional sense is not necessarily right-intensive or right-efficient. There is a company called Cassandra that has created an extension of the Hadoop platform and kind of modified the file system underneath Hadoop. The Cassandra platform is made specifically for fast writes. Now, JSON doc stores also have that capability. But those are the, once again, I think when you get into big data, once you're trying to figure out what you're trying to do, is it right-intensive? Is it analytical-intensive? Then it becomes a choice of looking at this platform map here and understanding the pros and cons of each one of the different platforms and picking and choosing the ones that are gonna give you the best lift and leverage for the activity that you're trying to execute. If you think about, again, not to pick on the Apple example too much this week, but one of the things they're looking at in the new release of both iOS and what are they calling it, El Capitan is the new Mac OS platform that are coming out. What they do is they look at how these things are operationalized and they look at how people use them and then they go back and recode those sections of the operating system or those functionality pieces using faster techniques. So for example, the original code might have been written in some object-oriented language and then we recode it in a similar language which runs much, much faster. And this is the same thing here. As you look at this and discover how your prototype is being used in real life, then absolutely it makes sense to go back in and look at the alternatives for that. And there are alternatives that are being built because people will do this, they'll find the scales to a certain way or scales at a certain type, scales great in the reading, oh, but if you need to do lots of writes, maybe not so good. So again, that's why we love this particular map. I'm just sort of keeping it up there as we've got this very technical discussion going on right now. Can you imagine trying to take this to management though and saying, well, really it's a matter of, we can't figure out whether we should use JSON or Cassandra and management's not gonna be able to help you in these areas. So I must say have a lot of good experience already. All right, and we've got another question here. Could you define your usage of the term real time and whether this relates to the data creation or the data analysis and processing? All right, so in the specific use case that I was talking about with our client, real time dealt with both the creation of data or what I would call the, they're actively tracking change of prices, right? So they would refresh web pages or they wrote scripts that would refresh web pages and actively look for changes on that page to see if there's a real time change in that information. That information that could be fed in real time to the processing and analytical engine that would then update the operational activity. So it was both real time from a creation of data and then also real time from analytical processing, let's update our price in the marketplace. But it had to be monitored and controlled, which is where some governance had to be put into place on the monitor, otherwise it was gonna turn into a, the machine would run and turn into a race to the bottom and so basically minimum prices were set on specific types or categories of books, which is more of their predictive model for the pricing scheme. We did mention another part of the presentation too and that is when I was defining big data techniques, which is in many ways somewhat of a similar analogy here. Many people characterize these big data techniques as the availability of these continuous, instantaneously available data sources. So Josh mentioned the movie Twitter feeds around them. It's quite useful, in fact, almost unproductive to go in a different direction. The best predictor of how a movie will do subsequent to its opening release is the Twitter feed and so they're taking all of this continuously, instantaneously available data and reading it in. Now does that mean you should focus exclusively on that? Well, if you're a movie studio and trying to decide whether putting millions of dollars into an already losing movie will change people's perceptions and will draw more audiences to it, that's a great application for it. So on the other hand, I don't know if it would help Newt Coke out or not. Of course, that example's probably too old for most of the people listening to this one. Anybody remember Newt Coke out there? Nope. All right, I got another one for you. What types of data problems, i.e. things that someone would wanna do with analysis of data, must be done in a relational database? So looking at big data, I don't think there's a must be done. I think there is more efficient or more capable to be done in a relational database. So relational database is acid, right? So it's atomic, consistent, the other two durable, and I forget what the I stands for right now at this time, but what that means is that when we consistently query a relational database, we're gonna get a consistent answer. But if we're looking at big data or no SQL, we only get eventual consistency. So for things like finance data, we wanna make sure that if we wanna know what was our revenue for yesterday, we wanna make sure all that data's consistent. So it might be more of a relational type of question because we need to make sure that the data's accurate and consistent versus give me all the revenues for the past 20 years and let's project out what revenue might be for the next five years, but I need to crunch all the 20 years worth of data. That might be more of a big data question because the relational system may bulk on grabbing blocks that are 20 years old and trying to put them into a model to future project. I think you have to think about the use case there and what becomes more efficient and what becomes more useful. Traditional BI is another relational activity where we use cubes, right? It's our schemas to look at our data and answer data efficiently and allow people to kind of explore data from the perspective that's created in the cube, right? So I want all my customers divided by the product type that they bought and how much they spent on that product. Might be more of a cube based answer versus looking at all the loyalty program data that exists about how many times they use their loyalty card along with how many things that they did for procurement. I think Peter just brought up a cap theorem slide on what's the difference between consistency, fault tolerance and availability, but those things need to be thought about because they're all drivers for selecting the right technology platform for what you're trying to do. So Josh gave a great response without the benefit of the slide there. I of course ran straight from the slide to see if I could get it up, but yeah, the top ethnicity, consistency, isolation and durability are characteristics of problems that are well solvable by relational database technologies. No sequel on the other hand, as Josh said, is basic available soft state and eventual consistency. And of course the other piece to note on this diagram is that small data sets can be both consistent and available. So clearly the adding to the word volume here is what really pushes you out of that small data set area. Now again, this is going to continue to grow as the technologies go, et cetera, et cetera, et cetera, all the way around, yes. These are all things. So I'll make sure we toss this slide into the deck here as well just so we're good to copy it. Okay, and a couple of quick questions here. One is will a copy of the slides be made available? The answer to that is yes. We also got a nice anecdote. This is just a statement. I was using parallel processing in the early 1980s to handle big data on a mainframe because our files did not fit on any of the tapes and disc was expensive. I think that you guys mentioned that some of these techniques have really been around for quite a long time. They've come back. Risen from the grave, if you will. I'm sorry, the 1980s was that long ago. Okay, moving right along. It says, this is a really good question here. So you highlight architecture, governance, and models as part of big data practice. Vendors promote the collect all the data and we'll figure it out later. No need for data models, et cetera, that slows projects down. That philosophy has killed many of data warehousing projects. Your thoughts and a follow up there is the storing of data first and do the transform and Hadoop sounds okay. But if the data needs to be transformed every time it's used, isn't that paying for MIPS used to transform over and over again? Shouldn't the data be landed in the most usable format? I think this is a great question. Josh, you wanna lead with this one? All right, so I'm gonna start with the first part of Bob's question here. So we definitely take the approach of highlighting architecture, governance, and models in thinking about a big data practice. It's part of avoiding the shiny object syndrome that the vendors are giving you about worrying about certain things later. And I don't think there's a need to understand data models when we come to big data, but there's definitely a need to understand how are we gonna use the platforms? What are the return on investment? What governance should we put in place? Not only from a landing the data, like the scoping of the data that lands in, but also how we intend for users to interact with this, right? If we don't want data scientists going wild in the big data platform and coming up with insights that aren't necessarily useful insights, right? So I think when we talk about governance, architecture, and some of those things on the front end are the foundational practices, they're a little bit different than the traditional data management, governance, architecture, and models. But they should also be thought about as part of that process. Because if you just ignore those things and do what the vendors say, yes, the vendors are gonna get their money very quickly, and you're gonna have a shiny object in your shop, but you didn't necessarily think about how it fit into your business strategy, and you might be able to land all the data in the world, but it doesn't necessarily mean that it's gonna be able to help you gain the insights that you're looking for. And I don't think thinking about some of these things up front is gonna kill your data warehousing project, it's just giving you a good foundational and consistent context for what big data is gonna be used for in your business or organization. All right, on the second part, where we're talking about the storing of data first and then the transformations of data in Hadoop. I think some of the other things in the Hadoop platform can help with this, or the Hadoop ecosystem can definitely be used to do X-Forms as things go into Hadoop. So Spark is in-memory analytics, so you might be able to do some in-memory or memory stream data first, and then land the data in Hadoop as pre-analyzed. There's also a tool called Flume, which allows some transformation, basically it lets you set up a memory or a stream price to certain types of APIs or other data sources, and then you can form that stream to how you wanna land the data in Hadoop. But also part of the processing or getting away from the MIPS on one machine is the platform parallelism that allows for the processing across multiple machines. So MapReduce is a great language that allows for transforming over and over again, but you're also not gonna transform it every time you land it. So you might land, I don't know, 10 days worth of data and then do an analysis across the 10 days worth of data, unless you're gonna create something that's more real-time and then you might go from Spark, might utilize Spark as part of the ecosystem, land the data eventually, but Spark is gonna give you more of your real-time transformation and informational information where Hadoop is more of a, I might have 10, 20, 30 days, 15 years worth of data, or I have 100 sensors on an engine and I'm gonna store all that sensor data and then run analysis after I've stored the sensor data. Or the Human Genome Project is another one where you're gonna analyze the whole set of genomes at a given point in time. So you have to think about the ecosystem and what you wanna use Hadoop is gonna store a large amount of data and then allow you to do analysis and transformation where if you're trying to do something more real-time, you might use Spark or something else in the ecosystem. And then the only reason I'm putting up the glove box example is that a lot of people say, let's just put it up and then we'll fix the data later. Now it just ends up being a more expensive process because you're constrained after you have to do it that way. So this particular example talks to using it in the cloud. We'd rather you clean the data on the way into the cloud than put the data in the cloud and then clean it. But we can help you do it right or we can help you clean it after you haven't gotten it. We're happy to work either way but we'd rather you guys move more efficiently on that process. All right, I've got one more question here. So if anyone has any follow-up questions, feel free to sound off. We've got this one. A self-serving career question, how do I transition from a relational data model or analyst to a valuable big data world position? I'm gonna give him silence on that guy's phone. So I think it's gonna be what would you want to do in the big data world, right? There's technologists that are all about setting up the ecosystem as it exists. Then there's data scientists which are all about utilizing the ecosystem to gain insights and answers. So if you're a data analyst, right, there's now this concept, and I don't know if there's a better term for it, but there's the concept of data scientists, right? And there's lots of organizations out there that are starting to put together programs for data science, which is a natural transition if you are more on the analyst side of understanding how do we analyze data and the data science programs are gonna teach you more about statistics than you ever probably wanted to know because it's moving into how do I use statistics to gain insights and look for correlations? I think Coursera has a program from John Hopkins which deals with data scientists. I think some of the major universities are starting to release programs like Berkeley. And so I think that's the next step if you're on the analyst side, if you're on the technologist side, it becomes more of do you wanna become a map reduce programmer? Do you wanna become a pig writer, right? Pig's one of the scripting languages to pull data out of Hadoop platforms. Do you wanna learn more about the different platforms and what they're beneficial for? Like I told you Cassandra's a right heavy system versus what are some of the no-sequels? Like no-sequeling, I guess another thing to look at would be no-sequel data modeling. There are concepts of how do I take a relational store and turn it into a column store type of model and how would I use that? Or there's the concept of anchor modeling which is every table is only one attribute and the value that's stored with it. So it's more like key pair values but they're modeled together to create the families and trees. So I think you can transition either way of the analyst being more data scientist with the modeler side being more, let's look at some of the no-sequel technologies and understand how data models apply to the no-sequel. There's another component that goes into that too is whether you're an introvert or an extrovert. There's an awful lot of folks out there that just prefer to interact with the data and that clearly focuses in on one aspect of it whereas trying to identify the business problems is another aspect of it and we absolutely need to have both. And so both Steve and Josh are superb at that area. So listen guys, thanks. It's been a terrific session. Really interesting questions on all this. Steve, do we get any more questions? We did, yeah. Here's a quick one. Oh, there, okay. Try to cut us off. Yeah, we got two more. All right, so this question is I am preparing a business case for a data innovation lab to look at the possibilities of using data, both big and structured, to drive business value. Would Hadoop be a good ecosystem to build this data innovation lab? Probably a component of what you need to be building but I would also be building skills in there as I mentioned before about Steve and Josh having the ability to turn these results into return on investment. Got a little book out there called Monetizing Data Management. I'd be worth looking up on Amazon and grabbing a copy of. So it's not just the ability to go back to management and say, hey, look what we can do with the data but look what we do with the data and how it works in supporting our businesses. That's much more important than just looking strictly at the data. I got dumped off the internet here, guys. I'm not sure what happened to me but I'm still on the audio. You guys still there? Yep, we're here. Yep. All right, keep going then. Okay, and the last question I have here is what are the best tools out there for cataloging the metadata for the data stored in Hadoop or NoSQL so business analysts can see what's available? That's a really good question. I'm not really sure. The only one I know off the top of my head for Hadoop personally is Hortonworks, each catalog maybe. Yeah, so as part of the ecosystem, you can create what would be considered structures or metadata around the files that are stored in the Hadoop platform. And there's two concepts where there's H-Base and H, I think it's called H-Catalog. Like you said, in order to use H-Base which kind of transforms some of the data into a table-ish structure, the catalog generates metadata about the data that's stored. It's not, I'm not gonna say it's a 100% answer for that. Because as the technology vendors will tell you, you can land whatever you want and try and figure it out later. Metadata isn't a big boon for the big data platform of Hadoop. Now metadata is built into the NoSQL technologies which are often put on top of Hadoop, but it deals with the technology working with the people who understand the data to say here's NoSQL and here's now the metadata that's in the NoSQL storage. And that would give the analyst information and so maybe NoSQL would be exposed to that versus the backend of Hadoop. Cool, yeah, I think that, I believe that's the last question we have here, so. Shannon, are you still there? Maybe just the three of us, we may have lost her. You know, the button is tough, it's really tough. We have a little technology glitch on this end, but as long as you're still there, I guess everybody else is, see you so thanks, Shannon. Yeah, no, I thought that you'd be tough. But no, thanks to the three of you for this great presentation, informative as always. And as always, thanks to our attendees for being so engaged in everything we do with all the questions in the chat going on. We just love the level of engagement there. And just, you know, as Steven has already answered in as part of the one of the most common questions that we get is whether or not people are gonna receive a copy of the slides and the recording. I will be sending a follow-up email by the end of Thursday for this webinar with links to exactly those things so you can have both and anything else that was requested throughout the webinar today. So Peter, I hope you have continued to have a great show there at DGIQ and hopefully we'll see everybody at NoSQL now in San Jose. Very good, and our next one we'll join you guys online with July 14th, we're gonna talk about the data management maturity model, very exciting developments in that area. Perfect, and thanks. As always, thank you guys. If you have any questions, feel free to shoot us an email. Shannon, I'll go ahead and hand it over to you. You can take us out. Thanks, yeah, and just another big thank you to our sponsor, HP Security Voltage. Our first sponsor in today's webinar and all of our sponsors who make these webinars happen for everybody. So thanks everybody, thanks again for the great presentation and I hope everyone has a great day. Bye all.