 From San Jose in the heart of Silicon Valley, it's theCUBE, covering Big Data SV 2016. Welcome back to theCUBE and the Big Data Silicon Valley event, we've been live for the last couple of days and tonight we're being joined by a whole bunch of the members of our community to talk about some of these core Big Data issues. This segment we're going to spend some time talking with some of the industry influencers, some of the leading analysts in the business and I've been around analysts a lot and this is one of the best groups that I've ever been a part of assembling here. I wanna introduce each of them in turn and I'm gonna ask them to introduce themselves. So Tony Bair of OVM. Okay, I'm Tony Bair, I'm principal analyst with OVM and my coverage basically is data management, database management and Big Data. And basically the, really the core, I guess the theme of my coverage is looking at what will make Big Data a first class citizen in the enterprise. Excellent, Mike Galtier of Forrester. Yeah, Mike Galtier, principal analyst at Forrester. It generally cover advanced analytics and Big Data platforms. So I'm co-author on the predictive analytics wave, on the streaming analytics wave, on what we call the Big Data search and knowledge discovery wave, a few other waves, Hadoop waves as well. So covering everything Big Data. You're one of Forrester's big surfers and George Gilbert of Wikibon. So my name's George Gilbert. I'm the chief Big Data analyst at Wikibon. I like to say chief, but no Indians. And I'm a techie by heart and I like to cover how the technologies work, but I think what's ultimately more interesting is how we're gonna apply them. And I think that having Peter here to help steer us with that kind of discipline is gonna be a good experience for us. Well, George you're still here, so I haven't driven you nuts yet. So you heard me talk about this evolving role of Big Data and digital business. And I know Tony, you for example, talked about making digital or making data a first class citizen. So I'll kick off some questions and then we'll open it up for everybody. I'll wander out into the audience and really, really frighten you. But Tony, why don't you talk a bit about what it means for data to become more of a first class citizen within the enterprise? Right. Well, it really goes on a couple levels and part of it is really looking at fairly mundane governance level, which is that basically with Big Data, you need to know what data is in your data lake. And the thing is that you need to have the same type of confidence that the data that's in that data lake is as protected and managed as it is in your enterprise databases. And so in the Big Data space, we basically started with a raw platform and so we've really had to invent as we've gone along. We've found is that we've really had to adapt and extend governance. And so I think that's been a real key challenge for the Big Data community. But the other part, which is really more in terms of what the advantage is and really were the big payoff is basically is looking how data basically can basically make a competitive difference. And that's where I really see a really key role for machine learning because ultimately it's going to transform not just analytics, but applications so that basically your supply chain application will have embedded analytics with embedded machine learning. You will not have to worry about having to program different machine learning algorithms, but it will make you smarter. And that's the type of thing that gets me very excited. Yeah, Mike, I was, until recently, I was at Forester with you and got to know your research extremely well. And one of the advantages that you bring to the table is great visibility both in the technology and many of these new business problems, what I call the problems of demand through some of Forester's research about customers and how customers are evolving. Why don't you identify, can you help identify a few of those key issues that are starting to percolate up? Yeah, so first I sort of use this framework. There's four types of insights that a business needs, strategic insights. Those are used to make decisions about should I build a new building here? Should I acquire this company? So there's strategic insights. Then there's KPI insights. How is my business performing? There's different time frames for these as well. And then there's operational insights, and then finally there's real time decisions. So there's sort of a spectrum, and across that spectrum, you need four types of analytics. You need descriptive analytics, which is more of your traditional BI. You need predictive analytics, and that's a little bit about what Tony was saying in terms of using a combination of statistical and machine learning algorithms to build predictive models. You need streaming analytics, which is all about what's happening in real time. Think IoT, and then you need prescriptive analytics, which is, okay, what am I gonna do? I've predicted this outcome. What's the next best course of action? So what our research shows at Forester is that the most focused insight that companies are looking for is that about the customer, right? Because if you can predict how the customer will behave, perhaps you can serve them better. And you can serve them better, not tomorrow, but right now, right? Because the consumer is increasingly connected as well. So I think what's behind big data and what's behind all of those companies on the exhibit floor is moving towards providing all those forms of analytics, but doing it cheaper, doing it faster, and especially using more advanced forms of analytics, not just reports that someone throws on the desk. And by the way, things like visualization tools, which have been very hot, I don't see those as particularly as useful as other forms of analytics, because I think they're really flashy tools, but how many insights are really gained from those? So I think companies should refocus their efforts more on the advanced analytics, more on the predictive modeling and prescriptive analytics. They're increasingly in service to that fundamental question of engaging the customer. Yes, yes. So George, we've talked predictive analytics. I find it fascinating that we can almost look at the history of the industry if we look at it from the problems that we've solved, the relationship between data and time, it's inextricable. OLTP recorded what happened in the past of spreadsheets and a lot of the writing documents inside was about telling people what's gonna happen in the future, and this real, real hard problem of what's happening right now. How do we compress the time to move data through the analysis chain to get to faster to that real time notion? Okay, let me start with a usage scenario that I think might be relevant, which is Jeffrey Moore has done a great job of popularizing the notion of systems of engagement, but he talks about it as a sort of a consumer, internet service provider class user experience tied into a traditional enterprise application. But it's when you start peeling away that the foundation that you need to build that, that you get to some of the answers to the questions you're posing, which is for one, and following on Mike's comment, you can't anticipate and influence and guide the customer's interactions unless you have a machine learning model in the back or at least a machine learning model and sort of snapshot it out a predictive model that says, here's how you should guide this customer. And then creating that model, in the past we had a database over here and a batch load over, was it daily, weekly, whatever to the data warehouse that might have, if it was very advanced, it would have cranked out a model and then model would have gone back into the operational application. Now we broke down the entire pipeline into a bunch of mix and match engines and that's what the Hadoop, that's the Hadoop analytic pipeline. And we thought it was the coolest thing since sliced bread because it was 5 to 10% of the cost of your typical data warehouse appliance. And then we got all this flexibility, you had incredible choice. And then we woke up one day and we realized we were in the same position we were with PCs when in the 90s, Gartner did a study and found out that the average annual cost of maintaining a PC was $6,000. And so this was, with our big data pipelines, it was like that Verizon commercial where you have 150 people behind you, can you hear me now, can you hear me now? Anyway, the shorter answer or the end of the answer is that we're revisiting simplicity with technologies like Spark. It doesn't replace Hadoop. It complements storage and management and it replaces some of the analytic engines. It's not all totally mature but I think there's a fair amount of agreement that it can deliver what we used to go to these Nixon match engines for. And Peter, you asked about sort of data as a first class citizen, right? I'm thinking about that. I'm thinking, well, okay, it's really higher than that. It's analytics but really it's predictive, right? So I like to think what does it mean to be a predictive enterprise, right? Where every decision you make, wouldn't it be good if you could have a certain probability, a better probability that what that decision was gonna result in, even if that decision is about what to offer a customer or at the strategic level, what company to acquire. I think there's an opportunity to use that data to create predictive models and I think once companies do that on all levels, they will become a predictive enterprise and guess what, there already are predictive enterprises. It's called Google, it's called Facebook, it's called Amazon, right? Those are the companies that are truly predictive enterprises. They're using it across the board for everything. And in many respects, it's almost the difference or the differentiation between some of those time frames, the prediction versus the operations, is starting to blur that the future is a series of operational steps that can be mapped out in advance. Well, there's another concept that I talk about a lot which is called perishable insights, right? There's many insights, it's perishable. You get it immediately, right? There's some things where you get that insight if you don't act on it, you're done, right? It's really easy to think of some stock trade or something. But increasingly, as consumers are connected and your B2B customers are connected, that window goes down, right? So how are you gonna capture those perishable insights? So it's not just about throwing it into a data lake and doing that analytics, that's insufficient. You need that but that's insufficient. You also need to do real-time analytics to capture and then act on those perishable insights. And then the other thing that came up in one of the CUBE interviews, the notion of today's perishable insight or this context perishable insight, like, oh, my heart rate's too high, becomes tomorrow's not so perishable insight when I have to go to a doctor because I work out so much. I mean, it's just... I have to go to, this is not my problem. I'm talking about a friend of mine. I have to go to a doctor and actually, that data now becomes not so perishable. In a new context, it takes on a new role and a new source of value. Tony? Well, there's another way of thinking about which is really breaking down the barrier between basically transactional application and analytics and that traditionally we did transactions and analytics we did after the fact to figure out how we could do transactions better the next time. What we're finding out now is basically through the advances in hardware in memory processing, this can now be done fast enough. And in fact, basically at the speed of today's business where you're trying to give, but basically, let's say, a next best offer, you're trying to engage a gamer, you're trying to basically stop some intrusion in real time. Integrating analytics with the transaction is not a luxury, it's a necessity. Right, so do we have any questions from the audience? Is anybody, so we got one, so let me see if I can take my incredibly fit self and we got a microphone here. Yes, there it is, Lawrence. Thank you, sir. Stay who you are, please. I am Phil Hotston, I'm with Sama Technologies. So a new buzzword that I heard today with a meeting with a customer, this is from Mike, towards you, a new buzzword that I heard today from a client that I was working with, talks about building not just a data lake, but an intelligent data lake. Can you tell me where you think that trend is going? Well, I think it's gonna go to an advanced intelligent data lake. It should be predictive as well. Yeah. Autonomous, yeah. So, yeah, so I haven't heard this, but I think what they're probably referring to is like the first notion of a data lake is that we just have silos. You know, we have a portfolio of hundreds of applications and the data is just everywhere. We have some of it in a data warehouse, but the data warehouse itself has become a silo, so let's just get it all together in a data lake. No intelligence. The intelligence is that now we finally have the data together, so now we can do some more sophisticated analysis. The intelligent data lake could mean, I don't know what they're referring to, but could mean we're seeing some companies now adding continuous analytics on that data lake, right? Starting to do machine learning to build models in that data lake, not using the data lake simply as a repository, as sort of a landing zone for data that is then pushed up to a data warehouse or to another platform analysis, but actually continuous analytics. And if you look at an example of this Google, Google Dataflow, which is sort of their real-time batch slash streaming platform in one, that's really that technical approach that I would say is probably similar to a intelligent data lake. Another question? James, say who you are. Is it on, guys? Hello? Did I turn it off? Yeah, James Capilos with IBM. I'm also a former Forrester analyst, but it's a material here. The notion of perishable insights was really interesting. In many ways, you can look at those as they're perishable in the sense that they may be very short-lived in terms of windows, but they may encode very important response loops to things that might happen in the future. And that's algorithms, essentially. So I'd like to hear what the panel's thought is in terms of how do you, in terms of governance, there's data governance that you should be doing in your data lake. But how should you be doing algorithm governance or model governance so you can preserve those perishable insights algorithmically so that they can be used in the future when those particular scenarios present themselves? How can you make those kinds of perishable insights discoverable in the future when somebody's building another next best action application that requires those kinds of assets? Yeah, it's a good question, Jim, because using a real-time or streaming analytics platform, it doesn't magically ingest data and figure out what to do. And this is why it's also important to have a batch processing regime as well. So you're gathering all of this data, you're building models, and once you're confident with that model, then you can inject it in the streaming engine to detect those, but it's a continuous process because models are based upon correlation, not causation, so they can go out of whack. So you have to continuously update and monitor those, but you asked another question, too, which I thought you were asking about how do you trust the algorithm? This is a big problem as well, right? Because some of these algorithms that are just running, a business executive has to make a decision. A data scientist says, here, I found the perfect model. That model could have implications of millions of dollars if it's wrong. And there's techniques that companies use to overcome that. I mean, Amazon, for example, is constantly testing models. They'll do an AB test or sometimes it's known as a champion challenger for those models, but there's also a big issue as more models are used, how can we trust those models from a business perspective? I think another issue there that was also like, are we using the right data set? Are we looking for the right signals? So it's really both models and data sets. I would add that there's sort of like a third derivative, Jim, which is that as we get to in that S-curve chart, when we're up to the right where we have sort of intelligent self-tuning systems, you never really take the data scientist or the human out of the loop, but you can apply machine learning to the analytic pipeline more and more. So it might evaluate new data sets to see if they are valid predictors or really if they have a positive correlation, if they add to the fidelity of the model. And you can even go so far as to have them benchmark a model against an existing one and to keep essentially the existing one from drifting. So I guess what I'm saying is there's a runtime one, which is the one we're all familiar with. And then there's a design time one that we'll come, we'll see in several years out. Just one thing to add to that, which is that what's really essential is to have, I mean, it almost sounds cliche-ish, but you need a good effective collaboration environment. So basically where people have insights, let's say on different data sets or in specific models or algorithms or know what to apply in certain scenarios, your domain experts or whatever, you need to have some sort of way of sharing that. And eventually over time you build up, I mean, like I don't want to sound like a, basically like I'm trying to promote like a knowledge management system, you know, but you need something very dynamic and maybe there's kind of like a Yelp rating system or whatever. I mean, that's the type of thing which hopefully the community will start to kind of figure out, but we need collaboration and to pool our collective intelligence. And one thing I'd add to that, James, I'm gonna move on. One thing I'd add to that is very quickly is that it is easier to govern hardware and software than it is to govern human beings. And so the other thing we haven't mentioned is that as this machine learning, as its system does more self-tuning, we can translate policy directly into software and that's gonna have an enormous implication how things run. One more? Okay. Who are you? That's a really good question, independent consultant. On a related point, I have a question as you move to real time, how do you see companies dealing with data quality? You know, as a consultant, I always talk about it as the Mickey Mouse problem, like almost every company has Mickey Mouse as a customer in their data set somewhere. It's a really good question with regard to data quality. And the thing is that what we've, through our research at OVM, we've really sort of, and I realize that what you're talking about is a slightly different use case, but there's your traditional analytics where you're making some hard and fast decisions that must be audible, and therefore that data has to be as watertight as possible. You need data where you have, let's say just for arbitrary sake, let's say a 90% confidence rating. Then you're doing the exploratory where you're basically trying to figure out what's the right data to look at, what's the right problem, what question to ask. And there, basically, if, let's say you, if your data set is not totally complete, maybe you've missed some click streams or some tweet streams or whatever, it's not a showstopper. That being said, basically, when you're dealing with a real time stream, you need to know the provenance of that stream, the provenance of the source, I should say, and take that into account in terms of whatever decisions you're making on that stream. All right, so I wanna thank our panelists, Tony, Mike, George, great to see you guys again. And we're gonna reconfigure the stage and bring up some customers and talk about how the doers are getting things done. Thanks guys. Thank you. Thanks.