 Hello everyone, welcome to the CUBE studios in Palo Alto, California. I'm John Furrier, the co-host of theCUBE and co-founder of SiliconANGLE Media Inc. I'm here with George Gilbert for a Wikibon conversation on the state of big data. George Gilbert is the analyst at Wikibon, covering big data. George, great to see you. Looking good. Good to see you, John. So George, you're obviously covering big data. Everyone knows you, you always ask a tough question. You're always drilling down, going under the hood and really inspecting all the trends and also looking at the technology. What are you working on these days as the big data? What's the hot thing that you're covering? Okay, so what's really interesting is we've got this emerging class of applications. The name that we've used so far is Modern Operational Analytics Applications. Operational in the sense that they help drive business operations, but analytics in the sense that the analytics either inform or drive transactions or anticipate and inform interactions with people. That's the core of this class of apps. And then there are some sort of big challenges that customers are having in trying to build and deploy and operate these things. That's what I want to go through. George, this is a great piece. I can't wait to join these questions and ask you some pointed questions, but I would agree with you that, to me the number one thing I see customers either fumbling with or accelerating value with is how to operationalize some of the data in a way that they've never done it before. So you're starting to see disciplines come together. You're starting to see people with the notion of digital business being something that's not a department, it's not a marketing department. It's data's everywhere, it's horizontally scalable and the smart executives are really looking at new operational tactics to do that. So with that, let me kick off the first question to you. People are trying to balance the cloud on premise and the edge, okay? And that's classic, you see in that now I got a data center, I got to go to the cloud, hybrid cloud and now the edge of the network, we just talked about blockchain today, this is a huge problem. They got to balance that, but they got to balance it versus leveraging specialized services. How do you respond to that? What is your reaction? What is your presentation? Okay, so let's turn it into something really concrete that everyone can relate to and then I'll generalize it. The concrete version is for a number of years, everyone associated Hadoop with big data. And Hadoop, you tried to stand up on a cluster on your own premises for the most part. Amazon had EMR, but sort of the big company activity outside, even including the big tech companies was stand up a Hadoop cluster as a pilot and start building a data lake, then see what you could do with sort of huge amounts of data that you couldn't normally sort of collect and analyze. The operational challenges of standing up that sort of cluster was rather overwhelming and I'll explain that later, so sort of parked that thought. And because of that complexity, more and more customers, all but the most sophisticated, are saying we need a cloud strategy for that. But once you start taking Hadoop into the cloud, the components of this big data analytics system, you have tons more alternatives. So whereas in Cloudera's version of Hadoop you had Impala as your MPP SQL database, on Amazon you've got Amazon Redshift, you've got Snowflake, you've got dozens of MPP SQL databases and so the whole playing field shifts and not only that, Amazon has instrumented their, in that particular case, their application to be more of a managed service. So there's a whole lot less for admins to do and you take that on sort of, if you look at the slides, you take every step in that pipeline and when you put it on a different cloud it's got different competitors and even if you take the same step in a pipeline, let's say Spark on HDFS to do your ETL and your analysis and your shaping of data and even some of the machine learning, you put that on Azure and on Amazon, it's actually on different storage foundation. So even if you're using the same component, it's different. So there's a lot of complexity in a lot of trade-offs. Is that a problem for customers? Yes, because all of a sudden they have to evaluate what those trade-offs are, they have to evaluate the trade-off between specialization. Do I use the best-of-breed thing on one platform? And if I do, it's not compatible with what I might be running on-prem. That'll slow a lot of things down, I can tell you right now, I have the same code base on all environments and then just have the same seamless operational role. Okay, that's a great point, George, thanks for sharing that. The second point here is harmonizing and simplifying management across hybrid clouds. Oh, again, back to your point, you set that up beautifully. Great example, open source innovation hits a roadblock and the roadblock is incompatible module components in multiple clouds, that's a problem. It's a management nightmare. How do harmonization across hybrid cloud work? You couldn't have asked it better. Let me put it up in terms of like an XY chart where on the X-axis you have the components of an analytic pipeline, ingest, process, analyze, predict, serve. But then on the Y-axis, this is for an admin, not a developer. These are just some of the tasks they have to worry about, data governance, performance monitoring, scheduling and orchestration, availability and recovery, the whole list. Now, if you have a different product for each step in that pipeline and each product has a different way of handling all those admin tasks, you're basically taking all the unique activities on the Y-axis, multiplying it by all the unique products on the X-axis and you have overwhelming complexity. Even if these are managed services on the cloud. Here now you've got several trade-offs. Do I use the specialized products that you would call best of breed? Do I try and do end-to-end integration? So I get simplification across the pipeline or do I use products that I had on-prem like you were saying so that I had seamless compatibility or do I use the cloud vendors? That's a tough trade-off. There's another similar one for developers. Again, on the Y-axis for all the things that a developer would have to deal with or this is not all of them, just a sample. The data model and the data itself, how to address it, the programming model, the persistence, so on that Y-axis you multiply all those different things you have to master for each product and then on the X-axis all the different products in the pipeline and you have that same trade-off again. Complexities off the charts. And you can trade end-to-end integration to simplify the complexity but we don't really have products that are fully fleshed out and mature that stretch from one end of the pipeline to the other. So that's the challenge. All right, let's talk about another way of looking at management. This was looking at the sort of administrators and the developers. Now we're getting better and better software for monitoring performance and operations and trying to diagnose root cause when something goes wrong and then remediate it. There are two real approaches. One is you go really deep but on a narrow part of your application and infrastructure landscape and that narrow part might be your analytic pipeline, the big data. The broad approach is to get end-to-end visibility across Edge with your IoT devices, across on-prem, perhaps even across multiple clouds. That's the breadth approach, end-to-end visibility. Now there's a trade-off here too as in all technology choices. When you go deep, you have bounded visibility but that bounded visibility allows you to understand exactly what is in that set of services, how they fit together, how they work, because the vendor, knowing that they're only giving you management of your big data pipeline, they can train their models, their machine learning models so that whenever something goes wrong, they know exactly what caused it and they can filter out all the false positives, that's the scattered errors that can confuse administrators. Whereas if you want breadth, you want to see end-to-end your entire landscape so that you can do capacity planning and you can see if there was an error way upstream, something might be triggered way downstream or a bunch of things downstream. So the best way to understand this is how much knowledge do you have of how all the pieces work together and how much knowledge do you have of how all the pieces, the software pieces fit together? This is actually an interesting point. So if I kind of connect the dots for you here is the bounded root cause and else where you see a lot of machine learning, that's where the automation is. The unbounded, the breadth, that's where the data volume is, but they can work together, that's what you're saying. Yes, and actually I hadn't even got to that so that was, thanks for taking it out. Did I jump ahead on that one? No, no, you teed it up, because ultimately- Well I don't want to know where it's going to be automated away. All the undifferentiated labor and scale can be automated. Well, when you talk about them working together, so for the deep depth first, there's a small company called Unravel Data that does, that sort of modeled eight million jobs or workloads of big data workloads from high tech companies. So they know how all that fits together and they can tell you when something goes wrong, exactly what goes wrong and how to remediate it. But so take something like Rokana or Splunk, they look end to end. The interesting thing that you brought up is at some point that end to end product is going to be like a data warehouse and the depth products are going to sit on top of it. So you'll have all the contextual data of your end to end landscape, but you'll have the deep knowledge of how things work and what goes wrong sitting on it. So just before we jump to the machine learning question which I want to ask you, what you're saying is the industry's evolving. So almost looking like a data warehouse model but in a completely different way. Yeah. Think of it as another cube. That's what I do, George, I help you out with the keys. But I mean, the data warehouse, everyone knows what that was, a huge industry, create a lot of value, but then the world got rocked by unstructured data and then they're bounded, if you will, view is got democratized, so creative destruction happened, which is another word for new entrants came in and incumbents got rattled, but now it's kind of going back to what looks like and looks like a data warehouse, but it's completely distributed around. Yes, and I was going to do one of my movie references, but. No, don't do it, save us, George. Yeah. If you look at this starting in the upper right, that's the data lake where you're collecting all the data and it's for search, it's exploratory. As you get more structure, you get to the descriptive place where you can build dashboards to monitor what's going on and you get really deep, that's when you have the machine learning. Well, the machine learning is hitting the low hanging fruit and that's where I want to get the next to move along, sourcing machine learning capability. Let's discuss that. Okay, all right, just to set context, before we get there, notice that when you do end-to-end visibility, you're really seeing across a broad landscape. And when I'm showing like public cloud, big data, that would be depth first just for that component, but you would do breadth first, you could do like a Rokona or a Splunk that then sees across everything. And the point I wanted to make was when you said we're reverting back to data warehouses and revisiting that dream again, the management applications started out as saying, we know how to look inside machine data and tell you what's going on with your landscape. It turns out that machine data and business operations data, your application data are really becoming one and the same so that what used to be a transaction, there was one transaction and that when you summarize them, that went into the data warehouse. Then we had with systems of engagement, you had about a hundred interaction events that you tracked or sort of stored for every business transaction. And then when we went out to the big data world, it's so resource intensive that we actually had 1,000 to 10,000 infrastructure events for every business transaction. So that's why the data volumes have grown so much and why we had to go back first to Data Lake and then curate it to Data Warehouse. Classic innovation story, great. Machine learning, sourcing machine learning capabilities because that's where the rubber starts hitting the roll. You're starting to see clear skies when it comes to where machine learning is starting to fit in, sourcing machine learning capabilities. Even though we sort of didn't really rehearse this, you're helping cue me on perfectly. So let me make the assertion that with machine learning, we have the same shortage of really trained data scientists that we had when we were trying to stand up Hadoop clusters and do big data analytics. We did not have enough administrators because these were open source components built from essentially different projects and putting them all together required huge amount of skills. Data science requires really knowledge of algorithms that even very sophisticated programmers will tell you, geez, now I need a PhD to really understand how this stuff works. So the shortage, that means we're not going to get a lot of hand built machine learning applications for a while. The first place. There's a lot of libraries out there right now. You see TensorFlow from Google, big traction with that application. But for PhDs, that's what my contention is. Well, developers too, you could argue developers, but I'm just putting it out there. I will get to that actually, a slide just on that. Let me do this one first, because my contention is the first big application, widespread application of machine learning is going to be the depth first management because it comes with a model built in of how all the big data workloads, services, and infrastructure fit together and work together. And if you look at how the machine learning model operates, when it knows something goes wrong, let's say a job takes, an analytic job takes 17 hours and then just falls over and crashes. The model can actually look at the data layout and say, oh, we had way too much on one node and it can change the settings and change the layout of the data because it knows how all this stuff works. The point about this is the vendor, in this particular example, unraveled data, they built into their model an understanding of how to keep a big data workload running as opposed to telling the customer you have to program it. So that fits into the question you were just asking, which is, so where do you get this talent? And here's, when you were talking about like TensorFlow and CAFE and Torch and MXNet, those are all like assembly language. Yes, those are the most powerful places you can go to program machine learning. But the number of people is inversely proportional to the power of those. Those are at the middle. Those are like really unique specialty people, high, the tough, top guys, rocket scientists. Well yeah, just high end tier one coders, tier one brains, gotten away, AI gurus, this is not your working developer. But if you go up two levels, so go up one level is Amazon machine learning, spark machine learning, go up another level and I'm using Amazon as example here. Amazon has a vision service called recognition. They have a speech generation service, natural language. Those are developer ready. And when I say developer ready, I mean developer just uses an API. He calls, passes in the data, it comes out. He doesn't have to know how the model works. It's kind of like what DevOps was for cloud. At the end of the day, this slide is completely accurate in my opinion and we're at the early days and you're starting to see the platform's developments. The classic abstraction layer. Whoever can abstract away the complexity as AI and machine learning grows is going to be the winning platform. No doubt about it. Amazon is showing some good moves there. And you know how they abstracted away? In traditional programming, it was just building higher and higher APIs, it's more accessible. In machine learning, you can't do that. You have to actually train the models, which means you need data. So if you look at the big cloud vendors right now, so Google, Microsoft, Amazon and IBM, most of them, the first three, they have a lot of data from their B2C businesses. So people talking to Echo, people talking to Google Assistant or Siri, that's where they get enough of their. So data equals power. Yes. By having data, you have the ingredients and the more data that you have, the more data that you know about, the more data that has information around it, the more effective it can be to train machine learning algorithms. Yes. And the benefit comes back to the people who have the data. Yes. And so even though your capabilities get narrower, because you could do anything on TensorFlow. That's why Facebook's getting killed right now just to kind of change tangents. They have all this data and people are very unhappy. They just released that the Russians were targeting anti-Semitic advertising. They enabled that. So it's hard to be a data platform and still provide user utility. This is what's going on. Whoever has the data has the power. It was a Frankenstein moment for Facebook. So there's that out there for everyone. How do companies do the right thing? And there's also the issue of customer intellectual property protection. As consumers we're like, you can take all our speech to Siri or to Echo or whatever and get better at recognizing speech because we've given up control of that because we want those services to be free. Whoever can shift the data value to the users. To the developers. To the developers or communities better said, we'll win. Okay. In my opinion, that's my opinion. And for the most part, Amazon, Microsoft and Google have similar data assets. For the most part, so far IBM has something different which is they work closely with their industry customers and they build progressively like they'll build, they're working with Mercedes, they're working with BMW. They'll work on the connected car. The, you know, the autonomous car and they build out those models slowly. So George, this slide is really, really interesting. And I think this should be a roadmap for all customers to look at as try to peg where they are in the machine learning journey. But then the question comes in, they do the blocking and tackling, they have the foundational low level stuff done, they're building the models, they're understanding the mission, they have the right organizational mindset and personnel. Now they want to orchestrate and implement into action. That's the final question. How do you orchestrate the distributed machine learning feedback and the data coherency? How do you get this thing scaling? How do these machines and the training happen so you have the breadth and then you can bring the machine learning up the curve into the dashboard. Okay, we saved the best for last. It's not easy. So we talked about, when I show the Chevrons, that's the analytic data pipeline. And imagine in the serve and predict at the very end, let's take an IoT app, a very sophisticated one, which would be an autonomous car. And it doesn't actually have to be an autonomous one. You could just be collecting a lot of information off the car to do a better job insuring it, the insurance company. But the key then is you're collecting a data on a fleet of cars, right? You're collecting data off each one, but you're also collecting then the fleet. And that in the cloud is where you keep improving your model of how the car works. You run simulations to figure out, not just how to design better ones in the future, but how to tune and optimize the ones that are on the road now. And then, that's number three, then in four, you push that feedback back out to the cars on the road. And you have to manage, and this is tricky, you have to make sure that the models that you trained in step three are coherent, or the same when you take out the fleet data and then you put the model for a particular instance of a car back out on the highway. George, just a great example. I think this slide really represents the modern analytical operational role in digital business. You can't look further than Tesla, that you just essentially Tesla and now all cars as a great example, because it's complex, it's an internet of thing device, it's on the edge of the network, it's mobility, it's using 5G, it scores cars a lot of data, it encapsulates everything that you are presenting. So I think this example is a great one of the modern operational analytic applications that supports digital business. Thanks for joining this Wikibon conversation. Thank you, John. George Gilbert, the analyst at Wikibon covering big data and the modern operational analytical system supporting digital business, it's data driven, the people with the data can train the machines to have the power, that's the mandate, that's the action item. I'm John Furrier with George Gilbert, thanks for watching.