 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCUBE. It's day two at the Spark Summit 2017. I'm David Goad, and here with George Gilbert from Wikibon, George. Good to be here. All right, and the guest of honor, of course, is Ash Mushi, who is the CEO of Pepper Data, Ash. Welcome to the show. Thank you very much. Thank you. Well, you have an interesting background. I want you to just tell us real quick here, not to give you the whole bio, but you've got a great background in machine learning. You were an early user of Spark. Tell us a little bit about your experience. So, I'm actually a mathematician originally. Theoretician worked for IBM Research, and then subsequently Larry Ellison at Oracle and a number of other places. But most recently, I was CTO at Yahoo, and then subsequently that did a bunch of startups that involved different types of machine learning and also just in general, sort of a lot of big data infrastructure stuff. And go back to 2012 with Spark, right? You had an interesting- So, 2011, 2012, when Spark was still early, we were actually building our recommendation system based on user-generated reviews. That was a project that was done with Nando De Freitas, who's now at DeepMind, and Peter Kanuta, who's one of the key guys that runs infrastructure at Yahoo. We started that company, and we were one of the early users of Spark. And what we found was that we were analyzing all the reviews at Amazon. So, Amazon allows you to crawl all the reviews, and we basically had natural language processing that would allow us to analyze all those reviews. When we were doing sort of MapReduce stuff, it was taking us a huge number of nodes and 24 hours to actually go do analysis. And then we had this little project called Spark out of Amplab, and we decided to spin it up and see what we could do. It had lots of issues at that time, but we were able to actually spin it up onto, I think it was in the order of 100,000 nodes, and we were able to take our times for running our algorithms from sort of tens of hours down to sort of an hour or two. So it was a significant improvement in performance. And that's when we realized that this is going to be something that's going to be really important once the set of issues were, once it was going to get mature enough to make happen. And I'm glad to see that that's actually happened now, and it's actually taken over the world. Yeah, that little project became a big deal, didn't it? It became a big deal, and now everybody's taking advantage of the same thing. Well, bring us to the present here. We'll talk about Pepperdata and what you do, and then George is going to ask a little bit more about some of the solutions. Perfect. So Pepperdata was a company founded by two gentlemen, Sean Sookter and Chad Carson. Sean used to run Yahoo Search, and one of the first guys who actually helped develop Hadoop next to Eric 14 and that team. And then Chad was one of the first guys who actually figured out how to monetize clicks and was the data science guy around the whole thing. So those are the two guys that actually started the company. I joined the company last July as CEO, and what we've done recently is we've sort of expanded our focus of the company to addressing DevOps for big data. And the reason why DevOps for big data is important is because what's happened in the last few years is people have gone from experimenting with big data to taking big data into production, and now they're actually starting to figure out how to actually make it so that it actually runs properly and scales and does all the other kinds of things that are there, right? So it's that transition that's actually happened. So, hey, we ran it in production. It didn't quite work the way we wanted to. Now we actually have to make it work correctly. That's where we sort of fit in, and that's where DevOps comes in, right? DevOps comes in when you're actually trying to make production systems that are going to perform in the right way. And the reason for DevOps is it shortens the cycle between developers and operators, right? So the tighter the loop, the faster you can get solutions out because business users are actually wanting that to happen. That's where we're squarely focused, is how do we make that work? How do we make that work correctly for big data? And the difference between sort of classic DevOps and DevOps for big data is that you're now dealing with not just a set of computers solving an isolated sort of problem, you're dealing with thousands of machines that are solving one problem. And the amount of data is significantly larger. So the classical methodologies that you have, while Agile and all that still works, the tools don't work to actually figure out what you can do with DevOps, and that's where we come in. We've got a set of tools that are focused on performance effectively, because that's the big difference between sort of, or distributed systems performance, I should say. That's the big difference between that and sort of classic compute, even scale up compute, right? So if you've got web servers, yes, performance is important, and you need data for those, but that can actually be sharded nicely. This is one system working on one problem, right? Or a set of systems working on one problem. That's much harder. It's a different set of problems, and we help solve those problems. Yeah, and George, you look like you're itching to dig into this, feel free. Well, so it was, so one of the big announcements at the show, and sort of the headline announcement today was Spark Serverless, like so it's not just, it's not just someone running Spark in the cloud as a sort of, as a managed service. It's up there as a, you know, sort of SaaS application. You could call it platform as a service, but it's basically a service where, you know, the infrastructure's invisible. Now, for all those customers who are running their own clusters, which is pretty much everyone, I would imagine, at this point. How far can you take them in hiding much of the overhead of running those clusters? And by the overhead, I mean, you know, the primarily performance and maximizing, you know, sort of maximizing resource efficiency. So you have to actually sort of double click onto the kind of resources that we're talking about here, right, so there is the number of nodes that you're going to need to actually do the computation. There is, you know, the amount of disk storage and stuff that you're going to need, what type of CPUs you're going to need. All of that stuff is sort of part of the costing, if you will, of running an infrastructure. If somebody hides all that stuff and makes it so that it's economical, then, you know, that's a great thing, right? And if it can actually be made so that it works for huge installations and hides it appropriately, so I don't pay too much of a tax, that's a wonderful thing to do. But we have, our customers are enterprises, typically Fortune 200 enterprises, and they have both a mixture of cloud-based stuff where they actually want to control everything about what's going on, and then they have infrastructure internally, which, by definition, they control everything that's going on. And for them, we're very, very applicable. I don't know how we'd be applicable in this sort of new world as a service that grows and shrinks. I can certainly imagine that whoever provides that service would embed us to be able to use the stuff more efficiently. No, you answered my question, which is, for the people who aren't getting the turnkey sort of SaaS solution, and they need help managing what's a fairly involved stack, they would turn to you. Yes. Okay. Can I ask about the specific products? I saw your booth, and I saw you were announcing a couple of things. What is new at the show? Correct, so at the show, we announced a code analyzer for Apache Spark, and what that allows people to do is really understand where performance issues are actually happening in their code. So one of the wonderful things about Spark, compared to MapReduce, is that it abstracts the paradigm that you actually write against, right? So that's a wonderful thing, because it makes it easier to write code. The problem when they were you abstract is what does that abstraction do down in the hardware and where am I losing performance? And being able to give that information back to the user. So, you know, in Spark, you have jobs that can run in parallel, so an app consists of jobs, jobs can run in parallel, and each one of these things can consume resources, CPU, memory, and you see that through sort of garbage collection, or a disk, or network, and what you want to find out is which one of these parallel tasks was dominating the CPU? Why was it dominating the CPU? Which one actually caused the garbage collector to actually go crazy at some point? While the Spark UI provides some of that information, what it doesn't do is gives you a time series view of what's going on. So sort of a blow by blow view of what's going on. By imposing the time series view on sort of an enhanced version of the Spark UI, you now have much better visibility about which offending stages are causing the issue. And the nice thing about that is, once you know that, you know exactly which piece of code that you actually want to go and look at. So, a classic example would be, you might have two stages that are running in parallel. The Spark UI will tell you that it's stage three that's causing the problem, but if you look at the time series, you'll find out that stage two actually runs longer, and that's the one that's pegging the CPU. And you can see that because we have the time series, but you couldn't see that any other way. Do you have a code analyzer and also the app profile? So the app profiler is the other product that we announced a few months ago. We announced that I guess about three months ago or so, and the app profiler, what it does is actually looks after the run is done. It actually looks at all the data that the run produces, so the Spark History Server produces, and then it actually goes back and analyzes that and says, well, you know what? Your executors here are not working as efficiently. These are the executors that aren't working as efficiently. It might be using too much memory or whatever. And then it allows the developer to basically be able to click on it and say, explain to me why that's happening. And then it gives you a little fix it, if you will. It's like, if this is happening, you probably want to do these things in order to improve performance. So what's happening with our customers is our customers are asking developers to run the application profiler first before they actually put stuff on production. Because if the application profiler comes back and says everything is green, there's no critical issues there, then they say, okay, fine, put it on my cluster, on the production cluster, but don't do it ahead of time. The application profiler, to be clear, is actually based on some work on an open source project called Dr. Elephant, which comes out of LinkedIn. And now we're working very closely together to make sure that we actually can advance the set of heuristics that we have that will allow developers to understand and diagnose more and more complex problems. The Spark community has the best code names ever. Dr. Elephant, I've never heard that one before. Well, Dr. Elephant actually is not just a Spark community, he's actually also part of the MapReduce community. So yeah, I mean, remember Hadoop, the Elephant thing, so Dr. Elephant, you know. Let's talk about where things are going next, George. So, you know, one of the things we hear all the time from customers and vendors is, how are we going to deal with this new era of distributed computing, you know, where we've got the cloud, on-prem, edge, like so for the first question, let's leave out the edge and say, you've got your Fortune 200 client, they have, you know, production clusters or even if it's just one on-prem, but they also want to work in the cloud, whether it's for elastic stuff or just for, they're gathering a lot of data there. How can you help them manage both, you know, environments? Right, so I think there's a bunch of times still before we get into most customers actually facing that problem. What we see today is that a lot of the Fortune 200 or our customers, I shouldn't say a lot of the Fortune 200, a lot of our customers have significant, you know, deployments internally on-prem. They do experimentation on the cloud, right? The current infrastructure for managing all these and sort of orchestrating all this stuff is typically yarn. What we're seeing is that more than likely they're going to wind up, or at least our intelligence tells us that it's going to wind up being Kubernetes, that's actually going to wind up managing that. So what will happen is, on-prem and let me get to that, right? So I think yarn will be replaced certainly on-prem with Kubernetes because then you can do multi-data center and things of that sort. The nice thing about Kubernetes is, it in fact can span the cloud as well. So Kubernetes as an infrastructure is certainly capable of being able to both handle a multi-data center deployment in on-prem along with whatever actually happens on the cloud. There is infrastructure available to do that. It's very immature. Most of the customers aren't anywhere close to being able to do that. And I would say even before Kubernetes gets accepted within the environment, it's probably 18 months. And there's probably another 18 months to two years before we start facing this hybrid cloud on-prem kind of problem. So we're a few years out, I think. So for those of us, including our viewers, you know, who know the acronym and know that it's, you know, scheduler slash cluster manager, resource manager, would that give you enough of a control plane and knowledge of sort of the resources out there for you to be able to either instrument or deploy an instrument, all the clusters? So we are actually, we're actually leading the effort right now for big data on Kubernetes. So there is a group of, there's a small group working. It's Google, us, Red Hat, Palantir, Bloomberg now, join the group as well. We are actually today talking about our effort on getting HDFS working on Kubernetes. So we see the writing on the wall. We clearly are positioning ourselves to be a player in that particular space. So, you know, we think we'll be ready and able to take that challenge on. All right. Auschwitz is great stuff. We just got about a minute before the break. So I wanted to ask you just a final question. You've been in the Spark community for a while. What other open source tools should we be keeping our eyes out for? Kubernetes. That's the one. To me, that is the killer that's coming next. All right. I think that's going to make life, it's going to unify the microservices architecture plus, you know, sort of the multi-data center and everything else. I think it's really, really good. Borg works. It's been working for a long time. All right, and I want to thank you for that little pepper pin that I got over at your booth. That's the coolest. Come and get more. Gadget here. We also have pepper sauce. Oh, of course. Well, there's the hot news from the pepper data, Ash Bushi. Thank you so much for being on the show. We appreciate it. My pleasure. Thank you very much. And thank you for watching theCUBE. We're going to be back with more guests, including Ali Godri, CEO of Databricks. Coming up next.