 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to Boston, everybody. This is Spark Summit East, hashtag Spark Summit, and this is theCUBE. John Stoica is here. He's the executive chairman of Databricks and professor of computer science at UCAL Berkeley. I just, the smarts is rubbing off me. I always feel smart when I co-host with George. John, having you on is just a pleasure, so thanks very much for taking the time out. Thank you for having me. So, love the talk this morning. We learned about Rise Labs. We're going to talk about that, which is Son of Amp. You made the father of those two, so. Again, welcome. Give us the update. Great keynote this morning. How's the vibe? How are you feeling? I think it's great and thank you and thank everyone for attending this summit. It's a lot of energy, a lot of interesting discussions and a lot of ideas around. So, I'm very happy about how things are going. Yeah, so, well, let's start with Rise Labs. Maybe take us back to those who don't know, understand sort of the birth of Amp and what you were trying to achieve there and what's next. Yeah, so, the Amp was a six year project at Berkeley and it involved around eight faculties and over the duration of the lab, around 60 students and postdocs. And the mission of the Amp Lab was to make sense of the big data. Amp Lab started in 2009 and at the end of 2009. And the premise is that in order to make sense of this big data, we need a holistic approach which involves algorithms, in particular machine learning algorithms, machines, means systems, large scale systems and the people, crowdsourcing. And more precisely, the goal was to build a stack, data analytics stack for interactive analytics to be used across industry and academia. And of course, being at Berkeley, it has to be open source. So that's basically what was Amp Lab and it was a birthplace for Apache Spark. That's why we are all here today. And a few other open source systems like Mezos, Apache Mezos and Alux shows, which was previously called Tachyon. And so Amp Lab ended in December last year. And in January, we started a new lab, which is called RISE. RISE stands for Real-Time Intelligent Secure Execution. And the premise of the new lab is that actually the real value in the data is the decision you can make on the data. And you can see this more and more at almost every organization. They want to use their data to make some decision to improve their business processes, application services, or come up with new application services. But then if you think about that, what does it mean that they emphasize this on the decision? Then it means that you want decision to be fast because fast decisions are better than slower decisions. You want decisions to be on fresh data, on live data, because decisions on the data I have right now are in general better than decisions on the data from yesterday or last week. And then you also want to make targeted, personalized decisions because the decisions on personal informations are better than on aggregate information. So that's fundamental premise. So therefore you want to build platforms, tools, and algorithms to enable intelligent real-time decisions on live data with strong security. And the security is a big emphasize of the lab because it means to provide privacy, confidentiality, and integrity, and as you hear about data breaches or things like that every day. So for an organization it's extremely important to provide privacy and confidentiality to their users. And it's not only because the users want that but it's also indirectly can help them to improve their service. Because if I guarantee your data is confidential with me, you are probably much more willing to share some of your data with me and if you share some of the data with me I can build and provide better services. So that's basically the nutshell what the lab is and what the focus is. Okay, so you said three things, fast, live, and targeted. So fast means you can affect the outcome. Live data means it's better quality and then targeted means it's relevant. Okay, and then my question on security, I felt like when cloud and big data came to fore, security became a do-over. Is that a fair assessment? Doing it over. Or as Bill Clinton would call it, a mulligan. Am I, yeah, do you get a mulligan on security? I think security is always a difficult topic because it means so many things for so many people. So there are instances in actually cloud is quite secure. It's actually cloud can be more secure than some on-prem deployments. In fact, if you hear about these data leaks or security breaches, most of you don't hear them happening in the cloud and there are some reasons for that, right? It's because they have trained people, they are paranoid about this, they do certification much more often and things like that. But still, the state of security is not that great. For instance, if I compromise your operating system or that it's in cloud or they are not in the cloud, I can do anything, right? Or your VM, right? It's your, in all these clouds, you run on a VM and now you are going to learn on some containers, right? So it's, you know, there are a lot of attacks or there are attacks or physical attacks with which your data is encrypted. But if I can look at the access patterns, how much data you transfer or how much data you access from memory, then I can infer something about what you are doing, about your queries, right? If it's more data, maybe it's a query on New York. If it's less data, it's probably maybe something more like something at Berkeley. So you can infer from multiple queries just looking at the access. So it's a difficult problem. But fortunately, again, there are some new technologies which are developed and some new algorithms which gives us some hope. One of the most interesting technology which is happening today is hardware enclaves. So with hardware enclaves, you can execute the code within this enclave which is hardware protected. And even if your operating system or VM is compromised, it cannot access your code which runs into this enclave. And Intel has Intel SGX and we are working and collaborating with them actively. ARM has Trust Zone and AMD also announced they are going to have a similar technology in their chips. So that's kind of very interesting and very promising development. I think the other aspect in which it's a focus of the lab is that even with these hardware enclaves, it doesn't automatically solve the problem because if the code itself has a vulnerability, yes, I can run the code in the hardware enclave so that the code can send out data outside. A more granular perimeter, right? So yeah, so we are looking and there are security experts in the new lab looking at this, maybe how to split the application so you run only a small part in the enclave which is a critical part and you can make sure that it's also the code is secure and the rest of the code you run outside but the rest of the code it's only going to work on data which is encrypted, right? So, you know, there is a lot of interesting research but you know, that's good. And does blockchain fit in there as well? Yeah, I think blockchain is a very interesting technology and again, it's a real time in that area. It's also very interesting directions, absolutely. So you guys, I want George, you've shared with me sort of what you were calling a new workloads. They had batch and you have interactive and now you've got continuous. Continuous, yes. And I know that's a topic that you want to discuss and I'd love to hear more about that but George, tee it up. Well okay, so we were talking earlier and the objective of RISE is fast and continuous type decisions and this is different from the traditional you either do it batch, you do it interactive. So maybe tell us about some applications where that is one workload among the other traditional workloads and then let's unpack that a little more. Yeah, so I'll give you a few applications. So it's more than continuously interact with the environment continuously but you also learn continuously. I'll give you some examples. So for instance, you know an example, think about you want to detect network security attack and respond and diagnose and defend in real time. So what this means that you need to continuously get logs from the network and from the more endpoints you can get the better, right? Because the more data will help you to detect things faster but then you need to detect the new pattern and you need to learn the new patterns because new security attacks, which is the one which are effective are slightly different from the past one because you hope that you already have defense in place for the past ones. So now you are going to learn that and then you are going to react. You may push patches in real time. You may push filters, installing new filters to firewalls. So that's kind of an application is going in real time. Another application can be about self-driving. Now, self-driving makes tremendous strides and a lot of algorithms, very smart algorithms, now they are implemented on the cars, right? All the system is on the cars. But imagine now that you want to continuously get the information from the car, aggregate and learn and then send back the information you learn or to the cars. Like for instance, if it's an accident, a roadblock, an object which is dropped on the highway. So you can learn from the other cars which what they've done in that situation, let me mean in some cases a driver took an evasive action. Maybe you can monitor also the cars which are not self-driving, but driven by the humans. And then you learn that in real time and then the other cars which follows through the same confront with the same situation, they now know what to do, right? So this is, it's again, I want to emphasize it's not only continuous sensing environment and making the decisions, but a very important components about learning. Let me take you back to the security examples as I sort of process the auto one. So in the security example, it doesn't sound like, I mean if you have a vast network, end points, software, infrastructure, you're not going to have one God model looking out at everything. So I assume that means there are models distributed everywhere and they don't know what a new, necessarily but an entirely new attack pattern looks like. So in other words, for that isolated model, it doesn't know what it doesn't know. I don't know if that's what Rumsfeld called it or whatever, but how does it know what to pass back for retraining? Yes, yes, yes. So there are many aspects and there are many things you can look at. And it's again, this is not, it's a research problem, so I cannot give you the solution now, I can hypothesize and I give you some examples. But for instance, you can look about when you correlate on some, by observing the effects, some of the effects of the attacks are visible. In some case, you denial of service attack. That's pretty clear. Even the world and so forth, they may be caused computers to crash, right? So once you see some, this kind of anomaly, right? Anomalies on the end devices and host and things like that may be reported by humans, right? Then you can try to correlate which what kind of traffic you got, right? And from there, from that correlation, probably you can, and hopefully, you can develop some models so to identify what kind of traffic where it comes from, what is the content and so forth which causes behavior, anomalous behavior. And where is that correlation happening? I think it will happen everywhere, right? Because it's- At the edge and at the center. Absolutely. And then I assume that it sounds like the models both at the edge and at the center are ensemble models. Yes. Because you're tracking different behavior. Yes, you are going to track a different behavior and you are going to, I think that's a good hypothesis. And then you are going to assemble them. It's assembled to come up with the best decision. Okay, so now let's wind forward to the car example. So it sounds like there's a mesh network. At least the Peter Levine sort of talk was there's near local compute resources and you can use Bitcoin to pay for it or blockchain or however it works. But that sort of topology, we haven't really encountered before in computing, have we? And how imminent is that sort of? I think that some of the stuff you can do today is a cloud. I think if you want super low latency, probably you need to have more computation towards the edges. But if I'm thinking that I want kind of reactions on tens, hundreds of milliseconds, in theory you can do it today with the cloud infrastructure you have. And if you think about, in many cases, you know, if you can do it within a few hundreds of milliseconds, it's still super useful, right? To avoid some object which was dropped on the highway. If I have a few hundred milliseconds, many cars will be effectively avoid that, having that information. Let's have that conversation about the edge, a little further one, we have an off camera. So there's a debate in our community about how much data will stay at the edge, how much will go into the cloud. David Floyer said 90% of it will stay at the edge. Your comment was it depends on the value. What do you mean by that? Yeah, so I think that depends who am I and how I perceive the value of the data. And you know, and what can be the value of the data? This is what I was saying, I think that the value of the data is fundamentally what kind of decisions, what kind of actions it will enable me to take, right? So here I'm not just talking about credit card information or things like that. Even in that case, there is an action I'm going, someone is going to take on that. But so if I do believe that the data can provide me with ability to take better actions and or make better decisions, I think that I want to keep it. And it's not because I want to keep it because also it's not only the decision it enables me now, but everyone is going to continuously improve their algorithms, develop new algorithms. And when you do that, how do you test them? You test on the old data, right? So I think that for all these reasons, a lot of data as a valuable data in this sense is going to go to the cloud. Now, there's a lot of data which will remain at the edges. And I think that's fair. But it's again, if a cloud provider or someone who provides a service in the cloud believes that the data is valuable, I do believe that eventually it's going to get to the cloud. So if it's valuable, it will be persisted and it will eventually get to the cloud. And we talked about latency, but there's a certain, you gave the example of an evasive action. You can't send that back to the cloud and make the decision, you have to make it real-time. But eventually that data, if it's important, will go back to the cloud. The other question of all this data that we are now processing in a continuous basis, how much actually will get persisted? Most of it, much of it probably does not get persisted, right? Is that fair assumption? Yeah, I think so. And probably, all the data is not equal, right? It's like, you want to maybe even if you take a continuous video, right, on the car, they continuously have videos for multiple cameras and radar, leader, all of this stuff, right, is continuous. And if you think about of this one, I would assume that you don't want to send all the data to the cloud, but the data around the interesting events you may want to do, right? So before or after the car, it was in the near accident or took an evasive action or the human had to intervene. So in all these cases, probably I want to send the data to the cloud. But for the most cases, probably not. It's good. We have to leave it there, but I'll give you the last word on things that are exciting you, things you're working on, interesting projects. Yeah, so I think this is what, you know, it's really what excites me is about how you are going to have this continuous application, you are going to continuously interact with the environment, you are going to continuously learn and improve. And here there are many challenges and I just want to say a few more there and what we haven't discussed. One, it's about, in general, it's about explainability. Right? If these systems augment the human decision process, if these systems are going to make decisions which impact you as a human, you want to know why, right? Like I gave these examples, assuming that you have machine learning algorithms in making a diagnosis on your MRI or X-ray. You want to know why? What is in this X-ray causes a decision, right? If you go to the doctor, they are going to point and show you, okay, this is why you have this condition. So I think this is very important because as a humans, you want to understand not only why the decision happens, but you also want to understand what you have to do because you want to understand what you need to do to do better in the future, right? Like if your mortgage is turned down, application is turned down, I want to know why is that? Because next time when I apply to the mortgage, I want to have a higher chance to get it through. I think that's a very important aspect. And the last thing I will say is that this is super important and you mentioned it's about having algorithms which can say, I don't know, right? It's like, okay, I've never seen this situation in the past, so I don't know what to do. This is much better than giving you just a wrong decision, right? Right, or low probability that you don't know what to do with. Yeah, excellent. John, thanks again for coming in theCUBE. It was really a pleasure having you. Thanks for having me. You're welcome. All right, keep it right there, everybody. George and I will be back to do our wrap right after this short break. This is theCUBE, we're live from Spark Summit East, right back.