 From San Francisco, it's theCUBE, covering Spark Summit 2017. Brought to you by Databricks. You are watching theCUBE at Spark Summit 2017. We continue our coverage here talking with developers, partners, customers, all things Spark. And today we're honored now to have our next guest, Dr. Ji Xiang Wang, who's the Senior Director of Data Science at the CTO office at Hewlett Packard Enterprise. Dr. Wang, welcome to the show. Thanks for having me here. All right, and also to my right, we have Mr. Jim Kabilis, who's the lead analyst for Data Science at Wikibon. Welcome, Jim. Great to be here, like always. Well, let's jump into it. And first I want to ask about your background a little bit. We were talking about the organization. Maybe you could do a better job of telling me where you came from and you just recently joined HPE. Yes. I actually recently joined HPE early this year through the NERA acquisition. And now I'm the Senior Director of Data Science in the CTO office of Aruba. Actually, Aruba, you probably know like two years back, HPE acquired Aruba as a wireless networking company. And now Aruba takes charge of the whole enterprise networking business in HPE, which is about over three billion annual revenue every year now. That's not confusing at all. I could follow you. Yes, okay. Well, all I know is you're doing some exciting stuff with Spark. So maybe tell us about this new solution that you're developing. Yes, actually my most experienced of Spark now goes back to the NERA time. So NERA was a three and a half year old startup that reinvented the enterprise security using big data and data science. So what is the problem we solved in NERA, we try to solve in NERA is called UEBA, User and Entity Behavior Analytics. So I just try to be very brief here. You know, most of the traditional security solutions focus on, you know, detecting attackers from outside. But what if the original of the attacker is the inside enterprise, say Snowden? What can you do, right? So you probably heard many cases today employees leaving the company by stealing lots of the company IP and sensitive data. So UEBA is a new solution. Try to monitor the behavior change of the enterprise users to detect both this kind of malicious insider and also the compromised user. Behavioral analytics. Yes, so it sounds like it's a native analytics driven product. Yeah, and Jim, you've done a lot of work in the industry on this. So any questions you might have for him around Yuba? Yeah, give us a sense for how you're incorporating streaming analytics and machine learning into that UEBA solution. And then where Spark fits into the overall approach that you take. Right, okay. So actually when we started three and a half years back, the first version, when we developed the first version of the data pipeline, we used a mix of Hadoop, you know, Yang, Spark, even Apache Storm for different kind of stream and batch analytics work. But soon after with increased maturity and also the momentum from this open source Apache Spark community, we migrated all our stream and batch, you know, the ETL and data analytics work into Spark. And it's not just the Spark, it's the Spark, Spark streaming, ML live, the whole ecosystem of that. So there are at least like a couple of advantages we have experienced through this kind of transition. The first thing which really helped us is the simplification of the infrastructure and also the reduction of the DevOps efforts there. So simplification around Spark, the whole stack of Spark that you mentioned, okay. So, you know, for the nearest solution originally, we support, even till today, we support both the on-premise and the cloud deployment. For the cloud, we also support the public cloud like AWS, Microsoft, Azure, and also private cloud. So you can understand with, you know, if we have to maintain a stack of different, like open source tools over this kind of many different deployments, the overhead of doing the DevOps work to monitor, you know, alarming, debugging this kind of infrastructure over different deployments is very hard. So Spark provides some unified platform. We can integrate the streaming, you know, batch, real-time, near real-time, or even long-term batch job all together. So that heavily reduce both the expertise and also the effort required for the DevOps. You know, this is one of the biggest advantages we experienced and certainly we also experienced something like the scalability, you know, performance, and also the convenience for developers to, you know, develop new applications, all of this from Spark. So you're using the Spark structured streaming runtime inside of your application, is that true? We actually use Spark in the stream processing when the data, so like in the UEBS solution, the first thing is collecting a lot of the data, different kind of data source, the network data, you know, cloud application data. So when the data comes in, the first thing is like a streaming job for the ETL to process the data. Then after that, we actually also developed some like a different frequency, like one minute, 10 minute, one hour, one day of this like analytics job on top of that. And even like recently we have started some early adoption of the deep learning into this, you know, how to use deep learning to monitor the user behavior change over time, especially after user gives a notice, you know, what user is user going to access like more servers or download some of the sensitive data. So all of this requires very complex, like analytics infrastructure. Now there were some announcements today here at Spark, some by Databricks of adding deep learning support to their core Spark code base. What are your thoughts about the deep learning pipelines API that they announced this morning? It's new, it's new news. I'll understand if you haven't digested it totally, but you probably have some good thoughts on the topic. Yes, actually this is also a news for me. So I can just speak for my current experience. How to integrate the deep learning into Spark actually was a big challenge so far for us. Because what we use so far, the deep learning piece, we use TensorFlow. And certainly most of the other stream and data, you know, massaging or ETL work is done by Spark. So in this case, there are a couple ways to manage these two. One is to set up like two separate resource pool. One for Spark, the other one for TensorFlow. But in our deployment, there's some like a very small premise department which has only like a four node, five node cluster. It's not efficient to split the resource in that way. So we're actually also looking for some like closer integration between deep learning and Spark. So one thing we looked before is called TensorFlow on Spark, which was open source a couple of months ago by Yahoo. So maybe this is certainly more exciting news for the Spark team to develop this native integration. Very good. And we talked about the UEBA solution, but let's go back to a little broader HPE perspective. You have this concept called the intelligent edge. What's that all about? So that's a very cool name. Actually, you know, come a little bit back, you know, I come from the enterprise background. And the enterprise applications have some like actually a lagged behind. Then consumer applications in terms of the adoption of the new data science technology. So there are some native challenge for that. For example, you know, collecting and storing large amount of this sensitive enterprise sensitive data is a huge concern, you know, especially in European countries. You know, also for the similar reason, how to collect, you know, normally when you develop enterprise applications, you're lack of some good quantity and quality of the training data. So this is some native challenge when you develop enterprise applications, but even despite all of this, HPE and Ruba recently made several acquisitions of the analytics companies to accelerate the adoption of analytics into different product line. Actually that intelligent edge comes from this like IOT. Which is the internet of things is expected to be the fastest growing market in the next few years here. So are you going to be integrating the UEBA behavioral analytics and spark capability into your IOT portfolio at HPE? Is that a strategy or direction for you? Yes, yes, for the big picture that certainly is. So you can think, I think some of the gotten the report expect the number of the IOT devices is going to grow over 20 billion by 2020. You know, since all of these IOT devices are connected to either internet or internet either through wireless. So as a networking company, we have the advantage of collecting data and they even take some actions at the first place. So the idea of this intelligent edge is to, you know, we want to turn each of these IOT devices, the small IOT device like an IP camera, like those motion detection, all of these small devices as both the distributed sensor for the data collection and also some inline like actor to do some real time or even close to real time like decisions, for example, the behavior anomaly detection is a very good example here. If IOT devices is compromised, you know, it's an IP camera has been compromised and used to steer your internal data, we should detect that and stop that at the first place. Can you comment about the challenges of putting deep learning algorithms natively on resource constrained endpoints in the IOT? That must be really challenging to get them to perform well, considering that there may be just a little bit of memory or flash capacity or whatever on the endpoints. Any thoughts about how that can be done effectively and efficiently? Very good question. And at low cost? Yes, very good question. So there are two aspects into this. First is this like a global training of the intelligence which is not going to be done on each of the device. In that case, each of the devices is more like the sensor for the data collection. So we are going to collect the data sent to the cloud, build all of this, you know, join the pool like a computing resource to train the classifier, to train the model. But when we train the model, we are going to shape the model. So the influence and the detection of the model of those like behavior anomaly really happen on the endpoints. Do the training centrally and then push the trained algorithms down to the edge devices? But even like the second as well, even like you said, you know, some of the device, like say, people try to put those small chip in the spoon, in the hair, you know, in hospital to make it like more intelligent. You cannot put even just the detecting piece there. So we also look into some new technology. I know like Kaffe recently announced release some of the lightweight deep learning models. You know, also there's some, you probably know there's some of the improvement from the chip industry. Yes. How to optimize the, you know, chip design for this kind of more analytics driven tasks there. Yep. So we are all looking to this different areas now. If just a couple of minutes left and Jim, you get one last question after this, but I got to ask you, what's on your wish list? What do you wish you could learn or maybe what did you come to Spark Summit hoping to take away? You know, I, I've always treated myself as a, you know, technical developer, you know, one thing I very, I'm very excited these days is the emerging of the new technology, like Spark, like TensorFlow, like Kaffe, you know, even big deal, which is announced this morning. So this is something like the first goal when I come to this, you know, bigger events, industry events, I want to learn the new technology. And the second thing is most to share our experience and also about, you know, adopting of this new technology and also learn from, you know, other colleagues from different industries, how people change life, you know, disrupt the older industry by taking advantage of the new technologies here. You know, the community is growing fast. I'm sure we're going to receive what you're looking for. Jim, final question? Yeah, I heard you mentioned DevOps and Spark in the same context. And that's a huge theme. We're seeing more DevOps as being wrapped around the life cycle of development and training and deployment of machine learning models. If you could have your ideal DevOps tool for Spark developers, what would it look like? What would it do in a nutshell? Actually, it's still, I just share my personal experience. In near, we actually develop a lot of the in-house DevOps tools. Like, for example, when you run a lot of different Spark jobs, stream batch, you know, like a one-minute batch versus one-day batch job, how do you monitor the status of those workflows, you know? How do you know when the data stops coming? How do you know when the workflow like failed? Then even how to monitor is a big thing. And then alarming when you have something failure or something wrong, how do you alarm it? And also the debug is another big challenge. So I certainly see the growing effort from both Databricks and the community on different aspect of that. Very good. All right, so I'm going to ask you for kind of a sound-bite summary on the spot here. You're in an elevator and I want to answer this one question. Spark has enabled me to do blank better than ever before. Certainly, certainly. I think, as I explained before, it helped a lot, you know, from both the developer, you know, even like the startup tried to disrupt some industry. It helps a lot. And I'm really excited to see, you know, this deep learning integration, all different like road map reports, you know, down road. I think they are on the right track. All right. Dr. Wong, thank you so much for spending some time with us. We appreciate it. And going to the rest of your day. Yeah, thanks for being here. And thank you for watching theCUBE. We're here at Spark Summit 2017. We'll be back after the break with another guest.