 Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Submit 2017, brought to you by Hortonworks. Hi everybody, this is George Gilbert, big data and analytics analyst with Wikibon. We are wrapping up our show on theCUBE today at DataWorks 2017 in San Jose. It has been a very interesting day and we have a special guest to help us do a survey of the wrap up. George Chow from Simba. We used to call him chief technology officer. Now he's technology fellow, but when he was explaining the difference in titles to me, I thought he said technology felon, but he's since corrected me. Yes, you're very much so. So George and I have been, we've been looking at both Spark Summit last week and DataWorks this week. What are some of the big advances that really caught your attention? What's caught my attention actually is how much manufacturing has really, I think caught into the streaming data. I think last week, I think it was very notable that both Volkswagen and Audi actually had case studies for how they're actually using streaming data. And I think just before the break now, there was also a similar session from Ford showcasing what they're doing around streaming data. And are they using the streaming analytics capabilities for the autonomous driving or is it other telemetry that they're analyzing? I think the Volkswagen study was with production because I still have to review the notes, but the one for Audi actually was quite interesting because it was for managing paint defect. For paint? Paint defect. Oh. What they were doing were essentially recording the environmental condition that they were painting the cars in, basically the entire pipeline. To predict when there would be imperfections because paint is an extremely high value sort of step in the assembly process. Yes, what they're trying to do is to essentially make a connection between downstream defects, like future defects, and trying to pinpoint the causes upstream. So the idea is that if they record all the environmental conditions early on, they could turn around and hopefully figure it out later on. This is, okay, this sounds really, really concrete. So what are some of the surprising environmental variables that they're tracking? And then what's the technology that they're using you know, to sort of to build model and then to anticipate if there's a problem? The, I think the surprising finding, they said they were actually, I think it was, I think it was a humidity or fan speed, by recall, at the time when the paint was being applied because essentially paint has to be, paint is very sensitive to the condition that is being applied to the body. So I mean, my recollection was that one of the finding was that there was a narrow window during which the paint were like ideal, like in terms of having the least amount of defect. And so had they built a digital twin style model where it's like a digital replica of some aspects of the car or was it just more of a predictive model that had telemetry coming at it and when it's outside a certain bound, they know that they're going to have defects downstream. I think they're still working on the predictive model or actually that the model is still being built because they are essentially trying to essentially build that model to figure out how they should be tuning the production pipeline. Got it, so this is, it's sort of in, it's still in the development phase. Yeah, yeah. And can you tell us, did they talk about the technologies that they're using? I remember the, it's a little hazy now because after a couple of weeks of conference, so I don't remember the specifics because I was counting on the recordings to come out in a couple of weeks time, so I definitely will share that, I think, but it's a case study to keep an eye on. Were there, so tell us, were there other ones where this use of real-time or near real-time data had some applications that we couldn't do before because we now, we can do things with very low latency? I think that's the one that I was looking forward to with Ford, that was the session just earlier, I think about an hour ago. The session actually consisted of a demo that was being done live, it was being streamed to us where they were showcasing the data, that was coming off of a car that's been rigged up. And so what data were they tracking and what were they trying to anticipate here? They didn't get enough detail, but it was basically data coming off the can bust of the car, so if anybody's familiar with the car. Oh, that's right, you're a car guru and you and I compare, well, our latest favorite is the Porsche McCann SUV, okay. So, but yeah, they were looking at streaming the performance data of the car as well as location data. Okay, and oh, so this was, it sounds more like a test case, like can we get telemetry data that might be good for insurance or for? They've built out the system enough using the Lambda architecture with Kafka, and so they were actually consuming the data in real time and the demo was actually exactly seeing the data being ingested and being acted on. So, in the case they were doing a simplistic visualization of just placing the car on the Google Maps so you can basically follow the car around. Okay, so what was the technical components in the car and then sort of how much data were they sending to some, or where was the data being sent to? Or how much of the data? The data was actually sent to stream all the way into Ford's own data centers. So they were using NIFI with all the right products. NIFI being from Hortonworks, the Hortonworks data flow. Yeah, with all the appropriate proxies and firewalls to bring it all the way into a secure environment. So it was quite impressive on the point of view of it was live data coming off of the 4G modem, or actually being uploaded through the 4G modem in a car. Okay, did they say how much compute and storage they needed in the device, in this case, the car? I think they were using a very lightweight platform. They were streaming apparently from the Raspberry Pi. Oh, interesting. But they were very guarded about what was inside the data center because no for competitive reasons. They couldn't share much about how big and how large a scale they could operate at. So Simba has been doing ODBC and JDBC drivers to standard APIs to databases for a long time. That was an era where either it was interactive or batch. So how is streaming sort of big picture going to change the way applications are built? Well, the one way to think about streaming is that if you look at many of these APIs, let's say, or many of these systems, I think Spark is a good example where they're trying to harmonize streaming and batch or rather to take away the need to deal with it as a streaming system as opposed to a batch system. Because it's obviously much easier to think about and reason about your system when it is traditional, like in the traditional batch model. So the way that I see it also happening is that streaming systems will, you can say, will adapt or actually become easier to build and everyone is trying to make it easier to build so that you don't have to think about and reason about it as a streaming system. Okay, so this is really important, but they have to make a trade-off if they do it that way. So there's the desire for leveraging skill sets which were all batch-oriented and sort of presumably SQL which is sort of a data manipulation everyone's comfortable with. But then if you're doing it batch-oriented, you have like a portion of time where you're not sure you have the final answer. And I assume if you were in a streaming-first solution, you would explicitly know whether you have all the data or don't as opposed to late-arriving stuff. That might come later. What I'm referring to is actually the programming model. All I'm saying is that more and more people will want streaming applications, but more and more people need to develop it quickly without having to build it in a very specialized fashion. So when you look at, let's say, the example of Spark when they focus on structure streaming, the whole idea is to make it possible for you to develop the app without having to write it from scratch. And the comment about SQL is actually exactly on point because the idea is that you want to work with the data. You can say, not mindful, actually, not with a lot of work to account for the fact that it is actually streaming data that could arrive out of order even. And so the whole idea is that if you can build applications in a more consistent way, irrespective of whether it's batch or streaming, you're better off. So last week, even though we didn't have a major release of Spark, we had like a point release or discussion about the 2.2 release. And that's, of course, very relevant for our big data ecosystem since Spark has become the compute engine for it. Explain the significance where the reaction time or the latency for Spark went down from several hundred milliseconds to one millisecond or below. What are the implications for the programming model and for the applications you can build with it? Actually, hitting that new threshold of the millisecond is actually a very important milestone because when you look at a typical scenario, let's say with AppTek where you're serving ads, you really only have maybe on the order about 100 or maybe 200 millisecond max to actually turn around. And that max includes a bunch of things, not just the calculation. And that, let's say 100 millisecond includes network transfer time. So, which means that in your real budget, you only have allowances maybe under 10 to 20 millisecond to actually compute and do any work. So, being able to actually have a system that delivers millisecond level performance actually gives you ability to use Spark right now in that scenario. Okay, so in other words, now they can claim, even if it's not per event processing, they can claim that they can react so fast that it's as good as per event processing. Is that fair to say? Yes, that's very fair. That's significant. So, what type, how would you see applications changing? And we've only got like another minute or two, but how do you see applications changing now that Spark has been designed for people who have traditional batch oriented skills, but who can now learn how to do streaming real-time applications without learning anything really new? What sort of, how will that change what we see next year? Well, I think we should be careful just to not pigeonhole Spark as something built for batch because I think the idea is that the originator of Spark know that it's all about the ease of development and it's the ease of reasoning about your system. It's not the fact that the technology is built for batch. So, the fact that you could use your knowledge and experience and an API that actually is familiar to leverage it for something that you can build for streaming. I mean, that's the power and you can say that's the strength of what the Spark project has kind of taken on. Okay, we're going to have to end it on that note. There's so much more to go through. George, you will be back as a favorite guest on the show. There'll be many more interviews to come. Thank you. With that, this is George Gilbert. We are at DataWorks 2017 in San Jose. We had a great day today. We learned a lot from Rob Bearden and Rob Thomas up front about the IBM deal. We had Scott Now, CTO of Hortonworks on several times and we've come away with an appreciation for a partnership now between IBM and Hortonworks that can take the two of them into a set of use cases that neither one on its own could really handle before. So, today was a significant day. Tune in tomorrow. We have another great set of guests. Keynotes start at nine and our guests will be on starting at 11. So, with that, this is George Gilbert signing out. Have a good night.