 from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here on theCUBE. I'm sitting with George Gilbert in Midtown Manhattan at Spark Summit East. We're coming to the end of two days of wall-to-wall coverage. There's a lot of excitement happening with Spark and we're closing out strong, which is what we like to do. We're excited to have CUBE alumni and many times Spark IBMer, Rob Thomas, Rob, welcome. Hey, thanks guys, great to see you again. So if it's Spark, we know that you must be in the neighborhood. Absolutely, wouldn't miss it. Would not miss it. So I think last time I saw you, I think was at the Development Center in downtown San Francisco. So why don't you give us kind of an update of how that's progressing and some new updates there? Yeah, so if I recall, we stood there in the Spark Technology Center in San Fran and at the time I described to you, we were still building up a team there. We had a team focused on design, which we thought was kind of unique. And at the time, a team of about 30 people that were really exclusively focused in the open source and in around Spark. And team has now doubled in size since that time and we're still hiring as fast as we can. We're moving into a new office in June actually, just right on Howard Street. There was an announcement around Watson West, which is really IBM's flagship location on the West Coast in San Francisco. So that opens in June, so our Spark team will be moving over there with our work closely with the Watson team where we've got a lot of work going on. So these are going great. Really focused still on system ML that we contributed to open source and continuing to put, it's now an Apache Incubation project as you know. So the team's spending a lot of time on that as well. Excellent. Well, so why don't you tell us what's the latest in product announcements? You know, after system ML, what's new? So we made a, we did an announcement about three weeks ago that we called Open for Data. And the idea behind Open for Data is our whole strategy around data analytics starts with open source in terms of how we're contributing, how we're building, and you know, the Spark Technology Center is probably one of the best examples of that. But on the heels of an Open for Data thing that was really setting up for another set of things to come, the first of which was just two days ago where we announced Quarks. And Quarks is an open source contribution that we've made for streaming analytics on endpoints. And what we did is we looked at the market for IoT, Internet of Things, and we said we're in a market that's going from around, you know, 15 billion connected devices to 30 billion devices. And the current model for IoT doesn't work. The current model is collect data on an endpoint, move it back to a centralized area for processing. That might work with a few billion devices. When you get up to 30, 40, 50 billion, it doesn't work. And so what Quarks does is basically streaming on the endpoint, it will do the predicate filtering, the initial analysis of what's the relevant data, and then only pushes the relevant data back to a centralized location so it can integrate into Kafka, integrate into Spark as part of going back. But the whole idea is how do we really collect data on the endpoint efficiently, do pre-processing, pre-analytics there, and then bring it back? So from the point of view of, let's say, let's take a real example, GE predicts where they're gathering telemetry, let's say periodically from the jet engine since it may not always be connected, but maybe a windmill is always connected. Sort of what data would it not send up? What data would it filter out? And then let's talk about sort of how it fits in the programming model of a Spark or other framework. So the situation with some like GE is GE has the advantage of being vertically integrated where they're not only producing the software, but in most cases, they're producing the turbine or the engine. And so they can have a proprietary or a closed system for collecting data. What we see in the market though is most companies aren't GE, most companies aren't vertically integrated in doing manufacturing software down to microprocessors and the equipment itself. And so then you say, so how is the average company or the average developer going to access data on an endpoint? And that's the problem Quark solves is give them an open source capability to go collect that data, do it in a sub millisecond response time, do the data analysis, and then to your question on linking back to Spark, so you connect that up to maybe it's Kafka for a data lab pipeline, maybe it's directly into Spark streaming, some way that you can then bring that back to where you're doing your centralized processing. But it's a very efficient way to collect data on an endpoint where if you don't have the advantage of being vertically integrated like GE, you need something like this. Okay, there's a couple things that we're in there, let's try and unpack them. So maybe on a hardware level, would it be fair to compare it to a low cost Raspberry Pi in terms of a sort of an execution platform? The Quarks itself? So I guess I'd say the difference here is Quarks is at its base is a streaming engine. This is based off of, it's a component of the IBM Streams technology that we've been in the market with for a while that will basically collect data on a continuous basis. Oh, it's a piece of software, it's not. Yeah, it's software. Oh, okay, yeah. All right, that clears that up. And it's a streaming engine. Okay. And so what's unique there is that you can now, if you're collecting data at a very rapid pace, large amount of data on an endpoint, you can ingest that, you can quickly filter out what's relevant and what's not, and you can build models on top of that using something like Java to your question on programming language. Okay. And those models would decide what gets kicked back to a centralized area. Okay. So the model is, that's what gives you the quick reaction time, that is sub millisecond. And then it can choose what not to filter out and send sort of back up to a central location where you have your choice of programming model. So, but those, the model at the edge has to evolve at times. So how does your central platform, whether it's Spark or some other choice, IBM Streams perhaps, how does that communicate back down to the edge device and keep it up to date when, you know, so it doesn't drift? Yeah. I'm not sure that I see us going to that kind of a model. That would be kind of a hub and spoke model you describe where you decided the hub, this is what happens and then you do that. I would say we're trying to invert the approach to analytics to say, analytics should actually be decentralized. The starting sort of point should be decentralized. And so if you wanted to make a change to the endpoint, you would change the model or the analytical or the algorithms that's running on the endpoint to adjust how data is being collected or something like that. And then that would eventually flow back in. But the point is we're trying, we believe IOT will happen at the edge. And so the analytics has to be decentralized and the centralized piece is not an afterthought but it's what happens next. It's not necessarily, it doesn't have to be viewed as one system if that makes sense. So it's not necessarily doing a better job of sending data upstream but in fact, using the analytics there locally at the endpoint then to start to make control and make decisions. That's right. Act on other types of activities there. So it never has to necessarily loop back. The point being that if you're collecting a billion events off an endpoint, so with the weather company that we acquired, some of their devices located, collecting a billion events a year, do you really need to really send all one billion events back? Probably not. But that's what most people do today. And so instead you say, if you can build intelligence on the endpoint, then you can send back a subset and that will really be where you're getting value out of it. So instead of having to build a data pipeline to handle enormous scale, build it to handle the right scale, which is the right data that needs to go back. And then will it drive also though, local action without necessarily having to go back to the mothership? So you know, again, back on the turbine example, which everybody loves to use, how much of the control of how they optimize the turbine configuration comes now from an algorithm, comes from the sensors, and how much could you do it better with the streaming engine there locally on the thing? Let me give you a real example. We were working with Dimension Data who has really instrumented all the bikes for the tour de France. And so now think of a bike that's in the tour de France as a node that is collecting data and doing that in real time. And so that data actually never needs to come back. If you want to monitor just where it is, what's happening, any changes in speed, changes in altitude, the data never needs to come back. But if you get to the point where now let's, okay, we want to analyze what happened or why something happened in the race. At that point you might bring it back. So it doesn't presuppose that you have to have a centralized model, but it gives you the flexibility for a centralized model if that's what you want. So is the Infostreams product going to be something like System ML that will be contributed to the community and sort of find its way into Spark as a platform? We have no plans to contribute infrastructure streams at this point. The technology that was contributed with Quarks is a subset of that technology that really does the streaming on the endpoint on an embedded basis really. So I would say we've taken a step in that direction, but right now we don't have any plans to do that. Is there, are there plans to do like where the weather company is like a hardware, software solution that's a service where you would be more like a GE and say, we're going to take a bunch of equipment or whatever, we're going to embed our software in it, and it will be a service that others plug into. Is, or are you going to try and stick mostly to software hardware, software and computing hardware? I think, where you're kind of going is, does it become a broader platform for IoT analytics? Which is definitely our vision. The weather company, I wrote a bit about this, is I think when we bought the weather company, the perception was that we were buying, first the perception was we were buying a TV station. We didn't do that. You left that one behind. We left that behind. Then the second impression was we're buying data, and we certainly got a rich data asset. The most impressive thing about the weather company that most people don't know is they have built a premier internet scale platform that would rival what you see from a Netflix or a Facebook. All open source, huge data ingestion, huge consumption from different endpoints, and it really gives you a platform for how to ingest data, manage IoT data, do analytics on the fly. So you didn't buy it just as a peer's weather advice to inform all these different business processes. You bought a platform. We bought a platform. Oh. That's the secret sauce that most people don't know is there. Okay. Unfortunately. It's not a secret anymore. Now the whole world knows. We don't want it to be secret. So unfortunately, we are way up against the clock. I got to get the guys in the gear to the airport in relatively short order, but I want to give you the last word, Rob. One is you've been at the spark thing for a little while now. What has surprised you most over the last year as you've seen this development and the community and the technology evolve? And then what are you most excited about? I won't say a year, that's forever from now. But over the next six months, nine months, maybe 12 months that you're working on. I'm impressed by how much this event continues to grow. How much the ecosystem continues to grow. The contributions that we see in the community are tremendous. I would say we're at a point now though where Spark needs to move to be a business discussion. And it's not a business discussion today. And you can even see that really with a lot of the, you know, what we've been doing here at the conference. There's still very much a developer discussion, but to get mainstream, which helps people move along this, you know, curve to big data maturity. Spark has to get to a business audience. That's a big part of our focus this year. That's how we start to take use cases and move from just the technical, the developer team crowd to actually making this mainstream for a line of business use cases. That's the big thing that's next. So you're going to be talking about that at Interconnect next week? Absolutely. Oh, very good. Great segue. So Rob, thanks for stopping by. Always great to see you. I'm Jeff Frick. We're at Spark Summit East. We're going to actually be at IBM Interconnect next week in Las Vegas. Stop by, we'll meet the Mandalay Bay who have the full CUBE team there. So stop by, say hello. We look forward to taking more of a business message out to the marketplace. You're watching the CUBE. I'm Jeff Frick with George Gobert. We're wrapping up day two of wall-to-wall coverage at Spark Summit East. We'll see you next week at IBM Interconnect in Las Vegas. Thanks for watching.