 Hey, welcome back everybody. Jeff Frick here with theCUBE. We're at the Cheek Data Scientist USA Conference in downtown San Francisco, and we're really excited to have a representative from IBM, Sam Lightstone, distinguished engineer from IBM joining us. Sam, great to see you. Thank you very much. A pleasure to be here. Absolutely. So we cover a ton of IBM events. We're at World of Watson, World of Watson Developer Conference, the big event in New York earlier this year around Stratas. So we're big fans of all the things that IBM is doing and Rob Thomas and the Spark Group, so I could go on and on. But we won't go there. Well, you were talking about earlier today and you kind of let the cat out of the bag, which is always exciting, breaking news or breaking beta. I don't know exactly how we would describe it, but you talked about something new, IBM data confluence. I wonder if you could share with us what's that all about? Yeah, so it's a whole new idea, a whole new paradigm that we're incubating right now inside of IBM. And it's not yet available, but we're hoping to start trials in January-ish timeframe. But it comes from a realization that so much data is about to come upon us from distributed data sources. Everybody's got not only your cell phone, but increasingly data's coming from internet of things. You're going to have data coming from your car, data coming from your glasses, some smart meters on your house. And it's just going to deluge of data. And the way that people like to do data science on this data today is they pull this data from these devices and put it into a central repository, which is a perfectly legitimate strategy, but it means that you're creating copies of the data and there's a certain complexity of dragging that data through the internet into some central repository. So the idea that we had with data confluence is to leave the data where it is and allow the data, all these different data sources, if you can imagine cars, you can imagine cell phones or smart meters on buildings, allow them to find one another and collaborate on data science problems like a computational mesh so that we can bring hundreds, thousands, millions of microprocessors to bear on the data where it lives without moving it around. And our theory is not only is that simpler for everyone because the data doesn't have to move around, but we can actually bring more computation to bear because every one of those data sources has compute and has persistence and you can multiply the opportunities. Right, and you took a chance, you ran a live demo, which is always risky business at anything. But there was some really interesting concepts that you highlighted, kind of organically forming, adapting, constellation of these source, in the example you used they were solar panels, but for them to do this kind of automagically, if you will, as opposed to someone going in and scripting and building the structure, because tomorrow as you demonstrated in your demo, you might want to add more or add more, so that dynamic function was pretty interesting. And it's a very powerful concept and a very necessary concept. And the reason it's so necessary is these devices could be anywhere, right? And you could have most of your devices in New York, but a few of them in the Yukon or Alaska or something, and you don't want them to all be equally connected. So it's important to be sensitive to create this network that is sort of geospatially aware and connectivity aware, not just sort of hard coded. So one aspect of that is to be sensitive to network latencies and topology, that's one reason why it has to be automatic. The other reason it has to be automatic is if you really want this to scale to thousands of devices, you can't have some programmer trying to figure out who connects to what. It's just too hard. So making it really adaptive and automatic is super important. Another thing that's really important for the internet of things is depending on the circumstance, but if you can imagine cell phones, for example, you can have a network of thousands, millions of phones, but at any point in time, some of those phones are going to be turned off. So the network has to be adaptive to the possibility that devices go offline, either intentionally like a phone or perhaps unintentionally because they break. You know, if you have a device on a smart meter, it may simply break, and then that particular device is offline for a period of time. So the network has to be resilient to that. And that's part of what we've been building, particularly using technology that we incubated in our UK labs in Hursley. So it's been a great collaboration across IBM. This is not just one set of people in one lab, but actually a corporate collaboration. And really our goal is to make this, as you say, auto magic, but I would say beyond auto magic to make it resilient. It's got to be resilient and fault tolerant because the complexities that we could be dealing with are just too large for a human being to deal with. Right, and clearly indistributed, right? That's the big thing. You guys are leveraging the IBM Bluemix cloud. All this stuff doesn't happen with cloud capabilities. I think the demo you did here, you were here, the data center was in San Jose and the actual data elements were in Toronto. So just Amazon and Microsoft and Google are always get talked about a lot within the cloud space, but really IBM is making major players. And if not in that top three, certainly right there in the fourth position as a leader in cloud and in what this cloud enables and then really kind of with the whole cognitive push, that's a priority for Jenny and the team to really bring more intelligence to this computing. That's exactly right. And with data confluence, we're hoping not only to tap into data science on distributed systems for IoT and also for enterprise use cases as well. But really to take it to the next level of hybrid cloud because these data sources could be in the cloud and they could be on premises, they could be anywhere in the world and you can mix and match and that's really a very powerful capability for our customers. Many companies now struggling as their data is now part cloud and part on premises. Right, and the compute as well, right? You could yield shift. Exactly. From the edge to the cloud in a dynamic fashion based on what the kind of optimal solution is or as you said sometimes maybe the edge is offline and you can't do it there. That's exactly right. So kind of a cool story. You said this came out of something called Blue Unicorn. What is Blue Unicorn? Oh, fantastic. So Blue Unicorn was an initiative that a few of us got together on inside of IBM. You probably know some of these folks, Rob Thomas, who I think you've interviewed, Girish from Catachalia and myself and the three of us got together and we said, you know, we want to find a more effective way to tap into the creative juices of our staff. We got some of the greatest minds in the world working at IBM. We hire brilliant people, PhDs, masters of the top schools all over the world and all too often we hire these people and we tell them what they should be working on. Wouldn't it be better if we could find a repeatable process for them to come to us and say, here's the next big innovation that IBM should have and Blue Unicorn came out of that desire to tap into and nurture this creative passion of our staff. And it was really designed almost like an internal VC initiative so people would come to us with proposals, we would vet those proposals, we start out with hundreds and vetted it down to dozens, vetted it down to just a small few that we would fund from the ones that we funded that would go through periodic reviews until eventually we ended up with a very small set that are still being incubated and data confidence happened to have been one of those projects. Awesome. So it's different than kind of the 10% thing. This is actually almost like an internal, you put your proposal together, you pitch it as if it was an internal VC, you get funded and then you go do that with your team full time. One thing I would say is one of the, as we're setting out, we're trying to find ways to make it work, make it efficient. One of the best filtering factors that we came up with is that people had to show us running code before it was funded. And that was amazing because that meant people had to work nights and weekends, they had to have that level of passion and commitment for their idea to get to that level of vetting. And that was incredible. That definitely filtered the people who were super passionate about what they were doing and the people just said, yeah, I'd like to tinker. And that was tremendous. Okay, and then you're here at this show, relatively small show, tight group, kind of multi-industry. Any good take away surprises from the last couple of days here at the Chief Data Science USA show? It's been an amazing conference, actually, and some great speakers, some great insights. I think one of the most useful insights for me was I was curious to hear from this audience, what is the duration of data that is important to them? Do they need to see data from the last hour, the last month, the last year, the last 10 years? Of course, it does vary from problem to problem, but many people said, for the work that I do, I need about three months to build a model. And then once I have a model, I'm really looking at the last two to four weeks of data to gain data science insight. And that was a very important point for me, especially as we continue on our work on analytics and data science and IBM. It's very important for us to understand the range of data that people are using. Is it shorter than you expected? Let's see if it's shorter. Yeah, it's shorter because, you know, certainly in the data warehousing space that I've been working a lot of my career in, people do data analytics on six months, a year, three years. So this is definitely a somewhat of a shift. And it tells us something about our society, that things are moving faster and data that's older than six months is usually not as interesting anymore. Yeah, it really shows kind of the dynamic, real-time nature. It's not, analyzing this old stuff is interesting, but not nearly as interesting as being on top of the Spark Stream, some of these other things. It's funny, Beth Comstock kicked off the GE Mines Machines event a couple of days ago. She said, we even walk faster in cities. They've done studies. Everything is continuing to speed up. All right, so a year from now, if you're back here, what are we going to be talking about? Wow, okay. Well, you know, we just launched a few months or a few weeks ago, actually, the Watson data platform. That's right, big event. Right, huge event for us. And it really is, for us, the foundation, the data foundation of all the cognitive computing that IBM is coming out with. It's going to bring together data science and data storage and collaboration across, you know, amongst analysts and data scientists together, all one platform for all your data needs. I'm hoping that a year from now, I'm going to speak to you about how data confluence is the core part of that platform, and we're going to be running analytics on millions of devices all over the world. All right, Sam, well, thanks for taking a few minutes. I know you got to go catch an airplane for stopping by and sharing your insight. Thank you. All right, Sam Lightstone, I'm Jeff Frick. You're watching theCUBE. Thanks for watching.