 From New York, it's theCUBE, covering machine learning everywhere. Build your ladder to AI, brought to you by IBM. And welcome back here to New York City, where at IBM is machine learning everywhere. Build your ladder to AI, along with Dave Vellante. John Walls here, we're now joined by Sam Lightstone, who is an IBM Fellow in Analytics. And Sam, good morning. Thanks for joining us here once again on theCUBE. Thanks a lot, great to be back. You're great, yeah, good to have you here on a kind of a moldy New York day here in late February. So we're talking, obviously, data is the new norm, is what certainly I have heard a lot about here today and of late here from IBM. Talk to me about, in your terms of just when you look at data and evolution and to where it's now become so central to what every enterprise is doing and must do, I mean, how do you do it? Give me that 30,000 foot level right now from your prism. Sure, so I mean, from a super, if you just stand back, like way far back and look at what data means to us today, it's really the thing that is separating companies, one from the other, how much data do they have and can they make excellent use of it to achieve competitive advantage. And so many companies today are about data and only data. I mean, I'll give you some like really striking disruptive examples of companies that are tremendously successful household names and it's all about the data. So the world's largest transportation company or personal tax, I can't call it taxi, but Uber owns no cars, right? The world's largest accommodation company, Airbnb, owns no hotels, right? The world's largest distributor of motion pictures, owns no movie theaters. So these companies are disrupting because they're focused on data, not on the material stuff. Material stuff is important, obviously. Somebody needs to own a car, and somebody needs to own a way to view a motion picture and so on. But data is what differentiates companies more than anything else today. And can they tap into the data? Can they make sense of it for competitive advantage? And that's not only true for companies that are, you know, cloud companies. That's true for every company, whether you're a bricks and mortars organization or not. Now one level of that data is to simply look at the data and ask questions of the data, the kinds of data that you already have in your mind, you know, generating reports, understanding who your customers are and so on. That's sort of a fundamental level. But the deeper level, the exciting transformation that's going on right now is the transformation from reporting and what we'll call business intelligence, the ability to take those reports and that insight on data and to visualize it in the way that human beings can understand it and go much deeper into machine learning and AI, cognitive computing, where we can start to learn from this data and learn at the pace of machines and to drill into the data in a way that a human being cannot because we can't look at bajillions of bytes of data on our own. But machines can do that and they're very good at doing that. So it is a huge transfer, that's one level. The other level is there's so much more data now than there ever was because there's so many more devices that are now collecting data. And all of us, every one of our phones is collecting data right now. Your cars are collecting data. I think there's something like 60 sensors on every car that rolls off the manufacturing line today, 60. So it's just a wild time and a very exciting time because there's so much untapped potential. And that's what we hear about today, machine learning tapping into that unbelievable potential that's there in that data. So you're absolutely right on. I mean, data is foundational or must be foundational in order to succeed in this data-driven world. But it's not necessarily the center of the universe for a lot of companies. I mean, it is for the big data guys that we all know, the top market cap companies, but so many organizations, there's sort of human expertise is at the center of their universe. And data is sort of, oh yeah, bolt on, and like you say, reporting. So how do they deal with that? Do they get on big giant DB2 instance and stuff all the data in there and fuse it with MI? Is that even practical? How do they solve this problem? Yeah, that's a great question. So again, there's a multi-layered answer to that. But let me start with the most, one of the big changes, one of the massive shifts that's been going on over the last decade is the shift to cloud. And people think of the shift to cloud as, well, I don't have to own the server. Someone else will own the server. That's actually not the right way to look at it. I mean, that is one element of cloud computing, but it's not for me the most transformative. The big thing about the cloud is the introduction of fully managed services. You know, it's not just you don't own the server. You don't have to install, configure, or tune anything. Now that's directly related to the topic that you just raised because people have expertise, domains of expertise in their business. Maybe you're a manufacturer and you have expertise in manufacturing. You're a bank, you have expertise in banking. You may not be a high tech expert. You may not have deep skills in tech. So one of the great elements of the cloud is that now you can use these fully managed services and you don't have to be a database expert anymore. You don't have to be an expert in tuning SQL or JSON and yada yada. Someone else takes care of that for you and that's the elegance of a fully managed service. Not just that someone else has got the hardware, but they're taking care of all the complexity. And that's huge. The other thing that I would say is the companies that are really like the big data houses, they got lots of data, they've spent the last 20 years working so hard to converge their data into larger and larger data lakes. And some have been more successful than others, but everybody has found that that's quite hard to do. Data is coming in many places, in many different repositories, and trying to consolidate, you know, rip the data out, constantly ripping it out and replicating it to some data lake or data warehouse where you can do your analytics is complicated. And it means in some ways you're multiplying your costs because you have the data in its original location and now you're copying it into yet another location. You gotta pay for that too. So you're multiplying costs. So one of the things I'm very excited about at IBM is we've been working on this new technology that we've now branded as IBM Queryplex. And that gives us the ability to query data across all of these myriad of sources as if they're in one place, as if they are a single consolidated data lake and make it all look like one repository. And not only do the application appear as one repository, but actually tap into the processing power of every one of those data sources. So if you have a thousand of them, we bring to bear the power of 1,000 data sources and all that compute and all that memory on these analytics problems. An example of why that matters, what would be a real world application of that? Oh, sure. So there's a couple of examples. I'll give you two extremes, two different extremes. One extreme would be what I'll call enterprise, enterprise data consolidation or virtualization where you're a large institution and you have several of these repositories. Maybe you got some IBM repositories like DB2. Maybe you've got a little bit of Oracle and a little bit of SQL server. Maybe you got some open source stuff like Postgres or MySQL. You got a bunch of these and different departments use different things and it develops over decades and to some extent you can't even control it, right? And now you just want to get analytics on that. You just, what's this data telling me? And as long as all that data is sitting in these dozens or hundreds of different repositories, you can't tell unless you copy it all out into a big data lake, which is expensive and complicated. So QueryPlex will solve that problem. So it's sort of a virtual data store. Yeah, yeah, and a lot of the terms, there are many different terms that are used, but one of the terms that's used in the industry is data virtualization. So that would be a suitable terminology here as well. To make all that data in hundreds, thousands, even millions of possible data sources appear as one thing and to tap into the processing power of all of them at once. Now that's one extreme. Let's take another extreme, which is even more extreme, which is the IoT scenario, Internet of Things, right? Internet of Things, imagine you have devices, shipping containers and smart meters on buildings. You could literally have 100,000 of these or a million of these things. They're usually small. They don't usually have a lot of data on them, but they can store usually a couple of months of data. And what's fascinating about that is that most analytics today are really on the most recent 48 hours or four weeks, maybe. And that time is getting shorter and shorter because people are doing analytics more regularly and they're interested in just telling me what's going on recently. I got a geek out here for a second. Please, well thanks for the warning. And I'm not a technical person, but I'm old, so I've been around a long time. A lot of questions on data virtualization, but let me start with Query Plex. The name is really interesting to me. And you're a database expert, so I'm going to tap your expertise. When I read the Google Spanner paper, I called up my colleague, David Floyer, who's an ex-IVM, and I said, this is like global Cisplex. It's a global distributor. David goes, yeah, kind of. And I got very excited. And then my eyes started bleeding when I read the paper. But the name Query Plex is a play on Cisplex. It's actually, there's a long story. I don't think I can see the story out here, but we suffice it to say we wanted to get a name that was legally usable and also descriptive. And there went through literally hundreds and hundreds of permutations of words, and we finally landed on Query Plex. But you mentioned Google Spanner. I probably should spend a moment to differentiate how what we're doing is a different kind of thing. You know, in Google Spanner, you put data into Google Spanner. With Query Plex, you don't put data into. You have to move it. You don't have to move it. You leave it where it is. You can have your data in DB2. You can have it in Oracle. You can have it in a flat file. You can have an Excel spreadsheet. And, you know, think about that. An Excel spreadsheet, a collection of text files, you know, common delimited text files, SQL Server, Oracle, DB2, Neteza, all these things suddenly appear as one database. So that's the transformation. It's not about, we'll take your data and copy it into our system. This is about leave your data where it is, and we're going to tap into your existing systems for you and help you see them in a unified way. So it's a very different paradigm than what others have done. And that's why we're, part of the reason why we're so excited about it is we're, as far as we know, nobody else is really doing anything quite like this. And is that what gets people to the 21st century, basically? They don't have to have all these legacy systems, and yet the conversion is much simpler. Much, much more economical for them. Yeah, exactly. It's economical. It's fast. You know, you can deploy this in, you know, a very small amount of time. And you know, we're here today talking about machine learning. And it's, maybe it's a very good segue to point out. In order to get to high quality AI, you need to have a really strong foundation of an information architecture. And for the industry to show up, as some have done over the past decade, and keep telling people to re-architect their data infrastructure, and keep modifying their databases and creating new databases and data lakes and warehouses, you know, it's just not realistic. And so we want to provide a different path, a path that says we're going to make it possible for you to have superb machine learning, cognitive computing, artificial intelligence. And you don't have to rebuild your information architecture. We're going to make it possible for you to leverage what you have and do something special. This is exciting. I wasn't aware of this capability. And we were talking earlier about cloud and the managed service component of that as a major driver of lowering cost and complexity. There's another factor here, which is we talked about moving data. That's one of the most expensive components of any infrastructure. If I got to move data and the transmission cost and the latency, it's virtually, speed of light still. I don't know, I know you guys are working on speed of light, but you'll eventually get there, maybe. But the other thing about cloud economics, and this relates to sort of QueryPlex, there's this API economy. You've got virtually zero marginal costs when you were talking about, I was writing these down. You got global scale, it's never down. You've got this network effect working for you. Are you able to, are the standards there, are you able to replicate those sort of cloud economics, APIs, the standards, that scale, even though you're not in control of this, there's not a single point of control. Can you explain sort of how that magic works? Yeah. Well, I think the API economy is for real and it's very important for us, and it's very important that, we talk about API standards, there's a beautiful quote I once heard, the beautiful thing about standards is there's so many to choose from. But, and the reality is that, you have standards that are official standards and then you have the de facto standards because something just catches on. Nobody blessed it, it just got popular. So that's a big part of what we're doing in IBM is being at the forefront of adopting the standards that matter. We made a big investment in being Spark compatible and in fact, even with QueryPlex, you can issue Spark SQL against QueryPlex, even though it's not a Spark engine per se, but we make it look and feel like it can be Spark SQL. Another critical point here, we've talked about the API economy and the speed of light and the movement to the cloud and these topics you just raised. The friction of the internet is an unbelievable friction. It's unbelievable. I mean, you know, when you go and watch a movie over the internet, your home connection is just barely keeping up. I mean, you're pushing it, man. I mean, so a gigabyte an hour or something like that, right? Okay, and if you're a big company, maybe you have a fatter pipe, but not a lot fatter. I mean, not orders of met. You're talking incredible friction. And what that means is that it is difficult for people, for companies to en masse move everything to the cloud. It's just not happening overnight. And again, in the interest of doing the best possible service to our customers, that's why we've made it a fundamental element of our strategy in IBM to be a hybrid, what we call hybrid data management company, so that the APIs that we use on the cloud and they are compatible with the APIs that we use on premises and whether that's software or private cloud. You've got software, you've got private cloud, you've got public cloud. And our APIs are going to be consistent across and applications that you code for one will run on the other. And that makes it a lot easier to migrate at your leisure when you're ready. Makes a lot of sense. So that way you can bring cloud economics and the cloud operating model to your data, wherever the data exists. Listening to you speak, Sam, as it reminds me, do you remember when Bob Metcalf, who I used to work with at IDG, predicted the collapse of the internet. He predicted that year after year after year and speech after speech at the Vizio. It was so fragile. And you're bringing back the point of guys, it's still a lot of friction. So that's very, very interesting. As an architect. You think Bob's going to be happy that you brought up that he predicted the internet was going to be a good device? Well, he did it. I'm just saying. I'm saying out of it, man. He did it as a lightning rod. As a talker. The industry had a response and he had a big enough voice that we could do that. That worked. Yeah, that was brilliant. But so I want to get back to QueryPlex and the secret sauce. Is this somehow you're creating this data virtualization capability? What's the secret sauce behind? Yeah, so we're not the first to try, by the way. And actually this problem of all these data sources, all over the place, you try to make them look like one thing. People wouldn't try to figure out how to do that since like the 70s, okay? So, but all. It really hasn't worked. And it hasn't worked. And really the reason why it hasn't worked is that there's been two fundamental strategies. One strategy is you have a central coordinator that tries to speak to each of these data sources. So I got, let's say 10,000 data sources. I want to have one coordinator tap into each of them and have a dialogue. And what happens is that coordinator, a server or an agent somewhere, becomes a network bottleneck. You were talking about the friction of the internet. This is a great example of friction. One coordinator trying to speak to N collaborators becomes a point of friction. And it also becomes a point of friction not only in the internet, but also in the computation because he ends up doing too much of the work. There's too many things that cannot be done at these edge repositories, aggregations and joins and so on. So all the aggregations and joins get done by this one sucker who can't keep up. Big Q. Yeah, so it's a big Q. Right, so that's one strategy that didn't work. The other strategy that people tried was sort of an N squared topology where every data source tries to speak to every other data source and that doesn't scale as well. So what we've done in query plex is something that we think is unique and much more organic where we try to organize the universe or constellation of these data sources so that every data source speaks to a small number of peers, but not a large number of peers. And that way, no single source is a bottleneck either in network or in computation. That's one trick. And the second trick is we've designed algorithms that can truly be distributed. So you can do joins in a distributed matter. You can do aggregation in a distributed matter. These are things, when I say aggregation, I'm talking about simple things like a sum or an average or a median. These are super popular in analytic queries. Everybody wants to do a sum or an average or a median. But in the past, those things were hard to do in a distributed manner, getting all the participants in this universe to do some small incremental piece of the computation. So it's really these two things. Number one, this organic, dynamically forming constellation of devices, dynamically forming in a way that is latency aware. So if I represent a data source that's joining this universe or constellation, I'm gonna try to find peers who I have a fast connection with. Of all the universe of peers who are out there, I'll try to find ones that are fast and the second is having algorithms that we can all collaborate on. Those two things change the game. We're getting the two minutes signed. And this is fascinating stuff. So how do you deal with the data consistency problem? You hear about eventual consistency in people using atomic clocks and... Right, so query plex, there's a reason we call it query plex, not data plex. Query plex is really a read-only operation. There you go. You've got all these... Problem solved. Problem solved. You've got all these data sources. They're already doing their... They already have... Data's coming to how it's coming in. Simple and brilliant. Right, and we're not changing any of that. All we're saying is, if you want to query them as one, you can query them as one. I just, you know, I should say a few words about the machine learning that we're doing here at the conference. We've talked about the importance of an information architecture and how that lays a foundation for machine learning. But one of the things that we're showing and demonstrating at the conference today or at the showcase today is how we're actually putting machine learning into the database. Create databases that learn and improve over time, learn from experience. You know, in 1952, Arthur Samuel was a researcher at IBM who first had one of the most fundamental breakthroughs in machine learning when he created a machine learning algorithm that would play checkers. And he programmed this checker playing game of his so it would learn over time. And then he had a great idea. He programmed it so it would play itself thousands and thousands and thousands of times over. So it would actually learn from its own mistakes. And, you know, the evolution since then, you know, deep blue playing chess and so on, the Watson Jeopardy game, we've seen tremendous potential in machine learning. We're putting into the database so databases can be smarter, faster, more consistent and, you know, really just out of the box performing. I'm glad you brought that up. I was going to ask you because the legend Steve Mills once said to me why I asked him a question about in-memory databases is that ever since databases have been around, in-memory databases have been around, but ML-infused databases are new. That's right. Something totally new. Yeah, great. Well, you mentioned deep blue. Looking forward to having Gary Kasparov on a little bit later on here. And I know he's speaking as well. But fascinating stuff that you've covered here. Sam, we appreciate the time here. Thank you. Thanks for having me on the show. I wish you continued success as well. Thank you very much. Sam Lightstone, IBM Fellow, joining us here live on theCUBE. We're back with more here from New York City right after this.