 Okay, we're back. This is Dave Vellante. I'm with Wikibon.org. I'm here with John MacArthur. And this is theCUBE, SiliconANGLE.TV's continuous coverage. We're here live in Orlando. This is IBM Edge. This is really the first breakout conference for the IBM Storage Group. We've been talking today about can IBM get its storage mojo back? Very successful first attempt to really reach out to that community in a branded event called Edge. Now we're here with Jeff Jonas. Jeff is a chief scientist at IBM Entity Analytics and a big data expert. Jeff, welcome to theCUBE. Hey, man, thanks for having me. Yeah, our pleasure. So we're talking off camera a little bit about how you came to IBM. Why don't you tell us that story? Well, one day I hired a CEO to run my company. I guess going back a little further, I started a company back. Well, going back further, I bankrupted a company when I was 20. Then I moved into my car. Fail fast. Yeah, fail fast. And then after I finished crying, I started my next company from my car. And then I ended up hiring a CEO. That company grew up okay. And then I hired a CEO to run it. And then he said, I wasn't a very good chairman. And then, so we brought in a real chairman of the board. And then the two of them gave me a stronghold. You know, a UFC move and sold my company to IBM, which turns out is okay. Okay, so you were sort of fighting it to begin with? Nah, I didn't fight it that much. It was a while ago. It was, it was seven years ago. No one thought I would stay. And you did stay. I did, and I love my job. I like, I'm working with amazing people on amazing stuff and have global access to really interesting problems. And it's fun. IBM is an interesting culture because there's a lot of things that sort of drive behavior at IBM because they are such a large company that's been around sort of humorless in some areas because these are important issues. Not a lot of swashbucklers, I guess I would say, right? So, and you seem like having lived out of a car, but more of a swashbuckler's style. So how does your personality integrate with IBM's? Well, I'll be candid, I was pretty sure the antibodies were going to attack, right? And I've spent every month, basically, for the first couple of years thinking they're just right around the next corner. And the reality, they just did. I mean, maybe you could count maybe two moments where I felt like there was a little bit of infringement, but it's negligible. And IBM has become just a great place for me. It's like my hobby. I don't think I've had one order from my boss. And when I... So let's talk about some of the really interesting things that you're doing. It's like they hit the undo button on the data mining, right, it's kind of some of what you do. The undo button, right? Well, a lot of times. This is the answer. Oh, no, I got new information. A lot of data, most all the data mining stuff going on is batch. And I'm somewhat obsessed with real time. Like, I spent a decade in the casinos, you can lose a quarter of a million in 15 minutes. A batch job at the end of the hour, like why? So I'm somewhat obsessed with the real time. And that is, how can you make sense of transactions, is there happening? Fast enough to do something about it while it's happening? Right. So what does that mean? Does that drive decisions around processing or does it drive decisions around storage? Because you need a lot of data to handle. Here's the interesting piece. I want you to think about it like when you put a puzzle together at home. You grab a piece out of the box. It's very hard to tell what it is. It just has flames on it. You don't know if it's good news or bad news. It might be a fire in the kitchen or a fire in the fireplace. Until you take it to the puzzle and figure out where it goes, you really don't know whether it's good news, bad news. But that's about how does one piece of data find and figure out how it relates to every other piece of data ever seen. So I've got one of these, I've got a couple systems out there now with a hundred terabytes of solid state, just for metadata. A hundred terabytes of solid state for metadata. For metadata. And one of these had 20 terabytes of solid state and we maxed out the IO on direct attach. So we went to a hundred terabytes of solid state. It's kind of exciting. So you get to put those together in the labs. Yeah. Put those together in the mission. And then like in the mission. Like real, real customers, real projects. Can you, are there any things that you can talk about? No. I didn't think so. So, you know, you talk about, everybody talks about connecting the dots, talk about big data. The Osama hit, right? Were you involved in that? No. Most people I asked in big data go, I can't say. No, I don't think I need to go that. That's not the first time the answer you got, right? Are you sure? You know, you got to be careful not to over claim. Are you sure? I'm not 100% sure. It's like the number of people that take credit for the Viora. But no one would tell me anyway. I don't have any government clearances, so like, you know. But I'm like 100% sure I wasn't in. Yeah, yeah. Do you think big data was involved in that? Is that real-time enough? Well, depends what the definition of big data is. Some people just call everything big data. Well, then sure. What's your definition? My definition of big data is something magical happens when you put enough data together in the right way. Something changes, it starts to behave differently. And I've done a few speeches about this called big data new physics. And what this is, is it turns out the more data you get, puzzle pieces to puzzles, not pile of puzzle pieces, puzzle pieces to puzzles, you get lower false positive and lower false negatives. That means the predictions are better. The bad data in your enterprise, you're gonna be glad you didn't clean it. It turns out the bad data helps you. And get this, the more data you get, not only does it get more accurate, it gets faster to process. You need less CPU, you need less CPU. I saw this in 2006, a system with 3 billion rows of data. When it got to 8 billion, it was more accurate and the cost to ingest the next piece was going down. I'll explain how this works in 15 seconds. Okay, go. Why is it when you put a puzzle together at home, the last few pieces are as easy as the first few? You have more data in front of you than ever before. Why is that? That. What happens is the puzzle puffs out and then it begins to collapse. And while it's collapsing, it's getting more accurate and your decision about where to place the piece starts to increase. I saw that for the first time in 2006. I'm engineering system today that are designed to take advantage of exactly that. And I think it significantly is gonna change what's possible in big data. So I remember a puzzle that my kids and I had when they were little and we put together and some of the pieces were exactly the same size and they all fit in the right. So you could absolutely put the wrong piece in a place where it fit, right? Yeah, so that's called a false positive. And what generally happens in a puzzle is you'll, and it doesn't happen that often, but it can happen. And the puzzle has a lot of the same shapes and a lot of the same colors, more ambiguity. But how do they, how do you find them? The way you find them is the arrival of new puzzle pieces. Because now you find another piece, you go, it goes right here, wait a minute. The piece next door is fighting me. You actually discover false positives while you continue the puzzle assembly process. So the big mistakes that people are making in big data are what? Not paying enough attention to the. I think one of the big mistakes is if you try to do everything batch, you're gonna find out that you're producing answers. As soon as you start giving the business users really great answers every Monday morning. It's not gonna take long. Assuming they're great answers for them to say, why did I have to wait till Monday morning? They left the website, we ought to give them a loan. So there's a class of things that I think are gonna be well suited for big data batch. But I think there's a whole slew of things that can be done in real time while the data's streaming. Yeah, so is it growth in real time or is it in real time decision making? Is that where the big investment is for IBM? Or is it both? It's really both. They play off each other real nicely. You need to be able to, let me differentiate these for a second. There's a moment in when you're trying to make sense of stuff. While the transaction's happening, you're trying to do something about it while it's happening. Call it a sense. Like a credit card validation, for example. You gotta half a millisecond to respond to otherwise it goes to the other service provider, right? Something like that. And that, you can think of that as sense and respond. But there's another aspect where you want to be able to deeply reflect over what you know. And this is no different than when you're sitting on the couch at the home at the end of the day and you're just thinking about what you know. You're data mining on yourself. You're not reading emails or watching TV. And you go, that's interesting. Something is revealed. Those are two very distinct processes and they're both two different legs on the stool. And the tighter you can feedback loop them between each other, I think that's what's gonna make organizations more competitive. And you would argue that you can take the second, which is the reflection on the couch and instantiate that in code. I think, yeah, I think people do that right now with predictive analytics. They do deep reflection. You go analyze the 70,000 fraud cases and you realize that there's three factors that are almost always true. That discovery is something you'd want to then instantiate on this dream. Yeah. So Jeff, you gave us your definition of big data. What's your definition of real time? Fast. Right now, my goals are like sub 200 milliseconds, sense and response. So it's fast enough to blink 200 milliseconds. Okay, so it's not really fast. Well, if you're a jet fighter pilot and it's a heads up system, I don't know. But my work right now, I'm trying to do sub 200 milliseconds. Let me tell you what, I'll tell you what's hard about this. I'm gonna just tell you the hardest piece. You've seen a billion puzzle pieces, you made a billion bets. Where the puzzle piece goes? You get peace billion in one. At the moment you get that, you have to say, now that I know this, had I known that in the beginning, over the billion decisions I've made, should I've made any of them differently? And if you can't let new observations reverse earlier assertions, your whole model drifts from the truth. Doing that at thousands per second over billions of rows of data with hundreds of millions of entities is non-true. We've been working on that for eight years. Right now, on a hundred terabytes of solid state, I can do that at about 2,000 decisions a second. And what I'm working on now, my new next generation stuff, code name G to two and a half years secretly. Which you just started writing about. I just started writing about it. And it's coming of age. And the world will see more of this. It's, I've done something radical in the schemas, designed for grid compute, to collapse this window of time about what's the latency to change your mind about the past over a trillion rows of data. That's what I'm working on now. So, are you clamoring for an entire new IOR architecture to support this vision? I need action. The funny thing is, it's kind of like extreme OLTP. Any piece a day that arrives is just as likely to need any piece you've ever seen. So that means you can't optimize in the way we're like, oh, let's just take the most recent stuff and flush it to the top. All of it. And that's why you end up with these hundred terabyte solid state kinds of instances. Now the question is, how do you even further collapse that latency? What if I took 100 terabyte instances of solid state and then I said, now I want a hundred of those. So I've been designing something that can talk across a grid like that. And it always knows where every piece of data is. It never has to broadcast. So what do you make of, you know, Hadoop talk, Hadoop is batch, MapReduce one, MapReduce two, Son of MapReduce whatever. Can that world play in real time? Well, I don't think it can play in real time, but I think it can play in deep reflection. And deep reflection is hugely important to making organizations smarter. So here's how I see it. Puzzle pieces are arriving up. Every time a piece of data arrives in the enterprise, it just learns something. At that second, what most organizations do today is they let the data land and they wait for users to ask it questions. There are not enough humans to ask every question every day. What needs to happen is every piece of data is the question, because you just learned something. So that means when a piece of data arrives just like a puzzle piece, it lands in a puzzle and you figure out where it belongs. Then you're taking information in context where you know things that are the same, you know the things that are related and you're publishing in context data to these big batch processes for deep reflection. These are the questions that you should be asking. And sort of. What you can notice when it's landing is, it's like the little kid, the baby that never doesn't know a bee stings. You watch the bee hover and all of them and the baby thinks it's cute. They don't have a feedback loop, right? So sometimes it's a secondary process. Sometimes you've got stung. But whatever that is gets inserted back up into and then in real time, you know when to duck, you can go, hey, something's not right about this. And again, it's these dual processes. And the question is, I've been doing these puzzle projects with kids and adults to, because it's inspiring them on my algorithms. I mean, literally, I'm watching people put puzzles together and finding really subtle things about the cognitive process about bringing, making sense out of diverse data. Where I hide the puzzle pieces, where I have puzzles for many puzzles in one pile. And it's really interesting. But one of the things I've learned is the cycle time between real time sense making versus a chance to deeply reflect. And the tighter you couple that, the more intelligent things seem to become. Jeff, what, as an IBMer working in your organization, what advantages do you have relative to when you were startup and what disadvantages are there? I think the advantage is when you have a, right after acquisition, I got to go see the labs. I can tour it around the labs. Well, I'm like, built a company out of my car kind of guy. So when you go see what it means, when an organization is spending $6 billion a year in R&D. So I would go to each lab and I would talk for an hour about what I do. And then they would hand pick five or six things to show me at each lab. Unbelievable. Like it's just John dropping. Yeah, so that's what's exciting to me is the access to that kind of investment, which is really unique. There's very few companies on earth that are that committed to R&D process. And then the hard part is, you know, it's any, all these big organizations are matrixed. If you cannot establish a coalition of the willing, it's hard to get from point A to point B. So part of the way that, and it's turning out okay for me, but it's about coalition of the willing. And it's also, if you want to get something big done, I call it a tightly held conspiracy to do good. You get a few people go, let's tell no one. Because if everyone knows how important this is gonna, if everybody knew how important the internet was gonna be, we'd still be discussing the standards. So, you know, you want to get something that's going to grow and be big, keep it small, try to make it so it can go viral later. So how does it work? So it's not like a grab bag of technology that you can just, an invention that you can just reach into and grab and apply, or is it somewhat? You mean across IBM's research. I'll tell you what, for me, this is a very narrow ban in which I think I'm smart, and then there's most all other bans, I'm an actual idiot. So it's a narrow place where I'm useful to IBM. And in that, when I look at across IBM, like I'm talking to the blue gene people, and these blue gene folks, they explain to me how to make my kinds of thing run on blue gene, and they taught me something. And when they taught me that, it fundamentally rewired the way I think about the way data should talk to schema. Can you explain that? Okay, here's how that goes. So I'm talking to a couple of blue gene guys and I go, wow, the way my Nora class technology for the casinos would run on blue gene would be this. And they looked at me and goes, are you kidding, that would never scale. And it took the wind out of me. I'm like, why? And they go, and at the time, I forget it was tens of thousands of nodes or something. And they go, they taught me two main things. One, you want the data evenly distributed across all the nodes, and you want every piece of data to know where it lives, so you never have to ask every node. Those were the two main takeaways. How does that translate? Well, what it translated to me was, I need schemas where I'm gonna index everything on a hash, so it's evenly distributed. And then I'm gonna only have one index per schema so that you never have to be optimal on one schema and suboptimal on the others. And that's some other little trickery around that, but it's damn exciting for me, I gotta tell you. So my colleague, John Furrier, who, this is actually the first time I've ever seen him give up the mic. I know, I was- Where is he by the way? I always heard he was gonna be. John's right here. He's right there. I know, I'm horsing around. And so he actually has a few questions. He's actually really angry now about giving up the mic. He's fired off like seven or eight to me, but some of the ones that he's really interested in, so I'll give you credit, John. What are the top trends and big data that aren't yet on people's radar? I'll tell you what's coming fast is geospatial data about where you and I are and how we move. It is going to blow your socks off. In the U.S. today, 600 billion records are being created every single day about where you and I are when and how we move. 600 billion a day. And this data is being anonymized and shared with secondary companies. And the kinds of predictions you can make are stunning. And this is going to lead to a whole bunch of privacy challenges. So I find myself working with things called space time boxes about defining where things are when and then how to anonymize space time boxes. Yeah, you've done a lot of work around privacy. Yeah, in fact, my new G2 project, I've got more privacy features baked into it than anything I've ever created. I thought there was no privacy in the internet. We're supposed to get over it. Isn't that a Scott McNeely quote on the cube? Get over it. So the other question John had is what fundamental changes are you seeing in trends and computer science that are changing the status quo? Touched on some of them. I think this notion of what I would call incremental context accumulation. Okay, let me step back and say this. What has been happening is if you have structured data, you use structured data analytics. It's like, oh, those are the red puzzle pieces? Well, process them with the red puzzle piece processor. Ah, you have unstructured data. Well, those are blue puzzle pieces. Ah, that's good, because we have blue puzzle piece processors. And then Twitter, Magenta. We got Magenta. And the problem is if you process each puzzle piece in isolation, you can get some gains, but those gains start to flatten out. General purpose context accumulation says you can take a very wide feature space across structured, unstructured, social, geospatial, biographic, biometric, pick. And how do you weave all those puzzle pieces together? The quality of predictions that you can get out of that, I kid you not. What I believe is imminent is we're gonna see computers more frequently arguing about something with a human and the computer being bright more often. And I have one example of that. You wanna hear one example? Yeah, please. So, one day in my little email inbox, I get this press release about one of my technologies is running. It's for MoneyGram. I didn't have anything to do with it, really. So, a lot of them I do, but this one's new to me, so I'm looking at it, right? The day they turn on context accumulation where data's coming together, fraud drops, I think, 70-something percent. Fraud complaints. Fraud complaints drop over 70 percent. That's nothing. Get this. I'm reading this, and it actually took me up. A 100-year-old grandmother goes to MoneyGram, says, transfer 2,500 to grandson. And MoneyGram says, no, we think it's fraud. She goes, it is not fraud. I'm on the phone. MoneyGram says, no. She goes, I've been giving you my business for years. This is my time of need. You will transfer it. MoneyGram loses nothing if they transfer it. But they were so confident in their prediction, they still told her, no. She calls back of three days later in tears thanking them. That's cool. That's cool. That's a prediction where you're so confident you can dispute a human who's closest contextually. That's interesting. Yeah. And I think context accumulation, diverse things coming together are going to do all kinds of interesting things. So we've been, my last question, we've been batting around this premise that was put forth by Peter Goldmacher of Kahnwin on the Cube. And basically he said, look, talking about who's going to make the money, who's the big winners, he said, the guys who are going to win are the big data practitioners, the guys that are using big data and getting value for their customers. Versus the suppliers, which I think is an interesting premise. What advice would you give to those big data practitioners in terms of what they should be doing, where they should be focused, how they can create that value? About a few years ago, somebody on stage said that every time you can take a half a millisecond out of something, you can save 100 million a year. And now? Goldman Sachs. It was Goldman Sachs, 2008. Every millisecond is $100 million a year. What I think comes next? You know, see, first it's Web2, where everything's connected. Phase two is pretty soon, it's all about the data. Now the question is, what can you do analytically on the data? And what's left to compete with? Here's what I think it is. It is latency. If all of us have access to a similar observation space, the next question is, who can make sense of it fastest? So I think the real big game next is compression of latency. And so thus 100 terabyte, 200 terabyte solid state storage arrays serving up to metadata processors or schemas that can process faster than anybody else. How can you integrate what you're observing? You're not just comparing the puzzle pieces that flies by. Imagine the puzzle piece first slams into the puzzle and finds a chunk. And what comes out of the back is richness, rich context. And then you're going to use that for really fine-grained decision. And it's going to be over super fast I.O. All right, Jeff Jonas. We're out of time. This was so fantastic. Thanks very much for coming on theCUBE. Thanks for having me here. Thanks to Jon Furrier for letting me in. This is theCUBE. We're at IBM Edge. We'll be right back after this word. Keep it right there.