 campus in San Francisco. It's the Cube covering Apache SparkMaker community event brought to you by IBM. Now here are your hosts, John Walls and George Gilbert. And welcome back to the Apache SparkMaker community event here in San Francisco. We're in the Galvanize campus, one of seven campuses that Galvanize has up and operating right now, mostly in the Western US. But we're here with the IBM sponsored event and of course the Spark Summit which is tomorrow and Wednesday continues tomorrow, Wednesday at the Hilton Cross Town. We'll be streaming from there, so be sure to join us for that. George Gilbert is with me as is Adam Kukolowski, who is the CTO of Cloud Data Services at IBM Analytics. And Adam, we certainly appreciate the time. Thanks for joining us here. Delighted to be here. Thanks for having me. Let's just talk about, you know, cloud in general. I mean, how is that changing the data game and with regard to Spark? Because all of a sudden you have this, you know, this influx is massive amount of data coming in and of course a lot of capacity storage to hang on to it. Sure, absolutely. I think one of the key things you have to keep in mind is that, you know, you have to architect in the right way to take advantage of the scale of the cloud. If you just do what you're doing in an on-premise environment and multiply that by a thousand, you're not really doing this, you're not doing it right. When you look at building out a hyperscale object storage facility, it stores the data for pennies a gigabyte because they designed the racks and the network and the disks and the chassis all to specifically support that very targeted use case. So that's what we see in the cloud is the specialization into focused individual data services, right? That's how you get those economies. Just to follow up on that, in other words, you're designing the data service and there be a bunch of them to be aware of a infrastructure, hardware infrastructure that is designed completely differently. In other words, one that it might be elastic, two that it might be aware of sort of racks and, you know, local storage that may be different on different nodes. Sometimes yes, sometimes no. The key is that when workloads achieve a certain scale, it makes sense to optimize the hardware for them, right? But you only get that scale if you bring lots of people together in these large, cloud-scale data center environments. So is that what allows you to build a hyperscale object store? The fact that it'll be multi-tenant? Yeah, that's a big part of it, exactly right. It doesn't make sense at the 100 terabyte level. It really arguably doesn't make sense at the petabyte level. But when you're talking about tens and hundreds of petabytes, now it really makes sense to design for that. But what you've done when you design that architecture and focus it that way, is you've kind of run counter to the way that we've traditionally analyzed data on premises. But we said, oh, you have to co-locate the compute with the data, right? That was rule number one of Hadoop, right? Co-location is a must. But now in the cloud, I don't really have that option, right? I've got my huge-scale data services in the background, and then I've got my workloads that might be running somewhere else. So the name of the game has changed a little bit, and I think we're seeing in the Hadoop space an evolution, a gradual evolution, where some of the original tenets around how one had to do the processing are being rethought in this cloud world. At the risk of diving into the weeds, isn't Amazon's implementation of Hadoop criticized somewhat because, I mean, they want to separate compute from data for cost reasons. So you just pay for storage when you're not spinning up servers. And then the problem is you have to move the data every time you're going to work on it. Sort of. So help us figure out how you get around that. Absolutely. So, you know, you're right that that's the reason they did that, that's the reason they separated the compute from the storage. They could have that elasticity, they could have that burstability, and there's a lot of attractiveness about doing that. If you've got a very steady workload and you're reading all the data all the time, then, yes, admittedly, you're doing more transfers in and out of the data services than you might otherwise want to do, right? But we think that there are ways to get smarter. We think that oftentimes you want to do an analysis that's a little more targeted, a little more focused, right? And there are smart ways that you can just retrieve a subset of the objects to do that. So there are games that one can play to make this efficient. Almost like a column store where you're lackless. Quite right. Those are now, that translates in a completely analogous fashion to the world of data services where you want to push down that predicate. You want to take as much of the intelligence in the filter and actually push that into the data service and have the data service just give you back the data you really needed to analyze in more depth. Okay, but even at this, and push down to some semi-intelligent storage and say, just give me the elements that are relevant to compute on. Exactly right. Because the guys at Snowflake talk about that. I think it's a very relevant model and I think it's a model that we're going to see a lot of in Spark going forward, actually. Where Spark will be sort of data storage aware? Yeah, I think so. They're going to have to change catalyst and the data source API. I mean, dropping way in the weeds, but... You're right. You're right. There are changes, architectural changes that has to happen, but I do believe that we're going to see a place where portions of these query predicates are going to be understood in greater detail by some of the data services and we will actually start doing smarter and smarter predicate pushdown as time goes by. So in other words, we will see essentially more knowledge of Spark in the dumb storage and more knowledge of what's getting to be slightly smarter storage in Spark. That's one of the things that has me really excited about IBM is because we are sort of operating on both sides of that coin. We understand enough about the core Spark engine to figure out the right way to do some of these predicate pushdown operations. And we also have this portfolio of data services that are purpose built for certain types of workloads, object storage, no SQL database, relational databases and so on. And we can provide a level of consistency in the way that those predicates can be pushed down. Well, George took us into the weeds. I'll get us back to the grass tops. So you're dealing with customers and they get the cloud. I like the cloud thing that makes sense and save costs and get things off-premises and all that. But I like fast and I like speed and I like saving on my other operations and all this and I like this Spark thing. How are you marrying all that up with your clients? I mean, how are you what's the education you're having to do and then how are you trying to get them to the point where they can operationalize it and put it in the practice? You know, it's fine. And it's a different conversation with different customers because people are at different stages in their sort of analytics journey, right? I think a lot of times we see folks starting on-prem and coming to the cloud and wanting a like-for-like copy of what they were used to doing. They'd already gone through this evaluation. They already know how they expect their environments to be deployed and then they have opinions about these things. They start with a virtual machine. They're like, just let me dump mine. Just take this and stick it over here and I want to do the same thing. Because I got a mandate to go to the cloud. Right, right. So they check the box and they can move on. Exactly, exactly. And that's all well and good. We have offerings that cater to that kind of model. But then once people are there, they understand the real benefits of the elasticity and the benefits for allowing a lot more experimentation, right? A lot more sort of data scientists trying out new ideas in a very rapid fashion. I think that is the place where right now we see the cloud truly shining, right? We're not yet at the place where we have to impose all the operationalization controls, right? It's a discovery area, right? And in that world, we're seeing a ton of traction right now because the architecture and the technologies are well-suited for it. And because CISOs and IT departments are maybe not as far along in some cases to allowing full-on production workloads into the cloud environment. But as we get there, I think we're going to see this convergence, right? Where you started with the like-for-like copies. Right? You get past the discovery and you get to a place where we have production class environments driving business processes out in the cloud. Just to try and get a sense for where we are on that continuum of just move my VM and I can do metered pricing and I have slightly more automated management. Amazon talks about 10% of their revenue is on databases. I believe actually it's on their databases. Could you use that as a proxy for we're not just moving the VM, but we're starting to talk about data as a database services? You know, it's tough. You have to break it down a level further because I think a very attractive proposition is that, you know, if a client is building an application that is creating data or building an application that is consuming exogenous data, data from the outside world, IoT data, social data, things like that, it's very natural fit to store in a cloud database, right? From Amazon's perspective, that's still cloud database revenue, but you really have to break it down and say, is this the kind of database that is optimized for, you know, traditional system of record workloads or is the kind of database that's optimized for, you know, systems of engagement data? Okay, got it. When you were talking about almost painting this, not quite wild west, but you're talking about the cloud, is room for experimentation. It's the, you know, to try things. In terms of, and I think, okay, where's the growth area in terms of what Spark is generating about machine learning? And so, is that a, is there some genesis there? I mean, does that provide opportunities for machine learning to explore and develop and grow because that, there's a big growth opportunity there? Yeah, so I mean, machine learning's been around a long time, right? But why, you know, why the sudden sort of sharp spike in interest in machine learning frameworks and deep learning frameworks today? And I think the answer is, because we now have access to unprecedented amounts of data and unprecedented amounts of compute. The algorithms were there in some way, shape or form, but it's the application of them at this scale that makes them really, really interesting. It's a very good fit for a cloud world. You couldn't justify the financial investment, the capital expenditure, to go buy up bajillion GPUs and, you know, a petabyte scale data store. But on the cloud, you know, you can do a shorter scale, you know, sort of run at this and demonstrate value out of the machine learning. I have to ask, what may sound like a geeky question, but it could be significant in the sense that whenever we go through platform shifts, whole new classes of apps become possible. And now there's two big things that are sort of looming on database architecture. One is memory-intensive systems, where, you know, rather than architecting everything to drink, you know, through this really narrow straw that goes to a physical spinning disk, you know, or spinning rust is one thing. We can fit most of it in memory, and then the disk is more like tape for backup. And the other thing is the increasing prevalence of graphics processing units to give you incredibly more parallelism to analyze the data that's in memory so you know better how to query it. Put those together. What does your platform start to look like a few years out? It's a fascinating, it's one of the questions that keeps me up at night, right? And I'll throw a third technology trend in there and that's the storage class memory space. I think we're seeing a place where persistent memory becomes a real driver. You could put like flash on a PCI, you know, or memory bus. Yeah, that's okay. So, you know, I'm very, very interested in that space. I think right now we are absolutely seeing GPUs make a difference in the deep learning space. That becomes, you know, the way that one actually goes and trains at production scale. The applications of GPUs and the database space, you know, we see a couple of different companies out there that are making a bet on this and certainly we've got research investigations. I think you have to be fairly targeted in the types of operations that you can, you know, push down into that space. Not to be totally geeky, but... Go ahead, George. I'll dive in. I think you're already there, actually. Yeah, the reason I mentioned the two together is, I mean, databases have been built pretty much the same way except for a few splinters, you know, for since the 1970, you know, 1970 paper in the original research project at IBM. But if you use GPUs and if you have all your data in memory, the stuff that's so hard to figure out about how to query it, which is I don't really know what's stored because it's all on my disk and there's no way I'm going to analyze it for this, you know, for this very query. Whereas if it's all in memory and I can have a pretty good up-to-date map of what's there, I can do stuff much, much faster. Is that going to change? You know, the reality is that this tiering, you talked about, you know, sort of disk is the new tape, right? This tiering goes many levels up. It doesn't stop with main memory, you know. Modern processors have multiple levels of cache and things like that. And if you look at, you know, for example, the data warehousing offering that we have in our cloud platform in DashDB, right? DashDB understands that the query layer that certain subsets of the query actually need to be kept in processor cache to optimize the performance of the query. So this kind of work will continue to happen. It just sort of moves up a layer. Okay. Instead of optimizing for disk versus memory, it becomes memory versus processor cache and so on and so forth. The games we play will continue to be played. Okay. You feel good about that, George? Yeah, I was hoping for more, but you might be playing KG. Well, Adam, thank you for the time. We appreciate your sharing some of your time with us here today at the IBM Spark Summit and look forward to meeting you down the road and seeing you here on theCUBE. Sounds good. Good deal. Thank you. We'll be back with more from San Francisco here on theCUBE in just a moment.