 Cube Covering, DataWorks Summit Europe 2017. Brought to you by Hortonworks. Welcome back to the DataWorks Summit in Munich, everybody. This is theCUBE, the leader in live tech coverage. Chandra Mukyala is here. He's the offering manager for IBM Storage. Chandra, good to see you. It always comes back to storage. It does, it's a foundation. We're here at a data show and you got to put the data somewhere. So how's the show going? What are you guys doing here? It's going good. I mean, lots of participation. I didn't expect this bigger crowd, but there is good crowd, storage. People don't look at it as the most sexy thing, but I still see a lot of people coming and asking, what do you have to do with Hadoop kind of cushions? Which is exactly the kind of cushion I expect. So going good, we're able to. Well, it's interesting. In the early days of Hadoop and big data, I remember we interviewed, John and I interviewed Jeff Hammabacher, founder of Cloudera, and he was at Facebook and he said, my whole goal at Facebook, when we were working with Hadoop, was to eliminate the storage container and the expensive storage container. So they succeeded, but now you see guys like you coming in and saying, hey, we have better storage. Why does the world need anything different than HDFS? Hi, this has been happening for the last two decades, right? In storage, every few years a startup comes, they address one problem very well. They address one problem and create a whole storage solution around that and everybody understands the benefit of it and that becomes part of the main storage. And when I say main storage, because these new point solutions address one problem, but what about all the rest of the features storage has been developing for decades? Same thing happened with other solutions, for example, deduplication. Very popular, right? At one point, dedupe appliances. But nowadays, every storage solution has dedupe. I think same thing with HDFS, right? HDFS is purpose-built for Hadoop, right? It solves that problem in terms of giving local access storage, scalable storage, big pool of storage. But it's missing out many things, you know? One of the biggest problems they have with HDFS is it's siloed storage, meaning that data is only available, the data in HDFS is only for Hadoop. You can't, what about the rest of the applications in the organizations who may need it through traditional protocols like NFS or SMB or they may need it through new applications like S3 interfaces or Swift interfaces. So you don't want that siloed storage. That's one of the biggest problems we have. So you're putting forth a vision of some kind of horizontal infrastructure that can be leveraged across your application portfolio. How common is that? And what's the value of that? It's not very common. That's one of the stories we're trying, messages we're trying to give out, right? I've been talking to data scientists in the last one year, a lot of them. One of the first things they do when they're implementing a Hadoop project is they have to copy a lot of data into HDFS. Because before they can ingest the data into HDFS, they can't run any analytics in it. That copy process takes days. It's a big move, yeah. It's not only wasting time from a data scientist, but it also makes the data stale. So, and I tell that you don't have to do that if your data was on, you know, something like IBM Spectrum Scale. You can run Hadoop straight off that. Why do you even have to copy into HDFS? You can use your same existing applications, MapReduce applications, with zero change to it and point them at Spectrum Scale. It'll, you can still use the HDFS API. You don't have to copy that. And every data scientist I've talked to is like, really, I don't have to do this, I'm wasting time? Yes. So, it's not very well known that, you know, most people think that there's only one way to do Hadoop applications. It's on HDFS. You don't have to. And the advantage is there is, one, you don't have to copy. You can share the data with the rest of the applications. But, and it's no more stale data. But also, one other big difference between the HDFS type of storage versus shared storages. In the shared nothing model, which is what HDFS is, the way you scale is by adding new nodes which adds both compute and storage. What about applications which don't necessarily need more compute? All they need is more throughput. You're wasting, you know, compute resources, right? So, there are certain applications where shared nothing is a better architecture. Now, the solution which IBM has will allow you to deploy it in either way. Shared nothing or shared storage. But, that's one of the main reasons, you know, people want to, data scientists especially want to look at these alternative solutions for storage. So, when I go back to my Hamabakar example, it worked for Facebook in the early days because they didn't have a bunch of legacy data hanging around. They could start with pretty much a blank piece of paper. Re-architectively, plus they had such scale, they probably said, okay, we don't want to go to EMC and NetApp or IBM or whomever and buy storage. We want to use commodity components. Not every enterprise can do that is what you're saying. Yes, exactly. It's probably okay for, you know, somebody like a very large search engine when all they're doing is analytics, nothing else. But, if you go to any large commercial enterprise, they have lots of data. The whole point around analytics is they want to pull all of the data and look at that, so find the correlations, right? It's not about analyzing one small, one dataset from one business function. It's about pooling everything together and see what insights can get out of it. And, so that's one of the reasons it's very important to have support to access the data for your legacy enterprise applications too, right? Yeah, so it's NFS and SMB are important, so are S3 and Swift. But also, for these analytics applications, you know, one of the things, one of the advantages of IBM solution here is we provide local access to the file system, not necessarily through NAS protocols like NFS. We do that, but we also have Post-X access to have direct local access to the file system. That's one of the, with that, you don't, with HDFS, you have to first copy the file into HDFS, you have to bring it back to do anything with that. All those copy operations go away. And this is important, again, in an enterprise, not just for data sharing, but also to get local access. You're saying your system is Hadoop-ready? It is. Okay, and then the other thing, you hear a lot from IT practitioners anyway, not just so much from a lot of businesses, that when people spin up these Hadoop projects, big data projects, they go outside of the edicts of the organization in terms of governance and compliance and often security. How do you solve, do you solve that problem? Yeah, that's another reason to consider, again, that enterprise storage, right? It's not just because you have, you're able to share the data with the rest of the applications, but also a whole bunch of data management features, including data governance features. You can talk about encryption there. You can talk about auditing there. You can talk about features like WAM, write once, read many, right? So data is, especially archival data, once you write, you can modify that. There are a whole bunch of features around data retention, data governance. These are all part of the data management stack we have. You get that for free. You not only get universal access, unified access, but you also get data governance. So there's just one of those situations where on the face of it, when you look at the CapEx, you say, oh wow, I can use commodity components, save a bunch of money. You know, you remember the client server days. Oh wow, cheap, cheap, cheap, microprocessor-based solutions, and then all of a sudden, people realize, we have to manage this. Is it, are we seeing a similar sort of trend with a dupe where the ability to, or the complexity of managing all this infrastructure is so high that it actually drives costs up? Actually there are two parts to it, right? There is actually value in utilizing commodity hardware, industry standards. That does reduce your cost, right? If you can just buy a standard x86 server with a storage server and utilize that, why not? That is going to reduce the cost. But the real value in any kind of a storage data management solution is in the software stack. Not the, yeah, you can reduce your CapEx by using industry standard. It's a good thing to do, and we should, and we support that. But in the end, data management is there in the software stack. What I'm saying is, HDFS is solving one problem, but it's missing out the whole other data management problems, which we just touched on. And that all comes in software, which could run on industry standard servers. Well, and you know, it's fine, I've been saying for years that if you peel back the onion on any storage device, the vast majority anyway, they're all based on standard components. It's the software that you're paying for. So it's sort of artificial in that, you know, a company like IBM will say, okay, we've got all this value in here, but it's on top of commodity components, we're going to charge for the value. And so if you strip that out, sure, you can do it yourself. Yeah, exactly. Yeah, in the end, it's all standard servers. It's been like that always. Now, one difference is, you know, 10 years ago, people used proprietary rate controllers. Now, all of that functionality is coming into software. In ASIC, you know, I think 3PAR still has an ASIC, but most don't. That's probably the only company I can think of, right? Almost everybody has some kind of a software-based or rigid coding if you're able to utilize standard servers. Now, there is an advantage in appliance model because, yes, it can run on industry standard, but this is storage. This is where, you know, that's a foundation of all of your infrastructure. Yeah, and you want RAS, right? You want reliability and availability. The only way to get that is a fully integrated type solution where you're doing a lot of testing on the software and the hardware. Yes, it's supposed to work, but what really happens when a drive fails? How does the system react? And that's where I think there is still a value for integrated systems. And if you're a large customer, you have a lot of storage-saving, storage-savvy administrators, and they know how to build solutions and validate it, yes, software-based storage is the right answer for you. And you're the offering manager for Spectrum Scale, right, which is the file offering, right? That's right. And it includes object as well? Spectrum Scale is our file and object storage track. It supports both file protocols. It also supports object protocols. The thing about object storage is it means different things to different people. To some people, it's the object interface, like the S3 or the S6. To me, it means get put. Yeah, yeah, if that's what the definition is, then Spectrum Scale is object-shared. The fact is, everybody supports S3 and Swift now. Sure. But to some other people, it's not about the protocol because they're going to still access it by file-based protocols. But to them, it's about the object store, which means it's a flat namespace, but there's no hierarchical name structure and you can get into billions of fine-ords without having any scalability issues. That's an object store. But to some other people, it's neither of those. It's about a ratio-coding which object storage provides. So it's cheap storage. It allows you to run on storage service and you get cheap storage. So it's three different things. So if you're talking about the protocols, yeah, Spectrum Scale is, by that definition, object storage also. So in thinking about, well, let's start with Spectrum Scale, generally, but specifically, your angle and big data and Hadoop, I mean, we talked about that a little bit, but what are you guys doing here? What are you showing? Partnership with Hortonworks? Maybe talk about that a little bit. So we've been supporting this, what we call as Hadoop Connector on Spectrum Scale for almost an year now, which is allowing our existing Spectrum Scale customers to run Hadoop straight on it. But if you look at the Hadoop distributions, there are two or three major ones, right? Cloud, Aura, Hortonworks, maybe MapR. One of the first questions we get is, we tell our customers you can run Hadoop on this. Oh, is this supported by my distribution? So that has been a problem. So what we announced is, we formed a partnership with Hortonworks. So now Hortonworks is certifying IBM Spectrum Scale. It's not new code changes, it's not new features, but it's a validation and a stamp from Hortonworks. That's in the process. The result of this, Hortonworks certified reference architecture, which is what we announced. We announced it about a month ago. We should be publishing that soon. Now customers can have more confidence in the joint solutions. It's not just IBM saying that it's Hadoop ready, but it's Hortonworks backing that up. Okay, and your scope, correct me if I'm wrong, is sort of on-prem and hybrid, not cloud services. That's kind of, you might sell your technology internally, but. Correct, so IBM storage is primarily focused on on-premise storage. We do have a separate cloud division, but almost every IBM storage product today, Spectrum Scale included, which is what I can speak of, are, we treat them as hybrid cloud storage. What we mean that is, we have built-in capabilities. We have featured most of our products called transparent cloud tiering. It allows you to set a policy on when data should be automatically tiered to the cloud. Everybody wants public cloud, everybody wants on-premise. Obviously, there are pros and cons of on-premise storage versus off-premise storage, but basically boils down to if you want performance and security, you want to be on-premise. But there's always some data which is better to be on in the cloud, and we try to automate that with our feature called transparent cloud tiering. You set a policy based on age, based on the type of data, based on the ownership. The system will automatically tier the data to the cloud, and when a user access that file, it comes back automatically tiered. It's all transparent to the NDA. So, yes, we are in the on-premise storage business, but our solutions are hybrid cloud storage. So, as somebody who knows the file business pretty well, let's talk about kind of the business, file and sort of where it's headed. There's some mega trends and dislocations. There's obviously software defined. You guys have made a big investment in software defined a year and a half, two years ago. There's cloud, Amazon with S3 sort of shook up the world. I mean, at first it was sort of small, but then now it's really catching on. Object obviously fits in there. What do you see as the future of file? That's a great question, right? When it comes to data layer, there's really either block, file or object. Software defined and cloud are various ways of consuming storage. If you're a large service provider, you would prefer a software based solution so you can run it on your existing servers, or whoever you prefer to serve a vendor is. You know, depending on the organization's preferences for security and how concerned they are about security and performance needs, they may prefer to run some of the applications on cloud. These are different ways of consuming storage, but coming back to file, an object, right? So, object is perfect if you are not going to modify the data. You're done writing that data and you're not going to change. It just belongs in object storage, right? You know, it's more scalable storage. I say scalable because file systems are hierarchical in nature. Because it's a file system tree, you have traverse through the various sub-directories. Beyond a few million sub-directories, it slows you down, right? So, but file systems have a strength. When you want to modify the file, any application which is going to edit the file, which is going to modify the file, that application belongs on file storage, not on object. But let's say you are in dealing with medical images. You're not going to modify an X-ray once it's done. That's better suited on an object storage. So, file storage will always have a place. Take video editing, you know, all these videos we're doing, you know, video production, they do a lot of video editing. That belongs on file storage, not on object. If you care about file modifications and file performance, file is your answer. But if you're done and you just want to archive it, you know, you want a scalable storage, you know, billions of objects, then object is the answer. Now, either of these can be software-based storage or it could be appliance. That's, again, an organization's preference for do you want an integrated, robust, ready-made solution then appliance is an answer. I don't know, I'm a large organization, I have a lot of storage administrators, they can bring something on their own. Then software-based is an answer. I think most vendors will give you a choice. What brought you to IBM? You used to be at NetApp, you know, IBM's buying the weather company, Dell's buying EMC, what attracted you to IBM? Now, storage is the foundation which everywhere else, but it's really about data and it's really about making sense of it, right? And everybody's saying data is a new oil, right? And IBM is probably the only company I can think of which can actually, which has the tools and the IP to make sense of all of this, right? So, NetApp, it was great in the early 2000s, right, you know? I mean, even as a storage foundation, they have issues with scale-out and, you know, a true scale-out, not just a single namespace. EMC is a pure storage company, you know, in the future, it's all about, you know, the reason we are here at this conference is about analyzing the data. What tools do you have to make sense of that? And that's where, you know, machine learning and deep learning comes, you know, Watson is very well known for that. IBM has the IP and it has the right research going on behind that, and I think, you know, storage will make more sense here. And also, IBM is doing the right thing by investing, you know, almost a billion dollars in software-defined storage. They are one of the first companies who did not hesitate to take the software from their integrated systems, for example, XIV, and made the software available as software only. We did the same thing with store-wise. We took the software of it, made available as Spectrum Virtualize, you know, we did not hesitate at all to take the same software which was available as, to some other vendors, I can't do that, I'm going to lose all my margins. We didn't hesitate, we made it available as software, because we believe that's an important need for our customers. So the vision of the company, cognitive, the halo effect of that business, that's the future, is going to bring a lot of storage action, is sort of the premise there. Excellent, well Chandra, thanks very much for coming to theCUBE, it was great to have you, and you know, good luck with, you know, attacking the big data world. Thank you, Dave, thanks for having me. You're welcome, keep it right there, everybody, we'll be back with our next guest. We're live from Munich, this is DataWorks 2017, right back.