 Live from Midtown Manhattan, it's theCUBE, covering Big Data, New York City 2017, brought to you by SiliconANGLE Media and its ecosystem sponsors. Hello everyone, welcome back to our CUBE coverage here in New York City, live in Manhattan for theCUBE's coverage of Big Data NYC. Our event, we had eight, five years in a row, eight years covering Big Data Hadoop World originally in 2010, then it moved to Hadoop Stratoconference, Stratoconference, now called Stratoconference, in conjunction with that event, we have our Big Data NYC event, SiliconANGLE Media's CUBE. I'm John Furrier, co-host with Jim Kobielos, analyst at wikibon.com for Big Data. Our next guest is Gus Horne, who is the global Big Data analytics and CTO ambassador for NetApp, machine learning, AI, guru, just talks all around the world, great to have you, thanks for coming in and spending the time with us. Thanks John, appreciate it. So we were talking before the camera came on, you're doing a lot of jet setting, really around evangelize, but also educating a lot of folks on the impact of machine learning and AI, in particular, obviously AI we love, we love the hype and it motivates young kids getting into software development, computer science makes it kind of real for them, but still a lot more ways to go in terms of what AI really is, and that's good, but what is really going on with AI? Machine learning is where the rubber hits the road, that seems to be the hot air, that's your wheelhouse, give us the update, where is AI now? Obviously machine learning is super important, it's one of the hot topics here in New York City. Well I think it's super important globally, and it's going to be disruptive, so before we were talking I said, how this is going to be a disruptive technology for all of society, but regardless of that, what machine learning is bringing is a methodology to deal with this influx of IOT data, whether it's autonomous vehicles, active safety in cars, or even looking at predictive analytics for complex manufacturing processes like an automotive assembly line, can I predict when a welding machine is going to break, and can I take care of it during a scheduled maintenance cycle so I don't take the whole line down, because the impacts are really cascading and dramatic when you have a failure that you couldn't predict. And what we're finding is that Hadoop and the big data space is uniquely positioned to help solve these problems, both from quality control and process and management and how you can get better uptime, better quality, and then we take it full circle and how can I build an environment to help automotive manufacturers do test and dev and retest and retraining and learning of the AI modules and the AI engines that have to exist in these autonomous vehicles. And the only way you can do that is with data and managing data like a data steward, which is what we do at NetApp. So for us it's not just about the solution, but the underlying architecture is going to be absolutely critical in setting up the agility you'll need in this environment and the flexibility you need because the other thing that's happening in this space right now is that technology is evolving very quickly. You see this with the DGX from NVIDIA, you see just P100 cards from NVIDIA. So I have an architecture that we have in Germany right now where we have multiple NVIDIA cards in our Hadoop cluster that we've architected but I don't make NVIDIA cards. I don't make servers. I make really good storage. And I have an ecosystem that helps manage where that data is when it needs to be there and especially when it doesn't need to be there so we can get new data. Yeah, because we were talking also before camera with folks watching that you were involved with AI going way back to your days at MIT and that's super important because a lot of people and the pattern that we're seeing across all the events we go to and we'll be at the NetApp event next week in site in Vegas. But the pattern is pretty clear. You have one camp, oh AI is just the same thing that was going on in the late 70s, 80s, and 90s but it now has a new dynamic with the cloud. So a lot of people are saying, okay, there's been some concepts that have been developed in AI in computer science but now with the evolution of hyperconvergence infrastructure with cloud computing with now a new architecture, it seems to be turbocharging and accelerating. So I'd like to get your thoughts on tying that in. Why is it so hot now? Obviously machine learning, everyone should be on that, no doubt. You got the dynamic of the cloud. And NetApp's in the storage business so that's stores, data, I get that. What's the dynamic with the cloud? Because that seems to be the accelerant right now and with open source and with the AI. Yeah, I think you got to stay focused. The cloud is going to be playing an integral role in everything. And what we do at NetApp as a data steward and what George Kurian said, our CEO, that data is the currency of today actually, right? It's really fundamentally what drives business value is the data. But there's one little slight attribute change that I'd like to add to that and that it's a perishable commodity, right? It has a certain value at T sub zero when you first get it and that's especially true when you're trying to do machine learning and you're trying to learn new events and new things but it rapidly degrades and becomes less valuable. You still need to keep it because it's historical and if we forget historical data, we're doomed to repeat mistakes. So you need to keep it and you have to be a good stewarder and that's where we come into play with our technologies because we have a portfolio of different kinds of products and management capabilities that move the data where it needs to be, whether you're in the cloud, whether you're near the cloud, like in an Equinox, Colo or even on-prem and the key attribute there is, and especially in automotive in this, they want to keep the data forever because of liability, because of intellectual property and privacy concerns. Let's double click on that. Hold on, one quick question on this because I think you bring up a good point. The perishability is interesting because real time, we see this now batch in real time is the buzzword in the industry but you're talking about something that's really important that the value of the data when you get it fast in context is super important but then the historical piece where you store it also plays into the machine learning dynamics of how deep learning and machine learning has to use the historical perspective. So in a way it's perishable in the real time piece in the moment and if you're a self-driving car you want the data like in milliseconds because it's important but then again the historical data will then come back. Is that kind of where you're getting at with that? Yeah, because the way that these systems operate is they, the paradigm is like deep learning. You want them to learn the way a human learns, right? The only reason we walk on our feet is because we fell down a lot, right? But we remember falling down and we remember how we got up and could walk. So if you don't have that historical context you're just always falling down, right? So you have to have that to build up the proper machine learning neural network the kind of connections you need to do the right things. And then as you get new data and varieties of data we think and I'll stick with automotive because it can almost be thought of as an intractable amount of data because most people will keep cars for measured in decades. The quality of the car is incredible now and they're all just loaded with sensors, right? High definition cameras, radars, GPS tracking and you want to make sure you get improvements there because you have liability issues coming as well with these same technologies. Yeah so when we talk about the perishability of the data that's a given. What is less perishable it seems to me in Wikibon is that what you drive from the data the correlations, the patterns, the predictive models the meat of machine learning and deep learning AI in general is less perishable in the sense that it has a validity over time. What are your thoughts at NetApp about how that, how that data of those data derived assets should be stored, should be managed for backup and recovery and protected to what extent is that, do those requirements need to be reflected in your storage retention policies if you're an enterprise doing this? That's a great question. So I think what we find is that you'll, that first landing zone. Yeah. And everybody talks about that being the cloud and for me it's a cloudy day although in New York today it's not. You know there are lots of clouds and there are lots of other things that come with that data like GDPR and privacy and what are you allowed to store? What are you allowed to keep and how do you distinguish one from the other? That's one part but then you're going to have to ETL it you're going to have to transform that data because like everything there's a lot of noise and the noise is really fundamentally not that important. It's those anomalies within the stream of noise that you need to capture and then use that as your training data so that you learn from it. So there's a lot of processing I think that's going to have to happen in the cloud regardless of what cloud and it has to be kind of ubiquitous in every cloud. And then from there you've decided how am I going to curate the data and move it and then how am I going to monetize the data because that's another part of the equation and what can I monetize? Well that's a question that we hear a lot on theCUBE. We were on the day one we were riffing at some of the concepts that we see as challenged when we talk to enterprise customers whether it's a CIO, CDO, Chief Data Officer or Chief Security Officer there's a huge application development going on in the enterprise right now you're seeing the open source booming there's huge security practices being built up and then it's got this governance with the data overlay that with IoT it's a kind of an architectural I won't say reset but a retrenching for a lot of enterprises. So the question I have for you guys as a critical part of the infrastructure with storage storage isn't going away there's no doubt about that but now the architecture is changing how are you guys advising your customers what's your position on when you come and do a CXO and you give a talk and it's like I say, hey Gus, the house is on fire we've got so much going on bottom line me what's the architecture what's best for me but don't lose the headroom I need to have some headroom to grow that's where I see some machine learning what do I do? I think you have to embrace the cloud and that's one of the key attributes that NetApp brings to the table we have our core software our on tap software is in the cloud now and for us we want to make sure we make it very easy for our customers to go both be in the cloud be very protected in the cloud with encryption and protection of the data and also get the scale and all of the benefits of the cloud but on top of that we want to make it easy for them to move it wherever they want it to be as well so for us it's all about the data mobility and the fact that we want to become that data steward, that data engine that helps them drive to where they get the best business value because it's going to be in the cloud On-prem in cloud I know just for the folks just for the record you guys have if not the earliest one of the earliest in with AWS when it wasn't fashionable I interviewed you guys on that many years ago and let me ask a related question what is NetApp's position or your personal thinking on what data should be persisted closer to the edge in the new generation of IoT devices you know so IoT Edge devices, you know they do inference they do actuation and sensing but they also do persistence now should any data be persisted in their long term as part of your overall storage strategy if you're an enterprise? It could be the question is durability and what's the impact if for some reason that edge was damaged, destroyed or to data loss so a lot of times when we start talking about open source one of the key attributes we always have to take into account is data durability and traditionally it's been done through replication and to me that's a very inefficient way to do it but you have to protect the data because it is it's like if you got 20 bucks in your wallet you don't want to lose it right you might split into two tens but you still have 20 right you want that durability and if it has that intrinsic value you've got to take care of it and be a good steward of it so if it's in the edge it doesn't mean that's the only place it's going to be it might be in the edge because you need it there maybe you need what I call reflexive you know actions right this is like when a car as well you have deep learning and machine learning and vision and GPS tracking and all these things there and how it can stay in the lane and drive but the sensors themselves that are coming from Delphi and Bosch and ZF and all of these companies they also have to have this capability of being what I call a reflex right the reason we can blink and not get a stone in our eyes not because it went to our cerebral cortex it's right because it went to the nerve stem and it triggered the blink and you have to do the same thing in a lot of these environments so autonomous vehicles is one it could be you know using facial recognition for restricting access to a game and all of a sudden this guy is on a blacklist do you stop the gate before we get into some of the product courses I have for you Hadoop in place analytics as well as some of the regulations around GDPR we're still again in the trend segment here is what's your thoughts on decentralization you've seen a lot of decentralized apps coming out you see blockchain getting a lot of traction obviously that's a tell sign certainly in the headroom category of what may be coming down not really on the agenda for most enterprises today but it does kind of indicate that the world is the wave is coming for a lot more decentralization on top of distributed computing and storage so how do you look at that as someone who's out on the cutting edge for me it's just yet another industry trend where you have to embrace it I'm constantly astonished at the people who are trying to push back from things that are coming to think that they're going to stop the train that's going to run them over and the key is how can we make even those trends better, more reliable and do the right thing for them because if we're the trusted advisor for our customers regardless of whether or not I'm going to sell a lot of storage to them I'm going to be the person they're going to trust to give them good advice as things change because that's the one thing that's absolutely coming is change and oftentimes when you lock yourself into these quote commodity approaches with a lot of internal storage and a lot of these things the counterpart to that is that you've also locked yourself in probably for two to four years now in a technology that you can't be agile with and this is one of the key attributes for the in-place analytics that we do with our ONTAP product and we also have our E-Series product that's been around for six plus years in this space as the de facto performance leader in this space even and by decoupling that storage in some cases very little but it's still connected to the data node and in other cases where it's shared like in an NFS share that decoupling has enormous benefits from an agility perspective and that's the key well that kind of ties in with the blockchain thing it's kind of a tell sign but you mentioned in-place analytics that decoupling gives you a lot more cohesiveness if you will in each area but tying them together is critical how do you guys do that? What's the key feature because that's compelling for someone I want they want agility certainly DevOps is infrastructure as code that's going mainstream you're seeing that now that's clearly cloud operates whatever you want to call it on-prem off-prem cloud ops is here this is a key part of it what's the unique features of why that works so well? Well some of the unique features we have in so if we look at our portfolio products so I'll stick with the on-tap product one of the key things we have there is the ability to have incredible speed with our AFF product but we can also de-dupe it we can clone it and snapshot it snapshotting it into for example NPS or NetApp Private Storage which is in Equinox and now all of a sudden I can now choose to go to Amazon or I can go to Azure I can go to Google I can go to software it gives me now options as a customer to use whoever's got the best computational engine versus I'm stuck there right I can now do what's right for my business and I also have a DR strategy that's it's quite elegant but there's one really unique attribute too and that's the cloning so a lot of my big customers have thousand plus node traditional Hadoop clusters but it's nearly impossible for them to set up a test dev environment with production data without having an enormous cost but if I put it in my on-tap I can clone that I can make hundreds of clones very efficiently that gets the cost of ownership down but more importantly gets the speed to doing getting the sandboxes up and running and the sandboxes are using true production data so that you don't have to worry about oh I didn't have that in my test set and now I have a bug a lot of guys are losing budget because they just can't prove it and they can't get it working it's too clunky all right cool I want to get one more thing in before we get to run out of time the role machine learning we talked about that's super important algorithms are going to be here it's going to be a big part of it but as you look at that policy where the foundational policy governance thing is huge so you're seeing GDPR I want to get your comments on the impact on GDPR but in addition to GDPR there's going to be another Equifax coming it's there out there right it's inevitable so as someone who's got code out there writing algorithms using machine learning I don't want to rewrite my code based upon some new policy that might come in tomorrow so GDPR is one we're seeing now you guys are really involved in but there might be another policy I'm going to change but I don't want to rewrite my software how should a CXO think about that dynamic not rewriting code if a new governance policy comes in and then the GDPR is obviously well I mean I don't think you can be so rigid to say that you don't want to rewrite code but you want to build on what you have yeah right so how can I how can I expand what I already have as a product let's say to accommodate these changes because again it's one of those trains you're not going to stop it so GDPR is again it's one of these disruptive regulations that's coming out of AMIA but what we forget is that it has far reaching implications even in the United States right because of their ability to reach into basically the company's pocket and find them for violations right so what's the impact of the big data ecosystem on GDPR it can potentially be huge right the key attribute there is you have to start when you're building your data lakes when you're building these things you always have to make sure that you're taking into account you know anonymizing personal identifying information or obfuscating it in some way but it's like with everything you're only as strong as your weakest link right and this is again where NetApp plays a really powerful role because in our storage products we actually can encrypt the data at rest at wire speed so it's part of that chain so you have to make sure that all of the parts are doing that because if you have data at rest in a drive let's say that's inside your server you know it doesn't take a lot to beat the heck out of it and find the data that's in there if it's not encrypted yeah let me ask a quick question before we wrap up so how does NetApp incorporate ML or AI into these kinds of protections that you offer to customers well for us it's again we're only as successful as our customers are and what NetApp does as a company that will just call us the data stewards that's part of the puzzle but we have to build a team to be successful so when I travel around the world the only reason a customer is successful is because they did it was a team nobody does it on an island nobody does it by themselves although a lot of times they think they can so it's not just us it's our server vendors that work with us it's the other layers that go on top of it companies like Zoloni or Blue Data and Blue Talent people we've partnered with that are providing solutions to help drive this for our customers Guts great to have you on theCUBE looking forward to next week hopefully I know you're super busy at NetApp Insight I know you got like five major talks you're doing but we can get some time on theCUBE it'd be great my final question one personal one we were talking that you're a search and rescue in Tahoe for cases in avalanches a lost skier a lot of enterprises feel lost right now so you kind of come in a lot and I'll bring my dog the avalanches coming the waves or whatever coming so you've probably seen situation you don't need to name names but like talk about what should someone do if they're lost when you come in you can do a lot of consulting what's the best advice you can give someone because a lot of CXOs and CEOs their heads are spinning right now and there's so much on the table so much to do they got to prioritize yeah it's a great question and here's the one thing is don't try to boil the ocean you got to be hyper focused if you're not seeing a return on investment within 90 days of setting up your data lake something's going wrong either the scope of what you're trying to do is too large or you haven't identified the use case that will give you an immediate ROI there should be no hesitation to going down this path but you got to do it in a manner where you're tackling the biggest problems that have the best hit value for you where there's ETLing goes into your plan of record systems your enterprise data warehouses you got to get started but you want to make sure you have measurable tangible success within 90 days and if you don't you have to reset and say okay why is that not happening am I reinventing the wheel because my consultant said I have to write all this scoop and flume code and get the data in or did I maybe I should have chosen another company to be a partner that's done this a thousand times and it's not a science experiment we got to move away from science experiment to solving business problems well science experiments and boiling over the ocean is don't try to overreach build the foundational building block the successful guys are the ones who are very disciplined and they they want to see results some call it baby steps some say building blocks but ultimately the foundation right now is critical yeah alright Gus thanks for coming on the Cube great to share with you great conversation about machine learning impact to organizations to Cube bringing you the data here live in Manhattan I'm Jeff Furrier, Jim Kobielus with Wikibon, more after this short break we'll be right back