 Live from Galvanize, San Francisco. Extracting signal from the noise. It's theCUBE, covering the Apache Spark community event brought to you by IBM. Now your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live in San Francisco, winding down the day here. What a day, big day today. It's been, this is theCUBE's SiliconANGLE's flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of SiliconANGLE. I'm joined by my co-host George Gilbert, our big data analyst at Wikibon. And we're here on the ground in San Francisco live for the IBM Special Edition theCUBE for the Spark community event in conjunction with Spark Summit where all the action is going on. We're all the fault leaders. And now mainstream is looking at Spark as that next innovation, that disruptive enabler that's creating a lot of opportunity. And our next guest to break it down for us is Paco Nathan, training director at Data Bricks, O'Reilly Media consultant in the trenches, been involved in data for decades. And now we're in a new era of data. Welcome to theCUBE. Thank you, John. Thank you, George. Great, so your panel up there in front of the live audience downstairs is Ray here at the galvanized workspace, incubator, education space, however it evolves. Either way, developers are everywhere. You see startups here. IBM announcing their support for Spark throws a huge endorsement to the community and also to customers as a way to telegraph what's next. And this has implications. I want to get your thoughts. First question is, you know, why you've been involved with Spark for a long time. You've been there from the beginning. You've been seeing it evolve. What's going on? What's happening? Why is IBM doing this? Well, it's really clear when we take a look at the use cases when we're out there in the field that Spark is a game changer. I mean, it's allowing more of the business customer inside of an organization to get hands on with the data, get something useful right out of it. And at the end of the day, that's what you need. So I got to ask you, you know, being in the 30 years in the enterprise business, software business, and entrepreneur. I've seen the waves come and go. Client server created a great gravy train. Certainly TCPIP was in a disruptive enabler back in the day, created networking and all that wealth. Now the enterprises have been consolidating over the past two decades, cut costs, cut down to the bone, outsource everything. And all of a sudden with cloud, just in the past five years or so, now invest. So all of a sudden you're anemic and now you've got to be explosively strong like Superman. It's hard. So I want to get your take, as someone who's out and talking to customers, this is a challenge, because some enterprises are like, okay, I'm lean, mean, operational, I've cut costs, we're keeping the lights on, and now they're asked to invest more. Then on the other end, you've got financial services, the big spenders, they've been doing it for a long time, they still invest. But that enterprise dilemma is, what do I do? What do I do now? This new innovation, what's the roadmap? So I think a lot of the spark adoption that I've seen is about mitigating risk. I mean, we were cutting back in IT because these projects are enormous, they're huge, they cost a lot of money, they don't always pay off. The thing that's different with Spark is you can get right in and get your hands on the data and prove what you need to prove. And when it's time to scale it up, it's very simple to scale it up. So there's a lot less risk to say, I'm gonna embark on this path and it's gonna take me nine months before I know yay or nay. You can instead get that back in nine minutes. And I think that's the game changer. This whole Spark thing in cloud, mobile, socialist, innovation, past a couple of five years has been really exciting. Reminds me of the Steve Jobs 1984 commercial where the drones are walking, this is against IBM ironically at the time, walking off the cliff. And that is essentially what the enterprise has been like. But now all of a sudden, the people close to the action, people in the field, the data scientists in the past have been indifferent for innovating because the hurdles, the friction to go back into the data warehousing, the business intelligence systems, they're all fenced off, this schema, it's a train wreck in terms of innovation, it's friction, it's interindict. So they say, I'm gonna go home. But now with the agile nature of Spark, people close to the action can be creative. The hackathons are booming. They had 28,000 at IBM. They had one this weekend. And just the innovation is just amazing. So I wanted to get your perspective. That excites people because now people can feel I can contribute the money balls you kicked around all the time with the baseball analogy. But what does that really mean? What's going on in the field relative to this new dynamic of creativity with data? Sure, well, I mean, one thing is there's a lot of data that we're just, we're surrounded by it. If you take a look, like flying in here into the airport, I'm on some large airliner. You know the name names. It probably was not that good. It was a good idea. It was a good idea. Okay. So a nice statistic there, you take a look at GE and the other vendors who provide those turbines. There's one sensor just to watch the bearings on those turbines. And currently for all the commercial aircraft in flight it generates 12 exabytes a day. And there's no way that we can just take and store that. I actually heard a conversation from a particular vendor about can you help us put that in the cloud? Are you saying it's one exabyte just from the sensors and all the engines? 12. Just for the bearings? Just for the bearings. And that's just for commercial aircraft. Wow. If we start thinking about trains and trucks and autos and all this, we're swimming in data and most of it we're ignoring. But there's a lot of things we could be doing really good. You know, I mean, especially with transportation, I started out on this. Yeah, I mean the cost savings are in the billions. Yeah. I mean, Enterprise you see hundreds of millions when you get to transportation, whether it's railroads and or other transportation is in the billions. Paco, I want to get your thoughts on the creative thing. I love this creative thing because it reminds me of that old movie, I can date myself with this, Contact with Jody Foster. That little data point that's an outlier can be opened up now and innovation can come from that. So we're seeing data science now explore these new capabilities that were once kind of like, well, I got a provision of data center. What is the impact of the customer? What's the mindset of people that you've worked with that have been successful, looking at everything in real time, maybe it's passive data and active data in real time. But when those outliers are out there, those are the new potential signals that could shift the company's trajectory. Sure, well, you know, for a long time we've thought about data when we want to go and do some kind of modeling, maybe customer segmentation for marketing. You want to take and get a sample of your data and make your models. We're past that now. Number one, the data's too large to really sample it effectively. So it's good to be able to do your training at scale. If you have 100 million customers, why not work with the data for 100 million customers directly and come up with segments, not just two or three, but maybe you need 10 or 20 or 100. Spark is opening up the ability to get in and treat the data. And what are the consequences? I want to get that, because that was a great example. So in order to do that 10 years ago, compare and contrast that scenario. Because back then you'd have to do some panels, you'd have to do some statistical calculations. Well, based on the correlation, now you're saying essentially everything's instrumentable. Why not use that data? So that's what you're saying, did I get that right? I want to go directly. So compare and contrast. What would it cost, order of magnitude, you know the number, but close enough, like 10 years ago to today. Well, you know, it's- Speed as well as deployment. It's time. I think deployment, you really hit on the head there. It's mostly about time. It's the opportunity cost and you take a look at how the likelihood of it working trails off over time. If you've got to write a PRD and an ERD and it's going to be six months before you get your first confirmation, who knows if it's going to work. But- The payment of the scope, that number can vary. Sure, for enterprise. Said something interesting earlier about sort of risk mitigation in the apps that started in the use cases that rather than the traditional big bang for the old systems of record, you sort of inventory your data and you inventory the functionality that you want to go after, carve off a small piece. Today though, you still need to be data scientist, you know, at the programming level or a data developer and data modeling level. Are notebooks or GUI tools going to empower more people to at least get the process started or to help with the process? I really believe that notebooks are as fundamental to change as, I'm going to date myself, back when we saw spreadsheets being introduced. I mean, I was a programmer working in finance. I worked at Lotus, had a summer job. The last year we were bigger than Microsoft in revenues, 86. It was such a game change. And I think that we're seeing this now with notebooks because it speaks to collaboration of teams. It speaks to people who aren't necessarily programmers collaborating with other people who are. So, you know, when I think about putting together a data science team, I know that I've got to have somebody do in the cluster. IBM is really good at doing those clusters, by the way. Somebody's writing some code, but other people are working on the analysis level. You know, the business analysts, data scientists. There's other people who are stakeholders, right? They represent the business. If you're in finance, you have to know the regulations. You have to know how to play the game. Same thing if you're in pharma, same thing if you're in transportation. You have to have that business knowledge as well. So you can't just imagine that all these different roles you're gonna set there in their IDEs and write code, and it's gonna magically work. Notebooks, on the other hand, being there in a browser, I think it allows a wider range of people to engage, but it's that team context, really contextualizing the business problem. That's what it solves. And also separation from the infrastructure. By decoupling that, it takes the provisioning out of it. It's also the speed. No one feels like they're moving mountains to get something provisioned if it doesn't work out. So like, I think that whole infrastructure separation, which developers like, because they didn't want to deal with the infrastructure guys. I mean, it's possible you could try to move mountains, but the fact is, this is more likely to work. So there's a likelihood that people will go after tougher problems then. I just wanna circle back to, well, there's one thing that John said where the notebooks are sort of like, since they're developer oriented, it's really like past, you know, IBM Bluemix would serve up this past capability, potentially. But I also, or I guess you could look at it as SAS, you know, for data science as well. But I wanted to ask if you could help us visualize the notebooks a little bit, where, you know, what would the data modeler, you know, data developer do, and how would he collaborate with the data scientist and the business analyst? Well, you know, if I'm building some kind of a data product, if I've got a team that's got a pipeline that they're responsible for, or some set of apps like that. There's always the work to pull in from different data sources. You know, you've got some big data, you've got some structured data, you've got some metadata to put it together, et cetera. There's always gonna be a lot of ETL. You're gonna have to clean up that data. You listen to DJ Patil, who's now the chief data scientist, right? First time in the history of the government, data scientist in the White House. Cheers. DJ Patil, yes. You know, and he talks about a Pareto ratio, right? 80-20 rule, you spend 80% of your time and your money cleaning up the data over and over and over. And so that's where these notebooks really simplify that, and there's people bringing it together. But then once you have your data, then you can start to explore. So you're saying that they can help in the ETL process. Exactly. Because they visualize it and they can even apply some machine learning techniques. The nice thing with the notebooks is you've got these different cells. It's basically like a big stack of cells. And a cell could be a piece of code, or it could be some documentation, think of it, or not even documentation. It could be HTML that embeds video. Maybe it's a YouTube clip. So you've got some, you know, you've got your code you need to work with that's pointing out to your data, but you've also got some notes or some instructions, some explanation. And then you've got other cells that are the results. So it could be that your code is generating a dashboard for the execs to look at. And it's right there, it's right in context. This is, I guess I've missed this, but having been in a spreadsheet company, this was the dream of the visual development environment. Exactly, yeah. I'm not gonna ask why it took us 20 more years to get there. But let me back you up until you talked about the ETL side. So that might be the data modeler who's doing some visualization. So once he's got a view of the data that makes sense to him, what does he hand off to the data scientist and what does it look like to them? So downstream, the data scientists may be doing some unsupervised learning to take a look at what's really the structure. Get in and do some complex visualizations, some clustering, look at the clusters, more than what you would typically do as an analyst. But really downstream from there, you typically wanna do predictive analytics, right, modeling. And so that's where the machine learning tools come in. And the nice thing is it was sparking with these notebooks. So far I've described about 10 lines of code. I mean, the machine learning models, you can get in and set up your feature vectors and run the model in a couple of lines of code, maybe another three lines of code to evaluate the results. And so you take a look at having a pretty broad range of people across these teams, hands on, all in the different cells in the notebook, and you just run the whole thing and you get your results in. It's not a lot of code. I mean, you have the data sets available. It's almost like the way SQL was in the 80s. You know, it's a really simple way to extract data. You don't have to be a super duper badass backend developer, C programmer. That's a really good point. Actually, what we're doing with Spark, the new innovation coming out of 1.3 and now further in 1.4 is about data frames. And the idea is that as we move up higher levels of abstraction, we use data frames, which means you're writing SQL. And by virtue of having that, Spark can do smarter things with optimizing code under the hood. But it's flexible as well. It's just not a SQL like in the sense of it's ease. But there's also some other options, right? Exactly. What are some of those things that you give in use cases? You can get into it with functional programming. At the end of the day, there's functional programming below and there's great reasons I won't go off a deep end. You can also get into it with R because R fits this very well. A lot of people know how to use R. Great visualizations, Python, et cetera. A lot of the kids coming out of college these days are all have R under their belt. That's what's in academia right now. So that's a nice job market. I mean, here, I'm here in some of the things that IBM is doing. There's immersion program in Galvanize. People make 120 grand a year. Coming out of like three month program. Great program. I'm an academic advisor for Galvanize. You are? I'll put it in a plug. Yes, excellent. I'm very impressed. I'm a master's student for 12 months. Great, definitely. And not just San Francisco, but also Seattle and Denver as well. Yeah, I love it. So going back to re-platforming, the way IBM re-platformed the database, the web sphere, notes, all on top of Linux, essentially making sort of re-hosting and establishing Linux as a key enterprise platform. We understood from some of the folks we interviewed today that bringing eventually SQL query optimizer, in addition to the machine learning, streaming all richer personalities on the Spark Core. Should we see the, should we expect to see IBM start building their own notebooks? Or is that something that the Databricks guys are uniquely qualified for because they own the stack all the way down? How should we think about that? It's interesting. So this is a big value prop for Databricks in terms of doing cloud-based notebooks. There's the Databricks Cloud offering. And a lot of that is actually, I should back up and say the Spark that you're using in Databricks Cloud is 100% open source, what you get from Apache. It's actually the nuts and bolts of managing the cloud is a lot of the hard part that Databricks is providing as a service. And there is where it's not just Linux, but over the top of Linux these days, more and more we're using containers. So doing something smart with containers in a cluster solution, that's a lot of the value that it brings. So that's up the operational value. Exactly. How to really cut that risk for launching a new app without having to hire a small army of ops people. But is there, are they gonna open source their tools, I don't know if they've done it or philosophically that's within, how they're thinking about it? Because so operationally it's clearly making it simple to run a cluster is there's value in that, that's not open source. But would they keep the same tools development? Well, the nice thing is that, I mean it's a 100% open source Spark and then over the top of that you've got Python and these great libraries like NumPy and Slipy that you can use those are open, are all the different libraries there you can bring in SQL, everything you bring in there. So as you move up the stack it's much more open. The nuts and bolts of running the clusters that's more proprietary and I don't think we necessarily translate, but. So Paco, I was just going to get tight on time since the long day here, I wanna get your take on what you think is gonna happen next. I mean just gonna let's speculate because it's getting at the dots, my favorite part of the interview. IBM comes in, Spark's been kicking butt, it's been, we've been seeing it. We were there at the first AMP lab and then when Spark's someone we were there for the first one, small. And then it gets bigger every second. But this is the second year or third year? Third year, that's double in size. So we've been here for all three years just on the ground, kinda kicking around with Clutair and those guys. What's next? IBM is just basically shine a spotlight, global spotlight on Spark, something that probably no one's ever heard of. Outside of Silicon Valley and the geek community. This is gonna open up a huge bout of visibility. What happens next? I mean obviously just it's gonna be a huge amount of interest. People are gonna be flooded with questions, events are gonna happen, community's gonna grow, what's your take on this? So I think the key phrase there is beyond Silicon Valley, outside of Silicon Valley. You know there's a lot of great companies here that have specialized in ad tech and social networks and these things we talk about in Silicon Valley all the time. Fraud is probably in there somewhere, e-commerce. But when you get outside of Silicon Valley there are real businesses doing real things. Big and small too. Big and small, absolutely yeah. Mom not all the way to enterprise. This is not just for big companies, right? Spark and work great. Let's go to Atlanta and you have transportation, you have package delivery shall we say. You have large beverage manufacturers. You have all kinds of things. The whole world is this is for all businesses. Exactly. This is for all businesses. I think that's what this signals is that it's not just about Silicon Valley and what we tend to focus on here. Yeah the other thing that Rod Smith brought up from IBM, great guy, it's a fantastic interview. He's historical perspective. He said something that was interesting to me. Custom analytics, this enables custom analytics that was kind of his word. Not that IBM's messaging but he opens up the thing okay with verticals you have domain expertise. A critical piece of these vertical markets where big data has been thriving with analytics. People who are close to the action and have expertise are building whether they're ontologies and other algorithms. So now with IBM on this it's going to be a vertical expansion. This is going to create a custom analytics market in our opinion. So what do you think about that? I mean how does I say I want some custom analysts. I'm an enterprise, I'm a transportation person. I have an IT, I read the blogs. I read the New York Times. I'm not in the inside baseball. What do I do? I'm an enterprise. How do I get involved? What's my next step? Well there's a lot of hard problems in the enterprise. Supply chain, maintenance schedules, all kinds of routing problems. Spark works wonderfully for this and that's what I've been going out and doing at part of my training is showing a lot of these kinds of industrial apps. Whether you've got a railroad or whether you've got a bioinformatics problem, some pharma thing. These are hard problems and this is what the world runs on. And they're being solved with Spark. Some people are actually tackling these things, these new capabilities. And that's where I think the custom analytics comes in because for a long time we've had the PhD students focused on making some better social network tweak. But now let's go after real problems. And with machine learning you can actually get into some of these things where you had needed expertise in the past. It opens it up a little bit. Okay so I'm going to put you on the spot, final question. You don't have to name names but you can just kind of give us a range order of magnitude kind of like which solar system we're in. The biggest couple examples of huge dollar savings from companies that have deployed good big data solutions where these companies have been full of data or data full if you will, what I say. It's like I'm full of data. I'm not a tool, I'm not an apparatus or software like some of the vendors in the industry. But I'm a company, I have a lot of data and I've done something and it's been a big money saver. Can you like, is it a hundred million? What's the numbers you've seen that have been the big numbers? You don't understand, just to quote order of magnitude. Like company X saved a billion dollars. I mean I've talked to United Airlines on an interview with GE once and the guy said we save it over a billion dollars on fuel transport costs alone with internet of things. And that was my highest number I heard. As far as dollar values, I stay more towards the use cases where they're talking about the data rates. So I don't know that I could really quote the dollar values quite as much. I can do that as homework. But I do see some pretty big cases. But there's some transformative infrastructure things you've seen. I think some of the biggest is gonna be in like cancer research and what we're seeing in medicine where genomics is a game changer. And yeah, that is billions and billions of dollars. Either way we're totally in the transformative market. You would agree. Absolutely. Bubble? No. I mean the transformation is legit. You can't put a, these are real businesses. You can't say it's a bubble when actual people are saved. I was just talking to a CEO of DocuSign and he talked to a customer that saved $280 million on one process worldwide. Savings. Guaranteed quantified savings. But I think that's kind of like what I'm seeing all across the board. That seems to be like the norm. It's like for the big companies hundreds of millions of dollars. It cuts across so many different verticals. I mean, like I say, energy, health, on and on and on. Okay, we're getting the hook here but I want to get to your final thoughts for the folks that are watching this interview that aren't here, that didn't experience in the moment like we're here live all day, talking to all the folks. What is this announcement all about? What is going on in Spark? Why is this event so huge? Why is the IBM, the community action, everything orbiting around Spark a big deal? Well, you know, Spark really makes it a lot simpler and this is work that needs to be done. We need to be smarter about how we use resources. We need to be smarter about how we answer to the bottom line and everybody in business has this and Spark is making it a lot easier than the tooling that we had before. Paco Nathan here inside theCUBE. Thanks so much for your time and your insight for this big data conversation. This is theCUBE. Thanks for watching. We'll be right back after this short break.