 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE with George Gilbert from Wikibon. We are live in Midtown Manhattan for day two of Spark Summit East. We went wall to wall yesterday. Today, I think we had 12 segments, George. You presented the Spark Summit forecast or Spark forecast, which was great. First time we opened that up. We had a customer panel, of course, a lot of great guests and we're really excited to kick off day two with CUBE alumni and always a great guest, Joel Horwitz from IBM, welcome. Thanks guys, thanks for having me here. It's great to see theCUBE at Spark Summit. It's awesome. Yeah, absolutely. So we're excited and you got a new, you got a new role. So give us a quick update on, you know, you keep moving and grooving. New year, new role for me. I think that's kind of my trend. And so it's no different at IBM. I'm taking on a new role in business, corporate development. Clearly there's a lot of activity going on in this space where there's some very interesting companies coming out that we're looking at. It's really, it's really exciting. Yeah. That's good to say. And comparing contrast we were talking about, we're here at the Hilton Midtown, right? Which is how the Hadoop has been at for years, kind of ground central of big data. How about the evolution of the Spark community and how do you see that relative to kind of what we saw before with Hadoop? Yeah, you know, it was interesting because we actually, I was just talking about that last night with Rob Thomas. And I came here in this place when Hadoop, you know, Hadoop World was just getting started. In fact, it's almost like we were joking. It's like there's a progression of like the hotels you go through to get to the top. Now Hadoop is at Javits and I don't know what comes after that, maybe Vegas we'll see, but you know, you kind of like move, you know, move south, I guess, through Manhattan as you get bigger. So it's exciting to see that there's still a ton of excitement, so much so that they're able to fill this great space. Right. You know, it's really exciting to see them. It's still real early days and the thing I like about conferences when they're early and they're new is the amount of interaction, the amount of engagement amongst the community. People trade in cards and really sharing information that hasn't really got to be all vendor overflow yet and it is a real significant, you know, sharing information. We went to the meetup before the show kicked off. I think there was 500 people in there. Wow. Really engaged, kicked off by Matthew Hunt from Bloomberg who I guess is ground zero for Big Data Manhattan. Yeah. So really exciting stuff. And a guest this afternoon, that's right. Yeah, no, Matt's great. We're actually working with him and talking to him about what is going on. And as you mentioned, I mean, it's definitely still early days for Spark. That's exactly why we opened up the Spark Technology Center. You know, partly to help bring kind of a business solution oriented thinking to the Spark community. I think that's what IBM brings to the table. Probably more so than anyone else, frankly. And so that's exciting to kind of bring that perspective. When you compare Spark for, you know, to say Linux or even going all the way back to system 360, I mean those, they're mature, right? Operating systems, 10 plus million lines of code. Spark is still, as you mentioned, very early in a lot of regards, you know, less than a million lines of code in total so far. And so, you know, clearly we're, you know, only around 10% of the way there if you're measuring it by code. Right, right. And so there's a lot, a lot of work to do. And as you mentioned me, if you look over here, there's, you know, an arcade going on. I mean, that's, it's very much kind of like Silicon Valley East at the moment around here. It's pretty cool. But IBM has put a lot of investment into Spark. And for those that aren't paying attention or maybe this is their first exposure, you know, you guys did the event at Galvanize last year, you're doing the data blouses. Talk about kind of IBM's investment into Spark, not only as a technology, but the community, and as always, business solutions for your customers. Yeah, I mean people, when they think of IBM, they like to, you know, throw around big numbers, right? They always want to say, oh, IBM invest a billion here and a billion there. But, you know, we took a very different approach this time around with Spark, in my opinion. Instead of, you know, throwing money at, at a problem, we actually threw, you know, people and resources at a problem. And so for us, I mean, a lot of it has been, frankly, you know, as you mentioned, growing the community, contributing to Spark, and consuming Spark internally with a number of our solutions and to help our clients actually adopt Spark faster. Because this stuff is not, it's non-trivial, it's not easy, right? I mean if it's, I think what the community is doing, I mean, adding data frames, you know, improving APIs, you know, doing all of, you know, adding all of these usability features and functions. I mean, we're certainly getting there, but, you know, even if we had the best UI in the world the simplest, you know, code, you know, and APIs to use, there's still like a, you know, a thinking gap in terms of how you work in distributed environments, you know, doing data science. And so that's exactly why we also launched, you know, what we call Data Palooza. We just had one last week in Seattle, it was a lot of fun. So, Joel, maybe tell us, sort of, by analogy, the type of work that you layered on Linux, because it was a, I mean, a world shaking event, at least in the tech world, when you got behind Linux before it was so immature, you know, it had promise, but it didn't have legitimacy. And you added a lot of, not just legitimacy, but technology on top. Can you walk us through to the extent that you can share your plans? I mean, I would go back, you know, maybe 20 years, even 30 years before Linux and talk about system 360. I mean, as antiquated as that sounds, it was the first time, I would posit that it was the very first operating system. And, you know, and that brought together a number of disparate peripheral solutions, right? That had its own, you know, way of interacting, its own, you know, programming, you know, environment or programming language, like per solution, right? And, you know, going up to Linux, it was the same thing. There was a lot of different operating systems at the time. And frankly with Linux, it became clear, again, that the community was behind it. And so what we did is we adopted it just like we're doing with Spark. We said, okay, let's contribute to Linux. Let's make sure that it's, you know, enterprise ready. And then let's also use it because the best, you know, improvements, as I'm sure you know, and the best thing that we can do is actually get it into more, into the hands of people, right? Cause then once you have that business kind of context for technology, then I truly believe that it tends to steer the project in a much more, you know, usable, you know, practical direction. And Spark is no different today, by the way. I mean, if you look at all of the different ways that people are interacting with data, through SQL, through Python, through R, through Scala, through, you know, machine learning algorithms, I mean, that's like overload, right? And so to me, again, what we're seeing is we're in the new era where, you know, Spark, we see it as an analytics operating system to truly unify the languages, right, around how we interact with data, how we access data, how we, you know, analyze data and how we actually build data products, as I would call it, that run on that operating system. Let me ask you questions in two slightly different directions. Sure. So you're talking about the community of people around Spark. If we think of it like sort of dropping a stone in the water and the ripples spreading, yeah. Who are the first ripples? Right. And then what are the roles that follow? Right. And then sort of what are the class of problems that they solve first, and then how does that evolve? Yeah, I mean, as uninteresting as it sounds, this all, I think the pebble that we are, that we have dropped into the ocean, or into a data lake. Right. I knew that. I knew we were going to do that. I had to say that. I had to say that. I'm going to run to that one, okay. So, no, I mean, so that's what this is all about. I mean, so Hadoop opened the door, frankly, right? So I, again, I use the same analogy. Hadoop was like the unix of the time, right? Where it's this heavy thing, you know, it's MapReduce, it's HDFS, it's not portable, right? I can't just spin up a cluster over here, take my work and port it over there in any other system, right? Right. Spark is not like that. Spark is very portable, and it's like, you know, Matei is like the Linus Torvalds of, you know, of Spark, right? I mean, of, and so it's very exciting. So the first ripple, or the first kind of ring, right, is data engineering and data developers. So what you're seeing now is a huge push to make Spark really accessible to practically any system, to make it truly portable. And at IBM, we're no different. I mean, we're expanding the scope of Spark to work with, you know, many different data systems, right? Whether it's mainframes, right? People are like, why would you ever do Spark on mainframe? Well, why not, right? Plenty of our customers and clients, there's a crap load of data still in there, right? And so, you know, so the first ripple is really, you know, is really getting that, is really getting Spark to sit on many different data systems. I think as you go out from that, the second layer that really gets me excited is you're seeing companies like, you know, H2O, Datto, you know, a ton of other machine learning companies, data robots here. I mean, you walk around, right? You can kind of start to see that emergence. And so the next layer out is actually, you know, working with the data, right? And so doing things like, you know, data science or just even data analysis, like there's a new term that people are throwing around called interactive analytics, you know, with Spark. So folks like Zoom data and others are starting to show a lot of this. And then there's also going out from that, I would say then there's going to be, frankly, solution building, right? So you look at companies like what Uber and Airbnb are doing, like in Bloomberg, right? And, you know, on the wider periphery, and that's when you start bleeding back into, you know, and expanding into other open source projects. Let me follow that up because we've been working for a couple months on a sort of market overview and forecast. The forecast part is, because everyone likes to latch onto numbers, but the real interesting part is the, what's causing the knee and the curve in different parts of the marketplace. And we've been on this search for this elusive big foot called, you know, large scale packaged applications built around, you know, machine learning and predictive analytics. And we've seen footprints, but we can't find the beast. Yeah, so there's, so sorry, I have to jump in here because I don't know if you know this, but like I think it's out in, I want to say Oregon, but there's like a big foot trap. Have you guys ever heard of this? No. So like there's an old trap. There's like this giant massive trap that they built, like, I don't know, 30 years ago, and it still exists to catch like big foot. So I think what we have to be careful of is going after things that might not exist. Okay, but this is the thing. That's where, you know, frankly, I mean, to jump it, I mean, I think that's where like Hadoop and other data stuff kind of, like they promise so much, like they promise like these magical creatures, like, you know, unicorns and big foots and, you know, all of these things that, you know, frankly, we haven't seen materialized. And so I would boil that down to saying, hey, let's just go out and like, you know, see a few wild animals. Let's not try to find mystical creatures. Because IBM doesn't seem to aspire to build these ERP, you know, generation two type apps, you know, where it's front to back all the, you know, system of record type processes. What I'm trying to get at is there's, there's no such thing as a recommendation engine because it's different depending on the context who's, you know, what industry. And IBM seems to be set up better than anyone else to say, okay, we'll take the pieces and we'll customize them for your business. And I'm wondering if that's sort of where you see that on the maturity curve and in IBM's sort of evolution. I mean, look, I think that, you know, bringing this back to Spark, I would say that what we are seeing is very repeatable architectures, right? I mean, we see people, you know, following similar patterns and we're trying to harden those architectures into, you know, frankly kind of purpose built solutions that we can actually then help our clients get to. And what's the scope of those solutions? I would say it's pretty broad. I would say that, you know, as far as Spark goes, it's pretty far-reaching. I'm sure you heard Anjul's keynote yesterday and how she talked about, we have around 29 solutions that now work with Spark. You know, either it's consuming Spark or working directly on Spark. You know, I mean, a perfect example is Watson Health. I don't know if you're familiar with this. You know, last year, we announced our contribution of System ML and Watson Health is using it with Spark to analyze, you know, they have a 250 node cluster running that's basically crunching through, you know, petabytes of health records to find, you know, areas that they can actually improve the healthcare, you know, business. So it's pretty incredible what we're seeing. And so I wouldn't say there's really like one particular solution to point to. Again, I think it's anywhere there's analytics. I think the challenge that we've had in the past with Hadoop and other things, SQL even, has been it hadn't been very portable. So you did have to like, invest a lot of, you know, resources to solve like this one big hairy problem. And it wasn't transferable to any other industry, frankly. And so that's why I talk about this notion of data products and to use another example, you know, one of the presenters at Data Palooza, they invented an iWatch app that basically they wrote an algorithm in Spark that basically consents how much weight, you know, you're lifting. And so for them, they were like, all right, we're going to consumerize this and make a workout application. It consents the weight that you're lifting. Yeah, yeah, so it consents how much weight you're lifting. Go figure, figure that one out. Here's the interesting part, right? So as we're sitting in the session, one person raised their hand and said, you know, yeah, the exercise, you know, market is so big. But why don't you go talk to UPS or go talk to FedEx? They like actually would love to have a sense or tell you, hey, you're lifting too much weight. Like over 50 pounds, you know, they know that that leads to health problems, sure, problems. My point being is that what we're going to see and what you saw with Linux is is that it led to the explosion of the internet. It led to the explosion of applications. I think Linux invented computer science. Computer science when I was growing up, I mean, not that long ago in college, right? Was a novel field to go into. And I think data science is like that now because of things like Spark and Hadoop. But let's follow up on the data science because we talked about the ripples of the applications and the solutions on the mythical data scientist, right? Which is less mythical than Bigfoot and the unicorns. But still, you know, there was the promise of the old original BI tools and, you know, really smart people and not mahogany, or I don't know what you call the data science row, you know, doing a lot of heavy lifting. But now we're trying to move that to the beyond just the data science. Talk about what you guys see in the marketplace in terms of getting it out of the, you know, the hallowed halls of the multi-PhDs. Every tower research type. Exactly. Well, look, I mean, they're certainly the visionaries, like leading the way, the data scientists, in a lot of ways. I mean, I think when you talk to data scientists, you know, they're not, you know, stuffy, like mathematicians. They're actually, you know, from my experience, extremely creative people. And what you're seeing is that, you know, the gravity is kind of shifting to them. And so people, and frankly, when we talk to chief data officers and other emerging role, they're the ones who are actually forming teams around the data scientists. And so, you know, clearly, I would say this conference is more of a conference for data engineers than it is even for data scientists. But I don't think you can do good data science without good data engineering, by and large. And in the food chain, which comes first, data engineering, and then data science, or is it vice versa? I think they're both equally. I mean, I, you know, that's probably not the answer you're looking for, but I frankly think that it's a team sport. I mean, DJ Patel said this the best. I think that's one of the biggest misconceptions that I hear, and I try to debunk any, you know, at every opportunity, is that there is no, you know, unicorn data scientists. What you have actually is data science teams. There's an emergence of data science, right? Just how, you know, there's computer science and computer engineering, right? Within the computer, you know, science profession, right? And there's programming. So, you know, so we have to be careful not to, like, individualize this. And then you say, well, you know, the data scientists aren't growing as fast, you know, as people say it, it's like, no, but the discipline of data science is. If we're smart, and hopefully if this catches on, which I believe it will, then there's a whole field. In fact, you know, we work with Amplab over in UC Berkeley, and we were talking with them, and actually also the data science professors there, and they're making data science a mandatory, you know, a mandatory class for their 5,000 incoming freshmen. Wow, for all of Berkeley? All of it. So, and that's, and they're not the first, like data science becoming a mandatory, you know, thing to learn. Sort of like statistics would have been in, you know, like high school. Well, computer science should be in high school, along with biology, and physics, and chemistry, and why it's not, should be. But unfortunately, Joel, we are out of time. Man, I was just getting started. I know, but the good news is we will be at, well, first off, thanks for stopping by. It was always great to see you. And again, the good news, the Cube will be at IBM Interconnect next week, Monday through Wednesday. So if you're going to be at Interconnect, that's where Joel will be there. Stop on by the Cube. We'll have a good production at Menley Bay. You're watching the Cube. We're at Spark Summit East. I'm with George Gilbert. Thanks for watching. We'll be back with our next segment after this short break.