 Live from Union Square in the heart of San Francisco. It's theCUBE covering Spark Summit 2016. Brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. Welcome back to San Francisco. Here's theCUBE continues our coverage at Spark Summit 2016. I'm John Walls along with George Gilbert. Glad to have you along here for the second day of coverage here on the show floor in the Hilton Hotel here in downtown San Francisco as we continue. Really what's been a fascinating couple of days. And we're joined by Ritika Gunner now who is the vice president of big data and analytics offerings at IBM. And I like to look at her as I call her that. We're just saying you're the futurist. You're the, you have all the cool stuff at IBM. So talk first off about your portfolio Ritika if you will because you really are kind of the forward thinking the leading edge of what's going on in terms of big data at IBM. Yeah, we were talking about how I am working on the next generation data and analytics technologies within IBM. And when we look at that, that includes things like Hadoop. It includes things like Spark and what we're talking about here. It includes things like the data science profession and what we're doing to enable that next generation set of data and analytics technologies for those data professionals that are going to drive data driven cultures in their organization. Which leads us to a big announcement that you had on Monday night. And I'm curious about the uptake so far the data science experience and all about looking at that community of data scientists. And I guess getting them up to speed basically or getting more of them is what that's all about about learning and collaborating and then putting them in the practice. And you've had pretty good uptake on that so far. Absolutely. Earlier this week on Monday night we announced our data science experience offering geared towards the data science professionals within the organization. And we've had phenomenal uptake from our introduction to now. We have well over six to 700 participants. I haven't actually checked since our keynote earlier in the day at the Spark Summit and really close to the four digit numbers. It's been phenomenal. So what do you want to do for them? What's the ultimate goal there? We interviewed as part of the data science experience we wanted to create a place where those data science professionals can really come and understand what it means to be a better professional and put that data to work for them. And through our interviews of hundreds of data science professionals we found that the number one gap was skills. If you think about the number of data scientists that are actually out there today in a very traditional role of what data science is to the number that really the rest of the world needs in order to have a very data driven type of culture that gap is immense. And even if you look at the way the tools have transformed from very proprietary model building type tools to very open source type skills especially within the last three to four years the number of languages, the number of tools, the number of capabilities that you have to be a data scientist have grown enormously. And so one of the foundations was how do you help even existing data scientists learn about new technologies? How do you help aspiring data scientists start to learn what it means to be better? Whether that means through blog posts, articles, snippets of code, videos, tutorials, those kind of things. So starting with the foundation of helping learning and helping that set of data science professionals learn was very pinnacle to one of the main fabric pieces. So you're saying I mean the relationship right now between technology and capabilities and know-how and all that and the personal expertise is pretty great, right? There's a lot of opportunity here on one side in terms of the tools you can use. Absolutely. And just not enough muscle right now in terms of actual people power if you will to take advantage of that technology. Absolutely. And that's what the science experience is all about. You got it. You need to go out and do the pitch for us. Let's go. Let's go. I'm ready. Would it be fair to say that that objective announced a year ago about training a million people? Was it a million on Spark or a million data scientists? Well, we claimed that we wanted to be able to educate a million professionals. We didn't really indicate whether there were data scientists or on Spark specifically. And we have done pretty phenomenal in that area as well. Our commitment to the data science profession actually started exactly a year ago, exactly here at the Spark Summit, where we announced that we believe that the Apache Spark open source project is the analytics operating system. And as I mentioned earlier, since that time we opened our Spark Technology Center. Through that Spark Technology Center, we went from just a few people that were committing into the community to well over 50 developers and 20 designers that helped contribute not only into the Spark Apache Spark community code, but also helped a set of aspiring data professionals build applications based on Spark. And that has been phenomenal as well. And through that initiative, within the past 12 months, we are now the second largest contributor to the Apache Spark project. And that's been phenomenal progress in just one year. And the data science experience, if you think about what I talked about, if we announced that last year that Spark is the operating system, the analytics operating system, guys, what does every operating system need? Apps. And so the data science experience effectively is that IDE or the integrated development environment to help the data scientist start building those applications. And so what might, there's two angles on this. Couple of years ago, McKinsey came out with this authoritative report. I mean, it had to be authoritative because from McKinsey, where they said we're going to be like, two or three million data scientists short of a six pack. And some folks in the industry were like, I guess they don't really get software because software's point is it makes people more productive. Absolutely. So help us describe how the data science experience might evolve to make those data scientists more productive so that we don't need three million of them. You know, just like I said that skills and that skills gap was something that we wanted to address. The other thing was this notion of allowing data scientists to be able to collaborate with each other. And so this notion of collaboration effectively means that you're learning very quickly from each other. You may not need as many because there's a foundation for you to start at. So if I developed a model or an algorithm or some set of capabilities, if I have a community where I can share that with the rest of my professionals, essentially I get started quicker and I don't need to do as much, right? Would that be something analogous to GitHub but not for hardcore developers but more for data scientists? Well, I mean, GitHub is absolutely a way to be able to share, if you will. Collaboration is a lot more than that. Collaboration is saying, through what I have in my development environment, I can actually work on projects together. I can work on multiple types of pieces together in an integrated environment. So it does share or take from that notion of GitHub, definitely, but it builds a lot more on that. Okay, so whereas in a previous era you might collaborate on a doc, now you might collaborate on a data science pipeline or a model or something. Exactly, and then share that with the rest of the community, be able to do whatever they want with. George was making, and if I'm wrong, just tell me. But the fact that the intelligence is being built in now, that it almost lessens workloads in some respects or lessens the need for deeper diving and so does that help the data science community in that maybe there's less, there will be less for them to do going forward because the machine will do more work for you? It still requires some hands-on, right, some? You know what I mean? You have to think about like the mundane tasks are automated for you and there's like a level of automation and machine learning recognition that happens. Yeah, automation is simpler for you or easier for you. It makes it simpler and easier. Of course, someone has to develop the code that actually does that capabilities and that's part of what we're doing through the data science experience. You know, you can envision as we go forward that we're going to introduce capabilities that do the auto modeling for you, that do the canvassing for you. All of those type of capabilities will enrich with. But exactly, it eases the burden of what you have to know to get started and to start working. Sure. Would the data science role be the most constrained in terms of needing to grow that pool of, I don't know, not necessarily pool of people, but pool of contribution to make this whole effort work? Yeah, that's definitely what we found through our research. As I mentioned, we did interview hundreds of data professionals, data scientists, data engineers, et cetera. And through understanding the life cycle, how organizations work with data to effectively start infusing and operationalizing insights into their applications. That whole life cycle notion of from data to actually operationalizing those into applications. When we looked at that whole life cycle, the biggest bottleneck was with the data science profession. We actually came out of that research with the notion of four critical skills or types of people that are necessary in the organization for any data project. The data scientist, the data engineer, a business analyst or citizen analyst, and of course the application developer. And what we noticed is today's data science professional may be taking on more than one of those roles. They're doing data engineering. They're doing data science actual machine learning like development. They're probably then even infusing that into some applications within the organization. So the boundaries of skills necessary for today's data science professional are a lot broader than what they traditionally were. And that's what we want to enable as well because it's not just about knowing how to create an algorithm or create a model. To be a very efficient data scientist today, you need to be able to do multiple types of roles. And just to continue that thought, does that mean that you want to take a data scientist sort of perspective on the tool set since that's the primary audience but then extend their capabilities down that sort of food chain? Absolutely, so you can envision a set of capabilities where it's built for you but works for us where each one of us may be different professionals in the organization. And that allows the business to really accelerate what it means to be data driven and to drive insights out of the applications that they create. So if you think about built for you but works for us, that means that the business is operating much more efficiently and much more quickly and is way more data driven than they ever were. There's one question that we've been wrestling with which is what we love to term the legacy apps which in customer speak means the apps that work. In some notions probably. What happens when you operationalize these insights? With legacy apps, how do you plug those actionable insights in without breaking them? You know, I think Spark has helped tremendously in that area as well. Apache Spark has the ability to be able to connect data from everywhere, right? Whether those be those operational, very operational systems or new type of instrumented data and from our perspective using that type of capability to be able to then go from a discovery of let's say, let's say an organization has a hypothesis of let's say a thousand different hypotheses that they want to be able to test. Using Spark, you can access all the data that you need to be able to have but then go from that thousand set of hypotheses to maybe 10 that really have significant value and from those 10, maybe you only really want to operationalize one. And so the ability to go from your hypotheses to discovering that 10 of them really matter to only one of them that you want to operationalize is something that we aim to be very efficient at and that means you need to be able to collaborate across the multiple different data professionals within the organization. Interesting, I guess roadmap of what's happened in the past year. Absolutely. And looking forward to seeing what's going to happen in this next year and let's get back together in 2017 and see where this is gone because there's no doubt that IBM has been huge win in the sales of the Apache Spark project and now what's become a real, I mean it's the center, right? Of the processing universe, good. Always a pleasure to speak to you guys. Rick, thanks for being with us. Thank you. Thanks for the time. We'll be back with more here from the Spark Summit 2016 and just a bit here on theCUBE.