 Live from Las Vegas, extracting the signal from the noise. It's the Cube, covering IBM Insight 2015, brought to you by IBM. Now your host, Dave Vellante and Paul Gillan. Welcome back to IBM Insight, everybody. This is the Cube. We're here at this day two. We're winding up our coverage of day two at IBM Insight. This event is IBM's, what we call their big data and analytics event. IBM doesn't really use the term big data. They sort of shifted the market a couple years ago, really focusing on analytics. They now use the terms like fast data. We've heard that. Mike Tamir is here as the Chief Science Officer and Executive Vice President of Data Science and Education at Galvanize, training the next generation of data scientists. Mike, welcome to the Cube. Thanks for coming on. Thanks for having me. This is the first data scientist I ever interviewed on the Cube. It was many, many years ago. It was Hillary Mason, quite famous now. She sort of educated us back then on the excitement of data science. Maybe talk about how this whole thing has evolved. You're a data scientist yourself. Hal Varian, of course, has pumped you guys up and increased the value of all these data scientists out there. But few there are. But you're working to create more. So give us the update. There's a huge demand for data science. There are figures that are clear all the time about the growing demand, the growing interest in data from individual companies, the growing interest in professionals who can work with that data. I was attracted to coming to Galvanize for that reason because I had worked as a data scientist in Silicon Valley for many years and saw the need for qualified data science professionals. And so one of the reasons why I wanted to create this education company for data science with Galvanize was because I wanted to train this kind of data professionals whom I was looking for for all those years. So talk more about the mission of Galvanize. Obviously train data scientists, but add some color to that and give us sort of a picture of where you've come from and where you want to go. Yeah, so Galvanize has gone through an explosive growth phase over the last year. We went from 20 employees to nearly 200 employees now. So 10x growth. The metaphor that I like to use to really help crystallize how people can think of all the different things that Galvanize is doing is this idea of kind of a teaching hospital, a teaching hospital for data scientists or for web developers. We have member companies that come there. They come onsite. They work and learn and grow at the Galvanize campus. We have professors appointed for professors that run our master's programs. We have instructors that run our boot camps. And the important thing is that it's not just about staying in the classroom. It's about getting your hands dirty, about being very close to the industry, close in proximity to the member companies, close in working with real-life data sets and working on real-life data projects and close in the actual experiential learning. Our professors and our instructors get to do data science consulting with each of our member companies and with our Galvanize experts' clients. And then, much like going back to the teaching hospital metaphor, the students get to shadow the professors as they're working on these projects, get to see what a real data science life cycle project is from end to end. And that's huge when they go out into the field. They already have that experience in their belts. I always say data science is a dirty business. You've got to roll up your sleeves. So, I'm sure you get this question all the time, but the skill sets, let's sort of run through them. I mean, you've got to do math. You've got to be able to maybe do some development. You've got to be kind of a data hacker, right? Maybe some stats thrown in there. What do I need? Do I need all of these? Can I have some of these? Because you're seeing that title being bandied around a bit. Somebody who's maybe got a strong stats background, deep statistician. I'm a data scientist now. Valid maybe needs a little bit more training. Should call you guys and join the teaching hospital. What are your thoughts on that? So, you hear that question a lot. What is a data scientist? What does a data scientist need to know? One answer to that question is that there are all sorts of different kinds of data professionals. There are data engineers. There are data scientists that are more focused on approximate use cases. There are data science researchers who are more focused on coming up with really interesting algorithms and really interesting machine learning applications in order to solve huge problems. This is what's happening with cognitive computing, for example. You always have to start with that question. What kind of professional are we training? And what kind of professional do you want to be? If you're a statistician, but you want to be a data engineer, you've got quite a number of skills in scalable computing and data architecture and data engineering using Hadoop and Spark and all the different data engineering tools that are now in vogue. If you want to become a data scientist who does more modeling, more machine learning, maybe all you need to do is bone up on the code, of course, but also work to learn more machine learning, advanced deep learning if you want to be more of a researcher. So it really depends on the kind of profession track that you want to go into. And we've actually created tracks for each of those. We have a data engineering immersive program, a data science immersive program, and a master's program for data science. So that kind of underscores the evolution of this role. I mean, you know, five years ago it was kind of however you defined it and, you know, people would just, you know, like yourself dive in and say, okay, we're going to just sort of create this role. And now you're starting to get more specialized. You know, the teaching hospital is a really good example. Is that a correct way to think about it? Absolutely. Three years ago there were no master's programs in data science. And the fact that over the last year in particular, the year or two years, there's been this proliferation of academia and of industry focused groups like Galvanize trying to create a standard of training for those professionals. It's bringing clarity to what it is that the profession's all about and what kind of, you know, if done well, it's going to prepare the students to become those kinds of professionals that the industry is looking for. When you talk about a standardized training or a standard set of training criteria, you also defined science as spanning a great many different areas. How do you boil that down to a set of core skills that you need to teach? And what are those core skills, I guess? So we at Galvanize focus in our instructional design with this methodology called backwards planning. We always keep the end in mind. We think about what is the professional that I want to hire as a data scientist? What do they look like? What are they able to do? Not what topics are they familiar with, but what are they able to actually accomplish? And then we teach to meet those standards, those learning objectives when we create those different topics. So I can go through several, sorry, when we create those different lessons, I should say. So this is something that's core to how we have created it. I came in from data science, so I knew exactly what kind of professionals I wanted to hire and what kind of tasks they wanted them to accomplish. And that's how we designed the entire master's curriculum and the entire evolution of the immersive program curriculum as well as that engineering curriculum. Are these skillsets are going to change dramatically over time or do you see them having some staying power? That's a fantastic question. Certainly the models and the algorithms are going to change. Models that are in vogue in 2015 are not going to be in vogue in 2025 or in 2035, but the methodology, the way of thinking about optimization, the way of breaking down an algorithm into a cost function and an optimization technique and a learning, a decision process. That core, that kind of core is going to stay and knowing how to prepare your data sets and deal with the data and apply the algorithms in a way that gives you best out-of-stamp performance, knowing how to think about what the problem is. This is how to set it up with a good scientific methodology. All of that's going to stay the same. And so those are the core skills that we try to impart on our students on top of the coding, which is very important as well. So building a fish in algorithms, a lot of people will say that's programming. What's the difference? Well, building the algorithm certainly takes programming. You might want to think, and this is again a spectrum of what is a data scientist who's approaching it as a scientist. How does that kind of professional work? A chemist needs to know more than just the chemistry. They need to know how to actually go into lab, how to do good lab work, how to titrate well. Valid experiments. And so being able to code, that's a skill, that's a valued skill in chemists. And just like in data scientists, it's a valued skill to be able to code well, to be able to code efficiently, quickly, elegantly. And we try to really focus on that by practice makes perfect methodology. So you guys talking about algorithms, I got this a lot of times in the Cube to get these sound bites that stick with you. And then you got to go back and sort of try to better understand them. And there's this sort of, like the recent Gartner Symposium, a lot of the discussion was around algorithms and how your business is going to essentially become one big algorithm. Flip side of that is Abbie Mehta. And if you know Abbie, he's very kind of outspoken, a Cube alum said to us a couple years ago, algorithms are free, right? It's how you apply the algorithm that's really going to differentiate your company. So there's dissonance there. You got Gartner, a very respected firm, saying, you know, you got to figure out the algorithms. You know, you are becoming an algorithm. And then you got somebody else saying, no, no, that's not where the value is. The value is on top of that. Help us squint through that sort of dissonance. Sure. I heard a quote repeated to me just a couple minutes ago about how the business professionals leading teams of data scientists get more results than IT professionals leading teams of data scientists. And immediately that brought a common trope in my mind that you need to be able to think about what the problem you're solving is in order to put together the solution, right? And so algorithms might be free. Those tools, those ingredients to a data science solution might be free. But unless you know what the problem is that you're solving and you can think creatively, because you understand the algorithms about how to apply them effectively to solve that problem, you're not going to get much as a data science professional. And so that's another big thing that we try to impart through all the experiential learning, through this shadowing program is letting the students not just see the coding, but see from day one you sit down with someone, with a professional who's got a real problem, has data, wants to solve it, and then you see the process of thinking about this is the problem, and how do I put all this machine learning, all this fancy, cool, interesting stuff that is all in vogue right now. How do I build those building blocks together in order to create something that's really going to have a remarkable success in solving that problem? And implement that continuous improvement over time. I would think that's a huge factor that ongoing learning, and that's going to create competitive advantage. How does that fit into your curriculum? Well, and this is taking a page out of maybe more of the engineering side of things, the way an engineer should think about product development. You don't want to come up with the most fancy thing first. You want to come up with the MVP first. You want to come up with the most simple thing, and then also have a roadmap for how you're going to iterate, how you're going to add new things in, how you're going to use the learnings you get from the performance of that first iteration in order to guide what you do with that second and third and fourth iteration, and so that's a huge part of it, and that's really more of an engineering part of it. Is there the potential that technology could obviate the need for data scientists? And I'm thinking of Watson, some of what Watson does is something that data scientists might otherwise do. Could technology make the discipline irrelevant eventually? I don't think that that's going to happen. I do think that eventually, parts that take a lot of the time of the data scientists right now doing feature engineering and manipulation and cleaning the data, there are going to be tools, there are already a lot of tools for making that job easier, and this is what representation learning is all about. Matrix factorization, which is the core technology behind all the best recommendation engines there, whether it's factorization machines or NMF or whatever the algorithms are. Deep learning is, this all falls under the auspices of something that's called representation learning, is machine learning applied to figuring out how to manipulate the data itself so that the machine learning algorithm can do better at it. So it takes the same sort of tactics, the same sort of strategies for doing machine learning and coming up with a good algorithm to figure out how to manipulate your data. And if you think about what's happening in a neural network, what's happening in a deep learning algorithm, it's just massaging that data, it's stretching it, rotating it, over and over and over again in a way that gets feedback so that that final stage you can apply a simple, you know, max-ent or whatever it is, algorithm to find the answer. But all of those layers in the middle are really an attempt to automate what data scientists do all day. Can you talk more about the, maybe peel the onion on the supply and demand imbalance for data science? Obviously critical to your plan as an organization. We've certainly heard a lot about it several years. We interviewed Mike Rappa a few years ago at the MIT IQ. He had the program in North Carolina on the first day of science. And you've seen an explosion, you know, IBM's training data scientists, EMC's training data scientists. Now you've got Galvanize, really, kind of the cool, you know, the now training, you know, organization. How, obviously it's acute, but how big and is that imbalance and how is it closing or is it getting wider? So there is a proliferation of organizations and companies who have seen this vacuum and are moving to fill it. I'm not worried that that demand's going to end any science soon. You know, that would be a good problem, of course. The need for data professionals who can competently work with the data, and not just at the extreme, you know, research end, but at all levels of stratification. Data savvy marketing analysts, data savvy social media, you know, all, at every level we need the ability to work and use our data better. You know, someone that can't use a cell at every level of an organization is probably not going to be able to be as effective as if they could and we need to expand that scope to more, to all kinds of data. So kids are always asking, you know, what should I get into? What do you think? And if they like data, I'm always saying, get it to data. Figure it out some way or another. I remember years ago, somebody was asking advice of a colleague and the advice that that person received was, you got to be a Windows admin. No, no. So you're not worried about a glut of data scientists is what I'm hearing. There's no shortage, or sorry, no waning of the demand inside. If anything, it's getting, the gap is widening. So the gap is widening based on, I guess it's anecdotal, it's really hard to tell, right? But you can observe organizations and where they're struggling. You know, they've certainly been struggling just deploying analytics infrastructure. That's hard enough as it is. They've probably spent 70% of their time deploying infrastructure and cleaning data. And then maybe 20% is actually getting insights out of it. First of all, is that what you're seeing? And will that flip? And what needs to be in place for that to flip? So that always comes out messy. It will always come out messy. It's always going to need that massaging. Good news for you guys. There will be ways of automating that massaging a little bit. This goes back to the representation learning that I was talking about earlier. There are also going to be other professionals who are trained maybe a little bit more focused at doing that very well. Data engineering is a piece of that, although not all data engineers are, you know, some are data architects and we're dealing more with doing distributed computing. And so there's a lot of aspects of each category and each stratification of data professional that could help. Do you see the internet of things introducing new disciplines for data science? Well, it will certainly add new data streams. And maybe that's enough. There are use cases. So maybe the best answer would be there will be far more classic use cases for data that are right around the corner. Earlier today I went to the indoor location which has been an interest of mine for years now. Around the corner with Internet of Things there are going to be all sorts of ways of tracking how you go into a store, into a target or a Walmart or one of these big box stores where sometimes you got to get something immediately. You can't just order it on Amazon or online and you want to find out where's this product and you're searching around for somebody to help you getting the kind of service where you can get the path straight through the store and have that path be dynamically routed so that you get to walk by only the products that you're going to like on your way from where you are to where you're going. It's going to make that experience so much better and the Internet of Things is one of the mechanisms for pushing that kind of change in how we shop. Do you teach ethics? I mean because there are so many ways to misuse data in ways that invade privacy or cause harm. Is ethics part of your curriculum? It is, it's integrated throughout the entirety of the master's program but we definitely have a strong focus on that. So there's security, there's privacy and there's also authorization, right? So you might have the data but you didn't protect it well. You might have, you might not have permission for the data or not get that privacy for the data to use the data and to draw insights from your personal identifiable information or your personally identifiable information or you might have some access but you might not have full access, you might be using it in the wrong way so we have to always be careful to observe what are the bounds, what have we been authorized to do and how can we use the data that we've been given for better? What industries do you see? I mean I know it's across the board but are there any that stand out? Obviously we know companies that are sucking up data scientists as a sort of arms race between Google and Facebook and all the other internet guys but are there any that stand out? We know financial services is a big return there. I'm curious about the government's demand. I mean it's got to be huge, you think about the future of war, cyber security, are there any that stand out in particular that are of interest? For me the answer is easy, it's healthcare. For good reasons going back to data privacy there's been a lot more friction getting your hands on healthcare data than maybe consumer data or retail data. For good reason because that's very private, that's very personal and people are going to want to protect that at the same time. That's where so much more good can be. I could do a lot to help analyze the data on how different drugs are interacting, how different regimes might be, might in large scale get gaining insights about how we can help improve the health of everybody of all Americans based on having that data. All Americans, all humans, having access to that data and having large scale insights. What we can grant for that I don't want to speculate but that's a huge untapped area of research that could give great gain. Well and fraud too, right Paul? We heard yesterday that fraud and healthcare businesses between, I think they're numbers 50 and 300 billion dollars annually and so to the extent that you can access data and make things more transparent and do some analytics on that you would think you'd be able to directly attack that problem. Oh yeah and fraud detection is a huge use case for data scientists looking at, it's in the end boils down to looking for these certain kind of anomalies with certain kinds of signatures, much like virus detection and these other security threat use cases. Yeah I mean the last just even couple of years we've seen a compression in, everybody's witnessed the fraud detection and you know you get the text did you make this transaction and still a lot of false positives but it's better than six months of waiting or maybe never finding out, right? And you would equate that or relate that directly to the improvements in data science, application of data science, right? Certainly I don't want to say too much about the false positives that I get one year. No no no, not the false positives. I mean the, I said that in a negative light. The compression of time in which you get notified has been amazing, you know? And it's game changing, right? And that's automated depending on which group and what company is actually doing that detection. You know you can fine tune the balance between false positives and false negatives. Certainly there's a higher cost to the false negatives as you say. But that's, I'm not going to take credit for either one. There's a lot of room for algorithmic applications and machine learning to detect when something is a case of fraud or at least a potential case of fraud. You mentioned recommendation engines earlier and that's a fascinating area. There's this sort of arms race in recommendation engines between the quality of the recommendations and the ability of people to game the engines and deceive them. Is that something, is that a problem that you think data science can solve? Can we create recommendation engines that really are completely reliable? So recommendation engines, and this has been the case for quite a while now, hinge on, so let's take Amazon for the case. When Amazon was just doing movies, or sorry just doing books, Netflix just as movies, when Amazon was just doing books you could actually look at the book and say oh well this is the author, this is the subject, this is the size of the book and you could recommend based on those products. The core fundamental insight, which is still a fundamental insight that has lasted for the last decade is the best indicator of a product that you buy is the fact that other people like to buy it. That's a property, you are a property of every product you bought, that this is the kind of product that you like to buy and if we look at other people that are similar into your profile then we know that we can recommend those products and so now when you talk about a kind of pencil or a kind of computer or whatever it is that you might want to buy too, you might want to be shopping for that, because you bought all these other products and those products have a similarity that me who someone has similar profile also buys them, now we have much better insight into what your purchase behaviors are that go way beyond the fact that a computer or a pencil doesn't have an author, it doesn't have a length, right? So we have to use these other properties of objects in order to sell them and recommend them. All right, Mike, we have to leave it there. Thanks very much for coming to theCUBE, sharing your insights. Good luck with your mission, obviously a fantastic opportunity for your organization and many people who want to really affect change in this world, so appreciate the effort. My pleasure, thank you. All right, keep right there everybody, we'll be back with our next guest right after this. This is theCUBE, we're live from IBM Insight 2015. Check out ibmgo.com, you'll see all the keynotes, all the presentations, all content related and social content related to IBM Insight. We'll be right back, this is theCUBE.