 Live from New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now here's your host, Dave Vellante. Welcome back to New York City everybody. This is theCUBE, the worldwide leader in live tech coverage, and this is theCUBE first. We've got a nine person, actually eight person panel of experts, data scientists, all alike. I'm here with my co-host, James Kobielus, who has helped organize this panel of experts. James, welcome. Thank you very much, Dave. It's great to be here. And we have some really excellent brain power up there. So I'm gonna let them talk. Okay, well thank you for- And I'll interject my thoughts now on them, but I wanna hear them. Okay, great. We know you well, Jim. We know you'll do that. So thank you for that. And appreciate you organizing this. Okay, so what I'm gonna do to our panelists is ask you to introduce yourself. I'll introduce you, but tell us a little bit about yourself and talk a little bit about what data science means to you. Number of you started in the field a long time ago, perhaps data warehouse experts before the term data science was coined. Some of you started probably after Hal Varian, said it was the sexiest job in the world. So think about how data science has changed and or what it means to you. We're gonna start with Greg Pioteski, who's from Boston, a PhD, KD Nuggets. Greg, tell us about yourself and what data science means to you. Thank you, Dave, and thank you, Jim, for the invitation. So data science, in a sense, is the second oldest profession. I think people have this built-in need to find patterns and whatever we find, we want to organize the data. But we do it well on a small scale, but we don't do it well on a large scale. So really data science takes our need and helps us to organize what we find, the patterns that we find are really valid and useful and not just random. I think this is a big challenge of data science. I've actually started in this field before the term data science existed. I started as a researcher and I organized the first few workshops on data mining and knowledge discovery. Then term data mining became less fashionable, became predictive analytics. Now it's data science and it will be something else in a few years. Okay, thank you, Eve's Malkers. Eve's, I, of course, know you from Twitter. A lot of people know you as well. Tell us about your experiences and what data scientist means to you. Well, data science to me is if you take the two words, the data and the science, it holds a lot of expertise and skills in there. It's statistics, it's mathematics, it's understanding the business and putting that together with the digitization of what we have. It's not only the structured data or the unstructured data, what you store in the database, try to get out and try to understand what is in there, but even video what is coming on and trying to find, like George already said, the patterns in there and bringing value to the business but looking from a technical perspective but still linking that to the business insights. I mean, you can do that on a technical level but then you don't know yet what you need to find or what you're looking for. Okay, great, thank you. Craig Brown, Cube alum. How many people have been on the Cube, actually, before? I have. Okay, good, he's like to ask that question. So Craig, tell us a little bit about your background and data science, how has it changed? What's it all mean to you? Sure, so I'm Craig Brown. I've been in IT for almost 28 years and that was obviously before the term data science but I've evolved from the, I started out as a developer and evolved through the data ranks, as I call it, working with data structures, working with data systems, data technologies and now we're working with data pure and simple. Data science to me is an individual or a team of individuals that dissect the data, understand the data, help folks look at the data differently than just the information that we usually use on reports and get more insights on how to utilize it and better leverage it as an asset within an organization. Great, thank you Craig. Okay, Jennifer Shin, math is obviously part of being a data scientist. You're good at math, I understand. Tell us about yourself. Yeah, so I'm a senior principal data scientist at the Nielsen company. I'm also the founder of APATH Solutions which is a data science, analytics and technology company and I'm also on the faculty in the master of information and data science program at UC Berkeley. So math is part of that, I teach statistics for data science actually this semester and I think for me, I consider myself a scientist primarily and data science is a nice day job to have, right? It's something where there's industry need for people with my skillset in the sciences and data gives us a great way of being able to communicate sort of what we know in science in a way that can be used out there in the real world. So I think the best benefit for me is that now that I'm a data scientist, people know what my job is whereas before maybe five, 10 years ago no one understood what I did. Now people don't necessarily understand what I do now but at least they understand kind of what I do so that's still an improvement. Excellent, thank you Jennifer. Joe Caserta, you're somebody who started in the data warehouse business and saw that snake swallow a basketball and grow into what we now know is big data. So tell us about yourself. So yeah, so I've been doing data for 30 years now and I wrote the data warehouse CTL toolkit with Ralph Kimball which is the best selling book in the industry on preparing data for analytics and with the big paradigm shift that's happened for me the past seven years has been instead of preparing data for people to analyze data to make decisions now we're preparing data for machines to make the decisions and I think that's the big shift from data analysis to data analytics and data science. Great, thank you. Miriam, Miriam Friedel, welcome. Thank you. I'm Miriam Friedel, I work for elder research, we are a data science consultancy and I came to data science sort of through a very circuitous route. I started off as a physicist, went to work as a consultant software engineer then became a research analyst and finally came to data science and I think one of the most interesting things to me about data science is that it's not simply about building an interesting model and doing some interesting mathematics or maybe wrangling the data all of which I love to do but it's really the entire analytics life cycle and the value that you can actually extract from data at the end and that's one of the things that I enjoy most is seeing a client's eyes light up or wow I didn't really know we could look at data that way that's really interesting I can actually do something with that. So I think that to me is one of the most interesting things about it. Thank you. Justin Sadeen, welcome. That's correct, yeah absolutely thank you, thank you. So my name is Justin Sadeen. I work for more FDU, an artificial intelligence company in Atlanta, Georgia and we develop learning platforms for non-profit and private educational institutions. Some Marine Corps veteran turned data enthusiast and so what I think about data science is the intersection of information, intelligence and analysis. I'm really excited about the transition from big data into smart data and that's what I see data science as. Great and last but not least, Des Blanchfield, welcome mate. G'day, yeah I'm the one with the funny accent. So data science for me is probably the funniest job I've ever tried to describe to my mom. I've had quite a few different jobs and she's never understood any of them and this one she understands the least. I think a fun way to describe what we're trying to do in the world of data science and analytics now is it's the equivalent of high altitude mountain climbing. It's like the extreme sport version of the computer science world because we have to be this magical unicorn of a human that can understand plain English problems from C-suite down and then translate it into code either as souls or to teams of developers. And so this is black art that we're expected to be able to transmogrify from something that we just in plain English say I would like to know X and we have to go away and figure it out. So there's this neat extreme sport view I have of rushing down the side of a mountain on a mountain bike and just dodging rocks and trees and things occasionally because invariably we do have things that go wrong and they don't quite give us the answers we want. But I think that we're at an interesting point in time now with the explosion and the types of technologies that are out of fingertips and the scale at which we can do things now. Once upon a time we would sit at a terminal and write code and just look at data and watch it in columns and then we ended up with spreadsheet technologies that are fingertips. Nowadays it's quite normal to instantiate a small high performance distributed cluster of computers effectively a supercomputer in a public cloud and throw some data at it and see what comes back. And we can do that on a credit card. So I think we're at a really interesting tipping point now where this this coinage of data science needs to be slightly better defined so we can help organizations who have weird and strange questions that they want to ask tailor solutions to those questions and deliver on them in a I guess a commodity deliverable. I want to know XYZ and I want to know it in this timeframe I want to spend this much amount of money to do it and I don't really care how you're going to do it. And there's so many tools we can choose from and there's so many platforms we can choose from. It's this little black art of computing if you like we're effective making it up as we go in many ways. So I think it's one of the most exciting challenges that I've had and I think I'm pretty sure I speak for most of us and that we're lucky that we get paid to do this amazing job. That we get to make up on a daily basis in some cases. Excellent well okay so we get to get right into it. I'm gonna go off script. Do they have unicorns down under? I think they have some strange species right? Well we put the pointy bit on the back. Hang on yeah you guys have it on the front. So I was at an IBM event on Friday it was a chief data officer summit and I attended what was called the data divas breakfast. It was a women in tech thing. And one of the CDO's she said that 25% of chief data officers are women which is much higher than you would normally see in the profile of IT. We happen to have 25% of our panellists are women. Is that common for data some, Miriam and Jennifer is that common for the data science field or is this a higher percentage than you would normally see in the small? Or a lower percentage. I think certainly for us we have hired a number of additional women in the last year and they are phenomenal data scientists. I don't know that I would say, I mean I think it's certainly typical that this is still a male dominated field but I think like many male dominated fields you know physics, mathematics, computer science I think that that is slowly changing and evolving and I think certainly that's something that we've noticed in our firm over the years and our consultancies we're hiring new people. So I don't know if I would say 25% is the right number but certainly we can hopefully we can get it closer to 50. Jennifer I don't know if you have. Yeah so I know at Nielsen we have actually more than 45% of our team is women at least the team that I work with. So there seems to be a lot of women who are going to the field which isn't too surprising because a lot of the you know with a lot of the issues that come up in STEM one of the reasons why a lot of women drop out is because they want real world jobs and they feel like they want to be in the workforce. And so I think this is a great opportunity with data science being so popular for these women to actually have a job where they can still maintain that engineering and scientific background that they learn in school. Great. Well Hilary Mason I think was the first data scientist I ever interviewed and I asked her what are the sort of skills required and the first question that we wanted to ask and I just threw that other women in tech in there because we love women in tech is about this notion of the unicorn data scientist right? It's been put forth that there's the skill sets required to be a data scientist are so numerous that it's virtually impossible to have a data scientist with all those skills. And I love Des's extreme sports analogy because that plays into the whole notion of data science. We like to talk about the theme now data sciences of team sport. Must it be an extreme sport is what I'm wondering. The unicorns of the world seem to be is that realistic now in this new era? Yes I mean when automobiles first came out they were concerned that there wouldn't be enough chauffeurs to drive all the people around. Is there an analogy with data to be a data driven company do I need a data scientist and does that data scientist need to have these unbelievable mixture of skills or are we doomed to always have a skill shortage? I'd like to have a crack at that. So it's interesting when automobiles were a thing when they first bought cars out and before they sort of were modernized by the likes of Ford's Model T and we got away from the horse and carriage they actually had human beings walking down the street with a flag warning the public that the carriage the horseless carriage was coming. And I think data scientists are very much like that that we're kind of expected to go ahead of the organization and try and take the challenges we're faced with today and see what's gonna come around the corner. And so we're like the little flag bearers if you like in many ways of this is where we're at today. Tell me where I'm gonna be tomorrow and try and predict the day after as well. It is very much becoming a team sport though but I think the concept of data science being a unicorn has come about because the coinage hasn't been very well defined. If you were to ask 10 people what a data scientist were you'd get 11 answers. And I think this is a really challenging issue for hiring managers and C-sweets when they turn around and say I want data science I want big data I want an analyst. They don't actually really know what they're asking for. Generally if you ask for a database administrator it's a well-described job spec and you can just advertise it and some 20 people will turn up and you interview them decide whether you like the look and feel and smell of them. When you ask for a data scientist there's 20 different definitions of what that one data science role could be. So we don't necessarily know what the job is we don't know what the deliverable is and we're still trying to figure that out so yeah. Craig Wayne. So from my experience when we talk about data science we're really talking about a collection of experiences with multiple people I've yet to find and at least from my experience a data science effort with a lone wolf. So you're talking about a combination of skills and so you don't have no one individual needs to have all that makes a data scientist a data scientist but you definitely have to have the right combination of skills amongst the team in order to accomplish the goals of data science team. So from my experiences and from the clients that I've worked with we refer to the data science effort as a data science team and I believe that's very appropriate to the team you know the team sport analogy. Yeah for us we look at a data scientist as a full stack web developer a jack of all trades. I mean they need to have a multitude of background coming from a programmer from an analyst. You can't find one subject matter expert because it's very difficult and if you're able to find a subject matter expert you know through the life cycle of product development you're going to require that individual to interact with a number of other members from your team who are analysts and then you just end up well training this person to be again a jack of all trades so it comes full circle. So I own a business that does nothing but data solutions and we've been in business 15 years and it's been the transition over time has been going from like being a conventional wisdom run company with a bunch of experts at the top to becoming more of a data driven company using data warehousing and BI but now you know the trend is absolutely analytics driven. So if you're not becoming an analytics driven company you're going to be behind the curve very very soon and it's interesting that IBM is now quitting the phrase of a cognitive business. I think that is absolutely the future if you're not a cognitive business from a technology perspective and analytics driven perspective you're going to be left behind for sure. So in order to stay competitive you know you need to really think about data science think about how you're using your data and they also see that you know what's considered a data expert has evolved over time too where it used to be just someone really good at writing SQL or someone really good at writing queries in any language but now it's becoming more of a interdisciplinary action where you need soft skills and you also need the hard skills and that's why I think you know there's more females in the industry now than ever because you really need to have a really broad width of experiences that really wasn't required in the past. Great Pioteski you have a comment? So there are not too many unicorns in nature or as data scientists so I think organizations that want to hire data scientists have to look for teams and there are a few unicorns like Hillary Mason or maybe Osama Fayad but they generally tend to start companies and very hard to retain them as data scientists. What I see is in other evolution automation and you know steps like IBM Watson data first platform is potentially a great advance for data scientists in the short term but probably what's likely to happen the longer term kind of more and more of those skills becoming subsumed by machine learning layer within the software. How long will it take? I don't know but I have a feeling that the paradise for data scientists may not be very long lived. Greg I have a follow up question to what I just heard you say when a data scientist let's say a unicorn data scientist starts a company as you phrased it and the company's product is built on data science do they give up becoming a data scientist in the process? It would seem that they become a data scientist of a higher order if they've built a product based on that knowledge. What is your thoughts on that? Well I know a few people like that so I think maybe they remain data scientists at heart but they don't really have the time to do the analysis and they really have to focus more on strategic thing. For example today actually is the birthday of Google 18 years ago. So Larry Page and Sergey Brin wrote a very influential paper back in the 90s about PageRant. Have they remained data scientists perhaps in a very very small part but that's not really what they do. So I think this unicorn data scientist quickly evolve so you have to look for really teams to capture those skills. Clearly they come to a point in their career where they build a company based on teams of data scientists and data engineers and so forth which relates to the topic of team data science. What is the right division of roles and responsibilities for team data science? Before we go to you, Jennifer, did you have a comment on that? Yeah so I guess I would say for me when data science came out and there was the Venn diagram that came out about all the skills it was supposed to have I took a very different approach than all of the people who I knew who were going to data science. Most people started interviewing immediately. They were like this is great, I'm going to get a job. I went and learned how to develop applications and learned computer science because I'd never taken a computer science course in college. I made sure I chewed up that one part where I didn't know these things or I had the skills from school so I went and just went head first and learned it and now I have actually a lot of technology patents as a result of that. So to answer Jim's question actually, so I started my company about five years ago and originally started out as a consulting firm slash data science company. Then it evolved and one of the reasons I went back into industry and now I'm at Nielsen is because you really can't do the same sort of data science work when you're actually doing product development. It's a very, very different sort of world. When you're developing a product, you're developing a core feature or functionality that you're going to offer clients and customers. So I think definitely you really don't get to have that wide range of sort of looking at eight million models and trying, testing things out. That flexibility really isn't there as your product starts getting developed. Before we go into the team sport, the hard skills that you have, are you all good at math? Are you all computer science types? How about math? You all math? What were your GPAs? Anybody not math or anything? Anybody not love math? You don't love math? I love math, I think it's a requirement. Okay, so math, yes. You dream in equations, right? You dream. Computer science, do I have to have computer science skills, at least the basic knowledge? I don't know that you need to have formal classes in any of these things, but I think certainly, as Jennifer was saying, if you have no skills in programming whatsoever and you have no interest in learning how to write SQL queries or R or Python, you're probably gonna struggle a little bit. That would be a challenge. So I think, yes, I have a PhD in physics, I did a lot of math, that's my love language, but I think if you don't necessarily need to have formal training in all of these things, but I think you need to have a curiosity and a love of learning, and so if you don't have that, you still wanna learn and you wanna, however you gain that knowledge, I think, but yeah, if you have no technical interest whatsoever and don't wanna write a line of code, maybe data science is not the field for you, even if you don't do it every day. And statistics as well, you would put that in that same general category. How about data hacking? That's right. You gotta love data hacking, is that fair or eaves? Do you have a comment? Yeah, I think so, well, we've been discussing, for me, the most important part is that you have a logical mind and you have the capability to absorb new things and the curiosity that you need to dive into that. Well, I don't have an education in IT, whatever, I have a background in chemistry and those things that I learned there, I apply to information technology as well. And from a part that you say, okay, I'm a tech savvy guy, I'm interested in the tech part of it, you need to speak that business language and if you can do that crossover and understand what other skill sets or parts of the roles are telling you, I think the communication in that aspect is very important. I'd like to throw just something really quickly and I think there's an interesting thing that happens in IT, particularly around technology. We tend to forget that we've actually solved a lot of these problems in the past. If we look in history, if we look around the Second World War and Bletchley Park in the UK, we had a very similar experience as humans that we're having currently around the whole issue of data science. So there was an interesting challenge with the enigma and the shark code, right? And there was a bunch of men put in the room and told your mathematicians and you've come from universities and you can crack codes, but they couldn't. And so what they ended up doing was running these ads and putting challenges. They actually put, I think it was crossword puzzles in the newspaper and this deluge of women came out of all kinds of different roles without math degrees, without science degrees, but could solve problems. And they were thrown at the challenge of cracking codes and invariably they did the heavy lifting on a daily basis for converting messages from one format to another so that this very small team at the very end could actually get and play with the sexy piece of it. And I think we're going through a similar shift now with what we refer to as data science in the technology world and business world where the people who are doing the heavy lifting aren't necessarily what we would think of as a traditional data scientist. And so there have been some unicorns and we've championed them and they're great. But I think the shift's going to be to accountants and actuaries and statisticians who understand the business and come from an MBA style background that can learn the relevant pieces of math and models that we need to apply to get the data science outcome. I think we've already been here, we've solved this problem, we just got to learn not to try and reinvent the wheel because the media hypes this whole thing of data science is deciding anew, but we've been here a couple of times before and there's a lot to be learned from that, my view. I think we had Joe next. So I was going to say that data science is a funny thing. To use the word science is kind of misnomer because there is definitely a level R to it. And I like to use the analogy when Michelangelo would look at a block of marble, everyone else looks at the block of marble, they say a block of marble. He looks at a block of marble and he sees a finished sculpture. And then he figures out what tools do I need to actually make my vision? And I think data science is a lot like that. We hear a problem, we see the solution, then we just need the right tools to do it. And I think part of consulting in data science in particular, it's not so much what we know out of the gate, but it's how quickly we learn. And I think everyone here, what makes them brilliant is that how quickly they could learn any tool that they need to see their vision get accomplished. Justin? Yeah, I think you make a really great point. For me, I'm a Marine Corps veteran and the reason I mention that is because I work with two veterans who are problem solvers and I think that's what data scientists really are in the long run are problem solvers. And you mentioned a great point that, yeah, I think just problem solving is the key. You don't have to be a subject matter expert, just be able to take the tools and intelligently use them. Now, when you look at the whole notion of team data science, what is the right mix of roles, like role definitions within a high quality or a high performing data science teams? Now, IBM with, of course, our announcement of project data works and so forth, we're splitting the role division in terms of data scientists versus data engineers versus application developers versus business analysts. Is that the right breakdown of roles or what would the panelists recommend in terms of understanding what kind of roles make sense within, like I say, a high performing team that's looking for trying to develop applications that depend on data and machine learning and so forth. Anybody want to contribute? Yes, I'll tackle that. So the teams that I've created over the years that made up these data science teams that I brought into customer sites have a combination of developer capabilities and some of them are IT developers but some of them were developers of things other than applications. They designed buildings, they did other things with their technical expertise besides building technology. The other piece besides the developer is the analytics and analytics can be taught as long as they understand what, you know, how algorithms work and the code behind the analytics, in other words, how are we analyzing things and from a data science perspective, we are leveraging technology to do the analyzing through the tool sets. So ultimately, as long as they understand how tool sets work and we can train them on the tools, having that analytic background is an important piece. Craig, is it easier to, I'll go to you in a moment, Joe, is it easier to cross train a data scientist to be an app developer than to cross train an app developer to be a data scientist or does it not matter? Yes. And not the other way around. It depends on the time frame. It's easier to cross train a data scientist to be an app developer than the other way around. Why is that? Developing code can be as difficult as the tool set one uses to develop code. Today's tool sets are very user-friendly with developing code. It's very difficult to teach a person to think along the lines of developing code when they don't have any idea of, you know, the aspects of building something. I think Joe, are you next or Jennifer? Who is it? So I would say that one of the reasons for that is that data scientists will probably know if the answer's right after you process data or data engineer might be able to manipulate the data but may not know if the answer's correct, right? So I think that is one of the reasons why having data scientists learn the application development skills might be an easier time than the other way around. I think Miriam, have I counted? I mean, sorry, go ahead, Joe. Yeah, sorry. I think that what we're advising our clients to do is to not think, you know, before data science and before analytics became, you know, so required by companies to stay competitive, it was more of a waterfall. You have a data engineer build a solution, you know, then you throw it over the fence and the business analysts would have at it. Where now it must be agile and you must have a scrum team where you have the data scientists and the data engineer and the project manager and the product owner and someone from the data chief data office all at the table at the same time and all accomplishing the same goal because all of these skills are required collectively in order to solve this problem and it can't be done daisy-chained anymore. It has to be a collaboration and that's why I think Spark is so awesome because, you know, Spark is, you know, a single interface that a data engineer can use, a data analyst can use and a data scientist can use and now with what we've learned today, having a data catalog on top so that the chief data office can actually manage it, I think is really going to take Spark to the next level. To Miriam. I wanted to comment on your question to Craig about, is it harder to teach, you know, a data scientist to build an application or vice versa? And one of the things that we have worked on a lot in our data science team is incorporating a lot of best practices from software development, you know, agile, scrum, that sort of thing. And I think particularly with a focus on deploying models, so we don't just want to build an interesting data science model, we want to deploy it and get some value, you need to really incorporate these processes from someone who might know how to build applications. And I think for some data scientists can be a challenge because one of the fun things about data science is you get to get into the data and you get your hands dirty and you build a model and you get to try all these cool things. But then when the time comes for you to actually deploy something, you need deployment grade code in order to make sure it can go into production at your client site and be useful, for instance. So I think that there's an interesting challenge on both ends, but one of the things that I've definitely noticed with some of our data scientists is it's very hard to get them to think in that mindset, which is why you have a team of people because everyone has different skills and you can mitigate that. Dev Ops for data science. Yeah, exactly. Dev Ops for what we call it insight ops, but yeah, I hear what you're saying. Data science is becoming increasingly an operational function as opposed to strictly exploratory or developmental. I think someone else had a cut, does. One of the things I was going to mention, one of the things I like to do when someone gives me a new problem is take all the laptops and phones away and we just end up in a room with a whiteboard. And developers find that challenging sometimes. And so I have this one line where I said to them, don't write the first line of code until you actually understand the problem you're trying to solve, right? And I think where the data science focus has changed the game for organizations who are trying to get some systematic, repeatable process so they can throw data out and just keep getting answers on things no matter what the industry might be, is that developers will come with a particular mindset on how they're going to codify something without necessarily getting the full spectrum and understanding the problem in the first place. What I'm finding is that people that come out data science tend to have more of a hacker ethic, they want to hack the problem, they want to stand the challenge and they want to be able to get it down to plain English simple phrases and then apply some algorithms and then build models and then codify it. And so most of the time we sit in a room with whiteboard markers just trying to build a model in a graphical sense and make sure it's going to work and then it's going to flow. Once we can do that, we can codify it. I think when you come up from the other angle from the developer ethic and you're like, I'm just going to codify this from day one, I'm going to write code, I'm going to hack this thing out and it's just going to run and compile. Often you don't truly understand what you're trying to get to at the end point and you can just spend days writing code and I think someone made a comment that sometimes you don't actually know whether the output is actually accurate in the first place. So I think there's a lot of value being provided from the data science practice of understanding the problem of plain English at a team level. So what am I trying to do from the business consulting point of view? What are the requirements? How do I build this model? How do I test the model? How do I run a sample set through it, train the thing and then make sure what I'm going to codify actually makes sense in the first place? Because otherwise, what are you trying to solve in the first place? Wasn't that Einstein who said if I had an hour to solve a problem I'd spend 55 minutes on understanding the problem and five minutes on the solution, right? That's exactly what you're talking about. Well, I think, I will say, getting back to the question, the thing with building these teams, I think a lot of times people don't talk about it is that engineers are actually very, very important for data science projects, data science problems. For instance, if we're just trying to prototype something or just come up with a model, then data science teams are great. However, if we need to actually put that into production, that code that the data scientist has written may not be optimized, right? It might not be optimal. So as we scale out, it may be actually very inefficient. At that point, you kind of want to engineer to step in and actually optimize that code. So I think it depends on what you're building and that kind of dictates like what kind of division you want among your teammates. But I do think that a lot of times the engineering component is really undervalued out there. Jennifer, it seems that the data engineering function data discovery and preparations over there is becoming automated to a greater degree. But if I'm listening to you, I don't hear that data engineering as a discipline is becoming extinct in terms of a role that people can be hired into. You're saying that there's a strong, ongoing need for data engineers to optimize the entire pipeline to deliver the models, the fruits of data science, production applications, is that correct? So they play that very much operational role as the backbone for... Right, so I think a lot of times businesses will go to a data scientist to build a better model, maybe build a predictive model. But that model may not be something that you really want to implement out there when there's like a million users coming to your website because it may not be efficient, it may take a very long time. So I think in that sense, it is important to have good engineers and your whole product may fail. You may build the best model, it may have the best output, but if you can't actually implement it, then really what good is it? What are you calibrating these models? How do you go about doing that and testing that in the real world? Has that changed over time or is it? So one of the things I think that can happen and we found with one of our clients when you build a model, you do it with the data that you have and you try to use a very robust cross-validation process to make sure that it's robust and it's sturdy. But one thing that can sometimes happen is after you put your model into production, there can be external factors that societal or whatever are things that have nothing to do with the data that you have or the quality of the data or the quality of the model, which can actually erode the model's performance over time. So as an example, we think about cell phone contracts. Those have changed a lot over the years. So maybe five years ago, the type of data plan you had might not be the same that it is today because a totally different type of plan is offered. So if you're building a model on that to say, predict who's gonna leave and go to a different cell phone carrier, the validity of your model over time is gonna completely degrade based on nothing that you put into the model or the data that was available. So I think you need to have this sort of model management and monitoring process to take these factors into account and then know when it's time to do a refresh. Cross-validation, even at one point in time, for example, there was an article in the New York Times recently that they gave the same data set to five different data scientists. This is survey data for the presidential election that's upcoming. And five different data scientists came to five different predictions. They were all high-quality data scientists. The cross-validation showed a wide variation about who is on top, whether it was Hillary or whether it was Trump. So that shows that even at any point in time, cross-validation is essential to understand how robust the predictions might be. Does someone else have a comment, Joe? I just want to say that this even drives home the fact that having the scrum team for each project and having the engineer and the data scientists, data engineer and data scientists working side by side, because it is important that whatever we're building, we assume we'll eventually go into production. And we used to have in the data warehousing world, we would have, you get the data out of the system, so out of your applications, you do analysis on your data and the nirvana was maybe that data would go back to the system, but typically it didn't. Nowadays, the data, the applications are dependent on the insight coming from the data science team. The way the behavior of the application and the personalization and individual experience for a customer is highly dependent. So it has to be, you said, is data science part of the DevOps team? Absolutely now, it has to be. Whose job is it to figure out the way in which the data is presented to the business? Where's the sort of presentation, the visualization plane? Is that the data scientist role? Is that dependent on whether or not you have that gene? Do you need a UI person on your team? Where does that fit? Good question. Well, usually that's the output. I mean, once you get to the point where you're visualizing the data you've created an algorithm or some sort of code that produces that is to be visualized. So at the end of the day, the customers can see what all the fuss is about from a data science perspective, but it's usually post the data science component. So do you run into situations where you can see it and it's blatantly obvious, but that doesn't necessarily translate to the business? True. How do you deal with that? There's an interesting challenge with data. I mean, we throw the, we bandied the word data around a lot. And I've got this fun line, I like throwing out them. If you torture data long enough, it will talk. So the challenge then is to figure out when to stop torturing it, right? And it's the same with models. And so I think in many other parts of organizations we'll take something. So if someone's doing a financial report on performance of the organization and they're doing a spreadsheet, they'll get two or three peers to review it and validate that they've come up with a working model and then the answer actually makes sense. And I think we're rushing so quickly at doing analysis on data that comes in various formats at high velocity that I think it's really important for us to actually stop and do peer reviews of the models and the data and the output as well. Because otherwise we start making decisions very quickly about things that may or may not be true. It's very easy to get the data to paint any picture you want. And you gave the example of the five different attempts at that thing. And I had this shootout thing as well. We're all taking a team. I'll get two different people to do exactly the same thing in completely different rooms and come back and challenge each other. And it's quite amazing to see the looks in their faces and they're like, oh, I didn't see that. And then go back and do it again until, and just keep iterating until we get the point where they both get the same outcome. There's a really interesting anecdote about when the Unix operating system was being written and a couple of the authors went away and wrote the same program without realizing each other were doing it. And when they came back, they actually had line for line the same piece of C code. Because they'd actually gotten to a truth, a perfect version of that program. And I think we need to often look at when we're building models and we're playing with data, if we can't come at it from different angles and get the same answer, then maybe the answer isn't quite true yet. So there's a lot of risk in that. And it's the same with presentation. You can paint any picture you want with the dashboard, but who's actually validating with the dashboard that's painting the correct picture? So, go ahead, please. I'm sorry if I could. So there is a science actually behind data visualization. You know, if you're doing trending, it's a line graph. If you're doing comparative analysis, it's a bar graph. You're doing percentage, it's a pie chart. Like there is a certain science to it. It's not that much of a mystery as, you know, the novice thinks there is. But what makes it challenging is that you also, just like any presentation, you have to consider your audience. And your audience, whenever we're delivering a solution, either insight or just data in a grid, we really have to consider who is the consumer of this data and actually cater the visual to that person or to that particular audience. And that is part of the art and that is what makes a great data scientist. The consumer may in fact be the source of the data itself like in a mobile app. So you're tuning their visualization and then their behavior is changing as a result. And then the data on their change behavior comes back. So it can be a circular process. Sure. So Jim, at a recent conference you were tweeting about the citizen data scientist and you got emasculated by some of it. I spoke there too. Okay. TWI on that same topic. Kirk Born, I hear, came after you. Oh, the Kirk Ment. Called foul flag in the play. Kirk Ment, well. So, no, great. I love Claudia Emhoff too. But yeah, it's a controversial topic. So I wonder what our panel thinks of that notion of a citizen data scientist. Can I respond about citizen data scientists? Yeah, please. I think this term was introduced by Gardner analyst in 2015. And I think it's a very dangerous and misleading term. I think definitely we want to democratize the data and have access to more people, not just data scientists but managers, BI analysts. But when there is already a term for such people, we can call them business analysts because that implies some training, some understanding of the data. If you use a term citizen data scientist, it implies that without any training, you take some data and then you find something there. And I think as Des mentioned that we've seen many examples, very easy to find completely spurious random correlations in data. So we don't want citizen dentists to treat our teeth or citizen pilots to fly the planes. And if data is important, having citizen data scientist is equally dangerous. So I'm hoping that, I think actually Gardner did not use the term citizen data scientist in their 2016 hype course. So hopefully they will put this term to rest. So Gregory, you apparently are defining citizen to mean incompetent as opposed to simply self-starting. Self-starting is very different, but that's not, I think, what was the intention. I think what we see in terms of data democratization, there is a big trend toward automation. There are many tools, for example, there are many companies like data robot, probably IBM has interesting machine learning capability towards automation. So I think I've recently started a page on KDNuggets for automated data science solutions. And there are already 20 different firms that provide different levels of automation. So one can believe either in full automation, maybe some expertise, but it's very dangerous to have part of automated tool. And at some point, then ask citizen data scientist to try to take the wheels. I want to quickly chime in on that. I want to quickly chime in on that. Yeah, pie along. I totally agree with all of that. I think the comment I just want to quickly put out there is that the space we're in is a very young and rapidly changing world. And so what we haven't had yet is this time to stop and take a deep breath and actually define ourselves. So if you look at computer science in general, a lot of the traditional roles have sort of had 10 or 20 years of history. And so through the hiring process and the development of those spaces, we've actually had time to breathe and define what those jobs are. So we know what a systems programmer is and we know what a database administrator is, but we haven't yet had a chance as a community to stop and breathe and say, well, what do we think these roles are? And so to fill that void, the media creates coinages. And I think this is the risk we've got now that the concept of data scientist was just a term that was coined to fill a void because no one quite knew what to call somebody who didn't come from a data science background if they were tinkering around in data science. And I think that's something that we need we need to sort of sit up and pay attention to because if we don't own that and drive it ourselves and somebody else is going to fill a void and they'll create these very frustrating concepts like data scientist, which drives us all crazy. Miriam's next. So I wanted to comment. I agree with both of the previous comments but in terms of a citizen data scientist. And I think whether or not you're a citizen data scientist or an actual data scientist, whatever that means, I think one of the most important things you can have is a sense of skepticism, right? Because you can get spurious correlations and it's like, wow, my predictive model is so excellent and being aware of things like leaks from the future. This actually isn't predictive at all. It's a result of the thing I'm trying to predict. And so I think one thing I know that we try and do is if something really looks too good, we need to go back in and make sure. Did we not look at the data correctly? Is something missing? Did we have a problem with the ETL? And so I think that a healthy sense of skepticism is important to make sure that you're not taking a spurious correlation and trying to derive some significant meaning from it. I think there's a Dilbert cartoon that I saw that described that very well. Joe, did you have a comment? I think that in order for citizen data scientists to really exist, I think we do need to have more maturity in the tools that they would use. My vision is that the BI tools of today are all going to be replaced with natural language processing and searching, just be able to open up a search bar and say, give me sales by region. And to take that one step into the future even further, it should actually say, what are my sales going to be next year? And it should trigger a simple linear regression. Or be able to say, which features of the televisions are actually affecting sales and do a clustering algorithm. I think hopefully that will be the future, but I don't see anything of that today. And I think in order to have a true citizen data scientist, you would need to have that. And that is pretty sophisticated stuff. So I think for me, the idea of citizen data scientists, I can relate to that. For instance, when I was in graduate school, I started doing some research on FDA data. It was an open source data set, about 4.2 million data points. Technically, when I graduated, the paper was still not published. And so in some sense, you could think of me as a citizen data scientist. I wasn't getting funding. I wasn't doing it for school. I was still continuing my research. So I'd like to hope that with all the new data sources out there, that there might be scientists or people who are maybe kept out of a field, people who wanted to be in STEM and for whatever life circumstances couldn't be in it, that they might be encouraged to actually go and look into the data and maybe build better models or validate information that's out there. Justin, I'm sorry, you have one comment. It seems data science was termed before academia adopted. Formalized training for data science. But yeah, you can make, like I said, you can make data work for whatever problem you're trying to solve, whatever answer you see. You want data to work around it, you can make it happen. And I kind of consider that in project management, data creep. So you're so hyper-focused on the solution. You're trying to find the answer that you create an answer that works for that solution, but it may not be the correct answer. And I think a crossover discussion works well for that case. So, but the term comes up because there's a frustration, I guess, that data science skills are not plentiful and it's potentially a bottleneck in an organization. Suppose the 80% of your time is spent on cleaning data and is that right? Is that fair? So there's a problem. How much of that can be automated and when? So I think there's a shift that's going to come about where we're going to move from centralized data sets to data at the edge of the network. And this is something that's happening very quickly now where we can't just haul everything back to a central spot when the internet of things actually wakes up. Things like the Boeing Dreamliner 787, that thing's got 6,000 sensors and it produces half a terabyte of data per flight. There are 87,400 flights per day in domestic airspace in the US. That's 43.5 petabytes of raw data. That's about three years of the disk manufacturing in total, right? We're never going to copy that across to one place. We can't process it one good. So I think the challenge we've got ahead of us is looking at how we're going to move the intelligence and the analytics to the edge of the network and precook the data in different tiers. So have a look at the raw material we get and boil it down to a slightly smaller data set, bring a metadata version of that back, and eventually get the point where we've only got the very minimum data set and data points we need to make key decisions. Without that, we're already at the point where we have too much data and we can't munch it fast enough and we can't spin up enough tin even if we switch the cloud on. And that's just this never ending deluge of noise, right? And you've got that signal versus noise problem. So I think we're now seeing a shift where people are looking at, how do we move the intelligence back to the edge of the network? Which we actually solved some time ago in the security space, you know, spam filtering. If an email hits Google on the West Coast of the US and they create a checksum for that spam email, it immediately goes into a database and nothing gets on the opposite side of the coast because they already know it's spam. They recognize an email coming in and that's evil, stop it. So we've already fixed it in security with intrusion detection, we fixed it in spam. So we now need to take that learning and bring it into business analytics if you like and see where we're finding patterns and behavior and prove that out to the edge of the network. So if I'm seeing a demand over here for tickets on a new sale of a show, I need to be able to see where else I'm going to see that demand and start responding to that before the demand comes about. I think that's a shift we're going to see quickly because we will never keep up with the data munging challenge and the volume's just going to explode. We have a couple minutes. That sounds like a great topic for a future CUBE panel which is the data science on the edge of the fog. I got a hundred questions around that. So we're wrapping up here, just got a couple of minutes. Final thoughts on this conversation or any other pieces that you want to punctuate? I think one thing that's been really interesting for me being on this panel is hearing all of my co-panelists talk about common themes and things that we are also experiencing which isn't a surprise but it's interesting to hear about how ubiquitous some of the challenges are and also at the announcement earlier today some of the things that they're talking about and thinking about, we're also talking about and thinking about. So I think it's great to hear we're all in different countries and different places but we're experiencing a lot of the same challenges and I think that's been really interesting for me to hear about. Anybody else final thoughts? To echo Des's thoughts. It's about, we're never going to catch up with the amount of data that's produced. So it's about transforming big data into smart data. I could just say that with the shift from normal data, small data to big data, the answer is automate, automate, automate. And we've been talking about advanced algorithms and machine learning for the science for changing the business but there also needs to be machine learning and advanced algorithms for the back room where we're actually getting smarter about how we ingest data and how we fix data as it comes in because we can actually train the machines to understand data anomalies and what we want to do with them over time. And I think the further upstream we get of data correction, the less work that will be downstream. And I also think that the concept of being able to fix data at the source is gone. That's behind us. Right now the data that we're using to analyze, to change the business, typically we have no control over. Like this, they're coming from sensors and machines and internet of things and if it's wrong, it's always going to be wrong. So we have to figure out how to do that in our laboratory, right? Yves, final thoughts? Yeah, I think it's a mind shift in being a data scientist. If you look back at the time, why did you start developing or writing code because you like to code whatever just for the sake of building a nice algorithm or a piece of software or whatever. And now I think with the spirit of a data scientist, you're looking at a problem and say, this is where I want to go. So you have more the top down approach than the bottom up approach and have the big picture. And that is what you really need as a data scientist, just look across technologies, look across departments, look across everything. And then on top of that, try to apply as much as skills that you have available. And that's that kind of unicorn that they're trying to look for because it's pretty hard to find people with that wide vision on everything that is happening within the company. So you need to be aware of technology. You need to be aware of how a business is run and how it fits within in a cultural environment. You have to work with people and all those things together, to my belief, make it very difficult to find those good data scientists. Jim, your final thought? My final thoughts is this is an awesome panel and I'm so glad that you've come to New York and I'm hoping that you will all stay, of course, for the IBM Data First launch event that will take place this evening about a block over at Hudson Mercantile. So that's pretty much it, thank you. I really learned a lot. I want to second Jim's thanks. Really great panel, awesome expertise. Really appreciate you taking the time and thanks to the folks at IBM for putting this together. And I'm big fans of most of you, all of you on the session here. So it's great just to meet you in person, thank you. Okay, and I want to thank Jeff Frick for being a human curtain there with the sun setting here in New York City. Well, thanks very much for watching. We are going to be across the street at the IBM announcement. We're going to be on the ground. We open up again tomorrow at 9.30 at Big Data NYC, Big Data Week, Strata plus the Duke World. Thanks for watching everybody. That's a wrap from here. This is theCUBE, we're out.