 Hello and welcome. My name is Shannon Kemp and I am the Chief Digital Manager of DataVersity. We would like to thank you for joining the latest installment of the DataVersity webinar series, Data Insights and Analytics, brought to you in partnership with First Hands Francisco partners. Today, Kelly will interview Chief Data Scientist, Narasimha Adala, to discuss the role of a data scientist. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, you will be collecting them by the Q&A in the bottom right hand corner of your screen, or if you like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag DI analytics. As always, we will send a follow-up email within two business days containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Now, let me introduce to you our speaker for today, Kelly O'Neill. Kelly is the founder and CEO of First Hands Francisco partners, having worked with the software and systems providers key to the formulation of enterprise information management. Kelly has played important roles in many of the groundbreaking initiatives that confirm the value of EIM to the enterprise. Recognizing an unmet need for clear guidance and advice on the intricacies of implementing EIM solutions, she founded First Hands Francisco partners in early 2007. And with that, I will turn it over to Kelly to introduce today's guest speaker and to get today's webinar started. Hello and welcome. Thank you so much, Shannon. Good morning, good afternoon. And if anyone is in Europe, good evening. So today joining me is Narasimha Adala. Narasimha is the chief data scientist with Databloom.com. And before co-founding Databloom in 2016, Narasimha was a solution architect at Intel building high-volume manufacturing control systems. Prior to that, he worked as a consulting research scientist with Reed Elsevier Incorporated, building PETA scale massively parallel processing information extraction systems. Narasimha holds a master's degree in human factors engineering from Wright State University and an ABD status in data mining also from Wright State University. So welcome, Narasimha. Very happy to have you with us today. Thank you Kelly, it is my privilege. Great. Well, so we have a few questions that we've put together that we thought would be a great way to get this webinar started. And then we would love to hear your question. So as Shannon mentioned in the Q&A part of your screen, the goal is to have this be an open interview and have your questions get answered. So we're going to start off with some very foundational questions around what does a data scientist do? What skills are needed? What training is needed? And then my current favorite, which I realized is a terribly crafted grammatical question. Who do data scientists report to? I am just imagining my English teacher, Mrs. Clark screaming at me for ending a sentence with a preposition. Anyway, to whom do data scientists report? And of course, anything else that you want to bring up. So without further ado, why don't we just get started? So Narasimha, in your experience, what does a data scientist do? Good morning folks. So data scientists draw actionable insights from data. And that's what the accepted definition in the industry is. But what do we really mean by that? And of course, everybody draws actionable insights from data. And that's what data is supposed to do for us. And so what I wanted to do to kind of introduce you to as to what we do, I wanted to kind of give you an anecdotal story as to how I serendipitously walked into data science. So I used to work as a research scientist with Read and Severe, like you rightly introduced. I was building massively parallel processing systems to actually index content and make it searchable for lawyers and academic folks so they can actually research and do this great stuff. Now, in 2011, a company called Manta Media actually had me out from Dayton to Columbus to be the chief data scientist. Now, mind you, I had actually worked as a research scientist and I was applying for quote-unquote a research science job with Manta Media. And I had gone to the interview at A.S.Date. Of course, I knew what I was talking about. And so they immediately called me back and said, you know, hey, we can actually make you chief data scientist. And to be very candid and honest with you, I had no idea what a data scientist was or I never heard about data science. And so I was really upset. I thought this was one way how I'm being, at least my skills are being underrated, if you will. And so I have a very good friend in Mountain View, California, right? And we all know him. He's called Google. So I duly actually searched on Google. Right. Don't we love Google? My best friend, I'll tell you, I spend a good 50% of my time every day with him. So I searched and actually encountered on this article that Dr. Tom Davenport had written about how data scientist was the sexiest career profession of the 21st century. And I think what it led me to believe was that, you know, data scientists dive in their neck deep into the data and draw insights that are really solving business challenges. So somehow at least the key takeaway that I had, and I believe at least my management that had made my intermediary had, was that, you know, we were this unicorn clan that was day in, day out breathing data, if you will, in order to find this golden nugget that will really advance business prospects forward. It was an awesome deal, but frankly I had no idea what I was doing. So for the first three months, I wanted to fake it till I made it, hoping that something would actually click in my head as to what a data scientist really does, even as I was focused on quote unquote research science. But the long story is three months later, I was still pretending, right? I mean, I was still pretending I was faking till I made it. But three months later, I think one business problem came about. By the way, in these three months, I hadn't really solved any business challenge per se. And so in three months later, one of the business problems that our chief technology officer posed at me, he said, you know, Narsimha, do me a favor. What we have, by the way, Manta Media is an online small business listings company. They source information from secretaries of state for all the business listings and make web pages available for these business contact information. So small and medium businesses have a web representation, if you will. Long story, we actually source that information, this business listings information, the 75 million businesses in the U.S. and worldwide, for example, that we had information for from Dunn and Bradstreet, from secretaries of state, and we pay a huge money for it. But what was happening was all of our information was being stolen, stolen by pirates that actually came through underground networks and stole this information and made a black market out of it. And obviously this was not good health prospect for our business. And so he said, you know, hey, Narsimha, can you help me? And I had no idea. But he gave me the web logs. He told me, you know, hey, here are the kind of patterns that we're actually seeing. We have about 30 million visitors a day. Can you tell us who are the rogues out of this and who are really genuine customers that are soliciting or seeking business information? I sat through it and immediately what became apparent was how these underground networks were actually being leveraged for crawling this information and stealing it. How there were directed patterns versus undirected patterns, you know, because we as humans usually use hyperlink structure to read the content. We space out a request of crawls and so on and so forth. But the point I'm trying to get across is after three months I had this Eureka moment saying, ah, now I've solved the problem. I can tell who is really a pirate user that is stealing our information from a genuine customer that is really soliciting small and medium business distinct services, if you will. And so when I solved that problem it gave me a tremendous amount of satisfaction. That for the first time told me something. I'm sorry for a very long verbal story here, but the point I'm trying to get across here is we draw insights from the data, but we don't do undirected searches. Maybe it was my misunderstanding of reading Tom Davenport's article on Harvard Business Review that we have this unicorn clan that is trying to forge new business models, new discoveries. That wasn't the case. They're all purposefully driven business problems that currently need solving, but do not have the very vital skills to actually solve them. Data scientists really dig into these big data structures, if you will, to actually find the patterns of solving those problems. So the very meaningful, applied, purposeful investigation that seeks to explore insights from the data. That's really interesting, Narasimha, because I do think that there's a bit of a misperception, or maybe it's my own misperception, that data science is somehow this magical thing where the data reveals itself as opposed to, like you said, purposeful investigation. Anyway, I think that's really interesting. So thank you for sharing that. Thank you. So what I also additionally wanted to do, Kelly, is kind of illustrate the use cases, recent set of use cases, perhaps, to try and define what are the data scientists, at least, I have worked on. And the intent of doing that is not just, of course, bragging. We all would like to brag. But more importantly, I think there's a... It's okay, Narasimha, you're on the interview to brag. We want to hear. We want to hear what you do. So also I think the other intent is, obviously there's questions on how is this different than research scientist? How is this different than, say, a data analyst that actually lives and reads the same data? And hopefully the use cases will make it distinct. If not, at least I do have some explicitly stated statements that will try to draw the distinction between the three personas. If you don't mind advancing to the next slide, Kelly. Perfect. So I will actually not speak to this slide right now. I will come back to it. But the three use cases that I really, really wanted to highlight so read Elcivir is an online information listings company. So what we do is crawl information, legal information, business information, financial information and so on. Make it indexed and searchable for lawyers, academic researchers and tax professionals, health professionals and so on to actually come to our website and be able to do research. So think about how you and I have the best friend of Google day in, day out. The lawyers, the scientific researchers actually use read Elcivir quite extensively for their research. So one of the problem that was posed to us back in 2009, 2010-ish timeframe was, hey, how come our royalty rates? Obviously, we have to pay for this content. It's not like we simply crawl and fleece the copyrighted materials for commercial purposes. We have to pay these royalty rates for every link that we present to our customers and consumers. We have to pay royalty rates back to the suppliers, the content suppliers and publishers. Turns out in 2010, again, a question was posed as to why our royalty rates were as high as they were. Turns out when we actually looked into the data, the web logs, the dwell times and so on, each of this content and our search results, what happened was we were actually surfacing. So I don't know how many of you know, there's a consortium basically and you have associated press or writers, for example, they're republishers. They actually compile material from other people and actually republish this content. The long story is, read Elcivir was actually soliciting information from multiple sources for the same content and we were actually paying double and triple royalty rates than what we're ideally serving the customer. So if we should say article one, that was titled the same way, it was written by the same person, but sourced through Factiva versus sourced through Reuters versus sourced through Associated Press, let's just say, we would literally pay the royalty to all the publishers, even though we only surfaced content from one of those three sources. That's really interesting and I think that some of the folks on the phone could probably relate to that because we all buy data, right? So whether it's, you know, Denon Bradstreet or what have you and in these, you know, large organizations, it's not always clear whether we are buying it once or multiple times. Exactly. Exactly. And for a research firm, that's at least three petabytes in content, you can imagine how much of royalty rate savings we could actually have achieved had we had a deduplication module in place. So as a research scientist, I shouldn't kill that as a data scientist, but we wouldn't have known that we were actually, you know, having these leaks in the content royalty rates had we not had the data at our disposal to study the web log patterns, the analytical patterns of the dwell times and so on to optimize and reduce all of this leakage that we had going on. So a second use case, for example, is, you know, at the high value manufacturing company that I worked for, we do what is called as layout sensitive defect analysis. That's just a very, very complex term describing what is very simple otherwise. So all of, we make circuits, right? Circuits that connect, you know, wires that connect point A to point B. And when you actually are making this complex circuit on a wafer and a die, mistakes happen, you know, because mistakes happen, one, because we are operating at such nanoscales, if you will. But more importantly, you know, mistakes also happen, you know, with respect to just the environmental factors, if you will, that exists. And so when this, when this circuit really fails in the field, we have an obligation to actually root cause exactly what happened wrong. Is that because of a dry edge? Is that because of a cross-connect that happened between two circuit patterns that are not supposed to connect with each other and so on. So one of the things that we actually do, we did, is we would actually put this under a scanning of the circuits to see what in fact went wrong. And this is obviously, you can imagine, a very manual laborious process that takes days and weeks, frankly, for us to do. Obviously, when multitude of these things happen, you don't want to do every one of them, but you want some automated way to actually coalesce these kinds of problems into groups. So you can actually dispense an entire diagnostic group through one fault analysis correlation versus doing multiple investigations, if you will. And so we had actually applied quote-unquote, we had learned about this deep learning, how we can actually use convolutional neural networks to actually do some of this automated image analysis. The long story is we were able to actually take this big lots of circuits, if you will, and group them into these diagnostic groups that we could dispense away are magicians, or real magicians, if you will, to root cause these groups versus individual units. And that was a huge deal, as far as I was concerned. Similarly, there's a third use case, for example, and stop me, by the way. This is the last use case I'll go through. No, that's good, that's good. And then we've got a question that I wanted to add in. So when you're done with this use case, I'm going to take a question from the participants. Perfect, perfect. So the third use case, for example, is there's a rental car company that operates in the US and has the challenges of fleet management. When I say fleet management, how many cars do you hold on the parking lot? So you're never really turning away a customer that rightfully deserves or is willing to pay the dime to rent the car. And when we dug into the data, it always seemed like Phoenix, for example, one good example here is in the Phoenix airport, it's extremely hot for those that don't know the Phoenix environmental conditions. In summer, we were seeing approximately 100,000 less cars rented versus the usual moving average of Phoenix. And then we could also see, for example, as to what class was being under rented versus what class was in fact being over subscribed in other times and so on and so forth. So one of the things that we also found out, obviously the rental car company had a problem with their disposal where they were scratching their head and saying, how do we maximize the utilization of the fleet if you will? And so we can typically apply the temporal patterns of analysis to see the supply demand patterns and so on. But what we'd also done was took a holistic geotemporal view to also have a Eureka moment where Los Angeles, for example, was renting 300,000 cars more during the same timeframe that Phoenix was actually exhibiting a slump. And so now we knew how to do this mixed modeling in order to maximize the profits by doing proper fleet management for this rental car company. Now, so these are the kind of use cases that data scientists actually deal with, where they have a stated business problem that is absolutely necessary, we're not magicians. And we dig into the data, we look for the patterns if you will into this data and look to solve these challenges, typically by doing what if analysis. And that gets me to the first talking point on this slide, which is typically people start to answer what and how questions. You talk about business analysts, you talk about business intelligence solutions that we build today. They're almost always looking to see how we performed last month, last year, what went wrong and so on. But what data scientists really try to do is answer the preemptive questions, which is they ask why. The first trait of a data scientist is they're not caught up between the what and how answer, but rather they're asking why. And they're also asking what if, meaning they're challenged and willing to challenge the status quo and emulate or simulate the scenarios as they play out and they're able to drive actionable strategies that they can then recommend to the business to actually drive towards. So they're actually in that sense formulating new business models, not so much from the fact that they're just staring at the data for a long, long time, which I was led to believe initially as I was reading it through my misconceived eyes of Harvard Business Review, the larger one that he was talking about. So that is in a sense the difference between the BI analysts, the data analysts and the data scientists. Data scientists is willing to ask the why and the what if questions. The other difference that I think from a research science perspective where I was focused so much on the CRISP-DM process, when I was with Lexus Nexus or doing traditional statistical data mining, I don't know how many of you are aware of CRISP-DM industry standard process, but basically it's about collecting data, formulating a hypothesis, evaluating the data if it's the hypothesis or not and dispensing it and so on. But the other distinction that the data scientist has beyond the research scientist is they use the same principles, but they are actually now dealing with big data, data that's happening in real world. This is not controlled data that has been pre-canned if you will for postulated for the real problem, but rather this data is happening in near real time at a very high velocity in a variety of data structures as it's happening. And so what we have the responsibility to do is actually collate some of these hypothesis that we're trying to apply, not in the controlled app setting, but rather on this live data as it's happening. And we're able to actually see through all of this through lean sigma principles, whether it's value, agility, engagement kind of imperatives that the six sigma principles really preach to us, the data scientists are actually able to apply the traditional research science hypothesis or methodologies in the big data context with a very clearly purposefully driven business process and value insight. So at the bottom by the way, I've listed a bunch of use cases. These are just exemplary if you will. There are tons of other use cases that we'll come across as we actually go through the interview process hopefully. Excellent. Excellent. Let's take a couple of questions while we're on this topic. So the first one is, would you consider operations research analysts to be data scientists? So this comes from a woman who works for the federal agency that was identified in Tom Davenport's competing on analytics as the first non-defense application of analytics. She says, we have economists, statisticians, and or analysts. They all perform analytics, but they don't have the title data scientist. So again, would you consider operations research analysts to be data scientists? I think so. Absolutely. I mean, they are. And I believe it used to be called decision science before in the operations research domain. We're calling them data science right now. But I think one clear distinction, even if it is a subtle distinction to be made is, it's not necessarily an optimization routine, as you will actually see in the top left graphic, is that you're not just doing the traditional BI descriptive analytics or diagnostic analytics, or even operation research of optimizations, your warehouse management and your supply demand, matchmaking, if you will. But you're also able to actually do some preemptive analytics, and that do not necessarily mean operations research alone. So we're not necessarily walking towards that, the Taguchi optimization, optimal point of dispensing, but we're able to also formulate some data strategies beyond the business strategies as well. Got it. Got it. So then it's really the way that those operations research people behave and how far they get into the why and the what if. Exactly. Yeah. Okay, great. So another question from the audience. So how much of the data that is gathered is already structured? Is there a significant amount of time spent cleaning, labeling the data sets so that they are suitable to train machine learning? Someone who's asked that question actually knows what they're talking about, to be honest with you. Very good. We do have an experienced audience here, so I'm surprised. So data scientists actually, you know, I kind of address this in the forward slides, but in any case, you know, let me give it up right now that it's not really the modeling that's the challenging. You know, everyone basically develops that instinct as they go along in this practice of which statistical model to apply, you know, what kind of analysis to apply and what kind of quantitative metrics to be used to actually measure the analysis. That's not the challenging piece, frankly. Initially, it might seem like it. You know, I guess it's like driving. You know, we initially remember every little physical movement we have to do, the motor skills that we have to develop. But I think once you really become good at it, it becomes muscle memory. So it's very similar in data science. It's not the modeling that's very challenging. It becomes muscle memory, but the data preparation never comes in a nicely formatted data frame format, you know, that you can simply apply a matrix model to, for example. Right. Hey, here is my regression equation. It never comes in that form. 80 to 90 percent of the time, frankly, goes into simply collecting, curating, shaping, imputing, exploring this dataset into a form that's then actually manageable. Wow. 80 to 90 percent. Oh, yes. Oh, my goodness. And that's, in fact, why we are the unicorns. Right? I mean, it's very hard to actually master that particular skill. And that mastery is what really comes with a lot of experience, if you will. And that persistence and patience to go through the trigger of data preparation is why, in fact, data scientists get paid a higher dime than usually the rest of the professions. Usually. Right. Great. Okay. Excellent. So let's move into our next category. And that is what skills are needed by a data scientist. So I think that we've got kind of two aspects of this. So what skills are needed and then what training. So I think the way that we have kind of flowed this presentation is answering the two questions at the same time. Sure. So is this an appropriate slide? Should we start here? Yes. This is very appropriate. Great. And this slide is really not intended for someone to read through from top left to the bottom right. Well, that's good because I can't. I don't know if anyone else can. So I think the point I wanted to get across with this slide is that the scope is very wide and deep. And what do I mean by that is data science, and you will see in the forward slides as well, is that data science is really the symphony, if you will. The symphony is probably a loaded word. But it is a mutual consensus between three units. One of them is the business. The other is the technology. The third is the science. And so what I have on the left-hand side is the scope of the business problems that we usually try to address. Now they're still actually articulated in technical terms, and I apologize for that. But really what we're trying to do is actually collect data and draw data-driven insights, optimize our data royalty rates, for example deduplications, automate some of these decisions, and so on and so forth. So on the left-hand side, you see the kind of use cases that data scientists usually deal with from a business perspective. From a technology perspective, the top right corner that you see there is that we deal with a variety of data. When I say a variety of data, it comes unstructured, it comes semi-structured, it comes structured. And not all analytics are the same. In some cases, you're actually trying to do forecasting, which typically means you're doing temporal analysis, time-based analysis. In some other cases, for example, you're looking to do conjoined analysis, which feature is selling most to our customers. In those cases, you might actually want to do regression. In some other cases, you're doing customer segmentation, looking to optimize your messaging to the customers that they're likely by your product upfront, given your field sales goals are very limited, for example. In that case, you're doing cross-train. The point I'm trying to get across there is that depending on the nature of the problem, the kind of data that you have at your disposal, the data can be coming in very fast in a variety of formats and so on. You should be willing to deal with no matter what kind of data comes at you at whatever speed you should be able to deal with it. From a technology perspective. The third and the one that's actually, you know, that most of the data scientists really first gravitate to is the statistical skills. And typically, people go through the same Christian process of collecting data, curating data, exploring this data, formulating a model, validating this model and scoring a model and applying this cyclically, if you will, to understand what is the maximum return that you can get from this data, if you will. So as you're actually doing that, you know, the variety of analytics, like I've said, that you see going down on the y-axis. And on the x-axis, you can look at it like a life cycle that happens where you're acquiring data and modeling it and ultimately scoring it at the other end. So there's a variety of statistical skills as well that you have to develop in a sense. The summary of this slide is not where you'd actually remember all of these terms, only the fact that and no one becomes a real data scientist, by the way, overnight or even possibly after 10 years, which, by the way, I'm not a real data scientist even today, I'm a hack. But the point of practice in the process is... Well, don't tell the audience because they're listening because they want to listen to a real data scientist. Very, very humble of you. Yeah, no, I'm kidding. Yeah. So there's always more to learn, right? I mean, we all have to acknowledge the fact that the field is very wide and very deep and that all of us should have that humility actually to acknowledge that we know little and we have more to learn in a sense. But the scope is very wide and very deep. Yeah, for sure. So if you don't mind moving to the next slide, I do have a little bit better breakdown on what it is that you absolutely need. At the bottom right corner, there's a Dr. Drew Conway who actually, I think, is a very beautiful representation of what a data scientist really needs. They need to be very good at software. They need to be a better software engineer than, for example, a statistical analyst. Someone said this beautifully and I wish I could find that quote. And they need to be a better statistician than a software engineer, for example, in things of that nature. But the point of that particular Venn diagram you see is that the data scientist trifecta requires that you actually have good software skills. So if you have a hypothesis about something, you don't mind digging into the data and you may be implementing that in whatever language of your choice is. Python, R, Jam, SAS, Java, whatever it is that really makes you comfortable. Be able to actually think up with data and try out those hypotheses very quickly. So you absolutely need software skills. And I believe Dr. Drew Conway calls it hacking skills. But to me, it's software skills. They will do to quickly try something. You need to have statistical knowledge. When I say statistical knowledge, let me not send a wrong message that you need to have a PhD in statistics. Actually, it's quite the contrary. You need to be able to understand exactly what the nature of the data is. I won't bore you too much, but there is this thing called normality distributions, central limit theorems and so on and so forth. So you need to know if your data's grain, if your data's distribution and so on are not violating some of these normative assumptions that we make in real world. And you should be able to tell that. And you don't tell that because you have a PhD in statistics. You are able to actually study bounds such as age, for example, shouldn't be beyond 20 to 80 in your business domain, for example. That doesn't need a PhD in statistics, but you should be able to understand that there are these real world conditions that need to be applied, and you should be smart enough to actually read some of these outliers out. That is the fundamental statistical skills that any data scientist would actually need. The third and often overlooked is the business skills. Like I said, I was actually faking it till I made it. There's a great book called Fake It Till You Make It. Like I said, in 2010, when I was faking for those three months to be a real data scientist, not knowing really what I was looking for in the data, you need to actually have the business expertise, understand what is the purposeful investigation that you're really trying to solve for. And so a data scientist needs to be a combination of software, statistics, and business to succeed. So if you don't mind, I think the other things, I'm going to ignore the statistical, the heuristic skills, which we've all recovered. You need to deal with big data that's just the harsh reality of today. You're not dealing with siloed databases anymore. It's very cool. You're collecting data at a very big fast pace. You need to be able to actually be savvy in big data. You're collocating all this data into a data lake and they're able to actually draw inside sort of it. And I can't stress enough, all of us actually can use the communication skills, the visualization skills, the leadership skills. In a sense, you could be the greatest data scientist in the world. And we know a few actually right here in the San Francisco area. If you cannot really digest, it makes this communicate the analysis that you've done, the insight that you have drawn to your key stakeholders that is digestible from their viewpoint, if you will. I think we have tremendously failed. So you could be the best scientist, but if you're not able to articulate that into something that they can validate to, then we have failed. So communication skills are very, very, very important. And I myself am learning that skill as we go along. Visualization skills are important. You do not want to articulate a story through a spreadsheet. You do not want to articulate a story through, say, you know, complex data screens or matrices on the screen. Rather, you want to be able to present summarized findings in rich, powerful visualizations. So your customers are actually consuming some of these insights perceptually first, more so than cognitively. And obviously, when the interest peaks is when their perceptive skills will switch over into the cognitive skills, but you've got to be able to actually have those skills to actually present visualizations in a very compelling manner. Third is leadership skills, and we'll come back to that in the next slide. But the point is, you need to be able to actually be willing to take the risk and lead the way by educating others on your viewpoints and often standing down by disagreeing and committing as well. But you need to actually exhibit the leadership traits as you go along. Okay. And finally, you need to actually be a lean start-up champion. What do I mean by that is, at least 1990s and, you know, early 2000s, perhaps, research science as it was being done. The appetite for failure was obviously less because, you know, labs were required, if you will, to explore new ideas and succeeded them, even if that meant long, tenure research into it. In data science, that's not to say you don't have the same discipline or regimen into it, but you should be willing to actually try quickly and be willing to fail and celebrate that failure and iterate on that success very quickly. Now, I'm preaching what is obviously very obvious, but in data science, at least in the new 2010, you know, the millennial domain, for example, this very quick lean start-up mentality is absolutely necessary. From a skill perspective, failing fast. Failing fast, yes. Failing fast. And then, you know, people typically ask me, how do I become a data scientist? You know, I always tell them, sure, you know, there are a bunch of online classes, there's online leagues, championship leagues, contests like KACO and so on, that you can participate in. Tons of online material that you can run and you can network through Silicon Valley Data Science, the meetups, you know, local chapters, for example. There are a lot of online graduate programs that you can actually take and they're all very, very good. There's tons of tools now. Previously, we didn't have all of these tools unless you, you know, you went with SAS or SPSS and so on. Now, you know, tools like Pandas, R, and time for us to actually mainly download and start tinkering with these things, with these data sets. But there is no surrogate. Let me start to go back to the left-hand side here of what I was trying to say. There is no surrogate for hands-on data science. Going to Kaggle's, going to online courses might teach you the academic underpinnings of your science. Absolutely, you should know that. But there is nothing better. You forget. If you don't practice, you forget. And so there's no surrogate for real hands-on experience. And that is in fact where most people struggle to find a real use case that they can apply these data science skills to. What I'm really saying is start with an enterprise data science use case. What do you mean by enterprise data science? You know, hey, say a customer is calling you and you collect all these service management calls. That data is available and obviously you want to mitigate the burden rate on your customer call center, for example. That's an enterprise use case. Say a lot of people are actually logged into VPN and stealing information. You know, you want to basically prevent this extrication of sensitive data and so IT collects all of these logs. Start with something that you can actually you have friends with in your enterprise where you can solicit this data and provide some real-time value within your own enterprise. So you're not really doing a product data science. You're not really doing if you will a new business product and so on, or business value. An enterprise clean-up data science use case is a wonderful way to get your feedback. Great. Excellent. Well, let's take a couple of questions here. I might, I think I know the answer, but I'm going to go ahead and ask this anyway. We've got a question that says, is it compulsory to have an IT-based knowledge or skill to be a data scientist? Or is it more to teamwork between business and IT? Excellent question. And I'm going to actually defer that question until we hit the next slide. It is actually a combination. It's not a singular. It is not a singular plane of skill that you need to acquire. Got it. Okay. So I have one more question and you might defer this one too, but I'm going to ask it anyway. Is it possible to graduate into a data scientist's role or do you need an apprenticeship and other disciplines first? Oh, that's a good one. Certainly I think I think graduate, I mean, if you're really talking about an academic graduation, then no, the answer is no. And it's like this. I'm a mechanical engineer by training. I'm a human factors engineer by training, but I'm a data scientist. Right. So in that sense, it's not an academic degree that gets you there. And I don't think that's what the question is really a man. Do you kind of ease into where it is that wherever you are, do you graduate forward into a data science role? I think that is the most appropriate answer. Yes. You don't go inventing new problems. You don't go actually, you know, go change your role for that. You know, I'm sure there are plenty of problems within your current role, wherever it is that you are, whether you are in IT, whether you're in financial analysis, supply chain analysis, manufacturing, customer care, sales and marketing, whatever it is, that role that you are in. I'm sure there's a problem, a standing problem, and you have a data set that actually will articulate the problem at you. Try and solve it from there itself. Now to doing the apprentice, I think that is also something that you should actually take an apprentice, but not at the expense of leaving what it is that you currently have. It's something I would recommend against, frankly, because I think then you're really starting off from trying to learn a new business domain as well, and that I wouldn't recommend. You have plenty of problems in your existing domain, I suppose. Great. Okay. Good. And then last question in this category, and then we'll go on to the next group of slides. Is it required to take the Hadoop developer learning path to work on it? Yes and no. And Hadoop would, I usually don't take technology calls, but in order to be brief, so we can actually advance to the next slide. Yes, Hadoop is the right technology, but right architecture, I would say, but I would really focus, if you're really getting into the big data, I would focus more on the Spark side of things versus the MapReduce programming and so on. Got it. Okay. Great. Well, let's go to the next set of slides. So we'll go to this next category and we are going to talk about the teamwork again. So again, my poorly worded question, apologies, to whom do data scientists report? Yes. All right. Sure. So I want you guys to read, you know, this, there's a problem, right? And I'm trying to describe the problem here. So Dilbert's manager says, we have a gigantic database full of customer behavior. Excellent. We can use non-linear math and data mining technology to optimize the results. And that's the Dilbert saying it. So the techie in Dilbert is saying it. And then the pointy manager is saying something else. Really, I think the highlight of the particular, you know, image or the picture is that's exactly the challenge that all of us face today. The business has a certain need and they think they have data to solve the problem. And they go to the technology to actually get the problem solved. Technology has their own nomenclature, their own protocols, and DevOps, and processes that they follow in order to actually extract this data and make things happen. And at least in the old world, what the dead was, you know, there was, and I call this the distance of data science, as you can see at the bottom corner, is the business, the technology, and the science that we see are three competing vectors of every enterprise. Small or big, it's a different question. But there is absolutely the dissonance where the business wants very quick decisions and they want very agile business strategies to be implemented. And the technology wants to make sure that they're actually still going through the quote-unquote security and the privacy compliance, the governance, the DevOps, and so on and so forth. And finally, in between all of this trash, we have people that really are very, very smart, the science geeks, if you will, the data ninjas, that sit there trying to figure out, you know, where exactly are we doing because they can't really deal with this trash, if you will. And what I'm really trying to tell here is that that dissonance exists in every enterprise and it's a data scientist's responsibility to make sure that there's a common nomenclature and a common viewpoint, if you will, that's reached to drive the harmony towards business, technology, and science. As such, they have to deal with a lot of different people within the company, right from the very top, the CXOs, the CEO, the chief operating officer and so on, all the way to the database administrators. Now, I'm not suggesting that there's a hierarchy here, right? I'm not trying to say the database administrators do, you know, any lowly work than what a CXO does, but the point is it's broad. The data scientist has to touch every one of these personas to understand and drive a common nomenclature within business, technology, and science. So to go back to B90, no, not quite at all. Actually, it's great that you actually are sitting on one of these wings, these flanks or flanges to actually, you know, be able to already establish the trust with your partners in the organization, but it's much better if you actually bring this combinatorial knowledge, your business relationships, your technology relationships, and most importantly, your academic clarity to the table than you would otherwise bring with simple singular point of view. So, so again, to answer your question, Kelly, very specifically, because the data scientists need to drive this synergy between these three, you know, factions, faction is probably an overloaded word, but, you know, these three clients, we basically sit currently, conveniently under solutions architecture teams. So whenever there's a business problem that has already gone through an exploration scenario, an exploration scenario, that is when data scientists currently engage, although that's not the right model. But currently, we report into solutions architecture group, we report into the technology groups and so on. Here is where my viewpoint also comes by the way, and this is my viewpoint, is every enterprise or every organization that we have worked with, and you can relate with your own domains here, there is always a chief product business vision forward. Every organization has one. Then there's a chief data officer that usually assumes the governance role, if you will, on ensuring that the data is actually adequately maintained and the technology is following the right processes and protocols like I said before. Unfortunately, you don't notice a chief analytics officer and that's my pet peeve with the industry today, not that I'm qualified to actually comment about the data, but I think it would be great if companies realize that there absolutely needs to be a champion for propagating science as well across the entire organization. So there needs to be a chief analytics officer in order for companies to really realize the value of data service. We are seeing a little bit of that. You know, in following the rise of the chief data officer, we're seeing a little bit of that, but I think we're seeing a little bit of that, but I think we're seeing a little bit of dissonance at the most senior level of sponsorship is really critical. I mean, I think, you know, you brought up a very good point around the dissonance as it particularly pertains to privacy, right, and using that data that has a very hard cost organizations if there are mistakes or gas that occur, you know, questions that came in and then I'll let you get back to your slide is, as a data scientist, well, do you have or who is your ideal non-scientist partner on the business side and what critical skills or strengths should they have or should they cultivate? Ah, that's a good one. I think it is definitely the solutions architect would be my favorite person to actually associate with because they actually bring that end to end viewpoint that a data scientist absolutely needs. So, they could be actually be your champion of the cause of avoiding this part in the dissonance the fear, uncertainty and doubt that each of these, you know, business technology and science partners have because they're participating in the meetings right from the beginning of the exploration phases all the way to the execution stages. I think they would be the best partners that you can actually make. I think they could be looking up to your partnership as well just as valued as you would the other way. So I think you should assume that collaborative relationship and it's a give and take basically in that relationship going forward. And and you shouldn't be you shouldn't be afraid actually to quell some of that FUD that has risen in the industry a little bit, you know, with the big data for example, you know, for example, could use that a great effect. But when a lot of these companies saw that, you know, hey, how do this is very cheap. We can go by a bunch of machines and put our data in it. They wanted to build a data warehouse out of big data. Now I will get back to the question really. But what happened is they were really so the the chief information officers and so on of the companies were expecting how do to basically give you analytics on your data lakes. The expectation of what the the product was built for or the technology could enable and the business expectation of what it could actually deliver failed. And therefore, there was some hyperbole about this big data and so on that happened. What I'm trying to get to here is that data science in order to not fall pedals to the same kind of hyperbole, we have an all echelons of the organization that there is realistic expectations and you actually prove the value by applied data science. So sit close to your best friends and solutions architecture. Okay. I think sit close to your technology partners to actually grapple with the data hands on. Sit close to the business stakeholders by not simply preaching to them but rather showing them. So show and tell you know put your money whether your day in day out in order for us to be successful in data scientists in this world. Great. Great. So one more question and then we're going to get to your last couple of slides which I'd like to get to because I think they're really compelling. Do you have any advice for an organization that wants to start a department working on data science or working with data science? Yes and no so hire a chief analytics officer. Okay. Good. So you don't want to you don't want to kind of you know you want to have you want to be and you will see some of this you know explore, educate, engage kind of exciting stuff like things happen but you do not want to sign something up with or with with a hypothesis right. I mean that you know there will absolutely be valued then so I think there is some precursory education that needs to happen within any organization and I think that chief analytics officer should be charted actually driving the prescriptive predictive preemptive insights that the data science confers to the companies. I think they they should champion that cause first. Excellent. Okay. Great. Now this next section I would like to get through and then we will wrap up by taking any final questions that have come through. So where is data science headed in the future? Awesome. This is actually my favorite. I think in if I may basically summarize this the industry is headed where deep learning is becoming becoming the norm is inherently an expectation of every data scientist to know. So before I define what deep learning or kind of easy approach what deep learning is let me tell you why I say that and the same very Tom Devonport respect him you know very much. He defines this kind of analytics 1.0 2.0 3.0 4.0 4.0 being the deep learning cognitive learning phase of analytics. So what I want to do is paint a picture as to how it looked in the past in the past we had user 1.0 data 1.0 and technology 1.0 what do I mean by that is there were specific personas or roles you know to doing traditional things like for example there was a financial analyst there was an operations research analyst there was a you know supply chain analyst for example. So they had enough statistical skills enough data management skills in order to actually look at their domain and be able to analyze that analyze the data set so to each their own is the best way of putting it where there was a technology guy a science guy and each had their own favorite tools if you will that they would actually use the data was also very scattered you know we had data much we didn't have data lakes we didn't have big data warehouses and so on we had data much where sales data would be in one database marketing data would be another database and sales and marketing folks don't really talk similarly we had you know behavioral analytics data in some other place those data sets the technology also was very dispersed meaning there were some people that were absolutely in love with SAF there were some other people that absolutely love their power BIS and Cognos tools and SPSS tools for example and they they warded each other you know because they would be so wed with the technology on the tool that they never really talked to each other I think what is happening now is that we are seeing evolution of users 2.2 data 2.2 and technology 2.2 these scattered datasets into a single data lake semi-structured structured unstructured various velocities at various complex event models together in a singular logical data warehouse because of the big data environment Spark and so on now we also have technology where you can run some in place analytics meaning you don't necessarily have to extricate the data from your SQL server database for example and take it to your jump tool in order to run the model right there on your schema on your table in place but most importantly the users are getting power personas meaning they are now assuming a hybrid role of becoming a data scientist understanding the business persona ability to do data exploration the shaping the imputations and so on themselves as this transformation is happening I think that is in fact the user 2.0 is really the data scientist and the machine learning where we were trying to apply the explainability the palatability and principles or analysis the trigger has been thrown out the door with the with the rise of deep learning for those of you that are keeping in touch the self-driving cars that we see the Google map driving instruction written you know things that we see the Google home appliance or Amazon Alexa and play games like Go for example and chess for example are all possible because of a technology called neural networks and these neural networks unfortunately you know don't give you any explainability neural network can associate a pattern in action with a data pattern it sees but exactly if you ask a neural network how it came about doing that association and what the inferences it cannot do it but who cares I think the point I'm trying to get so good at image categorization and they have become so good at detecting human stimuli and predicting outcomes from it even so far as driving a car completely unautomated unattended is awesome and the deep learning skills are absolutely necessary for you to succeed in the next four to five years in my team so I'm not saying throw away the machine learning that you've learned I'm not saying throw away all the data but you to still hold on to those the scientific and academic that you've learned but also embrace this new non-explainable deep learning capabilities that the industry is basing itself to be I mean it sounds like fundamentally this is an ongoing learning process and we are still very much in a dynamic industry where as soon as we think we've learned something where we think we've mastered something then something else comes up that we need to look at and build on top of that previously mastered skills exactly yeah yeah but again I think there's one subtle thing I don't know if I got that across is at least for me if I may speak from a personal point of view here it's very hard for an academic and I am an academic by the way that wants to understand how an algorithm is behaving because I want to be able to explain that I'm not able to do that with these neural networks these convolutional neural networks the recurrent neural networks and so on it bothered the heck out of me but I think I've come over with the feeling that maybe it's not meant for me to understand right again I don't mean to get get into theology here but practice it works the dang thing excuse the language it actually works and so what I'm really realizing is that maybe it's not meant for me to understand just the fact that I have enough knowledge to be able to grapple with these modules that are already out in the open source domain that Google is putting out through TensorFlow Facebook is putting out with their Facebook artificial intelligence research Microsoft is putting out IBM is putting out you know maybe it's for me to just kind of download this tinker be very hands on and apply a real problem like self-driving image categorization you know subtitling a video for example let's just apply and see how it works and that's what I want other people to embrace as well don't get philosophically or the theory that you absolutely to understand everything not everything needs to be explainable there are some advances where you just have to learn by doing you know just being stuck with the academic theory I guess is the point I want to try that's great that's good I think there's a lot of people on the line who want to figure out the how of everything so I think that's good guidance so just a couple of more questions we've got three minutes left and we're going to talk about a statistical mathematical background would it be tough to make progress on the deep learning technologies domain not at all it's quite the contrary like I was saying it's not the academic knowledge that's my important it's the hands on hacking skills that are very important and like I said you know statistics has never bothered me as a matter of fact in 2010 I didn't know what is I knew what a normal distribution what is the test was you know I don't you know they're all fundamental principles you learn but they're they're not inhibitors for you to actually succeed in this domain got it okay great so last question so can you elaborate a bit more on the business analyst which is the what versus the data analyst which looks at the why so you made this statement early on when you're looking at what is a data scientist and how is that different from the pre in an organization so what the data scientist thinking about the why as opposed to a business analyst thinking about the what and the direction of the question is trying to understand what are some of the key behavior differences that you would you that you would identify to delineate between the two sure I think data analysts are very close surrogates of data scientists the only difference I tend to make there are two actually like I was saying is ability to actually apply this in the live data context if you will not the data analysts don't do it they do it but data analysts are typically dealing with you know datasets or structured datasets that already have been defined or have been identified as been where the problem solution or problem itself exists skills rather to be able to directly dig into this live stream this big data sets if you will to extrapolate the theory that hypothesis that you have as a data analyst and write it out on the real data as well that I think is that is that why aspect it's a scalable why one and two they're also trying what if theories you know what if meaning they're not afraid to actually construct champion if so what kind of optimizations they're able to actually achieve by effectuating a change that is commensurate with hypothesis that they've actually postulated so it's very similar I think the distinction can be very very close but it's also the what if scenario that the data scientist is willing to play with and the big data that they're able to apply this with okay great so last question I know that it's the top of the arrow but this is a quick one and I think it'll be interesting what are the tools could you pick a winner so I know that's a big question but quick answer I won't pick a winner I've been watching it's not deep learning by the way there we go and the winner really is there are many cafe torch piano the floor deep learning for Jane so on but I think what you should really focus on is Keras there's a library called Keras K E R A S that is able to abstract the implementation from the interface so Keras you can plug and play a different deep learning algorithm behind but you'd not have to read at your implemented your algorithm really it's all plug and play in a sense so I would if you're really confused which tool to pick I would definitely pick K E R A S fantastic okay thank you so much Narasimha Shannon back to you thank you so much Kelly and thank you so much Narasimha for joining us and giving us a great insight into what it is to be a data specialist I know and thanks to our attendees for being so engaged in everything we do and asking such great questions throughout the presentation just a reminder I will send a follow-up email by end of day Monday with links to the slides links to the recording of the presentation and get all that information out to you along with additional contact information so you can continue the conversation and so I hope everyone has a great day thanks to all thank you all right take care bye