 from downtown San Francisco. It's theCUBE! Covering IBM Chief Data Officer Strategy Summit 2018. Brought to you by IBM. Welcome back to San Francisco, everybody. We're at the Park 55 in Union Square, and this is theCUBE, the leader in live tech coverage, and we're covering exclusive coverage of the IBM CDO Strategy Summit. IBM has these things. They bookend on both coasts, one in San Francisco, one in Boston, spring and fall, great event, intimate event, 130, 150 Chief Data Officers, learning, transferring knowledge, sharing ideas. Karen Woodruff is here, she's the principal data scientist at IBM, and she's joined by Ritesh Aroros, the director of digital analytics at HCL Technologies. Folks, welcome to theCUBE, thanks for coming on. Thank you. You're having us. You're welcome. So, we're going to talk about data management, we're going to talk about data engineering. We're going to talk about digital, as I said, Ritesh, because digital is in your title. It's a hot topic today, but Karen, let's start off with you, Principal Data Scientist. So, you're the one that is in short supply. So, a lot of demand, not as well, you're getting pulled in a lot of different directions, but talk about your role and how you manage all those demands in your time. Well, you know, a lot of our work is driven by business needs. So, it's really understanding what is critical to the business, what's going to support our business's strategy, and, you know, picking the projects that we work on based on those items. So, it's, you really do have to cultivate the things that you spend your time on and make sure you're spending your time on the things that matter. And, as Ritesh and I were talking about earlier, you know, a lot of that means building good relationships with the people who manage the systems and the people who manage the data so that you can get access to what you need to get the critical insights that the business needs. So, Ritesh, data management, I mean, it means a lot of things to a lot of people that's evolved over the years. So, help us frame what data management is in this day and age. Sure. So, there are two aspects of data, in my opinion. One is the data management, another is the data engineering, right? And over the period as the data has grown significantly, whether it's unstructured data, whether it's structured data or the traditional data, we need to have some kind of governance and the policies to secure data to make data as an asset for a company so that business can rely on your data, what you are delivering to them. Now, the other part comes that the data engineering. Data engineering is more about an IT function which is data acquisition, data preparation, and delivering the data to the end user, right? It can be business, it can be third party, but it all comes under the governance, under the policies which are designed to secure the data, how the data should be accessed to different parts of the company or the external parties. And how do those two worlds come together? The business piece and the IT piece, is that where you come in? That is where data science definitely comes into the picture, so if you go online, you can find Venn diagrams that describe data science as a combination of computer science, math and statistics, and business acumen. And so where it comes in the middle is data science. So it's really being able to put those things together, but what's so critical is, Interpol actually shared at the beginning here, and I think a few years ago here, talked about the five pillars to building a data strategy, and one of those things is use cases, like getting out, picking a need, solving it, and then going from there and along the way, you realize what systems are critical, what data you need, who the business users are, what would it take to scale that? So these like proof point projects that eventually turn into these bigger things, and for them to turn into bigger things, you've got to have that partnership, you've got to know where your trusted data is, you've got to know that how it got there, who can touch it, how frequently it is updated, just being able to really understand that and work with partners that manage the infrastructure so that you can leverage it and make it available to other people and transparent. I remember when I first interviewed Hillary Mason way back when, and I was asking her about that Venn diagram and she threw in another one, which was data hacking. I say, well you talk about that, if you've got to be curious about that, you've got to be taking a bath in data. Yes, yeah, I mean, yeah, you really, sometimes you have to be a detective and you have to really want to know more, and I mean, understanding the data is like the majority of the battle. So Ritesh, we were talking off camera about how titles change, things evolve, data, digital. They're kind of interchangeable these days. I mean, we always say the difference between a business and a digital business is how they use data. And so digital being part of your role, everybody's trying to get digital transformation, right? As an SI, you guys are at the heart of it, certainly IBM as well. What kinds of questions are clients asking you about digital? So ultimately see data, whatever we drive from data, it is used by the business, right? So we are trying to always solve a business problem which either optimize the issues a company is facing or try to generate more revenues, right? Now the digital as well as the data has been married together, right? Earlier, you can say we were trying to analyze the data to get more insights, what is happening in the company, and then we came up with a predictive modeling that based on the data that we historically collect, how can we predict the different scenarios, right? Now digital, over the period of the last 10, 20 years as the data has grown, there are different sorts of data has come in picture. We are talking about social media and so on, right? And nobody is looking for just reports out of the axle, right? It is more about how you are presenting the data to the senior management, to the entire world, and how easily they can understand it. That's where the digital from the data digitization as well as the application digitization comes in picture. So the tools are developed over the period to have a better visualization, better understanding. How can we integrate annotation within the data? So these are all different aspects of digitization on the data and we try to integrate the digital concepts within our data and analytics, right? So I used to be more, I mean, I grew up as a data engineer, analytics engineer, but now I'm looking more beyond just the data or the data preparation. It's more about presenting the data to the end user and the business, how it is easy for them to understand it. Okay, I got to ask you, so you guys are data wonks. I am too kind of, but I'm not as skilled as you are, but, and I say that with all due respect, I mean, you love data. As data science becomes a more critical skill within organizations, we always talk about the amount of data, data growth, the stats are mind-boggling, but as a data scientist, do you feel like you have access to the right data and how much of a challenge is that with clients? So we do have access to the data, but the challenge is company has so many systems, right? It's not just one or two applications. There are companies who have 50 or 60 or even hundreds of applications built over the last 20 years, and there are some applications which are basically duplicate, which replicates the data. Now the challenge is to integrate the data from different systems because they maintain different metadata. They have the quality of data is a concern, and sometimes there are international companies. The rules, for example, might be in US or India or China, the data acquisitions are different, right? And as you become more global, you try to integrate the data beyond boundaries, which becomes a more compliance issue, sometimes also beyond the technical issues of data integration. Any thoughts on that? Yeah, I think one of the other issues too you have is you've heard of shadow IT, where people have servers squirreled away under their desks. There's your shadow data, where people have spreadsheets and databases that they're storing on like a small server or that they share within their department. And so you were discussing, we were talking earlier about the different systems, and you might have a name in one system that's one way and a name in another system that's slightly different, and then a third system where it's different and there's extra granularity to it or some extra twist. And so you really have to work with all of the people that own these processes and figure out what's the trusted source? What can we all agree on? So there's a lot of, it's funny, a lot of the data problems are people problems. So it's getting people to talk and getting people to agree on, well, this is why I need it this way and this is why I need it this way in figuring out how you come to a common solution so you can even create those single trusted sources that then everybody can go to and everybody knows that they're working with the right thing and the same thing that they all agree on. Yeah man, the politics of it, I mean politics is kind of a pejorative word, but they'll just say dissonance where you have maybe of a back end financial system and the CFO, he or she is looking at the data saying, oh, this is what the data says and then I mean I was talking to a recently, a chef in a restaurant said that the CFO saw this but I know that's not the case. And I don't have the data to prove it so I'm going to go get the data. And so, and then as they collect that data they bring together so I guess in some ways you guys are mediators. Yes. Yes. Absolutely. The data doesn't lie, it's just you know, you just got to understand it. You have to ask the right question. Yes, and yeah. And sometimes when you see the data you start to, you don't even know what questions you want to ask until you see the data. Is that a challenge for you? Yes, all the time, yeah. So, okay, what else do we want to talk about? The state of collaboration let's say between the data scientist, the data engineer, the quality engineer, maybe even the application developer, somebody, John Furrier often says, my co-host and business partner, data is the new development kit. Give me the data and I'll, you know, write some code and create an application. So how about collaboration amongst those roles? Is that something? I know IBM has announced some products there but to your point Karen, a lot of times it's the people. It is. And the culture of what are you seeing in terms of evolution and maturity of that challenge? You know, I have a very good friend who likes to say that data science is a team sport and so these should not be like solo projects where one person is wading up to their elbows in data. This should be something where you've got engineers and scientists and business people coming together to really work through it as a team because everybody brings really different strengths to the table and it takes a lot of smart brains to figure out some of these really complicated things. I completely agree. Because see, the challenges we always are trying to solve a business problem. It's important to marry IT as well as the business, right? We have the technical expert but we don't know, we don't have domain experts, subject matter experts who knows the business in IT, right, so it's very, very important to collaborate closely with the business, right? And data scientists are intermittent layer between the IT as well as business, I will say, right? Because data scientists, as they, over the year, as they try to analyze the information, they understand business better, right? And they need to collaborate with IT to either improve the quality, right? The kind of challenges they are facing and I need to, the data engineer has to work very hard to make sure the data delivered to the data scientist or the business is accurate as much as possible because wrong data will lead to wrong predictions, right? And ultimately we need to make sure that we integrate the data in the right way. It's a different cultural dynamic than it was, say, 10 years ago where you'd go to a statistician, you'd fire up the SPSS. We still use that. I'm sure you still do, but it runs some kind of squares and give me some probabilities and maybe run some Monte Carlo simulation, but one person kind of doing all that to your point. Well, it's interesting. There are some students I mentor at a local university and we've been talking about the projects that they get and that more often than not, they get a nice clean data set to go practice learning their modeling on and they don't have to get in there and clean it all up and normalize the fields and look for some crazy skew or null values or where you've just got so much noise that needs to be reduced into something more manageable and so you made the point earlier about understanding the data. It really is important to be very curious and ask those tough questions and understand what you're dealing with before you really start jumping in and building a bunch of models. Let me add another point. The way we have changed over the last 10 years, especially from the technical point of view, 10 years back, nobody talks about the real-time data analysis because there was no streaming application as such. Now nobody talks about the batch analysis, right? Everybody wants data on real-time basis or if not real-time might be near real-time basis. That has become a challenge and it's not just the traditional which are happening in their ERP environment or on the cloud. They want the real-time integration with the social media for the marketing and the sales and how they can immediately do the campaign, right? So for example, if I go to Google and I search for any product, right? For example, a pressure cooker, right? And I go to Facebook, immediately I see the ad. Within two minutes. Yeah, the re-targeting, right? So that's a real-time analytics is happening under different application, including the third-party data which is coming from social media. So that has become a good source of data, but it has become a challenge for the data engineers and data scientists. How quickly we can turn around all data analysis? Because it used to be, you would get ads for a pressure cooker for months even after you bought the pressure cooker and now it's only a few days, right? It's minutes, it's minutes. You close this application, you log in to Facebook, and ad is there. There it is. Yeah, because everything is linked either your phone number or through your email ID. You're done. Well, that's interesting. We talk about disruption a lot. I wonder if that whole model is going to get disrupted in a new way because everybody's sort of using the same ad. So that's a big change over the last 10 years. Do you think? Okay, go ahead. Oh no, I was just going to say, you know, another thing is just there's so much that is available to everybody now. You know, it's not, there's not this small little set of tools that's restricted to people that are in these very specific jobs, but with open source and with so many software as a service products that are out there, anybody can go out and get an account and just start, you know, practicing or playing or joining a Kaggle competition or, you know, start getting their hands on, there's data sets that are out there that you can just download to practice and learn on and use. So, you know, it's much more open, I think, than it used to be. Community additions of software, open data, and the number of open data sources just keeps growing. Do you think that machine intelligence can, or how will machine intelligence help with this data quality challenge? I think that it's always going to require people. You know, there's always going to be a need for people to train the machines on how to interpret the data, how to classify it, how to tag it. There's actually a really good article in popular science this month about a woman who was training a machine on fake news. And, you know, it did a really nice job of finding some of the same claims that she did, but she found a few more. So, you know, I think it's, on one hand, we have machines that we can augment with data and they can help us make better decisions or sift through large volumes of data, but then when we're teaching the machines to classify the data or to help us with metadata classification, for example, or to help us clean it, I think that it's going to be a while before we get to the point where that's the inverse. Right, so in that example you gave, the human actually did a better job than the machine. Now, this is amazing to me how what machines couldn't do that humans could, you know, last year and now all of a sudden, you know, they can't, it wasn't long ago that robots couldn't climb stairs. It's really creepy. I think the difference now is earlier, you knew that there is an issue in the data, but you don't know that how much data is corrupt or wrong, right, now there are tools available and they're very sophisticated tools, they can pinpoint and provide you the percentage of accuracy, right, on different categories of data that you come across, even forget about the structured data, even when you talk about unstructured data, the data which comes from social media or the comments and the remarks that you log or the log by the customer service representative, there are very sophisticated text analytic tools available which can talk very accurately about the data as well as the personality of the person who is giving that information. Tough problems, but it seems like we're making progress. All you have to do is look at fraud detections as an example. Folks, thanks very much. Thank you very much. You've been sharing your insight. You're very welcome. All right, keep it right there, buddy. We're live from the IBM CDO conference in San Francisco, right back. You're watching theCUBE.