 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Welcome back everyone. You are watching theCUBE, our flagship program. We go out to the events and extract the signal from the noise. We are live in Silicon Valley for the Big Data SV. It's our second event in Silicon Valley fourth overall. We're now running Big Data SV here in Silicon Valley and Big Data NYC in New York City to really capture the geography differences of what's happening in the Big Data business, how is it evolving, and the implications for business and innovation. I'm John Furrier with Silicon Angle. My co-host here is Jeff Kelly, Chief Data Analyst with Wikibon.org. And our next guest is Dr. Satyam Priyadarshi, Chief Data Scientist at Halliburton. Welcome to theCUBE. Thank you. Big Data, I'll say the science is at the root of it. Data science is not a new field. Data processing has been around since mainframes and data has been always a big part of computing. Now more than ever, data is proliferated in all kinds of forms. Certainly real time in the moment and systems of record, systems of engagement. However you want to call it, data is everywhere. And you have a great background in data science and your company certainly doing a lot of stuff there. So first tell us about your role at Halliburton as a Chief Data Scientist. What does that mean? What are some of the things you work on? I'm sure you can reveal all the secret projects you're working on. In general, what is the main focus for your role? So Halliburton is a oil and gas service company. So we actually play in what is called the upstream oil and gas. Which means exploration, drilling and production. The oil and gas industry is much big. So technical terms people talk about upstream, midstream, downstream. And downstream is where we all come in when we buy the gas. And we find this fall into the midstream. So Halliburton is in the upstream business. And it is a very, what you call scientifically driven company. So in order to do exploration, you do what you call the seismic waves and you convert them into some meaningful analysis. And that is all driven through first principles and empirical models. And because you want to predict where you will go and drill, right? So if you think of it, we do collect quite a significant amount of data like the oil and gas industry has said. But they have been doing mostly scientific aspects of it. But as we know, the data is growing for everyone, not only oil and gas, but for every industry. And data driven models, what I call them, augments all these first principle models, right? First principles are when you understand some phenomena is happening because of certain event or attributes. You want to actually augment that with all the data that you have in principle, right? So oil and gas is a big data company because there's a lot on the line, right? So Jeff has covered this with the Internet of Things. That's a big application. So you're in the middle of the Internet of Things. How deep are you in? I mean, obviously it's a lot going on. I mean, are you fully engaged with this? Is there certain practices in place? Are the systems all out there? Where in the journey of that are you? So I don't think anybody can claim that they are mature in this Internet of Things, right? This is still a evolving field because especially if you're talking about sensors, they are evolving day-to-day. And the amount of data that is being collected at different phases of this oil and gas is going to grow. So we can't say that we are mature. We can't say that we are in the elementary state. But so we are somewhere in the middle. Almost all oil and gas industry would be like that. And you've got probes up. But you've got action going on. Absolutely. You've got some stuff happening. Absolutely. So Internet of Things is, again, it's a big word. Whatever we are talking about, which phase we are talking about. So I think there's a lot more to learn in this space. We'll talk a little bit about the evolution of the oil and gas industry from the way things maybe were done in the past in terms of exploration and deciding where you're going to put the drill on the ground and those kind of questions from maybe a less data-driven approach to some of the things that are happening now. Because obviously it's changing a lot, as you said, it's still evolving. But kind of walk us through how it's evolved. I mean, this is an old industry, obviously. It's been around for a long time. We're not talking about the little web startups that are just kind of cropping up. This is an age-old problem, really. How has it evolved? And how are we kind of taking those next steps? If I can add something to Jeff's comment, the value piece, because in your industry, there's a lot of value. So data can add a lot of value quickly. So I would just bolt on that real quick to that. So if you think of it like at a very high level, right? So you do the seismic studies, which is basically you have sound waves created by the sensors. And then the sound wave is collected in some digital form then converted into visual forms to understand what is lying under the earth, right? What clay, water, whatever you want to. The geological studies, geosciences studies take place. And so we employ a lot of geophysicists and geologists to explain to us what is there. Based on that, they figure out or they predict saying this is where likelihood of oil is going to be. And based on that, then they do the land surveys and all the legal governance, right? It's a highly regulated industry, right? So once that has been figured out, then they plan to do what is called exploration drilling. And then they try to figure out if yes, whether how big the size of the reservoir is likely to be. And once that is done, then they go into the production phase and then the completion phase, which is like trying to put the drill well and the oil comes out and all those chemical treatments and things like that. So it's a highly complex process if you think of it. But data is collected in all these phases. So data at every step is important for us, right? It's been collected, it has been analyzed from a first principle model, somewhere you want real-time analytics, somewhere there is no real-time part of it. But again, if you think in big data terms, if you look at the volume velocity problems, right? So when you talk about velocity by default, when you say big data and velocity, people assume that the data has to be real-time. That's not the case, right? The data can come in real-time, you may not have to analyze in real-time. Data may not come in real-time, but your predictive models will have to be done in real-time. So it's a two-by-two metrics, as I say, right? Real-time collection, real-time analysis, right? So which phase of this oil and gas life cycle you are in, it depends on that. It's not one thing that fits across the life cycle. Seismic, for example, is another case, right? Data is collected and then it has to be analyzed thoroughly. So no real-time component is needed really, right? But the calculations or what you call the attributes that they're calculated, they could be done much faster. So talk about the properties of data, obviously, with Internet of Things and or probes and or in data in general. Passive data, people are used to. You store data, you collect data, you put in a pile, you put in a data warehouse, systems of record, all these things that people are in this industry are comfortable with, and then you go get the data and you bring it out to passive data. What are active data? The role of active data, because with mobile devices, with edge devices, if you will, people or things, you have a lot more activity going on, engagement or things. So active data is coming in and you have passive data, a lot of data, passive data. So what's the relationship in your mind and how do you look at that paradigm, active versus passive? So what I understand, so when the drilling process is taking place, then we are collecting a lot of active data, right? It may not be everything is real-time, but it's still pretty active data, right? But the seismic study was the passive data. But at the same time, we are also collecting seismic data when an activity is happening. Connecting them can actually help us actually optimize the operations. And what happens is it's not always, depending on where the field is, what the concept is, people talk about digital oil fields when the data becomes really more real-time, then the challenges will show up. So right now we don't see all those big challenges from Internet of Things yet, but I think as more and more sensors are deployed, then we will see a lot more of those. Is the evolution of Internet of Things in your mind bottlenecked by smart devices, the connectivity, or just state of the evolution? So I started saying that we should not call the term Internet of Things. Internet of Things is good, but I think it should be what is called emerging technology devices. Because some may be connected on the Internet, some doesn't have to be connected on the Internet, right? Because all of them actually contribute to it. So the term Internet of Things narrows down the focus. So in terms for digital oil fields, I think these devices should be called emerging technology devices. Because it could be modification, what is called the bottom hole assembly. It could be what you call the sensors on the wires. So many properties we are measuring. So what device will come out, we don't know yet. It doesn't have to be really connected through the Internet, but they have to be. So most of these are emerging devices in your mind. I mean, Internet of Things is just kind of, but you say that's more of a buzzer. People get up on that. But that's a broad term, but so is emerging devices. So how would you classify the difference with emerging devices more generic or emerging as a non-old, new? No, I think it's the new things. Because as we learn about how to predict something or learn what other attributes are necessary, if those metrics are kind of programmed into those send devices, then we can get much more rich data, which will be very helpful. This is why I call emerging technology devices, because it's not only just about creating a sensor, because we have to learn about the data bits as well. I wonder if you could comment on kind of the market and the vendor ecosystem out there. There's a lot here at Strata Hadoop World. There's, I don't know how many vendors out on that floor. As a practitioner, how do you view what's happening in the market? Obviously there's a lot of marketing battles going on between all the different companies for mind share and that kind of thing. From your view as a practitioner, do you listen to all that noise? Does that matter to you? What is your take on how the ecosystem is evolving? And is it evolving you think in a healthy way? So I've been doing this big data since 2005 or something. So I do keep up with what's happening in the industry because it's important to watch what kind of tools and technology they are deploying. Some are mature, some are not mature. But at the same time, what interesting part that one has to look as a data scientist is what combination of tools can you use? You cannot be tied to one particular tool, right? Because especially with the technology evolves very fast. Especially if you look in, since we are in Bay Area, things come up very fast, right? Earlier two years back, nobody would talk about streaming analytics that much, but now there are so many funded companies, well-funded companies. So the thing is, can I deploy that or not? Is it useful for me? So those are due diligence processes that one has to take and it's not always easy because one is the hype that is created around a process and if you think of it like somebody says, oh, social media, I can do analysis, but does it actually apply to me? Right, one of the concept that I talked yesterday was that value in the data is important and how does the value come, right? The value, for example, seismic is much higher when the volume is high and the value is much higher compared to social media where the volume is okay compared to seismic. The velocity is much faster, but the value in terms of total business value is much smaller compared to a oil and gas industry. So I cannot take a solution that has worked 100% in a social media analysis that will actually apply to my seismic or drilling. But a combination of these tools can work for it. So in order to create that combination that would work, you have to be aware of what's happening in the industry. Well, that's also a challenge, right? Because you've got so many different players. And I think part of my thinking around this, having talked to a lot of practitioners, is that in fact, having to cobble together the different solutions is a challenge for most enterprises. I mean, we're seeing the global 1000, the really big enterprises are out there adopting Kadoop and some other tools. But when you get beyond that, I think a lot of companies don't have the expertise to do that, they're not equipped for that. And I think they're waiting for the market to mature a little bit, so you'll see some consolidation, you'll see a little bit more of a platform built out versus point tools. How do you approach that? Are you comfortable bringing together the different point tools, or would you like to see a more platform approach? So the business unit that I work on, we also build platforms. So we know the platform business well, and then we're one of the leaders in that sense. But I don't think some of us who have been around, I don't think we wait for somebody else to build the platform. I think we know enough what was needed from a business point of, one of the things one has to look as a data scientist, a chief data scientist to actually know the domain. Because if you don't know the domain, then you can pick up tools and do whatever you want, but that doesn't make sense. So if you know the domain, then you can actually find out what's the right tools and technology to deploy, and then connect it. And this is important, right, for any data scientist for that matter. It's not about just what you call coding and running the algorithm, right? It's about, I think a full data scientist is one who understands the business also. Yeah, talk a little bit about that. Obviously you need the analytics skills, the data science skills, but you've got to have the domain knowledge, and people talk a lot about the communication skills as well to kind of get across. You find a great insight in the data, you've got to communicate that to the business. Talk a little bit more, expand on that to kind of what you see as the key characteristics of a successful data scientist. So successful is one, one is of course, if they know computing well, then that's a great value. Second is understanding the domain, and the third is visualization of patterns, right? We all can recognize patterns, but everyone else looks at the patterns differently. And then of course, communicating what you see in the patterns for actionable insights. An important part is actionable insight, right? You can see a pattern, but there's just no meaning, right? What do you do? And so how do you understand whether it's actionable or not? That's important. So that comes unless you know the domain, you can't talk about it. So these four things are needed for a data scientist. And of course, communication, the communication is multiple layers, right? One is able to communicate with their peers in the data science realm, where you're programming or running predictive models or building predictive models. Second is talking to your product people, software, right? Or the business people, sales people, because they are the ones who are going to sell these solutions, right? So if you can't explain to them, we have a challenge, right? And then of course, communicating to the executives, which is also a challenge, because if a data scientist will say, okay, I can solve this problem, and they'll say you have 20 days, you really can't do that, right? You predictive models don't build in a time frame, so. But the ability to communicate in that 20 days, what you can achieve, what is my far side vision of the problem and what's my near side? What can I solve in the near side? That is what you want to solve first and show them some results and then go forward. And then simplifying the visualizations, right? When you create patterns, okay, fine. You can create this complex fractal diagram, whatever, right? But then how do you explain it, right? Bringing it down to what I call the kindergarten level, right? Sometimes helps, because that is what most people can understand, that okay, what I'm, because eventually we are running a business, we are not an academic institute in that sense. So for, especially in business world, you have to really bring it down to a level that everybody can understand and see the value in terms of profit, innovation and revenue and cost. Because eventually you have to boil down to that. So those are, that was a great, great description of what's required of a data scientist. But of course we know there's a shortage of those types of people out there. What's your view on how we're going to expand that pool of data scientists? Is it a training issue? Or is it, do you have to come at it through real world experience? Or can you learn it in the classroom? What's the approach you think is going to be needed to really expand the pool of data scientists? Yeah, so I think in my past life I've built some data science teams. So I don't know if it's a real shortage, but yes, yes, it looks like a shortage of people. But I think a lot of people who have done PhDs and all, they can actually be very easily trained into being a data scientist and given with the business knowledge. The other concept that I say is that complete data scientist could be two people or three people, right, a business person, a computer person, and a product or whatever you want to call it, but those three people can combine, right? So the team approach? Team approach, because that is very important. Otherwise, you can have a lot of people doing computing, right? Nowadays if you know like in Silicon Valley there are so many companies will claim, okay, data science as a service, right? Or you can download this package and you can run it to your self machine learning, fine. You can learn machine learning, but what do you do with it? Right, so how do you scale that? So look, that's a good setup. Now, companies that need to scale that. What's the strategy there? What do you see as the best practice advice that you'd offer? So some of them bringing some PhD level people, right, with you and then bringing, partnering with them, the people who have spent their career in business intelligence, for example, right? They have been, some of these people have done a lot of analytics work, but they have been doing in sort of, question has been given to them, they find the answers. You have to flip that model saying, here are the answers, how do you ask the question? Right, so it's a reverse order, basically, but that means you're forming the patterns and then saying, why do I see that? So they can go to the business guy or a product guy and ask them, I see this pattern in the data, can you explain to me? So a lot of business intelligence people could be actually converted into data scientist. Yeah, well, that's interesting because I've noticed, if you look at LinkedIn, a lot of the people that were previously calling themselves business intelligence professionals, changed that wording to data scientist, and it's, I think it's definitely possible, but it's not as easy to just flip the board. It certainly offends real data scientists, right? I mean, so it's like, what do you call data scientists? I mean, is there a test? Is there a... Well, I think it's, obviously I want to get your point. Certainly I want to get paid. I think it's results, I think it's results. Can you... So I think data scientist is not a new term, but it has become more popular. But if you think of quantum physicist, quantum chemist, computational chemist, a lot of atomic physicists, they were all data scientists because they were generating huge amounts of data. My background happens to be quantum chemistry of a PhD in that, and we did a lot of data, and from there we were trying to whatever predict whether a drug molecule is interacting with the protein. It's all about analyzing the data. We talk about visualizations. We were all looking at this structure of DNA, trying to figure out where this molecule will connect. So we were all data scientists, but we never used the word data scientist. We were called scientists, right? But nowadays people have just put this label on top of it called data scientist. Certainly you'll get a pay raise if you go from being a DB2 administrator to a data scientist. Certainly your pay just went up in your job security. So, but no, there's an element of hype around it. It's legit, there's data growth going on. Absolutely. There's levels of science. But certainly the mindset is needed in terms of knowing the answers and then asking the questions. That is certainly a flip side of it. From a business intelligence to data science. It's essentially the chemistry back and we're looking at real reactions and that's a mindset that's really compatible with how data is, because there's a lot of fusion going on around interactions of different data sets now. Integration is a big part of it. So final question as we get down to our time here. What do you like in the market right now? As you look out in the landscape, you're here kicking the tires. You're here at the Big Data SV event looking around and looking at all the content at Stratoconference and all the different startups that may or may not be around. If they don't get sales and there's some good technology tracks, what do you see and what do you like and what don't you like? Oh, that's a tough question. It could be anything like, so I think that... Obviously you don't have things we know you don't like that we just talked about that. But in general, what are you liking? What's getting you excited? The thing that is interesting is that yes, a lot of startups especially and some companies are trying to make it easier to use machine learning. Machine learning is nothing new. It was done in 1960. So we're not doing anything special, right? But what it has enabled is leveraging all these open source technologies which people or even organizations were not ready to it. So what Silicon Valley has done is this, bringing this open source technology, making this hard looking science easy to use is a great trend because it can benefit a lot of companies, right? Because one of the things... And compute is now available. And heavily compute is available easily. You don't have to worry about buying servers and all. So it's making that layer very easy, right? As they call it, like starting a startup is not a problem. Technology is not the problem, right? It's the idea and then its execution is the problem, right? So that is a good part of it. Difficult part is that when these solutions are coming up, if you don't know the domain expertise or how do you apply, how do you interpret, then it looks like, oh, we are wasting so much energy. So understanding the real problem, but that coaching doesn't come from this, many of this so-called application or what I call tabs like technology, application, products and solutions, right? The tabs, creators, that training doesn't come for the domain, right? They will pitch in their solution, saying, oh, we do this fantastic algorithm and it can create these beautiful graphs, right? But then how do you interpret? You need the domain. Yeah, you need to integrate that in. And that education gap will be the problem. So you're excited about the technology growth, the opportunity, but yet there's a practical integration. A practical integration. For solutions. The products aren't necessarily solutions. So, absolutely great. Dr. Satyam, thank you so much for coming on theCUBE. We really appreciate your candid practitioners perspective. We love to talk with folks actually implementing and this is really the reality. It's where the rubber hits the road. That's where the values created and a lot of great opportunities. Thanks for coming on theCUBE. We really appreciate it. We'll be right back more with live coverage here in Silicon Valley after this short break.