 Yeah, I'm sure Todd, Todd's the real one, he's the real one. Okay, we're back here live at siliconangle.com's exclusive coverage of Arale Media Stratocom, so I'm John Furrier, the founder of siliconangle.com. Join with my co-host. I'm Dave Vellante of wikibond.org, and we're here with Josh Wills, who's a data scientist at Cloudera. You all know who Cloudera is, and we're at Strata, it's a big emphasis on data science and data scientists. Josh, thanks for coming on. Thanks for having me. So data science is all the rage, and you know, you have the guy who invented the term Jeff Hummerbacher, who's great to see him actually leading the news cycle for EMC Green Plum these days, you know, in all the news press, highlighting his famous quote about, you know, all the best minds clicking on ads, it's kind of a disgrace or embarrassment. I know, great quote. Someone who used to work on advertising systems, I particularly appreciate that. And we have books coming out here and great content about real problems. But in the enterprise world, data science is all the rage. So we want to talk about that segment here around data science and what you guys are doing. And so let's break it down. So relative to your world right now, what you're in, what is the current view of data science? Obviously, there's a little bit of data science trending this going on where, hey, I'm a data scientist, now I have that on my Twitter handle just to kind of make fun of myself and saying I'm a data scientist. But what does it mean to be a data scientist? Oh, I think, you know, honestly it means to be a data janitor, I think more than anything else, right? I think it's like, you know, you try to describe people what it is I do, how do I spend my time, right? I think of myself as a mathematician. I think I'm a math nerd, writing equations, right? And I think, you know, in the public perception, people will see like Nate Silver, they see some of the cool visualizations the New York Times guys do, and so they imagine like, you know, kind of like a minority report, you know, like the crazy like moving stuff around like that kind of thing. To me, I'm Forrest Gump and I have a toothbrush and I have a whole bunch of data and I scrub. And that's really like, that's really what I do. And what does that entail? Obviously data, do you look at data, we had a big chat day of the morning about data quality and what does raw data mean with some stuff on the press here, but you know, data is now part of the developer community and we had talked to someone about Python being an excellent language for data scientists. So what are some of the tools of the trade for scrubbing, interrogating, or training data to learn? These are all kinds of concepts that we call data as code. So what can you share with the folks out there that you're seeing as data as code and how are developers playing with data and how is data evolving? Oh, okay, wow, that's a very broad question. My personal stack is primarily Python, Hive, and R. It's really kind of my three go-to tools for just about all of my data cleansing, data examination, like just kind of sort of getting myself familiar with the dataset. That's primarily what I use. But I mean, it's like a religious debate among data scientists when it comes to their tools. Like Hadley Wickham, it's R and Gigi Plot all the way for everything, right? For other people, it's Python, for other people, it's SAS. You know, a lot of SAS- It's a preference issue. It's a personal feel, right? It's a very, but it's like, it's personal. It's like religious. It's like, you can have my SAS when you pry it from my cold dead hands, kind of thing. Let's talk about that for a second because obviously given that, you know, or whatever version of a hammer you on or kind of tools you want to use, what are the table stakes? So in your mind, like minimum things that a data scientist needs to have at their disposal for tools. Oh, in terms of tools. Oh, some kind of scripting language. R is a great language, but it's primarily tabularly oriented, and I'm usually working with data before it's been nicely formatted into tables. It's still like log files or some other kind of binary format. So it really requires a real programming language just really to even get started at all. So I think of that as- So to set it up. Just to set it up, just like for an absolute bearing. Because then beyond that, a lot of it can just really just be counting things. Like a lot of what data scientists do, you know, is just count stuff. And that turns out that when you count things over enormous data sets, it can work like surprisingly well. What's the mindset right now in terms of what, you know, we were talking earlier about, you know, about operational data warehouses and business intelligence tools out there that people have out there. And it's known queries, you know, SQL is one of them, right? Or in one dimension. That's kind of the old way. It's still relevant, installed in these accounts and large enterprise customers. But this whole new ways of merging where you don't necessarily know yet, it's a lot of unstructured data. You need to code on the data. What are some of the architectural data platforms looked like to you when you're looking at data sets? Because you're dealing with multiple sets of data. You got to pull it in for multiple sources. It could be log files, machine data, user data. What's your point of view on all this? Oh, wow. This new way. The new way. That's a great question. So I think there's sort of two phases to it. Right now I like, I mean, you know, I'm a hidden vendor and all that kind of stuff. I really like a lot of the in-memory tools that are coming along right now. I really like Spark. I really like SAS's laser server. I think there's a lot of great stuff for just doing in-memory exploration of data sets when you're just kind of getting familiar with it. Like a lot of my problem with doing like a relatively small sample from a data set is I will miss a huge number of edge cases that will completely screw up my analysis. So if I'm grabbing like 100 megabytes from a terabyte, it's no good for me from a sampling perspective. I actually need a much larger chunk of that data in order to get started. That said, though, I still think of like, you know, the typical data warehousing models, all the typical sort of tabular tools, tablo, all that kind of stuff as being fantastic for publishing results, for sharing results, for like making, basically enabling other people to do analysis on data sets I've prepared. So I mean, that's, you know, that's always sort of the end state for me. How do I publish this? How do I empower other people? So. One more question, Dave. One more question. On the data science side, one of the things is collaboration is a big thing where teams are working together and you have a global workforce of people in India. You have a follow the sun strategy. Is there anything, because we haven't found anything yet to really talk about around using the cloud or other vehicles to put data somewhere where people can jump in. Are there tools out there for on the collaborative side of data science analysis? Oh, that's a good question. Nothing that like pops out at me is being particularly compelling yet. That sounds like a very good, are you suggesting like maybe you and I should start a company to do that or something? Is that a- No, it's a need that people I think we're hearing have is- I would agree with that. I think that's probably fair. I think, I mean, collaboration, not even like across the globe, across the office. I don't think collaboration in terms of analysis and seeing, like, I really want like almost like a social network to see what kinds of analysis, what kind of queries are my coworkers doing? How are they preparing data sets in kind of a, you know, very lightweight, very like ad hoc kind of fashion. I think would be incredibly useful. So LinkedIn's great, right? Go to LinkedIn. They are. Skills. Those are mine. And this is like, okay, what do you need to be a data scientist? So Hadoop, machine learning, statistical modeling, obviously stats, data mining, Python, big data job, the data science are. What's missing? I mean, curiosity about data, data hacker. Oh, that's it. I think, you know, a relentlessness. I think it's like one of the qualities I point to. Like it's sort of the joke I make as a data analyst. Sorry, data analyst has a question. If the tool doesn't answer the question, if SAS doesn't answer it, if like the data warehouse doesn't answer it, the question doesn't get answered. Data scientist says, if the tool doesn't answer my question, I go get a new tool. It's like unacceptable that my question not be answered. It's very kind of generation-wise, sort of entitled approach to answering questions and data analysis. So you say that's on the presumption that the answer exists in the data? How do you know that? Presumption means, oh, I'm gut-feel. You know, okay, gut-feel, okay. So all those sort of bad, naughty words that the data people don't talk about. Yeah, exactly. I don't know. And I'm wrong a lot too. So I mean, a lot of time, my intuition leads me wrong. Leads me astray, but. And so you go, fine, we're good. It's got me this far, yeah, exactly. What do you see for computer science and or disciplinary-type paradigms? I mean, we were talking last night in the Hyatt Lounge area around, you know, fourth generation open source. You know, you've seen this movie before, all those cliches, but you know, if you look at some of the computer science paradigms, like AI, ontologies, reasoning, these out-learning machines, do you see any of those paradigms that are more relevant now than before? I mean, they kind of go through peaks and valleys. I see machine learning has been around for a while, it's modernizing. What do you see out there that's rearing its head as a relevant paradigm for these kind of data science situations? I'm honestly like sort of, I'm sort of like Charlie Brown with the football. I'm kind of really excited about deep learning and neural networks right now, like the cat recognition stuff that Google was doing. I find that stuff really, really exciting. I'm incredibly optimistic about that. I think these sort of these trends tend to be so cyclical, right? Like neural networks were hot in the 60s, they were hot in the 90s, they're hot again now. So it was just like wait, like whatever was hot, like you know, what, 25 years ago will be hot again in five years. AI's coming back, Joseph Turian told us. That's right, I would agree with that. Has it ever gone away? And deep learning in particular. Never got away, but people used to run away from the turn of the day. Well Google was on and talking about, you know, code is evolving and learning code and you know, we say data as code for developer standpoint. You know, that's one paradigm in the data warehousing market that's kind of being disrupted is, there's one data mark and you interrogate it. And now the people talking about, you know, okay, one data mark evolves into two data marks of the same data, so data's a living, breathing kind of thing. Yeah, so that requires a different technology to do that. And when Charles from Cloudera talks about Impala being that resource-based, you know, distributed resource, what's your take on that? So you work at Cloudera, so you probably agree. I mean, I certainly agree. I guess for me as a data scientist, I mean, Impala right now, Impala right now is obviously not a data warehouse by any certain imagination, right? But it does two things really well. One, it helps me debug intermediate outputs in my ETL, in my machine learning pipelines. And two, it's a really compelling way for me to come up with new dimensions and new fact tables and new perspectives on data that isn't a data warehouse. Things where like, I have a hypothesis, I have an intuition, I have a gut feel that something would be useful. This would be a useful perspective on the data. A data warehouse architect should not re-architect this data warehouse based on my hunches, right? Impala lets me actually like, create it, prove that it's actually useful, and then at that point publish it to the data warehouse. I want to follow up on that. So prior to Impala, you wouldn't be able to do that or it would just take you longer, so is it making you more productive or allowing you to do things that you couldn't have done before or both? Oh, it's interesting. I mean, I was like, I was going to say it was sort of like a political advantage in some way. I didn't have to beg and plead with a data warehouse architect to let me try this crazy idea. I can just try the crazy idea, see if it works, and then if it does, fantastic and make it available to lots of people. So I think it makes me like, basically it frees me to allow, it enables me to take on some more of my crazy ideas than I might not have before. Crazy ideas as the crazy ones as Steve Jobs put out in the ad campaign is really making a difference. Josh Willis, data scientist from Claudia, final comment, I'll give you the last word. What's on your to-do list next 12 to 14 months on a personal data science standpoint and from a business standpoint, and I'll see you mentioned your interest in geared up on neural networks and those kinds of things. What specifically are you working on that you're excited about? Both personally and from data science personal as a personal standpoint and for Claudia. I'm very interested in making machine learning techniques available to a wider audience. I think from an academic perspective, people tend to focus on the fastest algorithm, like speed, speed, speed, accuracy, accuracy. I'm much more interested in easy to use. Anybody, like any statistician, any SAS programmer, anybody can just pick these tools up, apply them to a problem and get them out there. That is to me, it's not something that's going to be like a huge deal for in the industry in the next 12 months, but I think it will be a huge deal in about 18 or so, and that's kind of where I try to keep my focus. Simplifying, make it easy. Simplify, that make this stuff easy to use. Primarily an easy to use. Anybody can do it. Robust, reliable, easy to do both. No one ever went out of business for making things easy to use and simple and reducing the steps it takes to do something, so that's a good point. Thank you. Josh Wills with Cloudera, data scientists. We'll be back with our next guest inside theCUBE right after this short break. Thanks a lot.