 Hi this is George Gilbert. We're on the ground at the Marriott Marquis for the Data Science Summit in San Francisco. I'm with Sandy Rizza of Cloudera and Sandy as part of one of the largest Hadoop distributors has had a lot of experience in helping take customers down the journey towards machine learning applications. Sandy why don't you give us a taste of how they get started. What are some of the key skill sets they need, the data they need to put in place, the tools they start out using? I don't hold the microphone. Certainly George. So yeah just some quick background. I work on the data science team at Cloudera and a lot of what we do is help customers get stood up with more complex analytics use cases on Hadoop. I particularly focus on Spark so often when I'm helping someone out they're using MLib or the associated tools in the Spark ecosystem. Of course there's a lot that surrounds the ecosystem. Often we'll get to customers and they'll have no data on the cluster and it's like can we figure out a way to take all these different data sources and put them together. So there's sort of a broad range of things that need to happen before you can get to a successful data science POC and generally it starts with taking a bunch of data and putting it somewhere. Okay and once you collect the data sort of what has to happen to it? Who has to work with it so that you sort of have a good inventory and so that you know sort of how to bring the different pieces to bear on each other? Yeah, I mean so typically there will be data engineering teams at some organizations. Those might be separate or distinct from Hadoop operations teams and so those teams will be responsible for taking data, normalizing schemas maybe stuff that a traditional DVA might do for relational databases in a different kind of organization. Okay so once you have the data is it with the data in place that you determine what problems that you can solve with reasonable you know risk and reasonable cost or is it the other way around or is it a mix of the two? Yeah I think it's a mix of the two so you know often we'll go somewhere and they'll be like this is the amount of data that we have it's pretty set rigidly and then we try to figure out what can we do with this you know given this data what sort of approaches are the most useful. In other situations they want to ask us a little bit more about what data sources they can wrangle to solve their problems. Does it make sense to think of a common customer journey in terms of greater sophistication of the problems they solve and greater business impact or is that perhaps more like an industry specific type of question? Yeah I think so I mean we always say start slow so take some small data set that you have maybe not small in size but small in complexity take some sort of manageable problem maybe an existing problem that you're solving with a more rules-based approach and figure out some way you can use analytics or complex analytics machine learning etc to improve upon what you have. Can you give us some examples of where you start small and then why for instance Spark has become so popular now as the almost like default you know tool? So I'm pretty sure there's examples where you know fraud is a pretty common use case going to different banks helping them out with different kinds of fraud detection so they might have a specific data set they want to use that'll help them enhance their ability to automatically detect people and we'll take them and use various ML of algorithms to you know develop that. Now does that sort of start off in batch mode and then get operationalized say once a day or once every you know six hours with the goal of making it you know close to real time? Yeah so that's perhaps the next step on the you know data science journey you have a model that you think works pretty well and you have to figure out how you want to put that model in production it's fairly use case specific so with fraud that's often you know there's a real time component to that you know things shift in time and you want to be able to be adaptive so in that case yeah once you have some sort of model you start thinking about how often do we need to retrain this model on these existing data and how can we serve that model in really quick ways. Okay so let's dive into that a little more so there's there's two issues there there's operationalizing the model doing that quickly low latency so that the training gets applied very very quickly to the application that's got to decide is the fraudulent or not and then there's the let's keep it up to date in addition to operationalizing it quickly how how sophisticated does a customer need to be to do that what are some of the tools required what are some of the skills and processes that are needed. Yeah I mean so at that point often we're getting outside the realm of like traditional exploratory data science and more into systems engineering you might say so both of those problems we probably group them both in like operationalizing machine learning or operationalizing models so one thing that's you know proven to be really helpful is kind of stuff there's a open source software project started by Sean Owen from Cladera called the Oryx project and Oryx is built to sort of aid in this process of operationalizing models so it can retrain your models it configured intervals maybe really quickly if you want and then also develop code that can score in real time. Okay so and and what are the underpinnings of this what what data does it need what other you know Apache tools Hadoop tools support this? So Oryx itself is built mostly on Kafka and Spark it has servers of its own that are used for storing this stuff the idea is it takes you know data sorting Hadoop probably as text Avro or Parquet probably the most common format it runs machine learning algorithms that are most likely written in Spark and then all the internal piping that it does the data as it moves around uses Kafka. Okay interesting so now Spark is going in through this transformation where the different libraries are getting more and more integrated and it's becoming this sort of center of gravity in itself what key pieces of Hadoop do you see it requiring and depending on you know sort of well into the future yeah it's a great question so Hadoop traditionally was MapReduce, Yarn and HDFS so you know roughly that's a storage resource management and data processing so now we're taking out that data processing layer and in part replacing it with frameworks like Spark it's not the only data processing framework on Hadoop but it's coming out to be a pretty dominant one for lots of use cases so you know at least in our sort of platform our way of seeing things yarn which does resource management between different ways like data processing frameworks and even within so you can have multiple Spark applications resource manage them on the same cluster as well as HDFS which is storage those are both very core on the basic like you know data center operating system kind of level then there's all the tools that come around it so sentry for security Hive itself for all like it's metadata management as well as ingest tools like Fulumum and Kafka so Spark is a very important part of a large ecosystem with a bunch of other important components do you see the Hadoop ecosystem with Spark being perhaps simplified if it's deployed by cloud service providers who can sort of smooth over some of the seams between the components yeah I mean so I think cloud is important but maybe somewhat orthogonal to the integration of those components yeah so I mean I think you know tradition at this point the world has sort of gotten control with this idea that it's nice to have companies that are there to take a bunch of disparate open source projects and integrate them make sure they play nicely and then it's also nice when those companies make it easy to you know take that integrated project product and make it deployable on the cloud okay with that this is George Gilbert wrapping up at Spark well not the Spark Summit but Data Science Summit San Francisco and we'll be back shortly thanks