 This is George Gilbert. We're on the ground at Databricks. We're with Joseph Bradley, the brains behind Spark Machine Learning. And we are talking about how to build a machine learning, not just a pipeline, but a process, not in terms of tools, but in terms of activities. So having said that, Joseph, take us through how Spark would put together sort of an end-to-end process, a set of activities where we want to be constantly learning and constantly putting those learnings in the form of a model back into production. Right. That's a great question. And so I'll start off at a very high level, really quoting Professor I had in terms of defining what machine learning's goal is. And that would be to take experience, modify behavior into something which improves performance at some metric, and to do this back over time. This was from Tom Mitchell, a major name in machine learning, and playing this back into what an organization might want to do with, say, Apache Spark. OK. What this means is, essentially, first, someone needs to define the problem. What is the performance you're trying to improve, and how do you measure, actually, how that's being improved? Pick a domain just to be concrete. Sure. So let's take something, say, very straightforward, such as, say, there are users incoming to a site, and you want to predict which ad you should show them. Sort of a stereotypical machine learning, I think, in an industry. But a case where there is a clear mechanism, which is going on, you are trying to model the user's behavior. And the performance of your model is easy to quantify in terms of whether or not the user actually looks at the ad. OK. So I think that the first way which Spark helps an organization get started with that problem is simply in its basic, very interactive process of analysis, allowing a user to understand the data, do rapid prototyping, and so forth. From there, you then enter the step of putting it into production. And actually starting a constant feedback loop of monitoring how it's doing and updating it as needed. OK. So you've got this feedback loop which is focused on the metrics that you want to improve, which is, I assume, the click-through rate or something like that. Sure. And so when you've got that feedback loop, what happens next in the process? How do you take that feedback and improve the model? Right. So there are several possibilities. I'll group them into one, say, batch updates and the other online learning, which you had mentioned before. But ultimately, it is just about updating the model with new training data in order to either improve its performance overall or possibly follow the changing behavior of, say, your user demographic. OK, so if we're taking structured streams and we're doing continual online learning, in that case, we might, in an automated fashion, build several models and look at the efficacy or the accuracy of each one and test those. Sure. Today, I assume a data scientist would be involved in that process. Right. Yeah, so to get more concrete, I think, in terms of how you would actually use, say, ML lib, you would, for example, fit several possible models, presumably tuning them, figuring out what features you would use, so forth, on data offline, and then you could potentially either test on held out data offline or put them into your production system, maybe just sitting in the back, sort of making mock predictions, but in order to see really how they do in real time, in order to figure out which one's best and which one you should really put into your production. So this test process is sort of like the learning itself. It could be either batch or it could be sort of, it could be online, but not really online. Sort of, it's in the background. Definitely, and there I think it's critical, as you said, to have people with the data science background involved, where that kind of knowledge and statistics and so forth knowledge is needed to make sure that the way you are testing that model to be put into production is really mimicking how it will actually be used in practice. For example, if, say, user behavior is changing over time, it could be really bad to tune your model for data from five years ago, as opposed to, say, testing on a smaller amount of data, but from a more recent time. OK, so then let's talk about the deployment step, whether you've done it, whether you've tested it offline on recent data, or whether you've tested it online, but without serving up live answers to a real audience. When you pick the model you want to use, what are the options for putting it into production? That's a great question, and yeah, there are a few. So I think we had mentioned before what is we're calling ML persistence, which is the ability in MLib to take a model and save it out to essentially a format which can be read back in into another Spark context, possibly in another language, definitely on a different Spark deployment, and then execute it in exactly the same way. This is nice if, for example, you have, say, a data scientist develop something and save Python, and then it needs to be deployed within the JVM. Another option is to actually take that model and then move it completely outside of Apache Spark if your production system, for some reason, either because of constraints of that cluster or because of whatever business constraints needs to operate outside of Apache Spark. In that case, models can be serialized in a couple different ways. There is limited PMML support, just a model export format. There is also work towards actually supporting, yeah, as I mentioned, more local implementations of MLib models where you could potentially use the MLib format and read it back in but execute this outside of a Spark context. And I think that this in particular, which is ongoing work, is going to be super valuable for the community. Where you're doing it outside the Spark context. Right. And essentially what that gets you is the ability to execute MLib models in exactly the same way, using the same code paths, but being able to deploy it in your particular application, which can have fewer requirements than if you were deploying it. OK, might or might not include Spark. In this case, you're using Spark at design time, not at runtime. Right. OK, so now these are related to operations. Tell us some of the things you could do for building and testing and building and testing the model and deploying the model but in increasingly automated fashion. What exists today and what do you see coming down the road to help? Definitely. So currently what there is now in order to help data scientists start building these models, I think the biggest effort, which was actually driven by another Spark committer here at Databricks, Shangri-Mung, was what's often called ML pipelines. And this was the effort to first base ML transformations for features and models and tuning around data frames and then also to provide tooling in order to string those together into complex machine learning workflows. So what this allows is essentially, traditionally when you think of machine learning, you might think of, say, doing a random forest or linear regression or something like that. But really, a whole lot of the work goes into munging your original data into whatever numerical features the machine learning algorithm expects. And this is arguably the most important part of machine learning because it's really how you're phrasing your problem in the first place. So a lot of these steps may be taken. And that phrasing is the knobs that you're putting on the model, the features, the parameters. Exactly. Yes, it's like, do you represent an email that you're looking at as, say, a set of word counts? Or do you maybe look at pairs of words? Or maybe much more complex and expressive ways of representing it. And by making the transformations and the estimators in data frames, you have a uniform construct to string everything together more easily. Right. It's meant to help data scientists yet picture this workflow as taking their original input data frame. And every transformation is essentially modifying that data frame. Maybe appending new columns of new features you've generated. Maybe pruning some if you're doing feature selection. Feeding through models which add new columns with predictions or probabilities, that sort of thing. And being able to phrase that complex workflow in a really uniform, intuitive way. So that's, I think, the most impactful thing, which is there right now. OK. Yeah. If you're, so then that ML pipeline has one or more models and potentially an ensemble that gets, as you were saying earlier, deployed either outside the Spark context, within the Spark context, potentially scaled out to a cluster. And as we talked about, either online learning or batch. Now over time, how might you see greater levels of automation move into this process? That's a good question. I think there are a number of ways there could be automation. Certainly, the easy one to answer, I think, is you mentioned scaling out. And there, really, all of these elements are designed to scale out. That is something we're worrying about now, and which will, I'm sure, speed and scalability will continue to improve in the future, but is something which we're really always focusing on. And how about the feature selection? Other elements, such as, how do you generate features? How do you select them? Which models do you even use? Do you use logistic regression or a random forest? That, I think, is a case where there are ways to start to automate that. And there will always be ways to improve and always be in need for a human in the loop, as far as the ways to start to automate that. I'd say, first, there's certainly the opportunity to provide automation where you say, try multiple models automatically, try multiple ways of feature generation, and just rely on what we currently have for automated model tuning to figure out from held out data which of these ways is best. The test data. Right. So the machine learning machinery would generate a couple of sets of features for the models, and then test and evaluate those models. I think that's something which certainly quite a few researchers are looking at, and some open source projects are even starting to look at, but which is definitely still in the research domain. I think the other big aspect which machine learning can help with is not automation in terms of push a button and everything's done, but automation in terms of telling you what's going wrong or how you might improve. You can compute a lot of statistics after you fit a model, which indicate whether or not that model is great. I think that right now you require a data scientist or statistician to really read those statistics and understand how to improve from there. But I think that some more work could be done, maybe not necessarily using machine learning, maybe more of a design question, but in order to make it easier for a data scientist or a non-expert to take their current model, get feedback, and know how to improve it. So it's a, the machine learning process would include some sort of, it's like visibility into how it made the decisions or recommendations. Exactly. Or visibility into the data scientist in where his decisions made things more accurate or less accurate. Yes, I think really both aspects. Okay. And right, that interpretability and visibility is key. All right, let me just make sure that before we lose you we've got all the key questions we wanted to put to you. Oh, our notebooks have, you know, become sort of more and more pervasive in the Spark ecosystem. Might we see machine learning pipelines start to have graphical representations that would make it easier for data scientists or data engineers or, you know, perhaps even others to start prototyping? Right, that's a great question. And it's something which, well, Databricks demoed sort of like an early prototype of that at an earlier Spark Summit, but it's something which I would definitely like to provide. And there are multiple ways of providing this. I think partly in simply visualization being able to sort of deep dive into elements of an existing pipeline. And there's also the case of sort of more of a point and click drag kind of gooey for building these complex pipelines. And I, you know, both are certainly, I think quite valuable, especially, I think just for communicating the intuition and being able to look at the overall picture of what an ML workflow is doing. Would those be open source or is that, you know, proprietary Databricks value add? Right, good question. I think certain elements of it will be open source and certain elements will certainly be Databricks value add. And I think a lot depends on sort of whether it is something which can easily fit into one project or the other and also, you know, what interest is, you know, where interest is coming from and really who's pushing for these things. Right. All right, Joseph, on that, I think we had a very, very lengthy set of interviews with you and covered a lot of ground. That was most enlightening. This is George Gilbert. We are on the ground at Databricks. We've been with Joseph Bradley talking about machine learning in Spark 2.x and its futures. We will be back with more interviews.