 This is George Gilbert, we're on the ground at Databricks with the creators of Spark. We have now with us Joseph Bradley, who's the principal behind machine learning and graph processing and graph algorithms at Databricks and the 2.0 release. So, Joseph, let's start with Spark Machine Learning and how it's differentiated from other approaches. Let's start from the point of view of the sweet spot of its use cases. Sure, that's a great question. And I'd say the real sweet, the very obvious sweet spot of course of machine learning on Spark, I'd say would first thing everyone would think of would be scalability of course. Traditional machine learning libraries of course tend to be built often even for a single core from the beginning whereas with Apache Sparks library it was designed for distributed computing and that's partly taking advantage of all the work which the other people you've spoke with today have contributed to in terms of being built off of RDDs and data frames and so forth. But also just sometimes the machine learning algorithms themselves do need to be architected correctly from the beginning to really scale out. Okay, so from my understanding there are actually algorithms that don't really lend themselves to parallel processing but it sounds like for the other ones Spark or Databricks has done the homework of creating a data structure in the case of RDDs or data frames to make the parallelization of algorithms that can be scaled out possible. Yeah, I definitely agree that with certain algorithms it's very straightforward to scale them out with others it requires quite a bit of work and I think in all cases with some thought you're able to do, you know take certain steps forward to scale out and that's but it's certainly benefited from then underpinnings. Yeah, the other thing which I did want to add about like what is exceptional about Spark's approach to machine learning is that it is meant to offer the same implementations and APIs and algorithms from multiple languages and I think this really has been one of the big barriers in machine learning where a lot of people who are best at machine learning may be used to something like R maybe Python whereas maybe the people at companies who are best at say deploying it in production may need to operate and say the JVM or other frameworks. Okay, I want to hold that thought because we're gonna come back to the ability to go from design to production and that's a big differentiator but let's come back to the scalability because I've talked to a bunch of companies and I've actually only come across a couple we'll leave names aside who believe they can do scale out machine learning and others are emphasizing other things like usability of the pipeline that sort of thing. So if you have a couple other vendors who can do scale out eventually who are doing the work then what are the other differentiators you're gonna bring to bear? Definitely, so I will separate their right Apache Spark from Databricks where yeah definitely with Apache Spark like multiple vendors are contributing to it and the love work has been done to scale out there. I think the real challenge for a data scientist or so in wanting to do machine learning is not just in having that good implementation the algorithm but in having all of the operations which need to surround it being able to sort of have self-service very accessible cluster management creation, scaling, monitoring, checking out logs all of that which is so important for facilitating that scale out machine learning and there I think the ease of use in Databricks is just amazing. Okay, so let's take a look at perhaps the life cycle of a model where you're alluding to it's not just scaling it out but let's talk about where we wanna do online learning, so we're getting data feedback or feedback loops in the form of I guess the accuracy of the predictions. How will that work both online and in batch mode? Right, that's a great question and there are currently many answers out there which different I think both researchers and companies have tried. Of course you're talking with Michael earlier about structured streaming and that's a case where I think we have a real opportunity to really mesh those two modes together in terms of usability and the API where being able to fit on a batch dataset and on a streaming of data should really be very intuitively similar for a user and that's a case where in terms of the API we've been preparing for it in terms of modifying Apache Spark's ML Lib in order to use data frames and datasets as input and then in terms of the implementation that's something which Databricks we've been thinking a lot about in terms of how to take advantage of structured streaming to do that implementation and do those online updates. Okay, and that's what gives you, if you're using structured streaming you get both the online and the batch modalities, is that right? Certainly now with, for example, data frame and SQL queries, with respect to machine learning right now and Apache Spark, it's really just the batch mode. Okay, it's just you haven't gotten around to that. As far as doing the online learning. Once you fit your model you can certainly go ahead and deploy it to make predictions using streaming right now. But that's a roadmap thing. That's definitely something which we are, yeah, think is super valuable and a lot of people benefit from. Now, in terms of sort of agility, the ability to sort of rapidly test and evaluate the models as they're, because I don't assume, just because they're online learning doesn't mean they become the production model. They have to go through a process. What does that process look like now and what might it look like in a more automated world? That's a great question. I think there are a couple of elements to that. And those elements will depend a lot on whether you're expecting to deploy a new model every minute or every day. And we can come back to that later but I'd say sort of two big questions there. One, first just in terms of operations, how do you get that model into production? And second, how do you verify that, that's the model you want and you really do want it answering queries in real time? Well, I assume that you actually do those two in the opposite order. It's a good question and really both happen where as far as moving it into production, that's something which for example, with Spark right now is very doable in terms of say deploying in a structured streaming setting or using D streams from earlier versions of Spark. As far as deciding whether or not you want to actually have that model going to be making your main predictions for your application or what have you, there are multiple ways to do that. You can certainly simulate offline behavior or online behavior offline to make a decision. You can also for example ideally do a deployment where you do a bit of A.B. testing before you really put it as your main model. So this is sort of the- But it sounds like both of those certain interrupts but it sounds like sure, sure. Both of those are almost supervised. Like the A.B. testing, I mean ultimately someone's gonna look at the answers. Right. Although maybe further out they don't have to. That's a good question. So bringing up the question of like what can be supervised and what needs to be done either in an unsupervised way or in a supervised way offline later is a great question. And that really is very dependent on your application. But the question of like how to move your model into production and also a lot of infrastructure around monitoring can be shared I think in both of those domains supervised and unsupervised. Now- And by monitoring you mean the sort of the evaluation of the accuracy? Right. Okay. And to see if how behavior changes, right? As you move from one model to another. Okay. Yeah. I put up a shadow one you know if it sort of off simulated and if it doesn't work sort of roll it back. That is a common practice, right? Okay. So and then part of your question was like how is that done now? And how could I imagine it being done? Yeah. And sort of the ideal way in the future. There are multiple ways right now. And I think the big divisions there are one, are you deploying with Anapachi Spark? Where you may be say deploying either in a batch context with say scheduled jobs or would you be deploying and say a streaming say with structured streaming? And so if you're deploying within Apache Spark like that is very straightforward. We provide ways to persist ML models, load them back even across languages and so forth. If you need to deploy the model outside of Apache Spark, there are a few options. There's some limited PMML export. There is also efforts though, which I think are really going to change how things are done. Where we're collaborate Databricks. We Databricks are collaborating with some other open source contributors to get sort of local implementations of models into open source Apache Spark. Where those would be able to make very fast predictions without actually doing data frame or RDD operations. So I think that'll be really exciting as far as the sort of operational aspect of deploying. A model serving without all the richness of the Spark engine. Right. Just take the essentially the metadata about the model that JSON and the Parquet features, JSON metadata Parquet features and then let someone else execute. It says, I need this type of model. Go use the one in C sharp or whatever. That's the hope where it would be easy to say, have your web application running, which needs to be a very quick application, not touching Spark and being able to deploy a model trained within Spark in such an application. And that could potentially be at the edge for IoT scenarios. Definitely, that'll be very important there. Okay, okay. Let's take a break right now and we'll come back with more making data scientists productive, which is today the most manual part of machine learning.