 Okay everyone, let's continue. Our next speaker is Nick from iBEM, Principal Engineer, and he's going to talk about putting Spark Machine Learning Pipelines to production. Yeah, good afternoon everyone and thanks for having me here. It's my first time at FOSDEM, so it's great to see all the energy and enthusiasm around open source software at this event, so it's great to be here. So today we're going to talk about productionising Spark ML Pipelines, well actually any machine learning with the focus on Spark, with the portable format for analytics, so using open standards for machine learning models. A little bit about me, I'm ML Nick on Twitter and GitHub. I work at iBEM in the Spark Technology Center and the Cognitive Open Technology team, predominantly working on machine learning, AI and deep learning. And a lot of my time is spent in the Apache Spark project, I'm a committed PMC member there and I've written a fairly out of date book now, but it's called Machine Learning with Spark. So today we're going to talk a little bit about the machine learning workflow and then some of the challenges inherent in that workflow, in particular the kind of end piece which is deploying those machine learning pipelines to production and then how open standards can help solve that problem and some of the work that me and my team have been doing around this challenge and then finally kind of a summary overview in the future directions of this work. So the machine learning workflow is really simple as you know, you take data, you apply machine learning and you profit, make money, right? But in reality, it's a very complex beast. So it spans multiple teams, you've got your data that can be in various forms, some of it is historical, some of it is arriving in real time, it's across data stores, across systems, you've got lineage, you've got metadata, you need to kind of recall all of that and that's the domain of your data engineering teams typically in a medium to large organization. Then the traditional data science or machine learning workflow that most people think about is in the middle here. You take your data which you assume is all nice and clean and available to you, you ingest it, you apply some data exploration, feature transformation and engineering and you get that data into a format that you can feed into your machine learning model. And then the piece that most people tend to think takes up most of the time but actually takes up some of the least time is this model training. So actually selecting the model and doing your hardcore pure machine learning is only a really small piece but it is a critical piece here. And that's the domain of your data scientists and your researchers. And then once they are done with it, they tend to throw it over the wall and say okay, machine learning and production engineers, here's my model written in R or Python or whatever and you need to deploy it. So that spans a pretty big boundary there and you need to take the system and this workflow into production with all the requirements for scalability, uptime, speed, performance, monitoring that entails that. So it's not just the models that you deploy. So you train this fancy model but you actually need to deploy this entire workflow of pre-processing steps and feature engineering and cleaning of the data and all of that that comes before it. And then you've got to worry about versioning. Which version of this model are you deploying? Have the features changed since the last version? So your training pipeline needs to talk to your production system about these kind of issues. And finally once it's in production, as I mentioned, the real world system needs to predict on new data but it also needs to monitor, do live evaluation, it needs to know if things are going wrong, do I need to retrain the model? Is performance dipping? You need to get feedback from that system and put it into this feedback loop which then comes back to your historical and streaming data at the beginning as well as that data goes back into your data science pipeline. So this workflow is really a loop that spans teams as well as tools. So there's a lot of disparate formats for data. There's a lot of, and those schemas vary with time. So features, input formats, all the data that you have available may change. The definition of that data, the definition of the features may change over time. You've got a whole bunch of different tools for doing your pre-processing. It's pretty common now to have pipelines that do that or some sort of framework or toolkit that allows you to build pipelines such as Scikit-learn, Spark, MLlib, TensorFlow and the TensorFlow transform, the pre-processing steps. And then you perform cross-validation, again, sort of spanning R, Python, Spark, whether that's Scala, Java, Python, R, TensorFlow, etc. And then you've got the final model. So one important sort of thing to note here that I mentioned a little bit before is that the pipeline and the data schema itself has to be completely consistent between your training and data processing step. And production. So it doesn't help if you push a model to production and then the data coming in to the ingest part of that pipeline is not different. If the schema changes, if a feature changes or disappears in that model, you get complete garbage, garbage in, garbage out, and it's not going to work. So you effectively need to freeze that pipeline and you need to deploy the frozen version of that pipeline to production. So I've mentioned some of the challenges you need to bridge and manage all of these different gaps between languages, frameworks, the dependencies of those frameworks, the versions of them. If a version changes, the behavior might change. You have a tight coupling between the training and the version in training and the version in production. So deployment here is not really in the sense of the DevOps sense of the word of deploying an artifact, but it's really the fact that you're deploying this pipeline. So some of the DevOps solutions do help here. For example, containers are becoming quite popular for all kinds of deployment, but certainly in data science and machine learning. And it's obvious to see why. It's really compelling to think about we can wrap up the final version of that model in a container and we can deploy that. That does help with things like dependencies and versions, certainly. But it still means that you have to have a workflow that allows that. And every time a version updates, that workflow, that component of your workflow might change. So one other thing to think about is the performance characteristics can be really variable across these boundaries. So head model, inferencing and scoring in TensorFlow may be quite performant, whereas something in R or even Python scikit-learn in some cases and certainly in some cases in Spark can have serious performance issues, particularly for real-time inference. So ideally, you want to try and homogenize this as much as possible. You want to know that whatever you use for training and pre-processing, when you put into production, you want that performance to be predictable. So then a lack of standardization in this space leads to custom solutions. So everybody rolls their own framework for scoring models in their own model serialization formats. Their own interchange mechanisms, and it's all custom. Where standards do exist, and if you do, they have serious limitations, and that ends up leading to more custom stuff. So you end up writing custom extensions, which essentially means you get no benefit from a standard because the standard itself is broken and that portability component of standardization is lost. So that applies to any kind of machine learning framework, but certainly for Spark itself, we have a lot of additional challenges. So those of you who know a little bit about Spark ML may know that you have Spark ML pipelines, which is a component that allows you to quite easily and elegantly create machine learning pipelines using the data frame abstraction. So you take data frames and you transform the columns of that data frame as you go through the components of your pipeline. So preprocessing, feature extraction, and then machine learning model training, and then you end up with something called a pipeline model, which is exactly this frozen version of your machine learning pipeline or workflow, and you can just feed a data frame in and you get your result out. Sounds very neat. However, when you try to use that and actually deploy to production scenario, you have a tight coupling to the Spark runtime. So for training, that's great because you want to do training at scale and Spark allows you to scale up both the kind of traditional ETL data processing steps, sometimes using Spark SQL components and leveraging the power of the optimizer there. But when it comes to inference and scoring in real time, you have this overhead of the data frame. Just to generate the query plan can take more time than you have for scoring. You have the overhead of tasks scheduling even if you're running that entirely locally. So really it's optimized for batch scenarios, in some cases streaming micro batch. Yes, it can work. But certainly if you have a hard real time limit and that can, depending on the domain that you can range from microseconds through to, I always think up to a few hundred milliseconds or half a second in latency, Spark is just not going to cut it. It's not fast enough. So despite this elegant API and high performance training, you can't really use it for scoring. So in order to actually take that Spark pipeline that you've now trained and spent a lot of time working on and deploy to production, you need to do something completely custom. So Spark uses its own format for exporting things. You have to write a custom reader for that format, back into your own custom machine learning library for real time scoring, or write some custom converter from the Spark format into your library of choice, whether that might be as I could learn a TensorFlow and HTO or whatever it might be. So everything is custom. There's no kind of off the shelf solution. Well, there's sort of one that's a little bit older. We'll talk about one that is a bit newer. So the portable format for analytics is one of the solutions that I believe for this particular problem. So it's an open standard for deploying analytic pipelines. And it's been created by the data mining group of which IBM is a founding member, but there are many other enterprises large and small involved in that group. And it's really trying to be the successor to PMML, which is the predictive model markup language. That's arguably the only real open standard that is viable today. PMML is an XML-based serialization format, and it specifies the kind of transforms a model that you're modeling and prediction that you want to do in that pipeline. So it's great, and it has up-to-box support for many common components, you know, your logistic regressions or your pre-models and random forests and so on, some pre-processing. But it has many limitations, and that has led to that exact problem of custom extensions. So everyone wrote custom extension points for PMML, which completely nullified the benefit of that standard. So PFA was created specifically to address those shortcomings. And the idea is that instead of XML, it uses JSON, so a little bit more modern. That is the serialization format. It specifies schemas using Avro. So any Avro data type you can use, which effectively covers anything you care about. And then it encodes the set of functions or actions that you perform on your input to generate an output. So you can think of this pipeline as a set of these transformation nodes, and PFA allows you to specify in a kind of many functional mathematical language what you do on the input to create the output. And of course, given a single sort of set of PFA to do one transformation node, you can fairly easily combine those nodes together into one holistic document, it's called a PFA document, to specify your entire pipeline. So the type and function system means that once you've generated a valid PFA document, it can be effectively type-checked at runtime so that you know that there's not going to be any strange kind of runtime errors. So you've got type-safety, and once you've got a valid PFA document, you can run it on any compliant PFA scoring engine. So you have true cross-platform, cross-framer, cross-language portability. There are reference implementations available in Java, Python R, typically for producing PFA documents or versions of a model, a serialized model, you might be using Python R to do that. For scoring you're probably going to be using Java or if you really want performance, maybe you're going to write something in C++ or Go or something like that. So this is what it looks like, and the JSON is not really meant for human readability necessarily, it makes it easy for machines to generate. So it's a bit verbose. But here is, for example, logistic regression, so we take a double array, which is just an input vector, and we output a predicted class. So those are just avro types. And then the core function here is what we call the action. And that just specifies a set of functions to apply on the input to arrive at the output. So there are obviously a lot of built-in functions within PFA. You can write your own user-defined functions using any of the built-in functionality, but there's some kind of handy stuff for your typical models, including linear models. So you just do a linear, you call the linear regression function, which takes an input vector and a model cell. Now, we don't have that much time to go into cells, but a cell in PFA is just a way of specifying stored data. So it'll be typically the coefficients of your model or any state that your pipeline needs in order to do its work. Those will be stored in these kind of immutable read-only cells. So this function takes the cell, which is the model, which is effectively the set of coefficients. It does a matrix multiplication and softmax link, argmax, and you're done. So despite being fairly verbose, you can see that the actual application of it is very simple. It's just doing the math and specifying that in this kind of DSL. So, as I mentioned, there are reference implementations for PFA engines, and there is one in Java, but it doesn't really allow you to write PFA documents or create that export from something into a PFA document. And this is what I've created at the moment for SparkML. And we call it ArtFarq. I'm from South Africa, so you may know that an ArtFarq is something called an EarthPig, it's like an anteater. So that's where the name comes from. But the core of it is a scholar DSL, which you can see on the right here for creating PFA. And then ArtFarq SparkML is the user of that same DSL to export pipelines to Spark. And you can see if you compare the scholar code to the JSON that's kind of generating that exact JSON. You've got an input, you define the cell with your data, and then you just specify your action with the scholar DSL. So the idea is to try and make it as natural as possible and as typesafe as possible to do that. And we rely on Avro4s to do some auto generation of case class magic and automatically extract the types into Avro types. So in terms of coverage, we have full pipeline support, most feature transformers, almost all the machine learning models in Spark. There are some major missing features. For example, there's no generic vector in PFA, so you either have to have a dense vector or a sparse vector, and really you want to have a generic one that can mix so much with that because it gets very cumbersome to try and deal with both. So very briefly, why PFA? Why not something else? Are there other alternatives? MLEAP is a good alternative specifically for Spark and more recently for scikit-learn intensive flow. It's completely written in Scala, which means that if you want to do anything, any custom export of your own model, you have to know Scala, which may or may not be a problem, but many data scientists and machine learning engineers may not know Scala. They may be working in Python or C. So it's an open format in the sense that it's open source, but it's not a standard, and that has really led to a few issues in the project. Performance is very good and coverage is very good, but it really doesn't have this concept of independence from, you know, and across frameworks or versions. So you have a tight coupling between the version of Spark that you use to generate the model and export to MLEAP and the MLEAP version you run in production, every time you update your version, you have to change the version of both Spark and MLEAP in your production system, which is not ideal. And recently the open neural network exchange was announced, ONNX. It's a protocol buffer serialization format that also specifies the set of actions or functions or operators that are applied in your graph, your neural network graph. It's quite specific to deep learning at this stage, but in that sense it's very similar to PFA, and it appears to be a great standard for deep learning. But as I mentioned, it has pretty poor support for your traditional machine learning or analytic pipeline, so tree-based models, string processing, control flow, the intermediate variables are not really there. So this is something to watch. So in summary, PFA, I believe, provides an open standard for machine learning pipeline deployment and analytic artifact deployment. It provides true portability across languages, frameworks, run times, and versions, and it provides you an execution and scoring environment that is completely independent of the producer. So you can, whatever language or framework you use to produce your model, you can keep the same scoring run time and not have to worry about version changes, upgrades, and so on. So there's a significant pain point for the Spark ML ecosystem, because as I mentioned previously, it's very difficult, if not close to impossible, to actually deploy your Spark ML models currently, and also benefits the wider ecosystem. So those that are using PMML to do export, XGBoost, LightGBM, Cycletlearn R, can actually kind of graduate at some point to PFA. But there are risks. PFA is a young standard. It's still developing. The performance is not tested at any scale in production. That is something that we are working on. What about deep learning? It's very hot right now, and everyone wants to know about that. Can PFA be a contender there, or is it just not suitable? And the standard moves slowly. So if you want to change anything, you have to go through the standards committee. So that comes with benefits, the standardization of standardization, but there are just some downsides around open standards too. So to wrap up the future directions, this is not an open source yet, but we are working on getting it to the state that it can be released, starting with Spark ML pipelines and then later looking to add support for Cycletlearn, XGBoost, LightGBM, and some of these other projects. R already exists in the Hadrian project. There's a link there in the slides. So many R models and functionality already is exportable. We're busy doing performance testing versus Spark versus MLip and trying to kind of tease out where the performance issues are, of which there are many. And as I mentioned earlier, there are a lot of gaps even in the PFA standard. So one of the main ones is support for generic tensor or vector and performance improvements to the scoring engine. And then finally, we're looking at can one use PFA for deep learning? It requires all the, you know, this generic tensor schema. It requires your GPU operators to be built in and all the deep learning specific operators to be built in. But I think that this is something that wouldn't be too difficult. That's what we're talking about. Thanks very much. There's some links and references in the slides which are online. Thanks for your time. Questions? Thank you very much. That was very interesting. I understand that the purpose of the standard is to facilitate this link between R&D and the development of the models and then putting it in production. One of my concerns is, I mean supposedly this means that the framework used for the development of the model and then for the inference might be different, right? So one concern I have is when the actual implementation of the model is different, it doesn't mean that we're going to get the same results, right? So probably that's not the role of the standard actually to solve. But do you think that puts kind of a limitation on the practicalities of actually having this separation, you know, developing a model in one framework and running it in another? Yeah, I think that's a good question. And did everyone hear the question? So there are obviously benefits to either approach. So on the one hand, you've got this kind of multiplicity and plethora of producers. We call them producers in this kind of scenario. They produce the model. And in some cases you can use that same producer to score. So yeah, generally scoring in, for example, scikit-learn is really fast. Scoring in TensorFlow can be typically quite fast. And with some of these tricks of freezing the graph and making it more efficient at inference in time, it can be quite quick. Scoring in R for a large model, for example, may not be fast. So on that side of it, it's much easier to just say, okay, we'll use the same producer to score the model. And you can solve some of the pain points of productionizing it, things like the versioning and performance and so on. For example, using Docker containers and there's a few projects that will do that where you provide a kind of serving layer and underneath our independent Docker containers effectively that will house that kind of model and effectively a scoring function, whether it's a scikit-learn predictor transform or whether it's a TensorFlow run. So that is a valid approach. But even then, there's a big challenge there in managing that versioning and the runtime dependencies and the changes that happen. On the other side with the kind of standard approach, using a standard approach, yes, there's a big challenge in creating the PFA version that's for argument's sake of the model in the sense that you have to write that logic. So you can to some extent have automatic translation layers that will try to inspect, let's say, a TensorFlow graph, for example, and convert that. And that's certainly possible, but obviously a lot of work. Otherwise, at the moment for Spark, we've kind of written them by hand. So that does involve a lot of duplication of logic, writing it in a different form. But for scoring, you typically need much less logic than for training. So the training pipeline can involve some very complex algorithmic stuff happening. Scoring, as you saw, for even the most complex model, scoring is pretty much just a little bit of linear algebra and some lookups, and it's pretty basic stuff. And that's really the core idea, is that this is not meant to deal with training at all, it's meant to make the scoring component kind of easier to deploy and more standardized. So I don't know if that fully answers the question, but it's not necessarily one solution. I think there are definite drawbacks to the standard, and I've listed a few of them. And you do have to kind of rewrite a lot of the logic, but at the same time, once you've kind of done that, you're independent of any changes in, you know, any major changes in the producing framework. Of course, if the inference code or code path changes, you need to update that, but it means you don't have to redeploy like an entire new version of your scoring system, for example. And that, I think, is the key benefit, is that you can isolate your production system and have it just taken these arbitrary PFA documents, and as long as it's a compliant engine, it can read them and score them without ever upgrading that engine version, unless, of course, you need a new version for bugs or performance. Thank you very much.