 I'm not sure. All right, hi, guys. So next speaker is Ben Sagedi. And he previously worked at Microsoft as a solutions architect, is that correct? And then now he's working at Databricks, just in town here. And then he'll talk about MLflow and the lifecycle of ML and AI. Over to Ben. Thank you, Dan. Hello, everyone. Great to be here. So I'll get straight to it. This is MLflow. It's basically a machine learning lifecycle management platform. Let's talk about why it exists, why people are actually investing so much time building it, primarily because ML work, especially development, pipeline development, is complex. For those of you practitioners, you would know that it's an iterative process. It really is a cycle. You have a data preparation phase, followed by model training, model valuation, then on to deployment. And basically, you're constantly catching these metrics and seeing whether your model is drifting or not. Hence, you have new raw data coming in and potentially going back and running through your whole training and deployment cycle over again. There are dozens of tools, open source tools, out there for each of these various phases. Primarily, we can talk about R and the Python ecosystem. I'll talk about Spark a bit more in a second. But as a whole, each of these steps themselves are iterative. So you'll go through for the data preparation. You'll loop in there with new features. Same goes with, of course, the model training. You're going to iterate on that piece as well. So yeah, a lot of parameters to be tracked and that in itself can be a challenge. Then you need to make all of this scale to not just many servers but large groups. We're talking about bringing in siloed teams onto one flow. So for data preparation, you might have some data engineers involved. For the model training, it would be the data scientists for the deployment and the collection of the new raw data. That might be the DevOps folks. So this lifecycle needs to scale out to all these various teams. And on top of that, you want to impose some sort of governance, ideally. This is rarely done, but MLflow is assisting in changing all this. And there's also a model exchange. So ability to, say, train a model with one open source framework, yet deploy it using another one. Train intensive flow deploy in PyTorch as an example. So that's the complexity we're trying to address. And MLflow is here to the rescue. So what is it? It's an open source project. It was started by Databricks, open sourced in June of 2018. It's a set of conventions, specifications, tools, CLI libraries, and a community, of course. Currently, all development is on GitHub. Right. And yeah, a lot of different folks involved. It's already been integrated in three or four commercial software. Has. So quick design philosophy. API first. So everything has an API. It's meant to be really easy to set up programmatically. It is about automating the lifecycle. So everything needs to be done programmatically, if needed. It's modular. We'll talk about it's various pieces in a second. But again, you can take what you need, discard the rest. Yeah. Easy to use. I'll demo that in a second. And yeah. So right now, it's actually available from within PIP, within Conda. On Conda, it's on the source forge side. Good, good, good. Oh, by the way, APIs for Java, Python, and R. Didn't get to mention that. And yeah, this is open source in that it's a big problem. So we're trying to really get as many contributors to assist given their problems. We want to have this tool sort of address everyone's needs. And hence, we need more input and more contributions from others. So let's get into the actual components. There are four. One is brand new. I'm not going to talk about too much. That's the registry. But the major ones are the tracking projects and models. So tracking is literally tracking of everything. The code used for the data preparation, code used for the modeling, all the parameters used within the machine learning algorithms, the corresponding model performance results are tracked. Any sort of environment configurations, all that's tracked. Then you have the projects component, which basically bundles all those articles into a package which can be then redeployed anywhere, such that you can go ahead and reproduce the exact same results. So projects aims for reproducibility. And then on the models piece, you basically have two open source components integrated with MLflow, namely MLEEP and ONCS, ONNX, which are both basically converters from one machine learning framework to another. So again, using MLEEP and ONCS, you can go train in, say, TensorFlow, but then convert the model to PyTorch and also deploy it as a PyTorch model. So a bit more on the tracking side. Tracking, again, you have a few concepts here. You're tracking parameters. These are actually generic key value pairs. So it could be anything you like. Actual metrics. These are performance metrics for the models. Artifacts. These could be just any generic files. You've generated some image for the results of your model. You can bundle that in. And of course, the source code, yeah. Projects themselves, again, this is that bundling of the code, the configuration, and the data sets, such that you can readily re-run, re-execute the entire environment, and reproduce the same results, whether it's done remotely or on your local setup. So here's an example of what a project would contain. Basically, you always have some sort of, say, YAML config file, your main, your modeling script, and all that. You can actually just do an ML4 run. It'll find main, and it'll go get what it needs from the YAML and set up the environment for you, and you're up and running with that entire workflow reproduced. As I mentioned, yeah, so with models, there are just these MLEAP and ONX ability to actually convert. Furthermore, you can have it set up such that you have the native model, so in this example, TensorFlow, that's saved as is, and then you can have a converted one as a generic Python function, which can then be run by any Python environment, say Docker or on Spark, which I'll demonstrate in a second. Deployment environments, basically anything you can imagine. So Java should be included there, my apologies, but we often have actually deployments done on Docker containers. I'll go demonstrate this one in a second. We'll do some batch deployment using Spark, and there are even some other cloud services out there, namely Microsoft Azure's machine learning service and AWS's SageMaker. So yeah, it's lightweight, open platform, integrates well with existing frameworks, and it has its own server, by the way, running in the background, keeping track of all these logging activity. And within Databricks, you have a managed version of MLflow available. So if I may, jump into demo time. Right, it was demo time. So here we are in, so this is Azure Databricks. So Databricks is a managed Apache Spark environment available on Microsoft's cloud Azure and AWS. I'm on the Azure side right now. I have a little Spark cluster going. By little, I mean minimal, it has one worker. Yay, let's take a look at libraries. I do have two libraries installed, MLflow, which I fetch from PyPy, and Koalas, which I'll talk about tomorrow, if you guys are around. That's basically a Pandas API for Apache Spark, yeah. So this library MLflow is installed on this Spark cluster. And I'm gonna jump into a notebook. So this is a Databricks notebook. If you're familiar with Jupyter or Zeppelin, this should be, yeah, this should be very, very easy for you, same thing. It's an HTML-based IDE. And for those practitioners or those who've studied the machine learning a bit, you're probably familiar with this data set, the Iris data set. Some call it vintage data because it's from the 30s. Yeah, I really like it because it's pretty straightforward. So let's go. So I've actually connected to that cluster, which I've named falseAsia. And let's do some machine learning and keep track of all these experiments, right? Okay, so I'm gonna go ahead, I'm going and load this Iris data. Yeah, it's a CSV file, I'm gonna read it in into Spark cluster, I'm gonna do a little bit of renaming and ultimately just display the first 10 results, yeah? So this is the first execution on the cluster. It takes, give it a second. As soon as the data set is up, we can get started with putting together our pipelines, basically getting our data prepped for machine learning work. I hope it's not the internet here. Okay. So as soon as that goes, in the meantime, we'll continue on. So next stage, once we have the data set in memory, we're gonna do a couple of things. First off, you'll see that the one of the fields there label, excuse me, species actually is a string and we have to address that because what we'll be using later on for machine learning work, the Spark machine learning library demands that all data be in numerical format. So we'll go ahead and convert that is map those strings to integers, okay, whoops, that guy's going. That's using the string indexer and there's one other step that needs to be done, vector assembler, basically taking all the features that we see and basically crunching them up into one vector and that's what basically we'll have as our prepped data set. Okay, so here we go. Again, those of you familiar with the Iris data set. Yeah, this should be very old site but basically you have four fields and four length fields. These are sepals and petals. So if I'm not mistaken, the big ones are petals and the small ones are sepals. You know, which ones are the big ones? These, excuse me, no, sepals are the long ones, these guys, you know? And what Fisher did back in the 30s was go ahead and basically make these length and width measurements on these petals and sepals and he himself was a botanist slash statistician so he could identify the species of these flowers and he went ahead and constructed this data set. So actually it has three species, Satosa, Virginica and Versicolor. Okay, so we're gonna build a model that's basically an expert system which is fed these lengths and widths and with that it would just predict the species of the flower here, okay? That's what we're doing. As I mentioned, we're gonna do this, the mapping of our species to an integer, that's label, so these species have now been mapped to label and our original features have been vectorized into that vector there, okay? And it's this label and features columns that we're gonna feed into our machine learning algorithm. So a very quick run through the data science process. I just wanna talk about the act of splitting your data set into training test sets, just basically to make sure you have a sensible way of evaluating model performance. So typically you hand over the majority of your data set for training purposes and what's remains, what's held out is used for testing, okay? In Spark there's a very simple function for that, called random split. In this case, I'm gonna give past two thirds of the data set over for training, yeah? That's the side and remaining third we'll keep for testing, okay? Good, good, good. Here comes the amount of flow piece, okay? I'm just gonna import that in. I'm also gonna import another extension from it, the Spark pieces. While we're there, we'll go to, we'll pull in a couple things from Spark's machine learning library, namely a decision tree classifier and also a multi-class evaluator, okay? I'm building this little helper function here called train and evaluate and it's gonna take in two model parameters for this decision tree, max bins and max depth, okay? So here we go. I'll start off with an MLflow start, yeah? Everything following this within that indentation is going to be logged by MLflow, okay? So I'm gonna construct this, the constructor for the decision tree classifier and pass it these max bins, max depth parameters which were fed into my helper function, correct? I'm going to then train my model, fit it, that is, the training data set, data frame, and out will pop this, this entry classifier model. Do a little print, okay? Just for the sanity check. And straight afterwards, we're going to use that same model to make predictions on the test set, okay? That's this transform function here. You can think of it as predict as you wouldn't say scikit-learn, okay? So we've now made predictions, that is, well, what are we predicting again? The species of the flower, given the lengths and widths of the sepals and petals, correct? So we've built a model, we've made predictions but we need a way to basically gauge its performance, right? I'm going to use two separate metrics, one being accuracy, the other one being F1 score, yeah? So, yeah, we'll have those generated. That's basically comparing the actual species versus the predicted ones, different techniques for measuring how well a model is performing. And then comes all the logging, okay? I'm on flow, I'm going to log parameters, namely, as I said, it's just a key value pair. I'm going to log those model parameters, max bins and max depth. I'm also going to log two metrics that I'm generating, accuracy and F1 score, yeah? At this point, I could have even logged any sort of other artifact that I would like, code, any images I've generated, what not, doesn't matter, you can log all that. I'm also logging the actual species I'm also logging the actual model itself. In this case, it's this pipeline model and I'm just giving it a name, okay? So that's my little helper function and by the way, this thing returns a model at the very end, okay? Now let's go ahead and put this guy, let's use it. So here we go, I'm actually going to use this train and evaluate. I'll pass it a max bins value of 15 and max depth of two and have it build a model for me. Okay, good, good, good, there we go. Accuracy is 0.90, F1 score is 91, that's a pretty good model. Can we do better? I'm just going to say, go ahead with max depth of three, just alter that one value there, parameter, it's a model parameter, yeah, much better, 98%. Okay, I can continue this. A four, I haven't altered the max bins. Looks like that didn't make too much of a difference there. But you know what, I've already forgotten the first few scores, but hey, that's okay because that's what MLflow is doing in the background, tracking all that activity. So here are two things. Within Databricks, you actually have a little sidebar. Four MLflow will actually keep track of the three runs I just ran, you know? So actually this one point, yeah, these two are pretty similar. But better yet, there's a whole UI. So this is part of the MLflow server. Okay, give it a second. Again, this is, this what you would run in your, as on your CLI, it would be MLflow space UI and you get this environment up and running. And here you can actually come in and compare results, right? So if I wanna say, let's say sort these guys by accuracy, I could do that, you know? Okay, good so far, but I wanna go a little bit crazier. So so far, yeah, I've just done very simple changes of one parameter. Typically in real machine learning work, you might have dozens of parameters that you want to search through to find the best model available, right? So you wind up having a multi-dimensional parameter space that you'd like to explore through. In this case, I'm just gonna go with a two-dimensional one. So let me close that guy. This is something silly. So there's a right way of doing this and a wrong way. I'm gonna do the wrong way for the sake of simplicity. I'm gonna do a brute force search. So we'll do, I'm gonna actually basically just do a couple of four loops. I'll put my mic down for a second to type. For max bins, say five to 16, every two for max depth, helper function, okay? So I'm actually running through, how many is that? That's gonna be five by five by five, 25 runs here, roughly, you know? So it's chucking away. I don't have to pay too much attention here because, again, this is all being tracked. I have a whole user interface for exploring the results in just a second. Yeah, so give it a second. But yeah, we're doing a pretty aggressive sweep. The right way of doing this is to do cross validation while doing your parameter tuning. And if you're using Spark ML, MLflow will capture all that activity as well within your cross-validation runs, okay? And yeah, it's still going. Ooh, look at that, bam. Might have a champion right there. Okay, that's done. Good, good, good. Let me jump back into my UI. Let's do a little refresh real quick. I did, no, it was a range. It was a range, yeah, yeah. I give it a, yeah, five to 16, two steps. Intervals in increments of two, yeah. So now we have a bunch of runs. All right, right, look at that. Nice. I want to compare all these guys together now. Okay, let's compare. So you can actually go in and say, look at one parameter dimension at a time, say max depth versus accuracy, that's fine. We see that max depth of a four is faring better. But we actually were searching a two-dimensional space. So it makes more sense to actually go ahead with a contour. Okay, so this is max bins. Actually, let me do that around. Let me do that around max bins versus max depth. It's a bit, okay. F one score, yeah, okay. So where are we getting high? So four we're doing, we know we're doing pretty well. So there's max bins on the x-axis. So it looks like there's a few scenarios where, so we're trying to get to the lighter shade, right? The lighter shade meaning accuracy of one. So lighter is better. So it looks like these pockets are actually doing pretty well. For some reason here, this whole bit around max bins of 13, for some reason isn't faring well. Okay, good to know. We'll avoid that for when we actually do a final model run. But yeah, we've identified sort of corners within our multi-dimensional parameter space that are good for this specific task. So yeah, max depth of four and then max bins, anything above say 13. Good, okay, that's the little demo. So without the managed version within Databricks, you still get all this demo flow UI. You just don't have a, you just have a, you just do a local host, I think that's 5,000. And yeah, that's it. Thank you very much. By the way, so you can find me on LinkedIn, GitHub and Twitter. I have posted all the slides and the demo code on GitHub. So if you go to this, to my GitHub repo, my GitHub account, there's a FossAsia 2020 demos repo, the slides and the code are all there. Thank you very much. Thank you Ben.