 Okay. Can everyone hear me? Sounds good. Hi, I'm Joey Frazier. I'm going to be talking about machine learning and artificial intelligence life cycles. So, a little bit about me. I'm a technical enablement lead at Databricks. I'm a Apache Software Foundation member. I live in Austin, Texas. But I'm very proudly from Toledo, Ohio. Since we're here in Spain, I should probably clarify that that's the Toledo I'm talking about. So, here's what we're going to do. We're going to talk a little bit about why machine learning AI can be hard. It looks like a traditional software development life cycle. We can see how that aligns with ML and AI life cycle. Oftentimes, people think that the two are pretty distinct. Then I'm also going to be looking at MLflow, but I'm going to dig into how MLflow projects and multi-step pipelines are set up and specifically how those can appear in the experiment tracking. Then I'll show you a little bit of an alternative view of another project that I spent some time on called LaVar and then mention why it's important that these tools should be in open source. So, at the end of this, really, I just want you to go out there and start using some of the tools that are available to make your projects more efficient. Okay, it often gets said that the hardest part of AI isn't actually the AI. It's the data and a big part of that is because there's all these other things that you need to do basically from the point of data acquisition to actually deploying models and getting some value out of them. A lot of times, people think that we're only talking about the things on the left-hand side, primarily the data problems. So, this is an area that Databricks has excelled at for a long time. But if you're not doing the things on the right-hand side in particular, actually getting your models into production, serving up results, monitoring those processes and so on, then there isn't really a lot of point to doing any of it. And this is where things like MLflow and other tools are starting to come in because they can actually help you do these kinds of jobs. This gets thrown around a lot. This thing about how a lot of data scientists just feel like data janitors. I wanted to put this up here mostly because if you actually look at this breakdown, what's kind of funny about it is there's almost nothing allocated to that actually, that endpoint of getting things into production, to doing something with them, to getting business value. That really makes you wonder if anybody's thinking about it. And another aspect of it, I doubt anyone would say at the end point when they've actually developed a model that they can get it into production the very next day. That would be about 5% of a month, so that's completely unrealistic. So, what would it take to get some value out of things if you're having trouble getting value out of your ML and AI projects, which ostensibly most people think is actually a problem in the industry today? Some comments like this get thrown around. To complete the picture, these models must be deployed. It's kind of astonishing that we even need to say something like that, but we probably really did. Going back a few years, it was probably even more necessary. Some other things, a lot of people think that their team doesn't have the right tech stack to actually execute on what they need to. I would say this is a real problem, but we're making progress, and that's where some tools like MLflow come into play. And then there's a last problem where oftentimes people think that their teams aren't organized right, or that the data science team doesn't get along with the data engineering or the IT team who might have the responsibility of getting things deployed. That's a completely artificial problem, because if you start to have the right tools, then that's easy to solve, because everything will start to look more similar to both sides of the fence. Okay, this probably shouldn't be new to anybody. Pretty straightforward with software development. Lifecycle looks like you start with some problems. You come up with a plan for how you're going to implement things. You implement, you get things tested and deployed and so on. So nothing very new about that. But there's tons and tons of tools for doing it really well. And they've all been around for a long time, and they're all pretty mature. We know how to do continuous integration, continuous delivery, containerization is pretty mature at this point. Tons of choices for instrumentation, application logging, bug detection, all that kind of stuff. Do we have as many tools for doing this with actually getting, you know, ML projects into production? Not even close. So there's this common wisdom or thing that you also see thrown about where teams think that the process that they need to use for ML and AI is fundamentally different than what, you know, software engineering has done in the past. So they think that because they're research oriented or aspects of what they do might take a longer cycle, that the process is different. The deliverables are different, that's for sure. But I would say the process really isn't that much different. We don't need to go in search of some fundamentally new process. So if we look at something that somebody would claim the, you know, ML and AI life cycle looks like, at first glance it looks a lot more complicated. So there's a lot more time spent on understanding the problems, potentially more on actually testing things, the way you test things is more complex. But we're just using different descriptions. We still go through these phases of coming up with requirements and analysis and design and implementation and testing and deployment. It's no different. We can go even further back in time and look at how things were done in statistical consulting. So go way back to the 70s and 80s and they had to think about how to rigorously deliver on data projects. They already had a nice little approach, you know, where you spend time with the person that you're going to be delivering the product to, you identify how you're going to approach it, you actually go through, you know, implementing, analyzing things using whatever statistical methods you've identified and then you review and present data and so on. Exact same story. Looks a lot like what the software development life cycle is. So we don't really need, you know, to think of things in terms of a new life cycle. We just need to be thinking in terms of how to get better tools that might accommodate, you know, what people are used to in traditional software development. So if we look at some of this, what's not a problem is coming up with a business need. What is a problem is getting the data that you need to do to address it. It's not a problem that there are enough frameworks or run times for actually doing machine learning. You know, there's things like scikit-learn, there's MLlib and Spark, there's TensorFlow, there's SVMlib, you know, and so on. So, and there's more and more every day. But it is hard to maintain the systems, it is hard to install them, it is hard to make them available widely to a large team. We know how to test our models or we can figure that out. But we don't have a good place to store the results or track them over time. Tuning can also be hard. Tuning is especially hard if, you know, this isn't something that you do from the very beginning or you're not thinking at it at the outset. And then it's not really that the models don't have value. You'll see executives talk about this or, you know, businesses say they're not realizing any return on their data science teams. It's that it is too hard to get them into production. So, they're not fundamental problems. The problems are really driven by lack of tools to help facilitate getting the stuff done. So, what would you need? So, on the infrastructure side, it needs to be really easy to do things like manage clusters and install packages and provide them to teams and so on. On the workflow side, pipelines are complex and they continue to get increasingly more complex. So, you need good tools for doing really implementing workflows, scheduling jobs and so on. You need some place to do experiment management, Excel spreadsheets and, you know, notebooks, you know, either written or actually encode, aren't enough. And then you need to identify some kind of serving layer and actually get the models there. That's progressively made more difficult by the fact that all of the different frameworks use different serialization formats, but we're also making progress in the area as well. So, while I'll be talking about MLflow and Lavar, there are a variety of tools coming online in this area. A lot of them target different needs. So, you know, we have things like Kubeflow. That's going to be a lot more focused on, you know, deployment and serving up, you know, everything from the notebooks and models themselves. And Facebook, we have FB learner flow. That covers a lot end-to-end. It tends to be more opinionated in terms of data schemas, but it also gives you things like feature stores and feature reuse. Your teams can, you know, just get access to the same data and reuse the same features on the same datasets. You know, we have Michelangelo from Uber. It's also, you know, pretty much an end-to-end tool, but it's, you know, there are also constraints around it because it's going to make assumptions about, you know, how your programs are structured and so on. I'll be talking more about MLflow. A lot of you have already seen a few things about it, so I'll skip over that one. And then LaVar is a tool specifically for tracking evaluation datasets and datasets regarding the output or predictions from experiments. Okay, so if we look at the life cycle then, this is slightly simplified, but what we really want to identify here is, you know, we can look at, on the data prep side, lots and lots of tools on the training side, lots and lots of tools, same for deployment, same for the data ingest. And so every one of these steps needs something to tie it all together. And so that's what a lot of these tools are trying to attack. So MLflow is an open-source project that has, you know, really three big problems that it's trying to identify. It defines a specification in which you can organize projects and multi-step pipelines so that you can run and re-run the same experiments, potentially with different parameters. It provides a tracking server, which gives you a UI for looking at the runs of experiments or digging into them for comparing results, having that data over time. And then it provides a way to serialize models but also integration with a variety of serving layer infrastructure so that it's easy to get those models into production as well. I'm going to look at the first two parts. So in MLflow, when you look at the UI, it looks something like this. You have this nice big table of the runs of your experiments. So in MLflow, everything's organized around a run, which is an execution of some piece of code or a workflow. Think of that as an experiment. There are parameters which are going to be the inputs. The parameters are going to be things like learning rates or the number of features, things of that sort. You have your metrics. That's typically the output of your evaluation. Artifacts. You can store arbitrary files along with it. This could be everything from a log to pre-featureized data. And then actual links to the source in your version control when it was running. It's just another view. How did we actually get to that? This is where I'm going to actually walk through and break down an MLflow project now. So the project consists of a handful of files. There's an MLflow project file. That's going to be the specification of what's going to happen. I'll explain that more in a second. In addition to that, there's going to be a condit environment defined as well as several entry points. And these are going to be steps in a multi-step pipeline. So you can think of, you know, if you're looking at the process as data acquisition, data prep, featureization, training, evaluation, you can break things down in different ways. So this MLflow project spec, you specify, you know, what this condit environment file is, and then really just jump into defining entry points. So in this particular one, the process I'm going through is I'm going to download some data. I'm going to do a train test split of that data, extract the features, and actually save those so I can, you know, look at or reuse them later, perform a training run, and then there's always a main entry point which really ties everything together and gives you the overall run. The condit environment file, you know, there's nothing MLflow specific about this, but we can specify, you know, name of the environment, where we're going to get the packages, this makes it really easy to actually persist and keep track of the environment that was used. So there's a number of nice things about that. It could just be simply to go back and rerun something or see, you know, how it ran if you needed to, but you could also view it as having advantages for compliance purposes and so on because you could, in principle, archive those environments. Okay, the main entry point, I'm just going to draw your attention to a handful of things. So this entry point is really just going to, you know, take a list of other entry points and run those. The things I have circled and read here are the things specific to tracking an MLflow. So we're going to start a run. So now that's going to be something that's in the UI that we can dig into later. And then as part of that run, I'm going to log a metric. So, you know, go down to, you know, several lines there. I'm logging accuracy with this experiment and I'm bubbling that up from a sub-task in this overall project. So the first thing I do when I'm actually running it, though, is I'm actually just going to download some data and do a little data prep on it. So I'm downloading some movie review data and I'm going to look at the polarity of it. That's the experiment I'm going to be running. So I just download that file and then I'm combining all of the individual files into single files to make it easier to work with. One of the next entry points I have, as I mentioned, is to do this train test split on the data. So really what I'm going to do is I'm just going to combine those files that I previously downloaded. I'm going to do an 80-20 split on it and then write those back out. You'll see down there at the bottom that I'm logging some metrics to go along with what this step was, in particular how big was the test and train set so that when we go back into the MLflow UI we can actually see it. So next thing I'm going to featureize the data. So again, this is going to be the first time that we actually see we're going to log a parameter. So when we run our experiment there's going to be different parameters associated with it. So this featureization that I'm going to be doing on the movie review data I need to specify how many features there's going to be and basically how to combine specific words in that text, so n-grams. Am I going to use just a single word? Am I going to use pairs of words and so on? This is also the first instance where we're seeing an artifact being logged so when I've actually featureized this data and saved it I'm now just going to save those artifacts. So I can go back and look at those or reuse those or rerun experiments on the same artifacts that I had in this particular run. And then finally I'm going to train see another instance of logging some parameters and logging the metrics associated with it. So when we go into the UI we can dig into all of this stuff that we've actually logged and used. What do we do from there? We just run this. If we run it in the directory where we've set up that project it's going to go through the ML project file and complete that run, save all the parameters and metrics and artifacts that I specified to be saved and then we get everything nice that's in the UI. So I'm going to drill into what all of that is. So one cool thing is workflows in ML flow are nested. So I have this multi-step pipeline and I have a top level run there but then I can also see the individual steps. And metrics, artifacts, parameters are all associated with individual steps but you can also bring those up to the top. So I can look at each stage here and what happened. So in addition to all of that things like the actual command that was run, the command definition how long it took to run those things are saved as well. If I log image or text artifacts those things can actually be seen in the UI. And we can even go through and compare runs and not just like run A to run B we can do pretty much multi-experiment runs to see how or whether things are getting better how each one performed and do that across arbitrary metrics and parameters. So on the upper left-hand side you can see I'm comparing six of those runs with different parameter sets. The lower right-hand side we can actually see a chart of how this changed as the number of features changed in my training. So there's another perspective on tracking though. So in ML flow what we did is we tracked the metrics and we didn't actually save the data set in our tracking server. So we could but it's just an artifact it's not especially queryable it's not something that you're going to do comparisons on. But we could have imagined what if we tracked the test and evaluation data sets themselves and you could have a system that then was responsible for calculating those metrics. So what would or could that mean? So this is what we tried to do when we were attacking things in LaVar which is a training and evaluation database. So this is fundamentally different than what we have in something like ML flow where it's going to be a mixture of project specification tracking on the basis of things that you explicitly say you're going to track and then model and deployment this is entirely focused on training and evaluation. So you can save your evaluation and test sets you can save the actual results from individual runs so the predictions the output of your the actual output of your training and then it's going to handle all the metrics calculation for you. So it supports both regression and classification and gives you all the usual metrics that you're used to working with. So REST API access is there for it so you can integrate it with other tools so this is something that you can use alongside something like ML flow to then actually have in a structured way all of your evaluation data sets stored and available for the future. And there's also command line interface that you can do simple things like pull down that data set see what the metrics were and so on. Okay, so what does that give you? So you might think well everything we had in ML flow gave me the metrics and what I was interested in tracking so there's a few things it doesn't do yet. So you can save artifacts but it doesn't really do data governance around those in the sense of making them introspectable. So in Lavaar you can actually look at the evaluation items and the predictions that you get from those. You also have the ability to as you march forward to change and extend those evaluations so if I want to add a new evaluation metric into Lavaar I would get that for free now and so that's kind of cool I don't have to know in advance what I want to track or what I want to evaluate. We've also totally decoupled the training and really the metric calculation from the testing phase so I can iterate on one or the other completely independently. It also makes it a lot easier to do error analysis or error remediation so if somebody came to me and said here's a specific example that's appearing in our app you know I got this answer to it why did I get this answer and you can go back in and look at how it was over time. So if it changed from one model to the next you can go and identify that for specific data points and so you can either correct those as you go forward or maybe remove them from the test set and so on. Not the training set. So one last thing and I'm pretty passionate about this piece so why should these tools be open source? You might just care about getting the functionality that they provide but if we look at things this is a massive ecosystem of tools. There's probably at least a dozen things on this page alone and there will probably be a dozen more next year and it's constantly changing. So with an ecosystem of that size making sure that we have a variety of integrations is absolutely essential. So in general something purely proprietary or purely vendor specific isn't going to be able to keep up with that pace of change and might not even be incentivized to integrate with every single tool. So those integration challenges just aren't always going to be appeal to a single organization. So in open source communities uniquely equipped to incorporate support that might not already be in the project. So for example if you go build something in-house and you want that to be integrated with ML flow if it's a completely proprietary tool you may or may not be able to do that with something that's open source of course you can then build on top of that and actually get what you want out of it or tomorrow if there's a new deep learning tool that everybody starts using somebody can go and build that into one of these projects and actually have that integration without having to be incentives outside from that individual project. The other big thing is it's really important to be runtime agnostic. I think it's completely unrealistic that a productive machine learning data science or artificial intelligence team would lock themselves into a small number of tools. They're not going to be as productive they're not going to get as good results it's always going to hold them back and you need to use one tool so it's important that you have things that are runtime agnostic so that you can actually stitch everything together and then the last thing is is arguably that this kind of approach is more business friendly in the sense that it can address more concerns as things march forward and as the technology ecosystem evolves. Does anybody have any questions? Thank you.