 We have our first guest. She's a little bit far away from here. She's Brooke Wenning. Hi Brooke, how are you doing? Hi, good afternoon. Good morning everyone. Hi Brooke, where are you? Where are you? Because you are like 5,000 miles from here, something like that. I'm normally based out of San Francisco, but I actually relocated to Toronto about a month ago. Okay, not bad, not bad, not bad. Well, we'll have the opportunity to make questions to Brooke after her presentation. So remember, you can send all your questions in the platform. Questions in any language will be welcome and I will translate it to Brooke. So Brooke, it's your time and we are expecting your presentation about managing cows. I think this is something useful always, but especially useful idea like this, with a lot of cows everywhere. So, Brooke, thank you for being with us. I need your time. Perfect, thank you so much for the introduction. Can you see my screen? Yes, we see. We watch you, and now there's Karim. Yes. Perfect. So I wanted to start off just saying thank you to the organizers for inviting me to present today. My talk will be focused on managing chaos. How to do reproducible machine learning on Databricks. So I just want to do a quick overview of who I am. My name is Brooke Weneg. I lead the machine learning practice team at Databricks. I've been working with Databricks for almost five years now. I'm also the co-author of LearningSpark 2nd Edition. It's the new book, Focus on Spark 3.0, and I have included a link to the free PDF version. If any of you are interested in checking that out. And fun fact, I'm fluent in Chinese. I lived in China. I have an undergrad degree in Chinese, so if any of you want to ask questions in Chinese, you're welcome too as well. So that's enough about me. Now let's talk about the outline for the talk. So I want to start off with the motivation for why reproducibility is important. And more so now than ever, especially with COVID going on. Why we need reproducibility and any of our scientific endeavors from machine learning to healthcare. Then I want to do a quick overview of Databricks for folks who aren't familiar with it. And two key open source projects that we contribute to. Delta Lake for data quality and consistency and ML flow to reproduce and productionize your models. Because you need to be able to reproduce both the data and the models to have reproducible machine learning. Then I'll give you a demo of how both of these open source projects work together on Databricks. And then a recap of the talk. So let's get started with the motivation for why reproducibility. So if you asked me four weeks ago what I was planning to present on today, I would have told you I'm planning to present on new features in Spark 3.0. And then I saw this article show up in my email about Google Health claiming they can outperform human radiologists at detecting breast cancer. And that got me thinking about reproducibility. You might be wondering why. So the article that came through about Google Health claiming they'd be human radiologists at detecting breast cancer failed to disclose the model. So they did not provide the model for others to reproduce. They also failed to disclose the data and the hyper parameters in which they used to train this model. So even though they built their model on some open source data, they augmented it with some additional private and proprietary data that no other researchers have access to. So you can see the first problem here. They don't have data for others to reproduce. They also failed to disclose the commonly tuned hyper parameters for their model, such as the number of epochs they train their model, which optimizer, learning rate, etc. And so I saw a critique about this paper that was published in Nature, and one of the people had said, it's like posting a photo of a cake. Sure, it looks beautiful, but if you can't try a taste of the cake yourself, and if you don't have the recipe, how can you reproduce or how can you trust this? And so a lot of people have talked about explainable machine learning being very important. Paco had mentioned that in his previous talk. This was my area of research in grad school as well was explainability. But now more and more, reproducibility is the pivotal part of any type of machine learning work, because whether or not my results are correct, whether or not I violated some assumptions about the distribution of my data for that given model, if I can't reproduce the data, or sorry, if I can't reproduce the model, I can't trust anything about it anymore. And so you can see there are some other key tenants of machine learning, explainability, regulation, integrity. Reproducibility is so key these days. And so now I want to walk you through what I consider to be the key aspects for reproducible machine learning. So what is needed for reproducibility? First and foremost, you need the data. If you don't have access to the underlying data, you can't actually reproduce the experiment, because whenever we're working with data, we don't have access to the entire population. We often have access just to a given sample. We have access to the population. We know everything about it. Why do we need to model it? But often we have access just to a given sample, and there can be bias in this sample. For example, if you're polling people just in San Francisco for who you think is going to be the next president, it's going to look very different than Cleveland, Ohio. And so you need to have access to this underlying data to reproduce your models. The next thing you need is some type of documentation about the data. And from working with customers, this is one of the key areas that I find our customers struggle with. It's providing documentation, especially across teams. And the reason why this is so important is data engineers will often do some pre-processing for the data scientists. And sometimes they think they're helping out the data scientists. For example, they're dropping all of the missing values in the data set, or they're imputing them with the mean. They think they're doing something helpful. So the data scientists go and build their model off of this data set, then they put it into production. And in production, it starts going all these weird errors, or it starts making predictions that seem somewhat bogus. And they're scratching their heads trying to understand what went wrong. Well, it turns out they were seeing some missing values in the production data that they weren't seeing in their training data. And so this is just one additional reason why we need to document our processes. And taking it one step further, there's this great paper called Data Sheets for Datasets by Timnit Gebru and a few other researchers saying, we need to be documenting our data to document how it was obtained, any limitations of the data, what it should be used for, and also what the data actually contains. You'd be surprised by the number of projects we engage in with customers, where I'm looking at their data and two different people give me two different explanations about the same column. I might see the word like miles. And someone might say, oh, that's the number of miles the car drove in the day. Someone might say, it's the number of miles the car drove in a month. Nobody really knew what it meant. Or even their definition of a day is 24 hours. We need to be having consistent definitions for our data, for our features, to be able to reproduce this and to hand it off to other people. And so going back to that Google Health example, they didn't actually provide all of the information about how to document or reproduce their results. What they did provide though, was the set of libraries that they used. So they included the frameworks in the libraries. But taking it one step further, you also need to include the specific version of it, where you're using scikit-learn 0.23 or 0.19, because there's different default parameters and different behaviors and algorithms across these different versions. Also, you should be specifying the language that you're using. So if I tell you I built an XGBoost model, did I build that in Java? Did I build that in Python? Did I build that in R? Because they actually have different default parameters as well, different implementations. And so we need to be mindful of this. Hyperparameters. You spent a bunch of time doing compute. Paco talked about this in his talk as well, how it took five full cars worth of time to be able to generate this one BERT model using a given set of hyperparameters. So researchers clearly spend a lot of time developing these hyperparameters. You should share them with others so they can reproduce your experience as well, rather than forcing them to do a hyperparameter search. This is even more true when you're dealing with non-convex functions, like deep learning. You're never going to get to the global optimal solution. And so if you do a hyperparameter search, I do a hyperparameter search, even with the same ranges of values, it's possible we come up with very different hyperparameters in the end. So we need to provide the hyperparameters for reproducibility. And lastly, the computing environment. So you could have different results for using CPU versus GPU, especially now that there's different floating point precisions for GPUs, you're going to get different results. And then furthermore, if you're working across clouds, you develop something on AWS, you want to port it over to Azure, oops, they have different instance types, oops, they have different default libraries that are installed, this is all needed for reproducibility. So now that I've hopefully given you a bit of a motivation for why reproducibility is important, and some of the key things that are important for reproducibility, I want to talk about how you can do this on Databricks. So how can Databricks help solve this problem? So to start off, Databricks provides a unified data analytics platform. What that means is it's a common platform that data engineers, data scientists, machine learning engineers, and data analysts can all collaborate on. So Databricks provides this proprietary collaborative workspace. However, it does contribute to these two key open source projects to help other people with their reproducibility issues as well. So the first open source project is Delta Lake. So Delta Lake, we like to say it's spark on acid, because it provides acid guarantees to your data lake. But what it effectively provides is data quality and consistency. So I can always roll back in time, reproduce a given version of my data, and ensure that data quality is insured. There's also the open source project MLflow to help you reproduce and productionize your models. So taken together on the Databricks platform, you can address all those reproducibility concerns I listed on the previous slide. So now what I want to do in case you all aren't familiar with these two open source projects, I want to give you a high level overview of both of them, then a demo of how we can do this end to end. So let's start off with Delta Lake, because any good machine learning starts with high quality data. If you have lots of noisy, ill form data, you're not going to build a good machine learning model off of it. So we need to start with high quality data that we can trust. And there's some data reliability challenges you have to first overcome. What happens if you have a failed production job? So let's say that your write fails partway through. What do you do with those partially committed files? Or what if your data evolves over time? So for example, right now my data has 10 columns, but somebody adds some new features to my data. How do I either allow those columns to be added, or go back and retroactively add those columns to the existing data? And then just lack of consistency. If you have issues with consistency, how can you mix appends with reads or batch and streaming solutions? So these are all the data reliability challenges that Delta Lake solves. So Delta Lake is able to solve this by leveraging two key facets. One, all of the data is stored in parquet files. So it's a very scalable storage solution. But it uses this transaction log to manage all of the commits that have happened to the underlying data. So this way whenever you read from Delta Lake, it first checks the transaction log. What was the last thing that was committed? This way it guarantees that even if a write failed partway through, that transaction wasn't committed, you don't accidentally read those underlying partially committed files because that is not committed to the transaction log. And so with Delta Lake, you can unify both batch and streaming. You no longer have to have different setups for your batch jobs versus your streaming jobs. And it gives you these asset transactions because you get the guarantee that a write either fully succeeds or it does not. And you can verify that with the transaction log. It also provides schema enforcement. It will throw an error at you if your schema is incompatible with the existing schema. But it gives you the ability to actually evolve the schema as an optional command. And then one of my favorite factors is that it's an open format. There's no lock into Delta Lake. You're not using some proprietary technology to one of the large cloud providers. You can run Delta Lake locally on your laptop. You can run an AWS, Databricks, Azure, whoever you want and has lots of different connectors as well. And lastly for data scientists, we all love the ability to time travel or take data snapshots. And the reason why this is so important is if I build a model today with a given set of data and that data continually has a streaming job or an overnight batch job that writes to that same data source, tomorrow I build a new model with a different set of hyper parameters and it performs better. Did it perform better because I had additional data? Or did it perform better because of that hyper parameter combination? You don't know. And so the ability to time travel to be able to fix your data at a given snapshot without actually copying your data and having stale data. But the ability just to arbitrarily fix it, hey, I want to keep all the data as of November 1st, build all my models off of that, so we can compare apples to apples is something truly unique to Delta Lake. Because with the transactional log, it will just go back to that earlier state in time. It's not actually copying around your files over and over again. So Delta Lake helps solve the issue of reproducible data. It also comes with many different connectors. For the demo today, we'll be using Spark. Spark is one of the other open source projects that Databricks contributes to, but you can see you can access Delta from many other different sources. And if you want to see some of the syntax for how to do time travel, you can see just write your standard SQL code, select count star from events, timestamp as of timestamp. So you can say please give me all my data as of a given time or as of a specific version. And there's also Python and Scala API as well. So now you've seen how to get high quality data. Now let's talk about how to reproduce our machine learning models using MLflow. So if this were a live audience, I would love to ask all of you, please do a show of hands. How do you track data science experiments? Do you use a Google spreadsheet with your colleagues? Do you use an Excel file locally? Do you have a napkin next to your desk where you write down all of your results? Keeping track of data science experiments is actually not an industry standard technique. And so we see this with our customers and sometimes customers would be saving models and they would do like underscore alpha underscore two underscore final final to keep track of all the different model names. And that just becomes very tedious. And especially if you're an outsider to that organization, you have no idea what their standard conventions are for saving their experiments. So that's one hurdle you have to tackle. The second one is how do you allow others to reproduce your experiments? So I've got my Jupyter Notebook set up locally. I've got all of the different condom PIP libraries installed. How do I allow you to reproduce them? Do you need these libraries that I installed that may or may not be relevant to this given application that I want you to reproduce? And then how do you manage your models across various frameworks, languages, algorithms? So for example, I love using Python. Some of my colleagues love using R. How can't we all collaborate and be talking about the same models and compare them head to head? And then lastly, how do you version and deploy your models? Because oftentimes the most current model is not the one in production. You need to be able to iterate and prototype off of it. And if you accidentally release something to production, they need to roll back. How do you go back to an earlier stage of that model? So these are all of the issues that MLflow seeks to address. And so MLflow is an open source platform for the machine learning lifecycle. It's as simple as PIP install MLflow. And MLflow has these four key components to address everything from the previous slide. The first one and by far the most used is MLflow tracking. It allows you to record and query your experiments from your code, data, configuration, results, et cetera. So it allows you to keep track of everything, which parameters you use, what were the resulting metrics. And you can also log arbitrary artifacts like matplotlib images to the MLflow tracking server. The next one is MLflow projects. And so this allows you to package up your machine learning code and any of your dependencies to run it on other platforms. This way I keep track of, oh, I just need scikit-learn and pandas. It keeps track of those two libraries, which version, and I'll package it up so I can send it to other colleagues. If they have MLflow installed, it will ensure that they have all the correct dependencies. And if not, they will install it for them to be able to reproduce that project on any other platform. Next is models. So it's a general format that standardizes deployment paths. This way if I build a scikit-learn model today, I switch to TensorFlow tomorrow, the DevOps team doesn't need to know about this. They simply just need to know it's mlflow.pyfunk.loadmodel. And it doesn't matter what type of model it is, because MLflow will tell it internally what type of model it is. And then lastly we have the MLflow model registry. So it's a centralized and collaborative model lifecycle management. This way we can all be comparing the different models. We can see which models are in production, which one are in staging, what are the comments on it, and we can roll back to earlier stages of the model if need be. So now what I would like to do is just do a little bit more of a deep dive into each of these so you can better internalize these components. So I'm going to start off with MLflow tracking. So with MLflow tracking, you can start off by having some code in your notebook, local app, or a cloud job, and you'll write to the central tracking server. So here you can like your parameters and your metrics, and then you can allow others to query your results either through the UI, through the REST API, and it's actually now a supported Spark data source. And so this is great because now it's an industry standard framework that's open source. So if you work at your company today, you go to another company, or you're consulting for another company tomorrow, you know how to track their experiments, you know how to evaluate them. There's not all these proprietary in-house tools that become very brittle, and that are very unique or bespoke to a given organization. Anybody can set up MLflow tracking. I personally use it with all my Kaggle experiments. I love using it for that reason. And so this is a screenshot of the UI. We'll actually do a deep dive into that in a minute. But you can see here, I'm using MLflow locally. I am using it from Python. I created a run called random forest. You can see here created the get hash of the project, the parameters for my random forest, and not so great of our square, but at least I can register all of my runs here. And so if I had multiple runs, I could search for them, I could download them, I can compare them side to side. So this is what the tracking UI looks like. Now let's take a quick look into the model registry. So the model registry is a central repository to be able to search and discover models across all of the different teams. And let's just jump to this next slide. So if you want to see what it looks like, you can have different versions of a registered model created by various people, and you can see the various stages here. So you can see these two are now archived, this one's in production, version five is now in staging, but it hasn't quite been ready to move into production yet. And so now that you've seen how to track your runs using MLflow tracking, how to register your models using the model registry, let's talk about how to package up your code to be able to run your code on other platforms. So the MLflow projects, you define a project specification. What your code is, configuration, data, and any dependencies you have, you can then launch it locally, just on your Macbook, or you can launch it on any remote execution you want. So for example, I can call MLflow run project, and I can actually launch it on a Databricks cluster, but I'm launching it from my laptop. So here's an example of an MLflow project. They all have the same standard format of MLproject, conna.yaml, and then these are dependent based off of your code. But you can see here I specify lambda, I have a default parameter of 0.1. If I want to override that, I say MLflow run, I give it the get URL to my project, and I can pass in parameters, for example, parameter lambda equals 0.2. You can also invoke it this way, or this way as well. So now I just want to take the time to do a quick demo of how all of these come into play on the Databricks platform. So just for sake of time, I've pre-run this entire notebook, but I want to walk you through the code and through the MLflow UI. So here I have some data from Airbnb, they've open sourced their data from inside Airbnb, so I grabbed a snapshot of that. And what I did is I wrote it to a Delta table. And so here you can see I'm reading it in. Here I'm using Spark to read it in. It's one of the supported data sources. So I just say Spark.read.format Delta. This would be the same as if I'm reading from Parquet. I just switch Parquet for Delta. And I can tell it which version I want to read it. I want to read it as a version 0. So I'm going to load in my training data and my test data that way. And you can see here the schema. I have lots of different fields, like is there a superhost, what the cancellation policy is, et cetera. And the goal of my model today is to predict the price of the Airbnb based off of all of these features. So here I'm going to go ahead and create a table using this Delta, using this Delta path. And I can look at the history of it. So I can see right now it just is version 0. Timestamp wrote to it this morning. User ID write. My mode was to overwrite it. And you can see which cluster ID was used to write it, what the write was, et cetera, number of files. And so by default Delta keeps a commit history of 30 days. Part of the reason why this is set to 30 days, one is for GDPR. The other one is your storage costs will go up if you just keep every single copy of your data. So this is configurable based off of your business needs. So now you've seen how to load in data from Delta. Let's see how to use MLflow now. To use MLflow, you simply import it. And if you want to save any models, you'll need to import them here as well. So it's MLflow.spark, MLflow.sklearn, et cetera. And to use it with MLflow.startrun, this will kick off a run. I'm telling it to log these parameters. These are the three parameters I want to log. You can also optionally pass them in as a dictionary too. Then here I want to save my model. Now I want to save my metrics. And if I had any artifacts, I could save them here as well. So you can see this took about a minute to train the model and save it to the MLflow tracking server. And so what we can do is we can click on this beacon here. This is the first experiment that I ran. So let's take a look at that. And so this is what it looks like with the managed version of MLflow. So Databricks sets up the tracking server. This is something you could do on your own as well. So you can see this is the source notebook, the status, who triggered the run. These are my parameters. These are my metrics. And then this is this model that I had created. So I can see I created a Spark ML model with these stages. And I can see if I look into the condo.yaml, all of the different dependencies. So it automatically tracked, hey, I'm using Python and I'm using PySpark. Those are my key dependencies here. And what I can do is I can always click reproduce run. And you'll notice it will clone a copy of this notebook. And it will also attach it to the same cluster that I used to create this demo. And so this way we get reproducible computing environment, reproducible code, and a reproducible model. So if I click confirm, this will automatically clone it and attach it to the cluster for me. And then I can run it top to bottom should I want to. So that was a quick look into the MLFO UI for this first experiment. Let's take a look at the second experiment and then we can combine it. Then we can view it side by side. So when I was building this demo, one of my colleagues had pointed out, hey, it's actually better if you predict the log of the price rather than the price itself because it's a log normal distribution. So they went ahead and they added a column called log price to my delta table. And they wanted to write it out to the source delta table. But because the schema has a mismatch, we now have to pass in the option to merge schema. It's nice that delta automatically will throw this error of, hey, you're trying to do something that I'm not expecting. And so if you pass this in, you're explicitly saying, yes, we know we want to evolve the schema. So we're going to merge the schema here. And so now if we call describe history, we can see version zero and version one. And I can see the right here, read version, I had read in version zero before I had contributed these additional files here. And so now if I want to load in this latest version, I specify data version one, read data as a data version. And so now what I want to do is I want to predict log prices, the target, and run this with ML flow rather than just predicting price. So you can see it's practically the same code here, logging the same parameters, both different values now. And so now let's go ahead and compare the two side by side. So to do that, I'm going to click the speaker here and open up the central ML flow tracking experiment. So I can see here, I have these two runs. And I can see the high level data version one versus zero, my R squared, but I want to see them side by side. So I can select these two, select compare. And I can see that overall this log model is actually doing much better. It has a higher R squared and a lower RMSE. So now I'm going to want to put this LR log model into the staging environment. So if I click on this run ID, not only can I reproduce the run, but if I scroll down into the model section here, I can register this model. So I can select a model or can create a new model, call this big things conference. This will register a new model for me, which I can then request to move into staging. This process takes a few seconds. So just bear with me. So you can see here, registration is pending. Once the registration completes, then I can then request to move into staging and have one of my colleagues look at it and just give the okay. All right. Just for sake of time, I'll go back into the slides here. All right. So I just wanted to provide a quick recap of everything that I just presented on. I know there's a lot of information, but the key thing that I want to leave you all with is reproducibility is not the same thing as correctness. Reproducibility ensures transparency and helps gain confidence. It still does not mean that your results are correct. For example, if your data violates the assumptions of the linear aggression model, it doesn't matter that you built this model that's reproducible because it's not correct. However, if you build a model that's correct but not reproducible, then nobody will actually trust those results. So it's more important to have a reproducible trustworthy model rather than one that's 100% correct. So there's this great phrase that says, all models are wrong. Some models are useful. You want a useful model that is fully reproducible. That's the key goal. And so we can use the open source project Delta Lake to reproduce our data, and we can use MLflow to help reproduce our models, share them, and have a standard format so other people can compare these different models. And then we can leverage the power of Databricks to reproduce our compute environment both on AWS and on Azure, and it has managed versions and optimized versions of Delta MLflow. And so the key thing to take away is if you can't reproduce your own results, how will someone else? And just as a quick shout out, our data in AI Summit is actually going on this week as well. All of the talks are recorded, so I don't want you to leave this conference to watch them, but they are recorded. You can attend as you like, and there's going to be lots of great speakers there as well. And so with that, I want to say thank you very much, and now I want to take time to ask questions. So, gracias a ti, Brooke. Thank you so much. We're waiting for some questions. I don't know if there's some, like there's some time. Well, you know, people here are a little bit shy, so not that many questions. Okay, this may be everything is clear, or maybe nothing is clear, because when there's no questions. Or they fell asleep, one of the two. Yeah, yeah, yeah. I promise you that it's not time for CSI anymore. So it's not people, it's not having, so it's people, it's of course watching your presentation. So it seems there's no, there's no question that maybe like your contact is in everywhere, maybe people drop you a line after this. So thanks so much, Brooke. Of course, thank you so much for inviting me. Bye, thank you.