 Okay, great. Yeah. So I'm excited to be here today to talk about actually a new open source project that we've launched at Databricks to accelerate the machine learning lifecycle. So as you saw in Professor Rodi's talk, machine learning has huge potential to transform basically everything we do from our day-to-day interactions with technology to going to space and so on. And it's obviously a very exciting time to be working on that. But at the same time, there are a lot of challenges with turning that into reality, into even working computer applications, let alone say a working spaceship. And this is what I want to talk about today. So first, just a little bit about Databricks and my background. So I've been part of this company Databricks since we started in 2013. It was founded by the original creators of Apache Spark, the team of researchers that started the project at UC Berkeley. And we provide basically a unified analytics environment in the cloud for everything from data science and engineering to machine learning. So our customers work across this whole set of technologies and problems and we get to see what the problems are. And we have customers in a wide variety of industries, including a lot of customers now in Europe. And so working across these customers, we see some of the common trends and problems they have with using machine learning and big data and production. So the main thing that we see, and I'm sure everyone else who has started doing that sees the same, is that machine learning development is actually extremely complex today. And it's more complex than traditional software development because of specific challenges that come up in machine learning because of its probabilistic nature, because of problems such as overfitting, because of the many parameters and so on. So to illustrate some of that problem, I'm just gonna show a picture of the machine learning lifecycle and explain some of the issues that come up. So here is kind of a simplified machine learning lifecycle to show the issues. So first of all, machine learning begins with data. By definition, machine learning is computer software that learns or improves or is all based on data. And of course, machine learning works better the more data you have and the more diverse and high quality it is for the problem at hand. So that's kind of the first thing you need. Second thing you need is data preparation. That's where a lot of big data technologies come in like Apache Spark to actually get the data into a shape usable by various algorithms. Then you've got, of course, the machine learning training. And finally, you have deployment where you actually have this run in production and monitor it and maintain it. And then if the machine learning application is doing anything important, you probably want to collect more data about how it's doing in deployment and feed that back in, as we saw, for example, with all the measurement systems on the spacecraft. So what are some of the challenges? So the first challenge is that there are just a lot of tools involved at every stage of this process. For example, data can be in a huge range of storage systems that are available today. And again, in machine learning, you usually do better with more data. So you want to combine all these sources into your application. For data preparation, we've got a huge range of tools and algorithms. So things like Apache Spark, or SQL, or Python, or libraries such as Psycote, Learn, and Pandas, and so on that are available for that. For training, there's an even larger array of algorithms and libraries, anything from kind of linear models to decision trees to deep learning, and many different implementations of each one. And finally, for deployment, there are also a huge range of tools and environments where you want to deploy your model. Now, normally in just traditional software development, you've got a lot of tools to choose from at each stage of the life cycle, but you can just choose one. So for example, for your web application, you choose one database, you choose one web framework, one JavaScript framework, and you're done. But the problem in machine learning is that your goal is just to produce the best result possible. So your job is actually to try out all the algorithms and keep improving the model and try to get the best result. You can't just say we'll be using linear models and that's it, and we'll be using one data preparation step. So unlike traditional software development, you have to be able to run and interoperate with all these tools. So that's kind of the first problem. Second problem is that some of these tools have an additional dimension which is tuning, so they have many configurable parameters or hyperparameters. And this basically adds a new dimension to your life cycle where you have to figure out, well, when I ran with this processing step, what did I set these tuning parameters to? And that's very difficult to keep track of and it needs to be a first class concept in your workflow. Then the scale, all of these steps need to operate at high volumes of data or they will operate at high volumes of data in production. So all of them have to be designed to scale. Then there's actual model exchange or code exchange across these. So how do you get some data preparation stuff that's written in Python and Pandas to then be deployed in production when you've got an iPhone app or something like that? That's the type of challenge you need to deal with is passing this code and models between teams. And then the final issue you have is governance. So anytime you have a large company or a company in a regulated space, you want to be able to keep track of what you did, make sure you're not using features that you're prohibited to use and so on. So these are all challenges that you have to deal with in machine learning that maybe don't come up as much in traditional software development. So just some examples of this from a couple of customers that talk to us. So this is actually the chief scientist at an ad tech firm. This is a smaller company, but this person runs sort of a data science team of a few data scientists there. And they say I built hundreds of models per day to lift revenue. The goal of this whole goal of the team is to increase the company's revenue by predicting things about what people will click better. And they'll use any machine learning library, MLlib, PyTorch, R, etc. So they're constantly trying new things that are coming out, new algorithms to see whether they help with the problem. But unfortunately there's no easy way to see what the data went into a model from a week ago and rebuild it. So they have this whole problem of keeping track of what they did as they're using this wide range of tools. And similar problems happen in much larger companies as well. So this is a large consumer electronics firm. And they say our company has 100 teams doing machine learning worldwide. And we can't share work across the teams. When a new team tries to run some code that was written by someone else, it often doesn't even give the same result. And they have lots of horror stories where you've got experts in one country trying to look at a model built in a different country and they can't even get it running. Let alone actually use their expertise to make it better. Okay, so what are people doing about this problem? So one trend that's really begun in the industry and it's happening in sort of companies of all scale is the idea of machine learning platforms. So some of the best known ones are these internal software systems at large web companies who have written about them and explain how they use them. So some examples are Facebook FB learner, Uber Michelangelo, Google CFX. But many other companies are building platforms as well. And the idea of these ML platforms is let's put some structure, some APIs around the ML development process so that we can make certain guarantees about it. Like we can make sure that we can deploy the same job in production or we know how to move the model into a new platform or something like that. So the benefit of these platforms is that they standardize the machine learning development life cycle. So data preparation, training, and deployment. Basically as long as you work within the APIs that the designer of the platform built, they make sure the engineering team that builds it makes sure that your code is deployable in production and can be monitored and can be retrained and so on in a standard way. So you don't need to reinvent that each time for every new ML application. And this is a very successful model in these companies. There are basically dozens of applications using these internally, which is great. But there are also some limitations. So one of the main limitations we see talking to users of these systems is that they're limited to the algorithms and frameworks that the platform supports. So there's often an engineering team that supports maybe 10, 20 different algorithms or maybe one specific framework and one specific software version. And if you work with that, everything is great. But if you wanna go and try something new, you're kind of on your own. And remember I mentioned at the beginning, when your job is to improve some metric in the company, like every percent improvement leads to millions of dollars of revenue, you wanna be able to try everything that's available, every new technique and see how it helps with that. And then the other challenge is that each platform is tied to each company's internal infrastructure. And so it's very hard to move things around even to change infrastructure internally. So we looked at this problem at Databricks and starting at the beginning of this year, we asked ourselves, can we provide some of the same benefits of these platforms but in an open manner, both open source and also open in that you can bring whatever technology you want into these systems and have it work by default as opposed to being limited to whatever the system decided to support. And so that's what we're doing in this new project called MLflow. It's an open source machine learning platform and it's also designed to be open in terms of the interface, in terms of what can run in it. So it's designed to be able to work with any machine learning library and language, it's not limited to a few things. And to run these the same way everywhere or in particular, we are a cloud company running in the public cloud. So we wanna make sure things are on the same in every public cloud. So you're not dependent on one provider. And it's also designed of course to scale to big data and to integrate well with Apache Spark because that's something that we know how to do well. We launched the project in June, so it's still a pretty new project. But it's actually grown quite a bit as an open source community. So we already have over 50 contributors to the project, many new features that were built both by us and by community members and at least like a few dozen distinct companies contributing code to the project. So it's exciting to see that people are interested in this idea and to see what's happening with them. So I'll talk a little bit about how MLflow is designed and then I'll go into some of the details. And we actually have a more kind of hands on talk on MLflow tomorrow at the conference as well. So there's kind of two elements of MLflow's design philosophy that also make it a little bit different from some of the company specific ML platforms. So the first is that it's an API first or what we call Open Interface Platform. So we wanna make sure you can submit runs of your code, models, experiments and so on from any machine learning libraries and language. So that by default MLflow just kind of works with whatever technology you choose as opposed to you blocking on us to make something available. So one really simple example of this is think about deploying a model. What is a model? There are a lot of formats for models like Onyx and PMML and so on that try to capture all the weights and internals, but acting the output and use that to improve your model as well. So very kind of simple concept. It's like let's log what we're doing, keep track of it in a database and then be able to see and compare the results. So that's kind of the first part is MLflow tracking is just this API. The second part is MLflow projects. So this is a way to make it so that someone else can run your code again. And this is again a very simple design. It's this little spec file you can include with your project and your Git repository that says how to run it and can declare an environment and dependencies. So if you develop your project with this type of spec, you can then push it and anyone can run it again later. And one of the great things about this is that different people can be running it later. So to run a project, you just need to type MLflow run and pass in a URL for the project and then it will just be able to execute. So what does it look like? Just very briefly, your project is just a directory with some code, maybe data files, whatever you want to put in it. It can live in a Git repository. And you have this little descriptor file that describes an environment, such as required Python libraries and also describes parameters to the project and how to call the project. So anyone can call it just by passing these parameters and it's sort of self-documenting how to run this on new data or how to change the parameters. And you can run this either from the command line or using an API to have a script that sets up a multi-step workflow. And then the final component is models. Again, it's actually a pretty simple concept is basically a way to package up models that can expose different APIs, which recall flavors for doing inference using the model. So whatever library you're using for training, you can package the result into an MLflow model and then we provide built-in tools where you can take any model and deploy it as a REST server. For example, on Docker or Kubernetes as a batch or streaming job, for example, in Apache Spark. Or just as code that you can call in a programming language. And again, the design is each model is basically, has got some arbitrary files that you saved. Each model is just a folder with some contents and it has this descriptor file that says how to load it into different formats. And in particular, one of the formats is just load this as a Python function. So any tool that understands Python can just apply this. Other formats are more library specific. So for example, this one you can load as a TensorFlow graph as well. So if you're back in TensorFlow, you can actually load it back in and modify the graph and do things with it. So this simple descriptor format also captures when the model was made and stuff like that. So what does this mean for deployment? So now, instead of the data scientist throwing lots of different frameworks over the wall at the engineer, the data scientist just say, hey, please deploy this MLflow model. And there's a standard interface for using it in various modes. So the production engineer can just plug it in and do either real time or batch scoring with it. And likewise, if you want to run your training code in production, you just say please run this MLflow project. The production engineer doesn't need to know what's inside it, even what programming language is it in or anything like that, because they just have a standard interface for running it. Okay, so that's sort of a brief overview of MLflow. So since the project started, it's been actively developed. And we've got a lot of new features since our first release. So we've got both we and different people in the community have contributed model packaging interfaces for a lot of common libraries. So you don't have to write that descriptor file yourself. You can just call a function to save your model and then all the MLflow deployment tools work with that. RStudio contributed an RAPI. We're really excited about that because they're very good at building really excellent RAPIs and is great to see that in there. We've got a Java and scholar API in addition to R and Python. And we've also got a pluggable storage backends for a wide variety of storage systems. And just a couple of days ago, we released the next version MLflow 0.8 which brings more improvements, especially around the UI visualizing multi-step workflows, compact table view, and also letting you take your model and deploy it to Azure ML serving workspace, which is a new serving product on Azure based on Kubernetes. So basically all your previous models can now be pushed to this as well. Some examples of use cases, so MLflow is open source. There are people using it through that. And we're also working with a number of Databricks customers to develop and try the system early on. So examples, European energy company is actually using MLflow tracking to build and monitor hundreds of machine learning models for entities in the energy grid. So this is a case where they want to model every power plant, every renewable energy source, every consumer of electricity as well, like a town or a factory. And so that's kind of hundreds of distinct machine learning problems. And they use MLflow to keep track of these and make sure that they have the best model and that it doesn't regress for each specific task. An online marketplace working with us is using MLflow projects to package and run reproducible deep learning experiments using Keras. So you develop the experiment on your laptop and then you can kick off runs and parallel on GPU instances in the cloud and compare the results and keep track of what went into each one. And then an online retailer is using our model packaging format to package and deploy a recommendation model with custom logic. So they found that it's not enough for them to just take an off the shelf recommendation model and serve that as a REST API. The data science team wants to write a lot of logic around that with business rules to customize the recommendations that are coming out. And they like that with MLflow. You can develop these together and package them together and ship one artifact to the production team that captures both the weights of the model and that version of the business logic that you had. So you don't have to split it up across many little applications to do this. Okay, so these are just some examples. So anyway, just to summarize machine learning platforms, we think are a very important new part of the data science and big data toolkit. And if designed well, they can simplify ML development for both data scientists and engineers. That's a key thing that we noticed is that if you've got the right abstractions, the data scientist is more productive just working by themselves and they're excited to use it. And at the same time, it makes it easier to deploy. Whereas today in a lot of organizations, these two groups at odds where the data engineers tell you, well, you can't use this, you can't use that because otherwise you won't deploy your model. MLflow is easy to install. You can just install it through PIP and start using it in Python. We also have a bunch of pretty extensive documentation and tutorials on the website. And if you're interested in seeing more about it, including a hands-on sort of walkthrough of using it with deep learning, Jules Damgee from Databricks is presenting this tomorrow as well. So thanks a lot and happy to answer any questions. Hi, it's time for the question. Yes. So if there is any question, it's time to ask Matei. If not, I have many questions. Okay, yeah. You're facing like, please, no, Jules, no questions. Yeah, I don't know, yeah. Okay, I actually can't see you guys. Unfortunately for you, you are moving the hand, you can't see. It's hard because of the lights, yeah. Yeah. Now the microphone arrived at time, yes. Thanks for the talk, Matei. Yeah, no problem. I have a question about the integration between MLLip Spark and MLflow because with some friends, we tried it and it's very nice. But what happens with these models are called in MLflow in Scala because I think that the main of the IPA is in Python. Oh, yeah, that's a good question. Yeah, so I think loading the model in Scala right now, you'd have to do it manually from the serialized code for it. So that's actually something that we're working on to just provide a Java and Scala API. So that part is just we're just not done with it yet. You have the roadmap to integrate an active way to do it, right? Yeah, we will, yes. Yeah, a lot of these integrations are still in development. So that's definitely good feedback. I'll pass it back to the team, yeah. Thank you, thank you so much. Yep. Until the next question arrived, I have one. Okay. What do you think that is going to be the future for MLflow in the next few years? Yeah, that's a great question. Yeah, so actually the pieces I showed today are kind of just the beginning of what we want to build. So we also are planning components of MLflow on monitoring models that are deployed and also a component that's basically a model registry. So it's kind of like a centralized repository where you keep track of the models inside your company. And we've got a lot of interesting ideas and feedback there. So our goal is really to build kind of a complete platform that works together. And we try to design these pieces. So these are just things you can add on to the current workflow. So even though the monitoring and so on aren't available yet, we were kind of thinking about them since the beginning so that they will work with the existing workflows. The other thing we care a lot about in general at Databricks is just making sure that the platform is very solid. So we're moving, we're going towards MLflow 1.0 version which will have basically a stable interface so you can build on it. And it keeps working the same way each time we upgrade it as well. So that's probably something you'll see basically early next year or towards the first half of next year. So we have, yeah, we're excited to keep building this out. I have more questions. Hi. Another one. No, I'm here. OK. The other. Right. Right. There you are. OK, great. Sorry, it's very hard to see. Thank you. It's really simple. OK. How many companies are using this tool at this moment? How many companies? And how big they are. Yeah, that's a good question. So the companies I talked about before, well, one of them was a smaller one, the retailer one I talked, but they're large enterprise companies. And there was a European one and a US one that I talked about before. And we have a number of other customers as well that are using this in life sciences and in finance and in other domains. In terms of the community, it's hard to tell. We actually want to run a survey of this to figure it out. But the contributors, as I said, there's around 50 total contributors to the project. Actually a little bit more now. And our team that works on it is six engineers plus myself. So we have. Yeah, exactly. So it's a lot of non-databricks contributors. And usually, if someone submits a patch, it means that they've at least tried it out to the point that they kind of want to use it and make it better. So I don't have an exact number. But we've seen, I think, if you look at the contributors, they're from at least 20 different external companies that are contributing. This is a good moment to have more contributors for you. Yeah, we love to get contributors. By the way, if you're interested in looking at it, it's very easy to contribute to. So it's mostly Python, the server, all the server bits. And then the fun end is all react. So we've had, actually, a lot of the features I showed, like the note taking and the scatter plot, were done by people outside of Databricks. They sent the patch. They actually worked with our UI designer who just talked on GitHub about how to design it. And then they put it in. And that comes from the design of the platform, too, to have these very clean APIs and data model where you can just add this stuff on top. Yeah. Great use of the audience. OK. Any further questions? OK, there's another one. Yeah. Go, go, go, go, go, go. Thank you for your poll talk. Is there any plan for .NET interface or API? Yeah, that's a great question. Yeah, we'd love to have .NET eventually. We're not working on it right now. But I think it shouldn't be too much work to add. So it's definitely something we'd love to have. Yeah. OK, thank you. That's because my company is a serial latency. So we can't use a rest API to our Kim online. So that's why I got it. Got it. Thank you. OK, yeah, it makes sense. OK. As you can see, he's really passionate because all the questions are really good, great questions. So he always says, great question. And it's cool because he loves to talk about MLflow. The last one, or you can talk with him after in the Ask the Esper if you have a question that you want to say, not with the big audience. Of course, you have the APP in the App Store and so on. You can go inside, download Big Data Spain and all the information, all the speakers, and all of that you have over there. So try to do it right now. If there is no more question, big applause for Mati. And thanks so much. It's been an honor. Thank you.