 Hello, everyone. Thank you for being here. I'm Maria Medina. I work as a data scientist at Microsoft on the consulting part here in Madrid and I've been working in data science consultancy for little more than six years now in in different places and So it's amazing how things have evolved lately and Data scientists all over the world. I'm making great models that are solving really complex problems But I'm sorry someone has to say this we suck at putting models into production And this is of course a problem because life out there is not like kegel It doesn't matter if we have a model with 99% accuracy if nobody is using it then we're not providing any value But it's it's fine. There's some things that we can do to to help fix this So I'm going to walk through some of our pain points and some things we can do to improve this and being able to Put models into production quicker or even at all So one of the main things here is that in in machine learning systems the There's the part of the code that is actually related to to machine learning is very very very tiny it's the part of the code dedicated to training models and scoring predictions is Very small compared to these other parts we have about handling data creating features Configuring things deploying things and so on. So if we are Using notebooks for example to handle of see all of this It's very likely that we're going to fail to be able to build a strong robust system But we also need notebooks at some points because data science is a science right and Therefore we spend a lot of time Experimenting and trying out new things trying different ways of handling data train different models different configurations And we keep this in like in a tier in a natural TV process Starting from the business understanding part moving on to the data acquisition and understanding plan doing a bit of analysis and then starting with building the models and we keep this and as an iterative process But eventually we stop Experimenting and we should start building something more robust that we can eventually operationalize So from that perspective data science processes data science projects looks like something like this, right? We have a big experimental phase where we try out many new things and Then we start developing something that is going to eventually come into into production or in a deployment environment and You might see now where I'm going because this is looks similar to something they talk about in traditional software development They have only the development and the operation parts But recently they have come with this new philosophy which these two areas aren't disjoint anymore But they work together development and operations to make processes more efficient and that practice is called DevOps, right? So The definition we like to use at Microsoft is this one So DevOps is the union of people process and products to enable continuous delivery of value to our end users And I'd like this definition a lot because it it puts people on the top of it So it's it's stretching out the importance of communication and collaboration between Like inside the team and between teams to enable this kind of efficiency And we also have processes there because there's probably a lot of things that are going to Need that are going to to change for us to be able to implement these practices And of course, there's also a lot of products and tools that can help us through this process So in traditional software development delivering something can mean building a product or a service that are end users are going to use but in machine learning Delivering something it actually means Having a model ready to like for example in our service in an API that given input data It's going to issue predictions And that's what we want to deliver with these DevOps approach So there is not like an official list of what it's DevOps and DevOps and what it isn't But all the sources agree on some common points that and I'm going to try to explain those points and relate them to from the traditional software development world to the machine learning world So the main difference between these two worlds is in software development. They have like the code It's Every behavior of the system is ultimately coded into into the code if you look at the at the code You can see how the system is going to behave given a specific scenario at every time But imagine learning is not only the code that determines the behavior of the system, right? It's also models and ultimately the data used to train those models that are going to define the behavior of our system So we need to take into consideration these three things and put them together into these Practices here. So we have to take into consideration data models and code three things So the first practice is of course using version control this is Having a way of tracking your code changes So if everything in something goes wrong, you can go back and fix it as quickly as possible And in our case, we also of course need to track The changes in our models and in the data we are using to trains those models So we will need a version control system for modeling for models and data as well the second practice is called continuous integration, which is the being able to automate the process of from code Building testing and building Your code so you have something ready to use And in maxi learning having something ready to use also means having a trained model that is able to issue predictions, right? So our continuous integration practice is also going to include having having a model train. So our build Our building process is also going to include model training Next we have continuous delivery Which is a practice in which from from the cold be privileged previously built. We are able to Deploy it in different environments different staging environments and ultimately going into a production environment with that code that code or a system is going to be used by the end users and In our case, this is very convenient because we can have multiple stages with different models. For example, we want to try out and finally automatically move the best the most successful modeling to production and So if we're going to build many different environments to try out our system We need those environments to be the same, right? We cannot Riff's to have some errors related to different packages, for example being installed in different systems So we need a way to replicate our environments in a safe way and we do that with infrastructure infrastructure as code That is a practice in which you use code and configuration files to Automatically build your systems and build all your environment just from deterministic code So every time you build it again, you are sure that you are making exact copies of your environment and that is related to the use of microservices because If you are going to build your system Very often I'm from scratch. It's much easier if you use Tiny pieces that relate to each other instead of one big environment and those are microservices so microservices are those tiny pieces that have a very specific function and Relate to each other and communicate with each of the other Systems of microservices to create a whole environment and that way it's much easier to to deploy and automate things And finally, it's very important to monitor all of these things because you since you are building all of these Automatically you need to have a way of quickly finding out if something is going wrong. You get a notification and you go Fix it as soon as possible So in software development The only thing that can that you have to monitor and that can fail is your code, right? But imagine learning systems we can have our system running properly from a development point of view like we have our model giving out Numbers that is expected but those numbers might be wrong right because the model might drop its performance for example, so We don't only need to monitor our code Of course, we also need to monitor our model performance and also the data We're giving because that data might change as Casey was saying this model that the world We're living in might change and our model is going to probably suffer from that and we will need to adapt to that So now we know which practices are the best ones for Operationalization things in in data science. Let's have a look at six Steps we can we can give to to get into to an mlops adoption And this is how we do it in in Microsoft so Let's imagine you have already a trained your model after the Experimentation you already have come up with with a good model that is performing well And I don't know if this has happened to you as well But it happened to me a lot of times you have your model and you either forget to store it For example in a file or you have you rerun all your code and suddenly your model has changed And it's not as good as it was before anymore and you have overrated it And you need to start all over again and trying to see which things when were wrong So you can replicate your model again and that problem we had it because we didn't have as Version control system for our model. So the first thing once you have a good model is saving it So you don't lose it again And for that we use a tool called a share machine learning which is a resource available in in Azure to use for free and it's It's not totally meant for for saving our models It's meant for much more things and it's not replacing any of our machine learning code For example, if we're using Python with cycling all that code is going to stay and as your machine learning In going is going to assist us through all the the lifecycle of our data science project so and This saving the model thing we're doing is like the second step in in that diagram But it's going to help us through all the steps and we're going to see it so the way we we create this with Asia machine learning resource is by going to a short portal the cloud and created what is called a workspace Which is kind of a folder a project we have for for our System that is going to store all the assets and all the Yeah, processes that are related to our machine learning system And we have two ways of interacting with that environment with that workspace We can either do it by by hand clicking on the portal They have what it's called a designer, which is a web-based tool with Dragging and dropping you can build a machine learning system for from scratch without the needing to have Programming abilities, but since we want to automate this We need to use code of course and that we have three ways of interactive Interacting with this workplace through code We can use the a short command line interface, which is just a programming use on on your console And you also have Python and our packages with a lot of functions and functionalities to interact with the workspace So if we want to save our model we say we have our model file We just give it a name send it to the our workspace through a function call and then in our environment We're in our worst case We are going to have a list of all the models we have registered with the not with the name the version number the date The day we uploaded it and so on so all our models are in the store in there. We are not going to lose them anymore But now So you might have your model So simply having your model file and giving it to the software Developing in your team if you're lucky to have one It's not going to work right because you need to know you need to have some machine learning knowledge to to be able to extract the the predictions out from that model so It's your machine learning also helps you building a really simple web service And that way you just have an API with an endpoint you can give to the software development team So they integrate that endpoint with their Like bigger system that is kind of an microservices approach, right? So the way of doing this is you just take the model ID You got before for the model we just we have just registered in the previous step We write a very simple scoring script saying like a few instructions saying from the You load your model you do some data transformations and this is how you predict which is normally calling the predicts function And that's it and you may have some dependence. This is your good You can in your code you can also specify those and then you choose what they call a compute target Which is where in in the world is this web service going to be deployed? It can be locally in your machine. So in that case a short machine learning gives you the Docker image so you can put so you can deploy it anywhere you like or you can use hs your Container services, which is hr container instance is for like for the simple lightweight Containers, but you can also use Kubernetes service if you need something more robust and That is see you just make the the function call and you get your running your web service running with an endpoint You can call and give the data and the model will return you the predictions And then you have a list if you go to the your workspace you can also Make function calls and get these data like programmatically, but if you go to the Isher portal you can see the list of all the deployments you have Active and running or even stopped and if you click there you can go back To the model is actually being used in this service that you have just deployed So now we have our model up and running and we might want to monitor it right to make sure that everything's going fine And it's still working properly so the way we have for doing this is an integration with another microservice called application insights inside the Isher portal and this is the service that is in charge of monitoring every app inside a sure So it's very integration. You just have to click like some configurations and you can activate Web service log so you have a log of everything that is happening inside your work service But you also have two very cool things related to model and data you have you can track all the data that is Coming through your to your web service and the predictions you are issuing so you can check if your performance if the performance of your model is dropping and you also can have a Notifications if the data you're receiving the first two mat two mats From the data you use to train your model back when you train it So and you will get an alert and you will be able to go see what is happening And maybe retrain your model if you need to so in the case you have you want to retrain your model you might want to try out new Holland load of of new algorithms parameters, maybe even create new features and So see with you since we're doing this with an MLOps approach and we have already Experimented back when we built the first model we know now More or less the things that are going to probably work out and the things that are not so we can start building something more robot So we can easily automate this and the way we have for doing this with hr machine learning is You take so you can have the data set download Directly from your code, but it's it's a big data set is more likely that you're going to upload it to the cloud And use it from there. So you just specify the connection. Hey, I'm going to use this data set located over here in any storage option you want You build a training script a training file with all the steps You are going to give for for the training you do So everything you were doing before such as I don't know train test splitting the data or proving Trying out many different options models and so on you're still going to do that with the framework you prefer You're going to specify the dependencies and you're going to specify the compute targets Which in this case can also be your local machine or also maybe a virtual machine with CPUs or GPUs if you're using if you're doing leap learning or Databricks if you're using a spark or many other options and Asia machine learning is going to package everything together Send it to for example the virtual machine is going to start the virtual machine send everything they build like an image for to be executed there and store all the results that you're getting in your training process and then when the training is finished It's going to store everything shut down the machine and And happily So you are you're only using the compute you need and on the only times you are training It's automatically being starting and turned off when you're not you using it again and then in the workspace what you get is a summary of all the train all the all the experiments you get Every Interaction everything you try with with a new model is called an experiment and then there you have all the metrics you You store like I don't know if you're accuracy and the same the time it took to train the model and everything You have it there and you can take the best performance model from all the Log you have here of all the experiments you have made So you take the best experiment that's the model you might want to put into production of actually and you Register it as we were doing before so before we took our model file We had in our computer and register it and that's like Having a list of favorite models that are likely to be end up in in a productive environment And Here what we are doing is okay I'm not giving you a file because I've already trained everything in the system So go take that experiment we put which was the best one and take the model you had there and Save it as a favorite to be used further on And in here what we have is this is the same list that we have from the beginning But now we have That parameter there which is called a run ID that tracks back to the experiment We had and the experiment tracks back to the code We used to train the model and the data we used to train the model So from that's this model here We can go back to the even the data we used to train it We can go further to every place this model is deployed into so now we have the end-to-end Traceability of our whole machine learning system from the data We used to train the model to the end point with that model is working in our system And now we have that end-to-end process now It's time to automate everything using continuous integration and continuous delivery pipelines and for that we are using Azure DevOps that's the tool we use inside Microsoft, but you can use any other end-to-end software building tool you want and In here we have so in here we have the repositories, but you can integrate that with github for example if you want to We have the pipelines we have a task force, but you could also use I don't know JIRA if you like or Jenkins or any other tool for automated things you you want You might want to use and integrate all the pieces you want if you like different tools for different things and What we're doing here is So we're having our our cold repository. We normally are Trying to do a trunk-based philosophy world everything tries to be on on the master Brands and if we do some changes we try to keep them as short and as small as possible to to prevent having like a mess when we try to merge brands together so We set up a trigger in the month in the master branch and every time something changed in that master branch We can trigger a really build pipeline story That is going to perform every step we configure here So this is totally customizable and we have selected with steps We want to take each time our code changes in our repository So the steps we are taking here is first whenever something changes in our code, we're going to Launch a training with special machine learning. We're going to take that code which also has All the variables we're going to use the configuration for the models We want to try and everything else and we're going to launch a training and Those experiments are going to begin to run and where all of them are finished The pipeline is going to move on to the next step and the next step for us is going to be having some command line script Configure there to get the best experiment we got and that Experiment that was successful. We're going to register it as our favorite model and as your machine learning is going to package it and we're going to Publish it into this pipeline as an artifact. We're going to use later So that is the result of our build pipeline is the model We are going to use eventually in in a productive environment And this is how it looks like when you execute this so you get You can have many jobs if you want to do things in parallel, but this is like a simple thing We are going to go doing one Like one line and we have every step We had configured in our pipeline and we get the green check marks is if everything goes right if something goes wrong We get a cross a red cross and the pipeline stops there and don't continue to the next Step so for example if the model training fails, then we don't register any model because the previous step has failed And now we have our release pipeline, which is the continuous deployment Practice and we can set a trigger when and when an artifact change So in this case when our model article fact change, which means when we have a new model train We're going to release. We're going to trigger the release pipeline, which is going to Deploy that model into different environments we We can configure this is we for example may want to create first deploy the model first in a development environment and then move on very Following stages until we'll get to production environment and we can even put a manual approval Process in between so we make sure that somebody is taking these before changing the model into the into the Productive environment if we want to have some business decisions involved there or anything And this is how it looks like so if we have Everything has gone right then we have all our Environments one after the other and if we click on one of them We see all the steps that we're performing and everything green in there if everything has gone, right? So now we have both of our Continuous integration and deployment pipelines in place. So we have automated everything But there's one thing missing very important thing here So I didn't mention it yet to like to simplify the process But of course if you're doing all of these automatically you need to have tests everywhere to make sure nothing is going wrong And you're not propagating something wrong until the very last stage of your deployment The problem with testing in machine learning systems is this that is very complex and difficult So in here you can see the difference between a traditional system and software development system in which we have only unit testing integration tests monitoring but in machine learning systems again We have data to test models to test and codes to test a code to test So we have many different tests we we need to we need to take into account This is a very good paper that has a battery of 50 tests or so so you can have a checklist of everything you need to test But I'm going to sum up with some examples of things you might want to to try So for example before launching the training So imagine you yeah, you make a change in your code and that triggers training pipeline and you Start machine with GPUs and you have it running for eight hours And when you finish you realize that the data was wrong and that's of course a waste of time and resources, right? So you want to make sure all your data is fine before you actually start doing anything and consuming resources So in this step you might want to take things as Features how are they built if their distributions are fine if everything looks more or less correct before you move on to the modeling phase and then Before registering the model you might want so want to take the model you have created To see if it's actually Doing better than the model you currently have into in production For example, or to take that is not biased or skewed in any way You might want to have different tests here's here to check that your model is actually good and then Every time you deploy a new environment you need to have a lot of tests their integration states to check that everything is working fine and maybe have a different Difficult examples to make sure that everything is behaving properly before and you move on to the next stage And if any of these tests fails Then the pipeline is going to stop and you are not going to or this the tool is not going to move forward to the next step And this issue DevOps has integration with many of the testing frameworks out there So you can have a test tab here inside the tool where you have like all the tests you have run and The processes that has have gone right and the ones that haven't and you can see the log of everything that has happened inside your system and with that we already have an Yeah, an MLOps system that is automatically running and Yeah, efficiency Automating most of the machine the machine learning and data science projects. This is something that looks like this and Since I work in in consultants We try to do this in in every project we have but of course that depends on the maturity level of the customer and the project So we try to do this step by step Try not to rush anything because it's easier to start deploying a model and then start with the experimentation phase and then start and then follow with the CICD pipeline so it's better to to have everything done a step by step until you read this whole big architecture up and running But I think we have it so in in our area I think it might be easier than than DevOps in in software development systems because this is a quite new area Meaning that we don't have that many systems that have been running for 10 on 20 or 20 years So we don't have that much legacy. It's easier to change and also there are some teams that are even being built now So we're like a very young area where it's easier to change the culture and implement something like this easily and in an efficient way With that I conclude it's only if you want to know more about the business perspective of this you can take the talk by my colleagues Pablo and Carlos tomorrow and also if you want to know more on the data side, you can take the talk about Automating data quality, which is also going to be tomorrow where my colleague. I turn Thank you. You have any question? I Set here some resources you might want to go check great. Thank you then