 In the real world, many problems can be too complex to be solved by a single machine, by a single machine learning model, whether that be predicting sales for each individual store, building a predictive maintenance model for thousands of oil wells, or tailoring and experience to individual users. This leads to a pattern of many models. When should you use this approach and what are some of the best practices or things to avoid if you decide to go down this route? I wonder. Well, to tell us everything they know about it, we have with us Maria Medina and Jose in Alizade from Microsoft. Welcome Maria, Jose in. How are you? Thank you. I'm glad to be here. Lovely to see you. So we're looking forward to all your secrets, to finding out all your secrets. Remember we're in the attic and the attic is the place of the hidden treasure. So all yours, Maria. Okay. Yeah. So let's talk about the many models pattern or how to manage thousands of models at a scale all at once. Yeah. So as Elena said, I'm Maria Medina. I work as a data scientist in Microsoft Consulting Services. I'm based in Madrid in Spain. And together with me, we have Dr. Jose Alizade, who's also a data scientist in Microsoft Consulting Services and he's based in Melbourne, Australia. So he's joined us from the other side of the world today. So in the following 30 minutes, we're going to talk about this many models pattern and why the need of it. We're going to talk a bit also about Azure Machine Learning, which is the tool that we're going to be using for developing all these solutions. We're going to talk about how to train those thousands of models and how to make predictions with them. And then we'll end up explaining how to parallelize everything and a bit of wrap up. So with that, I'll pass it on to Jose, who will start explaining us why many models. Thank you, Maria. And I'm also very glad to be here today, sharing what Maria and I have learned through helping our customers, building many models for their resolution complex problems. So yeah, as some of you might already have seen this slide, organizations can be different in terms of where they are in their data and AI maturity journey. Some might only have some basic analytical capabilities. Some might be embarking on their digitization and some further up in this ladder. The more organizations get further up in this ladder, the more they are matured in this, in their data and AI journey. We see at Microsoft more and more a need for an emerging pattern for many models solution. So instead of just training only one model or inferencing from one model, we see a pattern where there is a need for doing and managing the full life cycle and machine learning models on a scale, including training of thousands of models at the same time in parallel forecasting of that in terms of auto drift detection and also in terms of productionization of this. As an example, one of our customers, which is a hundred year old Australian utility company, they have reached out to us with their problem. They are pretty mature in their AI journey. They already have a sizable center of excellence for advanced analytics. So the requirement they have is also very advanced. What they come up with in terms of the problem, they had a forecasting problem for their solar production and load from the grid for every household among their customers. The solution they required had to be scalable enough to leverage all of their customer base, which is around 400,000 of households. So this is a perfect example of that requirement where you want, where you need to build thousands of models like a scale and create a solution that is scalable and being able to manage the life cycle of the full solution at a scale. But this is not the only example. We have observed a variety of industries and organizations that share similar requirements to train and manage life cycle of thousands of machine learning models, that whether it is a bank which has cash replenishment models for thousands of their ATM machines, or it's a retail business which has thousands or even hundreds of thousands models for their price optimizations and want to manage that whole solution in a proper manner or it's an energy and utility that has demand forecasting problem at a scale. They all again share the same pattern of problem and requirements that requires a solution, a proper solution. What we came up at Microsoft in terms of the solution is what we call it as many models solution accelerator. This solution accelerator is addressing that problem. At Microsoft we open source this solution accelerator and everyone can access that in through MS slash many models. We publicly shared that so that we empower every organization in the planet to achieve more in terms of their AI capabilities. But what is this solution? What are those components or the main components in this solution? How they work and how different bits and pieces in this full solution work to achieve this goal? We are going to, for the rest of this talk we are going to unbox this solution and talk through each components one by one and share how we can actually. But before going through that, first we need to define and to go through some of the key concepts and components in Azure machine learning for those who may not be aware or may not be familiar with that so that we can take it from there. So with that I would like to ask Maria to take us to the introduction of the learning please. Thank you. Thank you. Yes. So Azure machine learning or AML in short as we call it, it's a tool that helps you manage the life cycle of your machine learning applications, the end to end life cycle. So it's not meant to replace any of your machine learning code, instead of that it's meant to assist you during the process of executing the code, storing all the results and metrics that the code produces and also help you put in that code into production. So for example, when you're training a model it can help you by tracking all the experiments that you make, all the metrics that you produce, all the artifacts that you produce and when you have a build some successful models you can also package those models and version them so they're ready for being potential candidates to be put into production, kind of like your favorite model list. It also has some features and capabilities to help you validate those models and understand and interpret what those models are doing on the inside and why they're making the predictions they're making. It has a wide variety of functionalities to help you deploy these models and use them for making predictions in real life, in production, ready environments and also it has features to help you monitor the models and detect if the data that you're using is drifting or if the model is drifting and then you might need to retrain it and start the process all over again, closing the cycle. So when you create an AML service in the Azure cloud where you're creating in the cloud is what is called a workspace, an AML workspace. So that is essentially like a folder for your project where you will store all your artifacts or your data, your models, your experiments and everything like a very strict and with versions so you can keep track on everything you're doing and never lose anything that you're doing in your machine learning system. And for interacting with this workspace, you can use the main two tools are the Python package and the R package, but you also have some other tools depending on the language you're using for building your models. And one very special artifact that you can create with AML are the pipelines. These pipelines are used for defining workflows in your machine learning applications. This is a good way to encapsulate your machine learning logic so it's more repeatable and unreproducible and modular, right? So a pipeline is essentially like a sequence of tasks that you're going to perform. Row data is going to come from one of the sites and the result of a training pipeline, for example, you would have a train model. So inside the pipeline, yes, as we mentioned, we will have a sequence of steps. And again, this is not meant to replace your scikit-learn pipelines, for example. Instead, this is like a layer of abstraction about this to make the code more manageable. So for example, in a training process, you would have a step for data preprocessing at high level, then you would have one step or several steps for doing your training with different algorithms, different configurations, and so on. And you might have a final step for evaluating all those training results and choosing which is the best performing model and registering it into your workspace. And we have very different types of steps. So there are steps for running Python code or R code or specializing hyperparameter tuning, for example. But in our case, we're going to be using a very special type of step, which is the parallel runner step. This is a step that allows you to run a task in parallel many times, as many times as you need. So what you are going to be using is you're going to provide a script, which is going to be defining the task that you want to perform over a set of data, and you're going to input a whole data set. And then parallel run step automatically is going to split that data set in mini batches. You can configure how you want to do that splitting. It's going to split everything, process these mini batches in parallel by executing the task that you define in the script that will produce some output. And that output will be aggregated again automatically. And the output of the parallel run step will be this aggregated data set with the final result of applying this task in parallel to all your mini batches. So this is the structure that we're going to use in the solution accelerator for many models for training and also for casting. And so I'll pass it to Jose, who will explain to you how exactly we're going to do this. Let's have a look further in detail on the training side of the machine learning of the mini model solution. So on the training side, all we need to do is provide to the mini model solution with some scripts for the training of one of those, for one data. Then we have the mini model solution, which actually takes that data and that script and run it in parallel and provide the results, which are basically the models and register them in the model registry. But how it does that? So let's look at the under the hood, what's going on. So if you look at the right hand side, you see that up there, there are data and also a training script provided. So they are provided to the mini models solution. And the core of mini model solution is that parallel run step. So the job of parallel run step is to take the data, break the data down to very small pieces and then pass each one of those pieces along with the training script to the different nodes within a cluster. So you see down there, there is a cluster, each node within that cluster is taking one piece of that data with the training script and run an algorithm on that data and train that and the results will be saved in the Azure machine, Azure model registry. And this is done in parallel on all of the nodes in the cluster. You may want to provide your custom script for training, a custom model or if you don't want to, there is an option in Azure in mini model solution where you can choose to leverage AutoML and that is also included there. So AutoML can automatically test through many of machine learning algorithms and give you the best results out of those experimentations. So this is basically the training side of the solution and when it is done, it provides all the models in the model registry. So you'll see that all of the models when they are finished in the model registry and you can keep track of them, you can see the version, you can see what experiment they have been and results of, and so on. Next please. Yeah. So for our specific use case in the site level forecasting, we have had 1,500 households for the first phase of our solution implementation. Each one of those households had basically three models underneath. Solar Production AC was a time series to be forecasted. Then Solar Production DC was another one and then consumer usage, which is basically the load from the grid. So overall, we had 3 times 1,500. Overall, we had 4,500 models to be trained, to be forecasted and to be productionized. For the training script, we have had a custom script that we're training a neural network-based forecasting model. For implementation, we leveraged many models, solution accelerator and specifically the parallel runner step to train all of those 4,500 models in parallel. There was also a requirement from the customer to leverage the real-time thing with a single API. But how we can do that? Let's have a look further into the foresight. What is the forecasting capability in the many models solution? So generally, we have two types of forecasting. One is batch inferencing or batch forecasting or batch prediction and the other is real-time forecasting. On the batch forecasting, it's pretty much straightforward. It is you have a pipeline and you want to, on a schedule-based, you want to periodically generate the predictions for all of your data and save it in a storage. And so this can happen on every day or every week or every hour, whichever suits the best for the business, you can schedule that. On the real-time inference, however, you want to have some web services that are up and running 24-7 and they are always ready for prediction for whatever data that you provide to them. So they are always ready up on request. So they are basically more resource heavy compared to the batch inference. Let's look at each one of them in more detail. For the batch forecasting, it's pretty much very, very similar to the architecture, to the batch training. As you see again on the top right-hand side, you see your data and your prediction script. You provide that to the many models. You provide that to the, specifically the parallel runner step. Parallel runner step, again, distribute that against the cluster. And every node within that cluster takes one subset of that data with the predict script and then apply the model, retrieve the model from model registry, apply that model on this part of the data and provide the forecast and return the prediction. It may save it in a storage or wherever it is appropriate. But on the real-time, they are a bit different. You want to deploy them on web services. You want to deploy your models on web services. So let's assume that you have 1,000 models in your model registry. So you want to take each model with the predict script and deploy that on a web service. Therefore, you will have 1,000 web services for taking the 1,000 models deployed. And each one of them will provide you an API or a link where the user, the end user, can call that model through that API. With this approach, you will have, you will deploy 1,000 web services and provide 1,000 end points to end user. But that is not an ideal approach because it is not using the resources well. The 1,000 web services are very costly. And most of them will be idle most of the time. Therefore, we want to come up with a solution that optimize the resourcing and the resource management. Therefore, if we go to the next slide, therefore, we have packaged three groups of models into a web service that's possible. Let's assume that our web services are big enough to contain 100 models there. If so, we go with groups of 100. So we package all of the 100 models along with the training script to deploy them into one web service. And then we do another 100 and another 100. So in this approach, you will have only 10 web services for your 1,000 models. And each web service provides you with an API. So you will end up with 10 points. And the end user need to call them. That's much better than the first approach. However, there is still an issue. And that is the end user does not want the headache to go through the understanding of which model is deployed in which web service. They just want one API so that they can just call that API all the time. To address that requirement, we came up with a two-layer structure where, on the right-hand side, you see the routing web service. So we have here in this web service, we have deployed a router. What is this router? This router is nothing more than a dictionary where it's served, it saves. And the model 1 and model 2 up to model 100, they are saved or they are deployed in web service number 1 and so on, like model 901 and 902, they are deployed in web service number 10. So this router knows where each model is deployed. So this router also, because it is a web service, it also has an API. So we provide that API to the end user. So end user provide the data and call this single endpoint. And this routing web service, its job is only to call the associated web service that has the relevant model inside. Apply that, get the results back, and provide the results back to the end user. So with this approach, you will deploy 11 web services and provide only one endpoint, which is the best case for our end user and solve that single API issue. This is the same approach that we have leveraged for our use case as well. And with that, we are going to see how we are going to productionize this solution. So Maria, can you please take us through that productionization phase? Sure, thank you. So yeah, you probably have seen this figure before. This shows how little our actual machine learning code is compared to the other many parts of our machine learning system. So in our particular case, the machine learning code, this little black box here, it's only the training code that we're going to provide, the training script with these modules or whatever, and the predicting script that we were using either for batch forecasting or for real-time forecasting. All the rest, the Azure machine learning pipelines and all the infrastructure needed to connect everything, to do the data cleaning, all of that is part of our machine learning system, but it's not machine learning code. But still, of course, we need to make sure that we follow the best practices for productionizing this for in development and also operations. So we're talking, of course, about DevOps practices. This is a very broad area. So we'll just focus on six practices to summarize. We're going to see, yeah, of course, we're going to use version control. We're going to be using continuous integration pipelines and continuous delivery pipelines. On the infrastructure side, we're going to leverage infrastructure as code and use microservices to make everything more interconnected and easy to use. And of course, we're going to be monitoring and logging everything that happens in the system to make sure that everything is working properly and if not, to get an alarm and be able to fix it quickly. But in our case, we're not talking about traditional software systems. We're talking about machine learning systems. And we need to adapt this. So now we're talking about MLOps practices, not DevOps. And what are the differences in this? Well, first, about versioning. So version control exists because we need to find a way to control everything that is going to influence the behavior of our system. And in this case, in machine learning systems, it's not only the code that determines the behavior of the system. It's also the models that have been trained and give predictions. And ultimately, the data that has been used to train those models. So all of that needs to be controlled. The versions need to be controlled. So of course, we'll just get for controlling the version of the code. And we can use the tools such as Azure Machine Learning to versioning the data and the models and make sure that everything is tracked. About continuous integration. So continuous integration is having the capability to, from our code in our version control system, build the code and make it ready to be used whenever we need to use it. So in our case, having a machine learning system ready to be used means having a model trained and ready to give predictions whenever we need to, right? So our continuous integration pipeline is going to consist of the training of the model. Then we have continuous delivery. In this case, where we are delivering our model predictions. So what we're going to do in continuous delivery is deploying these models. There's not much change needed on the infrastructure side, but of course, we're going to be leveraging infrastructure as code, for example, when running trainings remotely. So we have the nature of the course that with the clusters we're going to use remotely for training. We have that configured as code. So we can replicate all these nodes and with all the environment that we need as many times as we want to run this training in parallel. And then of course, we're using services in the cloud with these microservices approved. And then on the monitoring and logging side, we also have a big difference here because our system, the code might be working perfectly and the model might be giving predictions. So everything looks fine on that side, but model predictions might be not accurate enough. And that also means that the system is failing. So we don't need just to monitor the code, but we also need to monitor the data that is coming to the system and how the model is working. And we need to place this drift monitoring in place to be able to, for example, retrain and we see that the model is not performing as we expected. And for building this continuous integration continuous deployment pipelines that we mentioned, we normally use Azure DevOps which has many capabilities for handling all these practices. So a continuous integration pipeline would look like something like this from changes on the code would trigger the pipeline which would essentially, as we mentioned before, launch the training, evaluate all the models, register the best one. And then that best model that is registered will be published as a result of the pipeline as an artifact. And then that artifact will trigger the continuous deployment pipeline which will essentially put this model into production. So the typical way of doing this is deploying the model in different stages. So we start with the development environment and we might be wanting to try deploying the model in successive environments until we eventually deploy it in production. And what we normally do is we set a manual approval gate before deploying it to production because data science processes are also very influenced by business. So it's not just a technical, a quantitative decision whether to deploy a model or not, you also have to take into consideration some qualitative things that will come from business from, for example. So this is like the whole architecture of how everything would look like. We would first have a setup pipeline, then we have the train pipeline, then we have the deploy pipeline and then we have this monitoring in place that eventually can make us retrain the whole process again. And this is how these pipelines look in our use case in many models. So we have this setup pipeline to deploy the infrastructure and do the setup of the data and environment. Then we have the modeling for simplicity. We have combined these two. So first we update the data that we're going to use for training. We run the model training and then right away we deploy these models into different HR resources for deploying web services. And this is optional, of course. You can skip it if you don't want to do real-time deployment. And in that case, you will run the batch inferences pipeline. So for example, you might run training pipeline once a month to train your models, but you might want to run the batch inference pipeline once a day to make daily predictions. And this is essentially the same update in the data and then running the pipeline to issue the forecasts. But in a real-life use case, apart from having all these pipelines, we start from the beginning with experimentation. So data science is the end of the day, it's a science, right? So it has a lot of components of trial and error and experimentation. So you need to start with that part. Then eventually you will move on to developing the components that are going to be part of your DevOps and MLOPS pipelines. And then you will have, of course, your monitoring system in place that could retrigger some trainings to launch the CI pipeline if something is not going well, or you might need to even go back to the experimentation phase. For example, try with different new features that will make your model perform better. But this is an ongoing cycle that never ends because at the end, we're talking about a product here. We're not longer talking about projects that have an initial phase and an end. We're talking about an ongoing cycle that never ends and always improves. And with that, I will pass it on to Hussein to do a bit of wrap up of what we learned in this session. Thank you, Maria. So we have so far discussed and unboxed basically the many models solution accelerator, the different components and parts within that, especially the training prediction and prediction side of it. We also saw how we have leveraged the many models to solve and to address the business problem that we had with our customer, with Microsoft's customers. So, please. So we also leveraged the productionization and how we have a productionized that solution into the customer's environment. But that was basically mainly around the technical side of the operationalization. What is still remaining and is also very important to note is that the business always on the operationalization side cares about the orchestration of the capabilities and the tools and processes to successfully drive business outcome. They like productionization from business point of view is all about adopting change which drives the organization's success. So, we at Microsoft, we are very glad that the solution that we helped our customer to implement has been delivered and received with great success. There has been great testimonies from the customer on that and you can read more on this use case with aka.ms many models use case. So, that solution is already in their live production and they are using that on their day-to-day basis. So, with that, I would like to thank you, everyone who stayed with us in the last 35 minutes, 40 minutes and we are happy to take on any question. Well, thank you both, Maria and Hossein, especially Hossein, because I didn't realize that obviously he's in Australia, so it's what, four o'clock in the morning for you? It is four or nine. So, it doesn't look, you look fresh as a lettuce. Lovely, and so nobody could tell, but it is true. Poor thing is four o'clock in the morning, so I'm not sure if you want to take the question so we'll let Maria do it. We have time for a couple of quick questions if you're okay with that. They're asking you about security and can AML access possible sensitive data on premises and how to tackle this? Yeah, so what we normally do with Azure Machine Learning, it's that we have some connections, so for the data scientists that is using Azure Machine Learning is transparent, but what it does is setting up some like virtual connections to some resources that are in the cloud. So normally all this data is located in the Azure cloud of the customer and you can set some, of course, a lot of permissions in place to make sure that anyone and nobody that isn't allowed to access can access. So what we normally do, in fact, is with these pipelines organization, we have like a pipelines user, which is not a real user, and the pipeline user is the only one that has access to the production data. And then the data scientists, for example, are testing with a subset of data that might be masked, but they don't have access to production data, they don't need to. Okay, so everything has been thought about. Of course, I was sure about it, but they ask we have to double check. Just a quick one, if you can, Jose, now that you're awake, I don't know if... It's a terrible time for o'clock in the morning because you don't know if you should go back to bed or there's no point, but it's too early, so you're gonna hate us just for 24 hours. Tomorrow you'll be fine. Just quickly on AML as well, what are the prerequisites of the solution and the automation of the solution? I guess this is a long answer, but if you could just quickly, yeah. Sure, so I guess from the prerequisites of leveraging many models solution, you basically requires the access to Azure machine learning environment and a workspace. So it's not really... So if you think about that, that you may need to be very advanced to be able to leverage this solution, I think that is already taken care of and this solution accelerator is pretty much ready and easy to use for a person if they want to leverage this. For the productionization, also we have many components already built in within the system. Maria has worked herself, she has worked extensively on the productionization, pipelines of this solution, so it is very well ready for everyone who wants to use that. As you said, there's been a successful solution so far, your customers are happy, so I guess you just encourage our viewers and the audience to use it and if they have any questions, just to let you know. It's always room for improvement, yeah. But we'll have no time for more, but it was lovely having you both, Maria and Joseen. This is one of the great things about technology that we didn't have to fly you in, but you had to wake up at four or five or, sorry, at three, probably or two, I don't know, but it was lovely to have you all the way from Australia and obviously Maria as well, she's closer to us and we hope to see you very soon, so next year maybe in Madrid. Joseen, Maria, from Microsoft, thank you both so much for this fantastic in-depth explanation and congratulations for such a great job. It looks fantastic. Thank you very much. Stay around in the attic for our next keynote speakers, our keynote which will come in just a few minutes, so stay around, go in the platform, remember we have a hashtag, bigth20, go in the platform and don't forget to check on your next keynote in order to ask questions to our coming speakers. See you in a minute, bye-bye.