 Hi, welcome back. Have you ever sat down in a restaurant hungry, ready to eat, and when you open the menu, you've seen so many choices that you just don't know where to start. Sometimes you can have too many options. Data science is moving faster than ever, but its evolution is presenting a wide range of machine learning tools that are increasingly complex to manage. Fortunately, we have with us Antonio Rodriguez, an AIML specialist solutions architect with Amazon Web Services to help us make sense of it all. Antonio, hi, how are you? I'm good, how are you? I'm fine, thank you. Good to see you. Antonio, you have the honor of being our final speaker at the garage. The last thing between me and the final keynote, that's nice, thank you. Go ahead, Antonio, whenever you can, share your screen. We're ready. I'm trying now. You are. Okay. So everything is all set. Thank you. So first of all, thank you for having me. When I was preparing this session, to be honest, I was thinking what could be useful to share with the community in a talk like this? And I think that the main thing that we can share is our experience. And the lessons learned that we get, especially during the last year in machine learning, because this is evolving really fast. So one of those areas, the main pain points that we see is industrializing the machine workloads in the cloud. So first of all, before we start talking about the technology, because we all love to talk about the technology, we have to step back a little bit and think, why do we need this anyway? So the first thing is, what is industrializing ML workloads? So to us, industrializing is basically going efficiently from development or experience in machine learning to production. So that's what we refer to industrialize. And obviously, when we say efficiently, we want to do this in a way that is reproducible, that is fully automated, et cetera. The challenge is that the AI and ML industry is moving faster than ever. You know it. So we have a graph here showing the amount of papers that have been submitted per day. We have around 100 new papers. And that starts from last year actually. So probably we are exponentially growing from that number up. So there is a huge amount of tools. There is a huge ecosystem. Don't try to read this slide, because there is no point in that. But the thing is, there are many options to choose from, not only an open source, but also in the cloud and from third party. And often we feel like this, right? We feel like the imposter syndrome in us is increasing heavily with all this amount of technology that we have at our hands. So for us, it is very important that we try to help our customers to decide the right architectures to understand that it can get really complex and it can take a lot of time. So we have to try to keep it simple and we have to try to stop reinventing the wheel. So that is going to be our mantra during this session, when we talk about moving ML workloads to production. And obviously for doing that we want to increase that efficiency as much as possible and reduce the time to market as much as possible. And how do we get there? Well, we basically get there with the cloud. So the cloud already has a broad amount of tools for helping you on that and with automation. So we definitely want to automate as much as possible. So that brings us to a concept that is called MLOps. And what is MLOps? Well, those of you who have been under a rock, I don't hear about MLOps yet. If you know DevOps, which is the combination of development and operations, so moving in that continuous cycle from the development, experimentation, package, release, configuration, feedback, MLOps is simply adding a third wheel to this. So basically playing with data, experimenting in your machine learning workloads, training models, generating those models, and then pushing those to the DevOps wheels. Okay. So that's pretty much the summary of what is MLOps. And there is a nice graph here from Cloud Guru that says that if we keep in this thread, we are going to end up having three set of ops, right? Because we have DevOps, secOps, MLOps and all of these things. But the point is that at the end we should follow at least a set of best practices that we understand for MLOps. In Amazon, we like to say that we have best practices until you find better ones, because everybody understands the technology and the world and the MLOps in general in different ways. And that's fair. I mean, if you have your own view on how your company should address the MLOps, then you should follow that. But I will tell you the ones that we know, the ones that we have seen in our customers, as I said this last year, especially with the evolution of machine learning to more into an MLOps way. And the first thing is that you want to be agile in this implementation. So another mantra that we have in AWS is we fail fast and we iterate often. So we are not afraid of failing, but we have to do it fast. So you don't want to spend three months on a project where you have a full team of data scientists working on a project and then find out that you are going nowhere, right? So you want to fail really fast and then iterate until you succeed. So that's the idea. The other thing is that we want to avoid a few design antipatterns. So I will talk to you about a few of them. And we want to follow the recommended principles. And those are very similar to principles in the cloud in general, but I will give you the ML flavor that we have of that. And eventually we want to automate, automate, automate as much as possible. So infrastructure as code, nothing in the console, nothing in the graphical UI, unless it is a test, a demo, an MVP or probably presenting something to the CTO or the CEO of the company, right? But at the end you want to, everything have it automated. So the first design antipattern that we have seen is what we call the superhero dependence. And that's pretty much when you have one data scientist or a couple of data scientists who are owners of the full project end to end. So they start a POC, they experiment, they understand all the things that you have in your project and then they push this to production. What happens is this guy will become the most important guy of your company. And that's not a bad thing. The bad thing is when this guy is going to move responsibilities to another team, is taking broader responsibility or leaving the company. And then what happens? Then we have a big mess in production, right? So you want to avoid that. You want to decouple as much as possible the responsibilities within your teams with regards to the steps of the machine learning pipeline. The other thing is the deep embedded failure. So we normally have projects that are complex, that require artifacts in machine learning, that require code, that require scripts, that require a lot of experimentation and parameters. You don't want to hard code anything. You don't want to have dependencies anywhere. So you want to be as much as not as possible. And one example of that is that we have customers that start working with TensorFlow or with Bytorch. And then they change their mind at the end and they say, well, maybe I just want to work with, I don't know, TensorFlow from Python, MXNet or something like that. So you want to be flexible on the frameworks. You want to be flexible on the parameters. You want everything to be based on variables and based on repositories for that. The other anti-partner that we see is when you don't have a lifecycle management for your machine learning models in production. So you make a project, it's a success. You push it to production and then you forget, right? Wrong. So any model, any model, I would say that you have in real life, is going to eventually decrease the accuracy that you have on that modeling production. And that's natural just because people is changing the behavior because we have more data because things are moving. So it's normal that the accuracy is going to decrease. So you must have a method for monitoring the accuracy in production and also for automating when you manage the drift. So the drift change that you have in general. OK. So in general machine learning code and I think most of the cloud vendors are showing this slide many ways because it's really good. It actually shows you what is machine learning code within the full end-to-end workflow of pushing a project with machine learning to production. And it's really a small piece because in reality when you're starting a project and you are in the face of experimenting and you are in the face of research, then the machine learning code is very important because you're asking yourself the question, can I use machine learning to solve this problem? Is that the right approach? Because maybe an Excel spreadsheet can cover, right? But is machine learning the right approach? Should I use deep learning or classical machine learning and these kind of things? What happens when you push to operation? So you finally have an experiment that is working. You are happy with the algorithm. You are happy with the metrics that you're getting. And now you have to push that to operations. Then the machine learning code is irrelevant because that's close. You should have that code fixed in a repository and then focus on the rest of the elements in the workflow, right? How to verify the data, collect the data, extract features, how to manage the resources that I have. If I have to instantiate resources in the cloud, how to do that? How to analyze? What about the infrastructure? Should I use GPUs or CPUs? Should I monitor this in production? I'm monitoring performance and I'm monitoring latency drift and these kind of things. So it's very important that we focus on those other areas in the second phase. Obviously, in all this story, there are a lot of teams that are participating, right? So there's not only the data scientists or the data analysts, right? So we also have people from security, in example, for making sure that we are complying with all the regulations that we need. We also have DevOps engineers, very important, so that they make sure that we are following the best practices of CI CD of our company. We also have system engineers for helping us with the infrastructure that we need for deploying the resources. And obviously we have business sponsors, otherwise nobody pays for that, right? So we need everybody participating and this is a collaboration and normally a transformation that you have to make in your companies if you are a manager and you have to decide the structure that your teams must have. So I have seen many customers this year that are moving, shifting towards transversal teams, collaborating with each other and also moving towards having something like a cloud center of excellence or data scientist center of excellence for machine learning and things like that. So at the end, any project that you do with machine learning and with MLOps, nowadays should follow a set of principles and those principles are basically consistency because you don't want to make environments valuable. So if you test something and it works in my laptop, right? So that's the classical. We want to push it to production and make sure that it's consistent, that it's still working in the same way. Flexibility, we wanted to make sure that we can accommodate any framework or any parameters, any algorithms, as I said before. Reproducibility is very important. So I need to be able to recreate past experiments that I have done and also reuse ability. So I did a forecasting project or fraud detection project or classifying image project. And I need to be able to get the components for reusing in another project and do it in short time so that we don't have to reinvent the wheel. So writing everything again from scratch. And obviously we need a scalability. We are in the cloud, so everything is elastic. We have to make sure that everything is able to scale on demand and auditability. So we must make sure of who did what. So governance is going to be very important. So for that we follow a few tenets that we recommend normally. So the first one is you want to create automated and reproducible ML workflows. So as I said before, everything must be as much estimated as possible. The other thing is that you want to manage those models in a model registry. So normally we have services like in Amazon elastic container registry is one example. So we work with containers heavily because that's a good way to minimize that variance to be consistent. And registering those models in images is a good way of doing that. But you could also use other tools like an example, artifactory or things like that, right? We also want to enable CICD and infrastructure as code. So you have to follow the best practice of DevOps. Machine learning should be no different to that in that sense. So use your Jenkins or in AWS use code byline, code commit and this kind of tools. And obviously once we push in production we have to monitor everything, right? So we have to check performance metrics and feedback information to the models. That's very important because that will help us to improve the way that we have our models. Okay. So now we saw a few of the principles. Now let's talk about technology. We all love to talk about technology anyway. So the first thing is if we see a machine learning cycle like this and obviously this is extremely simplified from the regular amount of steps that we normally have in the real cycle. I would say that for the first part which is dealing with the data collection integration, preparation, cleaning and all of that. We have a broad range of services in the cloud. No matter the vendor that you're using but if you're using AWS you probably know Amazon S3 as the storage for doing ETL and working with Spark and EMR for using Hadoop clusters Athena for using SQLs on your queries on the fly, on metadata Redshift as the data warehouse or column database for your information and StageMaker. You can also do collection integration and preparation and cleaning with StageMaker. And I tell you what, in two weeks from now we have the annual event in AWS for launching the new services the new feature that is called ReInvent. This year is going to be special as any event in the world this year so it's for free and it's remote. It's going to be three weeks until mid-December I will leave you the link at the end of the session and I recommend you to attend especially the sessions on December the 1st the keynote and December the 8th which is the keynote on machine learning. We have a ton of things about AI and ML that we are going to be releasing and StageMaker is going to play a very important role in many parts of the pipeline being this one of them, okay? So stay tuned on that. But what about the rest of the pipeline? So, well, for the rest of the pipeline we pretty much covered that with StageMaker already. So that's visualization analysis, the feature engineering, training, parameter tuning, evaluation and hosting those models, doing batch inference, hyperparameter optimization and these kind of things. To be honest, most of the features that we are going to be releasing I'm going to be even complimenting this more. So again, stay tuned on the news that we are going to have on that. But if you're familiar with the stack of services of AWS for machine learning in general you probably know this slide. So we have many options from less managed infrastructure at the bottom where we pretty much give you an image for virtual machines or for a container that comes preloaded with the libraries and the frameworks like TensorFlow, MXNet, PyTorch, obviously with the APIs like Luon, Scikit-learn, Horowitz, Keras, etc. We have another layer which is the platform where we have StageMaker. Being the end to end platform to help data scientists to be more efficient in the world for not reinventing the wheel. And the last stage at the top is pretty much the AI services which is a set of fully managed services where you just consume those through an API is intended to be very simple to use for specific use cases like an example computer vision forecasting, personalization, fraud, search, chatbots, etc. And we have new members of this family coming very soon as I said before. So you have a broad range of options coming from compute by using pure Amazon EC2 to the management of those even with containers, an example by using elastic container service or elastic Kubernetes service whether you're using Docker or Kubernetes even Qflow integrations if you are using Qflow today for orchestrating your pipelines and then the registry of those images that I said before in ECR and SageMaker as the platform for doing data science for doing machine learning. And let's assume that you are in your experiments you are using SageMaker or any tool that you like for doing the experimentation and training of your models. You are happy with your model and you are ready for pushing this model to production so you probably have something like this. So you are working in your IDE it can be SageMaker Studio which is our own IDE locally or even with Bycharm or with Visual Studio Code and then you are happy with your process right now in experimentation the data scientist says I'm ready I can push this to production, right? What happens then is that we have to orchestrate the lines that we have in this cycle so we managed to get it once now we have to orchestrate the whole thing so that we can integrate this in production and for doing this we have two options one we go with managed tools with third party tools like in example Qflow that I mentioned before Airflow, Jenkins we rely on repositories like Bitbucket we even integrate with partners like MLflow obviously repositories like GitHub etc. All of those are supported with integration with SageMaker today because SageMaker is very modular so you can use pieces like in example training somewhere and hosting in SageMaker the other way around so training in SageMaker and hosting somewhere else you can choose but that's with third party and with regards to Qflow because it's very relevant these days there is many people trying Qflow we normally support in SageMaker two ways of connecting one when you're using plain Kubernetes to the SageMaker operators for Kubernetes which is pretty much like a Jamel templates that are preloaded for helping you leverage in on SageMaker for some specific tasks in your machine learning workflow and then we have the other option is if you are using Qflow pipelines today then we can integrate through the SageMaker components for Qflow so the experience is actually seamless when you are integrating your pipelines from the single pane of glass of Qflow and you are still leveraging on SageMaker and the infrastructure in the cloud and all the power of the instances that we have there and all the features about outer scale instances etc in SageMaker in AWS okay but what if you use the native services of AWS for the orchestration well in that case today you have options like CodePilot, CodeCommunity, the step functions which is our workflow generator and manager Lambda which is serverless functions CloudFormation which is infrastructure as code and obviously SageMaker and I tell you what we have many surprises on this in SageMaker very soon so again stay tuned because it's going to be very important for this area but normally let's assume that you integrate those tools in the workflow as we have it here you normally will have a step machine with step functions to define your workflow and those can be triggered by events, for example whenever I put new data in an S3 bucket in my storage I want to trigger an automatic execution of a retraining of my model same example so I want to do preprocessing, training, evaluation register a new model propose a model to be pushed to production right so that's going to be orchestrated by the workflow and then propose as a CI CD pipeline in CodePilot in example step functions here plays an important role because it pretty much help us build in this pipeline so today it supports what we call the data science SDK which is a way to integrate with SageMaker and again we are adding a management layer on all of that so that you can do everything from a single pane of glass ok so how is the full picture now let's imagine that we go through the full architecture now you have three separate environments that's also good to reduce the radius glass that you can have in case of any error etc so you have a development or staging area you have an automation area and then finally you have a pre-production and production area so in the development and staging is normally the place where the data scientists are going to be working with your models they are going to define those step machines they are going to prepare everything on the experimentation etc and when they are happy they will push to a staging and they will package all the artifacts and all the lineage of the model to the registry and then tell the DevOps guys ok I have this package ready to be pushed to production and the automation account is going to be central to your team while the development and staging is probably split per project and again this is what we have seen in our customers but every customer could be different and in the automation they will say ok I now going to push a release pipeline where we basically are going to put an endpoint in production to with our model preloaded so that we can respond to inference through an API so that could be a typical case obviously you need to respond to request from a client that you could have in a front end or outside in another system etc. The important thing here is that we can even monitor drift and we can even monitor some parameters on how good our model is doing in production or whether we have any problems in production through SageMaker model monitors so that's one of the features that we have in SageMaker between the ton of things that we are supporting today this looks the same if we have batch inference job let's say it's offline we don't need an endpoint all the time 24-7 in production but we can just run an inference in a file and then write the resource so it's the same schema the difference is that we run a batch inference job in our pre-production or production account with our dataset cool so this is an example of a real use case I hope you can see it fine in the slide from a customer so this is how they split the teams and that's very important because you want to split the responsibilities in your teams also so as I said before you have an automation account at the top where the DevOps and CI CD engineers are the owners so they will define they will approve the pilots they will define how is the structure they will be handling the model registry and they will collect insights and metrics and logs and normally you have operations engineers who are also approving things to production and consuming the metrics to check whether we have any incident in production while in the lower side we have the development account or the staging account where the data scientists are going to be working mainly and they are going to do their experimentation and pushing things to be deployed in production and then the pre-production and production accounts for hosting the actual inferences that we are doing at the bottom you will see that we have what we call the data platform so depending on how mature your company is already you might have a data lake so a modern data lake based on the cloud with Amazon S3 an example or any other provider if that's not the case you probably have something more traditional here like a database, a traditional database or even separate let's say storage containers for each one of the areas of your company or things like this another example for another company was when they were doing a retail demand forecasting so they were building these comps that we see very often which is what they call the automation ML factory, ML factory algorithm lake is another way that they call it, where is that every data scientist, every artifact that they write, so every piece of Python script, every piece of model, every algorithm they will store in a specific bucket in S3 and they will call that the ML lake or the algorithm lake and what they do is that they define some templates in JSON, in Jamel etc. where they basically define the pylons that they want to follow so an example I want to use this preprocessing script that I know that I have here from a previous project I want to connect with this regression algorithm I want to connect with this classification algorithm etc. so that's something that is called the ML factory and it's the way that we are scaling, so you don't have to reinvent or rewrite all those artifacts again, what do you get with this, well you get efficiency so normally in this customer, this is a case from Europe as I said before it's a retail customer doing demand forecast for more than 600 thousand models in parallel and they reduce the whole process from 14 hours to less than one and a half hours and the cost of doing that was less than $800 for the company while they were spending $20,000 per month on this, they were on the spending less than $800 per month on the full thing, so there are even services that doesn't appear like lambda, lambda is pretty much for free because you pay familiar executions on this and they were hosting the ML models in lambdas actually here, so that was very efficient so the final thing before we wrap up is that we need to balance the needs of the ML builders of the data scientists who wants to be agile so they want to be flexible and they want to move very fast, innovate create new things, let's take the last thing that we have heard let's take any new algorithms like GANs reinforcement learning etc but on the other side you have the people from cloud IT that has to establish governance so they want to control cost they want to make sure that we comply with the regulations they want to make sure that nobody mess up the security so obviously we have to find a compromise between the two of them it's easy to do that today in the cloud so with AWS you could have a multi-account instructor with something like control tower and there you can use services like service catalog in example to pre-define some templates where you can give self-service access to the data scientists and the DevOps teams to do this kind of project so whenever I'm a data scientist and I want to start a new project I want to do my forecasting project or whatever I just click on my self-service console in service catalog and I deploy an environment that is already pre-approved by my governance people and has permissions to my staging and development environments and does not have permission to push anything to production so I'm safe on that the data that I can access to is also controlled etc so for the other people the DevOps teams and the system teams you give them permission to the actual production environments and the metrics logs etc but you don't give them permissions to mess up with the models and the algorithms and the data scientists work so that's kind of how we do it at scale in the companies today with AWS so key takeaways from this session first of all do not reinvent the wheel be efficient so you have to ask yourself is my company doing engineering are we building platforms are we building software or are we building use cases are we building solutions so that's going to be very important because researching queue flow researching metaflow which is the only one from Netflix researching these kind of areas that are very green that are still not mature will take you a lot of time and you have to ask yourself whether you have the time to spend on that otherwise just rely on the tools that we have already and you also have from third parties so do not reinvent the wheel again be efficient the other thing is avoid the antipatterns that we mentioned before be agile on those and obviously follow the best practice principles in your ML designs at least the ones that we know until now right consistency, flexibility, reproducibility reusability, scalability and auditability final slide before we go for questions I leave you some resources here you can take a screenshot if you want you can also scan the QR code that I live in there that's pretty much for registering to the RainBand RainBand is running from November 30 to the 18th of December I cannot tell you specifically I'm not authorized to tell you the good things that we are going to have but I guarantee that it's going to be a ton of good stuff coming from machine learning and artificial intelligence so do not miss RainBand that's my piece of advice you also have our training and certification portal sample notebooks our machine learning blog our documentation and obviously the web architect framework white papers there is one specific for machine learning if you want to check so with that I think we are on time and you have my Twitter handle there if you want to connect or if you want more information or if you want those slides that I show before or having a meeting feel free to contact me and I will be happy to attend Antonio that's great you've been so punctual thank you so much so that was a very thorough talk you outlined all the do's and don'ts and I found the use cases to be very illustrative we saw the benefits the before the after so that was very helpful thank you very much indeed and also I believe you mentioned there were some keynotes coming up on December the 1st and the 8th do you want to just remind us where we go to see those yeah so you can go to as I said before you can register and then that AWS events that come and there is the keynote from Andy Jassy our CEO is on December the 1st so that's where we release the most important things obviously and the other one is going to be on December the 8th which is from Swamy who is our VP for machine learning is going to be focused exclusively on machine learning and it's going to be also very fun to watch okay you showed that very impressive graph in the early part of your talk about machine learning tools how they're growing exponentially that can't carry on forever presumably how do you see this landscape developing in the near future so I think it is going to consolidate eventually right but right now it is a wild world so it is complex to get order on things and we see these in customers every day so again my recommendation is to not reinvent the wheel you have the options there already go with any cloud vendor I'm not saying this because I'm from AWS but with any cloud vendor you have the options already for doing the end-to-end with managed services so that will lower your TCO and that will definitely make the time to market shorter for you great there's a more specific question for you according to you what are the main challenges to apply Mlops in edge computing scenarios so yeah it's computing is a scenario that is very common these days so obviously we are making an effort to push the services from the cloud to the edge but there is a specific use cases where this makes sense or other use cases where this doesn't so the main concern that you will have there is compute power so in the edge devices normally you have very few compute power and running inferences on some specific use cases is going to be a real challenge so again we have news on that as well for helping and that and that's also area that is accelerating a lot in terms of function learning okay Antonio I think that's pretty much what we have time for right now you've given us your contact details you've told us where we can get your slides so that's fantastic and really enjoyed that talk thank you so much and hope to see you again soon