 that this talk is about how do how we have build out scalable and robust near real-time machine learning models at Episodes. So, Episodes deals into health care, health care analytics and I am a part of will be team into the Episodes. So, the talk will so over in the stock will focus on how we have build out the specifically for a near real-time machine learning system, how we have build out pre computation feature pipelines and what is the ML serving platform that we have used. So, here is a little about me before we move on. I am a lead data engineer at Episodes and so far in my career I have worked in building out distributed data processing pipelines machine learning systems and setting up ML ops practices in the team. So, the agenda of today's talk is that we will start first with the challenges that we have faced in building out the one of the recent ML pipelines for our team and for Episodes and then we lead to that how it has lead with the idea of building out a pre computation feature platform dwelling into what ML serving platform are we using and then ultimately going to the complete overview of the ML model server pipeline at Episodes. So, this ML pipeline that we are talking about that started when the data science team at our organization came up with the problem statement of deploying transformer models. So, specifically this this talk will focus on how do we deploy the complex models when we are saying complex models when they introduce the transformer models we pick up that model and deploy it into that time current ML serving pipeline that was like more of a real time type of API systems. So, we deployed that specific ML transformer based ML model into our pipeline and what we saw that we are facing high latencies when we are saying high latencies they were more than a couple of seconds and if we are leading into a API type of framework if we are having a high latency of more than seconds then that is where the problem is and you cannot take up to such a model to a production and when it comes to complex models what are an essential ingredient is that you have to use GPUs and GPUs are really expensive. So, if we go to the first point which is the high latency suppose if I am getting high latency one simple solution is that I add more replicas to my API and make the system more scalable, but now when we are talking of GPUs we cannot scale to GPU directly from say 10s to 100s just to serve more load because that will then go over budget of our whole computing load. So, that was the first problem that we are seeing we were seeing high latencies and we wanted to scale up to the load incoming load that we were getting to reach back to the to reach back to the serving time into milliseconds and the next problem that came along was that these model sizes were too large so they were into GBs. So, the one model that they came up with that the size itself of the model was going up to 4 to 5 GBs and the usual way of deploying machine learning models into production at least at our organization is wrapping up into them into containers. So, once your model is wrapped up into a container the whole Docker image size goes up to again a size of 10 GB. So, if we are talking about scaling the API replicas so first I have to scale the GPU machine. So, if you have already worked on cloud and seen that if you even scale from 1 to 2 GPU machines the time it takes for to get a GPU machine is approximately around in a range of 4 to 5 minutes and then if your Docker container image size is going up to say 10 GBs then the time only to bring up the whole container will again take additional 3 to 4 minutes. So, it means that for me the time to bring up one more replica was going around 9 minutes. So, if I and at episodes we always we always believe in scaling up the load as per scaling up the infrastructure as per the load itself. So, we do not over scale and always have a architecture cost optimized. So, we have to make sure that whenever the load is coming up first we do the calculation and then scale the instances accordingly. So, over here if I am seeing that to bring up one replica I will need 9 minutes then that will additionally add to the first problem that we have which is the high latency for serving of the request. And into the whole NLP engines there are multiple machine learning models that are working together to bring up the outcome. So, we have to make sure that we are not losing data across the models when say a request or a data is going from one model to another and the data one of the feature of the whole machine learning pipeline is that we should have the data unit management whenever a request is going from a model one to model two we should know that what was the state of the request was it success was it failure. So, this was the last challenge that we were there we had in designing the RML pipeline. So, over here the first problem is high latency for inference task. So, we picked up the task of going deep dive into what happens to an inference task. So, this is just a general overview of that whenever we have an inference task we get a data. So, for our substance we are on AWS cloud the I have shown an S3 bucket for that. So, we get the data on an S3 bucket we pick up that data we do data preprocessing and then from that we go for a feature generation load the feature load the model embed the features into that model. So, to the input processing and then ultimately model gives out a prediction and then we upload the prediction and do post processing into an output bucket. So, these were the broad steps of an inference task and if you see and what we notice is that the first two tasks which is a data preprocessing and feature generation they both were CPU based tasks and still they were happening into a GPU based machine and they were taking approximately 30 to 40 percent time of the total inference task. So, what was the first conclusion that we came out after deep diving into this one is that we have to remove the first two steps out from our inference task and make it more of a pre-step tour inference task. So, that gave generation to the idea of the creating a pre-computation feature platform. So, over here whenever data comes up first we do the feature computation and then only embed the request to the model and then get the prediction. So, first let us talk about the pre-computation feature platform benefits. So, now when we have and since like at all that we have multiple models working together to bring out the predictions on a piece of data. So, there are also chances where these models are sharing few features. So, over there we have reduced the computation load. So, if you have in the last talk also we saw we saw the mention of feature store. So, in those features store we are first generating all the features at once storing over there and then in the model we are retrieving those features as and when which model require which particular feature. So, it has overall reduced the computation load for each request and the resources needed and since we are removing and pre-cleaning our data it has improved the accuracy at the model side. Yeah the next benefit is that it has increased scalability because now we are reusing features across the models and not generating for every time we are predicting we are sending a prediction request to a model and since now in terms of inference sign since we have now moved this whole process of generating features out of the inference task we are saving directly 30 40 percent inference time on that side and which means that we are actually saving 30 40 percent of our GPU time and GPU cost. So, now over here we have seen that we have moved out the feature the feature generation part to a feature platform. The next step that comes out that what will be our ML serving platform. So, now since we have the whole architecture that we look out in the end we will see that the yeah the latency was still coming around in not not like a few milliseconds, but that was was still coming around one second or so for this complex model. So, we were very clear that we cannot go with a real time of paradigm. So, at this point we were clear that we have to go for a async type of API paradigm for our ML serving platform. So, after performing experiments on Cortex open source Sheldon core and AWS SageMaker we came up to in conclusion that AWS SageMaker is fitting the whole problem area that we are trying to solve and also we believe in keeping our team lean. So, what we were what we wanted was that if we have a managed service so that we our engineers spend less time on maintaining the infrastructure other than innovating on the infrastructure side. So, what we needed in our ML serving platform was that it should have support for complex heavy models and it should it should be able to for save the models outside even outside the container and then help us to retreat there. So, SageMaker has its own SageMaker model registry where you can keep your model outside the container. So, now we have the container size or the Docker image size is not comprising of the model size which is an advantage over here. Tolerate high latency inferences. So, for this we were sure that we can't do with the real time APIs. Though the real time API of SageMaker also was giving us results, but we always believe in scaling from 0 to 1 whenever there is load and not keep the resources up whenever there is no load. So, SageMaker asynchronous help us over there also and in terms of SageMaker asynchronous endpoints the way it deals with it has its own internal queue. So, you have to just invoke the asynchronous endpoint it will for save your request and then do the inference and then save your output to an S3 bucket again. So, it has its own internal queue and the way it scales out you just have to define your own auto scaling parameters. In terms of data lineage management there was no direct support from the SageMaker side. So, we have build out our own pipeline to support the data lineage management that we will see in again into the next slide. Scalability, reliability and security that is all guaranteed under AWS service usage and in terms of cost optimization with specifically with the asynchronous type of endpoints in SageMaker you have the flexibility of going from 0 to 1. So, suppose there is no load at this point of time then you can keep your instances at 0 and whenever there is a load and since the star feature of asynchronous endpoints is that it has its own managed internal queue. So, first when your request will come up it will go into the queue the SageMaker depending on however you have defined your auto scaling policies it will bring up one GPU instance and then your instance and then your request would be served. So, now even over here the whole GPU provisioning time is coming around 4 minutes because that's how that's how the cloud yeah so because that's even the whole the cloud provisioning logic but over here you have the flexibility of saving your request till the time it is not being served or inferenced. Okay, so this is the complete overview of ME model serving pipeline that we have built out at Episodes. So, we were talking of the feature store and building out a pre-computation feature computer pre-computation feature store platform. So, over here you see that we have built out Apache we have we are using Kafka over there to for save a data event point and then compute and then consume those data event points to bring out the features from that data points and then ultimately save again the feature events to a feature store or directly to the S3 bucket. Once you have the feature pre-generated we have integrated the lambda to listen to the Kafka topic and then it picks up the data points from there and pings the SageMaker Amazon SageMaker async endpoints and with the SageMaker async endpoints you have the flexibility of delivering your result status to SNS topics over here for say you are sending a X request to endpoint A. So, whenever that request is say failed success whatever the state of that request is changed you will get a notification on a SNS topic and from so now for the data lineage management part you are getting all the information into that SNS topic. So, we can directly consume that SNS topic and pull out the information and drop it into a database. So, that way machine learning engineers can directly query the database layer to know that what was the status of their request. In terms of near real time because since nothing is happening in real time but what we have kept is that the whole architecture is more of an event driven architecture. So, as soon as your data is dropped at the event topic the whole pipeline is triggered for that particular message. So, that is why it is a near real time architecture. And in terms of data storage we are the SageMaker async itself uploads your inference output to an S3 bucket. So, that is how we are dealing with the data. So, that is the complete overview of how we have built out the ML serving pipeline at episodes. Thank you.