 Hi guys, I'm Nikunj. I'm here to talk about the platform which we have created Zomato about real-time machine learning in France. So just show off hands, how many of you do know something about machine learning? What is machine learning? Cool, just half of the crowd knows. I'll tell you a bit about machine learning before starting to dive what we have created and why we have created. So what is machine learning? Machine learning is a simple problem that given an input, for example, let's take an example. Given an image, you have to detect whether this image is of a car or not. So the general process what follows is you give an image, people tries to extract a lot of features from the image. For example, in this case, the features can be the RGB values of each of the pixels. Then all these pixel values goes into an algorithm. Let's say this is a black box, algorithm can be a simple linear combination of anything, which this algorithm has learned basically by seeing through all the past images, what kind of pixels contribute to the car image. Then it returns whether this is a car or not a car. So simple input, input is then converted into a lot of features. Features is gone into an algorithm, and then this algorithm gives you an answer. So these features can sometimes be very complex. There are lots of features. Sometimes the features are real-time, or sometimes the features are static. For example, take an example of a personalization or a recommendation algorithm. Real-time feature can be what the user has just clicked the last restaurant in the case of Zemato. There are a lot of personalization algorithms where we show recommendations to a user which kind of restaurant they should see. So here the real-time features, what the last restaurant you have seen. Maybe you have just clicked on the Haldiram, you are looking for a more chart kind of place. Maybe the next recommendation to should be Bikanerwala, these kind of real-time features. Then there are some static features. Static features are basically the restaurant attributes as you can see in this case. The restaurant attributes can be what kind of cuisine it serves, what kind of dishes it serves. These are all the static features. Then the models, the algorithm which I've shown you as a black boss, the algorithms can be very complex and simple. Some very known algorithms are regression-based, pre-based like exebused, all these algorithms. Cool. So now I'll talk about some of the machine learning problems we are trying to solve at Zemato. There are very simple problems, some are very complex. Some of the problems are basically predicting kitchen preparation time. So as some of you know, I don't know maybe, but at Zemato we are an order aggregation platform, we receive orders from users, and then ask the merchant or the restaurant to prepare for it, and then allocates a driver to take from the restaurant to the user. So in this whole journey, the very complex problem is to predict how much time a restaurant is going to take. Basically given all the dishes, we want to predict the time the restaurant is going to take to prepare the order so that we can allocate or dispatch the rider only at that point of time. If the dishes are going to take 50 minutes, then there is no point just assigning the rider as soon as we receive the order because the rider is going to wait 50 minutes there. The second problem is again the predicting rider assignment time. In the peak time when there are a lot of orders coming in from the users, there are a lot of riders around the place, we want to predict how much time will it take for us to allocate a rider to an order. Third is obviously I told you earlier, personalizations and recommendations, what kind of restaurants we should show to the user when he opens an app. Then obviously there's fraud detection, there are lots of frauds going in by the user, by the merchant, by the rider. Then photos classification. On Zomato platform, a lot of user uploads a photo. We run different algorithms to classify whether this photo is a food, a menu, or just an ambience. And yes, rider allocation. So over the last six months, a lot of these algorithms, we have hired a big data scientist team, they have started working on these kind of problems. But from an operational point of view, we are facing a lot of problems. First of all, every model which a data scientist create, we, it is dependent on an engineers to deploy them in the production for real-time inference. Then again, the data scientists are creating models, models, models, but there are no standard workflows for them so that they can be easily deployed in production. No ability with the data scientist, basically with each model, you need to create some features, right? Some real-time features, some static features. But the data scientist team basically, they does all the offline analysis using OLAP transactions, they think what is right for them, then they constructed the model. But when you take that model into production, you need a lot of algorithms or you need a lot of cron jobs to compute these features. At that point of time, six months earlier, there was no standard workflow. There was some one of models, basically data scientists come to us, I need to deploy this. And then there is an engineer which writes some cron jobs to compute these features. Some engineers store it in Redis, some engineers store it in MySQL, there was no standard database where we can put all the features. And yes, it was kind of an add up thing going on. Then we as a team sit together one day and think of a platform, the ideal platform which we should construct so that all the data scientists and engineers can follow that platform to construct all the features and deploy the models. So these are the simple requirements which we wrote down. So first and the foremost requirement, it should be easily used by all the data scientists in the company. And I tell you, the data scientists are not that software engineers, they know a bit of coding, but some don't have the ability to write a production ready code. Then second is the platform, basically some real time features at scale. We need a platform where all the data scientists or the engineers can create real time features. Third is the auditing, it's like a GitHub for your models. Everybody has used some more functionalities for the GitHub. So we need same functionalities for the model, auditing and the versioning of their models. Then yes, the last goal or the last goal is very important for us that all the real time productions should take less than 50 MS, not more than that. So I'll go through this platform by using you a case study. The case study is the kitchen preparation time production. So I'll tell you how we created how we deployed this whole architecture to predict the kitchen preparation time. So kitchen preparation time, I've already told you that what is the kitchen preparation time? So we have an in-house Xebus based model which actually takes in all the features of a restaurant and predicts the kitchen preparation time for that order. It includes some static features as well as real time features. Static features are those features which can be computed at some frequency which is usually a day or a week. In case of kitchen preparation time, the static features can be what is the historical kitchen preparation time for that restaurant during that meal time. And then there are a lot of real time features, simple real time feature can be how many orders are currently being served by that restaurant. So it's basically, you can say last 30 minutes running orders in that restaurant. Yeah, so this is the proposed architecture. So this architecture is a bit overwhelming, but yes, I'll go it by the step by step. So first thing is basically we need a place where data scientists can create and store their models. Basically I'll go from the top, the model manager, where you can just create your models and store them. Then the second thing is the model inference platform where basically you can deploy your models. So it's kind of a beanstalk based service or any service where I can just take the model and deploy it on a pods. I can also use, deploy them on Kubernetes. Then these two components, and then I'll go from the left bottom. Then there is a real time feature computation. Basically my backend seems to send a lot of events like this order is created, this order is delivered, this order is done. Now I need a aggregation platform or a streaming platform where I can basically aggregate all these events by a restaurant and store it somewhere. So this is the real time feature computation. And then when I store it, I need a feature store where I can store all these features. So these are for real time features. Basically let's say a simple backend starts pushing me that this order has been created for this restaurant at this point of time. A simple feature computation job can be give me all the orders grouped by that restaurant in the last 30 minutes. And this job can run every minute or every 30 seconds. And it just pushes that this is the restaurant ID and this is the number of orders. And then all these features can be stored in somewhere which is a feature store. Feature store will be having two main nice properties. One is for the real time features. Basically all the real time features can be stored there. And another is the static features where basically data scientist can write their cron jobs and publish the static features to it. And now when I want to know what is the kitchen preparation time, my application will ask some gateway or a feature enricher. What it does, it queries the feature store to get all the features. And then after enriching my request, it sends it to a model inference platform. The model inference problem does nothing but takes all the features, pass it through a model and give me the output. So this is the proposed architecture. In the subsequent slides, I'll talk in detail about each piece, what system we are using in Zomato. So first of all, first thing is the real time feature computation. So we evaluated a lot of streaming jobs, streaming platforms for this. Yeah, so here, so we finally chose Apache Fling. But yes, we evaluated a lot of streaming platforms which takes in a stream of events and gives me the aggregation over those stream of events. So we evaluated spark streaming, Kafka streams, and finally Apache Fling. The main reason for choosing Apache Fling is because it is of very, very strong community, ease of setup, and the job level isolation. What do I mean by job level isolation? So in production, we will be running thousands, hundreds of models. And for each model there will be tens or tens of jobs running. So I need the whole job level isolation as in one job fails or one job should not affect other jobs. So if my, let's say kitchen preparation time job fails, my rider assignment time jobs should not be affected. Yeah. So we finally chose Apache Fling for the real time feature computation. Then comes the feature store. Feature store has two components as I've told you earlier. One is the real time feature store and another is the static feature store. So for real time feature store, we need a data store which can be easily scaled for high ride throughput as well as high read throughput. It can be eventually consistent as in the data can be reflected after 10 seconds, five seconds need not to be strongly consistent. So for kitchen preparation time use case only, there are 10 features which will be output every minute by the Apache Fling for nearly 100 K plus restaurants, meaning rides are around one million per minute. This is only for one model. And yes, believe me in production we will be, we are running around tens of these kind of models. For reads, there can be up to 100 K simultaneous users building card, which takes my reads as 100 K request per minute. Then this is only for one job. So finally we evaluated certain data stores for that and finally we choose a Redis, basically elastic cache backed by cluster Redis powered by AWS. Why we use Redis? First of all, it's supposed high ride throughput by adding shards to it as well as supports high read throughput by adding replicas to those shards. Also, we chose the managed one because it provides automatic failovers and thus is highly available. And it's a very low read and write latency. Plus also it provides a TTL to us, basically time to live because all these features will nearly have TTL of five minutes or two minutes. So we can easily set their TTLs. Yeah, so fling to feature stores, we just set up a Kafka connection and the flank is pushing features to the Kafka and at the feature store level, there is a consumer who just read these features and set them to the Redis. Yeah, so this is just a Grafana dashboard showing our real time consumer throughput. The second component is static features. There are a lot of static features like the historical time I talked to you about. So for KPD, like we need the historical KPD for that restaurant and a time slot. Here we need a store with key value access, just a basic key value access, not a relational database, which can support eventually consistent because if I update these features right now, I'll mostly probably be updating these features for tomorrow. So yeah, they can be easily eventually consistent. I don't need strong consistency here and definitely high read throughput. Right throughput can be high also but it can be distributed over point of time. It just need not to be at just point of time, I need to write so many records. So for that, we need evaluated Redis, Cassandra, Dynamo as well. So Redis we didn't choose because it keeps all the data in memory, so it proved to be expensive for this task. And then two things we are left with Cassandra and Dynamo. I think both we can use here but at that point of time, we chose Cassandra because the team has used Cassandra before and not Dynamo, that's why we finally chose Cassandra. So yeah, so the third component here is model management. So where we store models? Where can the data scientists create their models? So MLflow, I don't know how many of you have learned about MLflow before but yeah, MLflow is like Kubeflow. Basically it is a model management platform where you can just show data scientists can come, they can create models and they can store it over there. So we evaluated both MLflow and Kubeflow but finally chose MLflow because it was very, very easy to set up. I myself set it up in like one hour and the Kubeflow we didn't choose because yeah, the team didn't have prior experience on Kubernetes, so yeah, we went through ahead with MLflow. There's a simple dashboard for our KPD model. You can see there are a lot of models, version one, version two and version three and finally version three is in production. Now finally, we need a platform where these models can be deployed for inferences. So we chose AWS SageMaker here, evaluated a lot of stuff here like elastic Beanstalk. The elastic Beanstalk is simple application deployment platform where it gives you a load balancer, give you a EC2 instances and sets it up for you. Then ECS is there like EKS. So we can also deploy it on ECS and then SageMaker, AWS SageMaker is there. Again, why SageMaker? Because it was very, very easy to set up. Like one day we were up and running with a model and plus, sorry. It provides direct integration with MLflow. So MLflow, I can just write a command and deploy as the whole model and AWS SageMaker. Yeah, so this is my final KPD model deployment. You can see the invocations and the model latency. There are simple different dashboards for those. Now third component is feature enricher. Now I need a platform which can stitch through everything. Basically when I ask for the KPD, it goes to the features store to get the features and then it goes to the AWS SageMaker to get me the inference. So for that, actually we created an in-house ML gateway. That's a very simple go service which actually takes features from the feature store and enriches it and sends it to the SageMaker. So this is a simple KPD plan for that which can be deployed on our ML gateway. The plan is very simple, just fetch KPD features, get KPD and send it to SageMaker and publish it. So yeah, this is the final architecture which we have deployed. Basically a data scientist, MLflow, they publish their models to MLflow. Then an engineer comes into a picture. They deploy it to SageMaker. Finally an ML gateway which actually, okay. Finally an ML gateway which takes the predict API call, fetches feature from the features store and calls the SageMaker. Then features store powered by Redis and Cassandra and then Apache Flink for streaming features. Yeah, so this is the final architecture which we deployed in production. What is the future work we are thinking to do for this platform? First is the ability to shadow models on real traffic before taking live. So here we kind of need an AB platform plus as well as let's say by KPD model V1 is live and our data scientist comes up with a V2 model. I need an ability to put the real time traffic through V2 model before taking into production. It's like a shadowing traffic. So that is the ability we are thinking to make in the ML gateway only. Second is the archiving real time features to S3 for model retraining. So basically all these features which are getting published to Redis, they get archived somewhere maybe to S3 for model retraining. Third is I need to retrain my models at some frequency maybe a month, maybe a week. And then tracking my model performance online. Yeah, so that's the whole platform. Any questions? So we could also choose DynamoDB for Redis but at that point of time, our team was mostly working on ElastiCache very efficiently. That's why we chose finally ElastiCache.