 Hi folks, so I lead the delivery engineering team in Swiggy and my perspective here is going to be from the consumer side of data responsible for ensuring that the delivery operations are both efficient as well as predictable in Swiggy and so this talk is more about how in logistics especially real time logistics we are using data and not just data science and data engineering but general focus on data you know how is it relevant for us to take decisions on you know capacity and efficiency. So, I have a little short throat so please bear with me on that. I will talk a little bit about the introduction to the capacity and the efficiency problems then the challenges with respect to data capture the inherent nature of data then we will run through some efficiency livers that are very important to us and then we will talk about what we have done in simulation and then running experiments in the real world. So, firstly a quick snapshot of what delivery actually involves right. So, there are two parallel tracks here you know one on the top you see is the preparation time track once the order is obtained by Swiggy the restaurant gets the order and the restaurant starts preparing the order. On the other hand we also start searching for a DE a DE is short for delivery executive we search for a delivery executive we find the best one we assign the order to him or her and then the delivery executive travels to the restaurant which is called the first mile time. If the delivery executive has reached before the food is ready then the executive ends up waiting there and if the delivery executive reaches after the food is ready then the during this time the food was waiting both of these are bad options for us but in any case the pickup is done once both are ready and then the last mile happens. Now just to be clear last mile is not just you know reaching the customer's doorstep it also involves navigating through the society in which the customer is an odd the office building and actually delivering and handing over the food packet to the end customer. So the first problem is capacity you know in very short and simple terms capacity is about how many orders should I accept right so given I have 100 D's let's say you know what is the number of orders I should accept in other words at what point should I start saying no I'm not going to take any more orders from the customers this is the high-level problem now given 100 D's I can deliver you know 300 orders given enough time right so the first constraint is obviously you don't want to serve orders you know two hours down the line to customers so the first constraint is there is a max SL I don't want to serve orders beyond this particular time. Now obviously the second important thing we've already pointed out is how many delivery executives do I have at any given point in time so that determines you know that's a base level of determination so if I have 100 D's what's the minimum number of orders can I serve obviously 100 you have to do better than 100 because you have 100 delivery executives. Now beyond 100 lot of factors start affecting you know how many orders we can actually serve in the max SLA time. So there are things like you know what is the traffic conditions on the road if the traffic is less then my each of my delivery is actually delivering orders faster is coming back and then taking up the next order and delivering right so now you can deliver probably not 100 but maybe 120 150 maybe 170 or maybe even more so what determines this one factor is traffic whether and all our conditions that may happen may not happen you know customer density how close our customers to each other and also customers to restaurant distances right so and distance is actually a proxy for time you may in Bangalore for example 1 kilometer could take you half an hour to travel right depending on traffic conditions and all so so all of these factors and conjunction determine what is the true capacity that I have the same number of these now let's say you have 100 D's still and all of these things remaining the same let's say you yesterday you delivered you know let's say 140 orders then with the same number 100 deliveries with everything else remaining the same imagine that traffic conditions are the same your odd number of orders that you have got from the customers are the same is can you still deliver 140 orders you know given that you have 100 D's may not be necessary right because 100 delivery executives at 7 p.m. in the night is very different situation from 100 delivery executives at 10 15 p.m. in the night because these many of these hundreds are going to log off in the next maybe 15 15 minutes whereas at 7 p.m. more are going to join the fleet right so your true capacity at is not just you know dependent on these factors number of D's and all it's also dependent on what is going to happen what is predicted to happen in the next 10 minutes right so now you see the nature of the problem is such that you know lot of external factors you know are coming in into play right things like traffic weather conditions you know you know there is an accident on the road there is a road construction going on multiple things which are not predictable control what is our capacity right another example given here is batching possibility right you can also take the same delivery executive and give him more than one orders as long as you know the customers are nearby and multiple other conditions are satisfied so the capacity problem is more about how close to actual capacity can I get we don't know what the real capacity is it's very hard to determine that so it's more of and what are the costs right so suppose I over predict my capacity which means I end up taking more orders than what I should have taken what ends up happening is that customers will now get the orders delayed right so they will have a bad experience and if I under predict my capacity then I would have left orders and money and hence money on the table right so both of these conditions are not true and hence it's very important to operate at the true capacity level now capacity has two notions here one is the more common form which is the aggregated capacity we calculated at zone level in some sense it is a ratio of how many orders we have with respect to how many deliveries executors we have now the traditional notion of capacity you know and this is the stress in the system the how many orders can we take is an upper bound right that bound is traditionally calculated by a lot of learning that happens in a very static and human learning mechanism that used to be the traditional way of computing capacity and the way it would still work is because given enough number of days enough number enough experience of people who are running on ground operations it used to kind of be okay sometimes but that's no longer going to scale it is not responsive to you know sudden changes in conditions on the road it's not responsive to weather conditions because you know rain happens once a while and when it happens you don't know what the intensity is and so on right so how what is the right number where you would stop accepting more orders so this is one of the challenges and often what it also leads to is that if I have an incentive as an operations person to get to serve maximum number of orders I may you know tweak these numbers without any scientific reasoning and that could also lead to bad orders and you know customer dissatisfaction and so on so we work through this problem you know for quite some time you know tried various algorithms came up with different statistical modeling so as I said I am from the delivery engineering team so why work very closely with the data engineering data science teams and all ultimately you know the current version that we are working with is called order limiter it has it is very very much dependent on what is happening in real time on the ground right what state the delivery executives are and where are they in real time how close are they to their destination what is happening to the food that is being prepared on the restaurant was the last signal that we got in time as predicted or was it delayed by 10 minutes or delayed or was it early by X minutes so all of these conditions are taken into account you know things like when will the delivery executive get free is included in the computations and we now have a much more dynamic version of capacity so remember this is not stress stress is still the old notion of saying hey how many orders I have with respect to how many delivery executives are there this is more of what is my upper bow so this kind of a system as you can see is very very you know dependent on what was my ETA what was my SLA what was my ETA what is the current state of each and every order each and every delivery executive in real time on the ground so a huge amount of data processing work is required here what it helps us do is it helps us reduce those bad orders that we would have taken and it also helps us avoid any orders left on the table now this is a visual representation of you know how the system works on the left side you would see for every you know N minutes of interval we would have computed how many orders we can take in that N minute of interval and the green graph basically shows how many we actually ended up taken so as you can see around lunchtime we are operating very close to what we could have taken and as the lunchtime you know fades away around 2 o'clock or so you see the gap between the blue and the green graph increases that indicates that our capacity is unutilized at that time because the orders have got delivered but the delivery executives are still available they are saying yeah we are still in the system equivalent graph on the right side if you notice you know the points here and here these are places where we have hit the capacity and so at that point we would have said hey take no more orders and the yellow line and the blue line here are dynamically changing right so that is saying cannot not just what the current capacity is saying hey at this point I can take so many orders because let's say all my delivery executives are busy or at this point I can take so many because now they are getting freed up in the next 10 minutes the other notion of capacity is point capacity right can I accept this order and this is also very important because you know for example if you have only 5 let's say delivery executives at a particular point in a particular time point in time right these delivery executives could be in the east side of the society because most of the times you know breakfast early morning breakfast orders come from a certain restaurants and on another side of the city probably there are none right so while these 5 are there I cannot take 5 orders from the west side of the city because it will take too much time for these delivery executives to travel right so this is an extreme example but it happens all the time where we say keep for this particular order you know let us compute are there any delivery executives available in the vicinity either now or are expected to get freed in the next 10 minutes so both of these notion of capacity as you can see are fast computations they are approximate you can't afford to get all the data in the world so you kind of do all these computations based on some last known state and ultimately aimed at preventing both the sides of the spectrum bad orders on one side and the leaving money on the table on the other side the other notion is efficiency in simplistic terms efficiency is how many orders you know per delivery executive per hour do I do right so there are two notions of efficiency one is called the assigned efficiency and the other is called simply efficiency the regular notion of efficiency also includes what is the utilization of my fleet right so if you leave that out in most of the conversation will only focus on assigned efficiency part today now what are my efficiency diverse how do I control that so first is obviously who do I assign this order to right so I want to assign a greedy approach could simply say I want to assign this to the nearest delivery executive possible but the problem is that for this order the nearest delivery executive could have been the best executive for another order right which is probably closer than this delivery executive is closer to this order right so now we are talking about not doing a greedy approach but doing a more of a global optimization so there are you know versions of Hungarian algorithms that we have coded up I am not going to get into that because the focus here is not the assignment algorithm but the focus is what is the basis that we use to do these kind of optimization right all of these the basis is okay I know where the delivery executive is I can estimate the time that this person will take to travel I also have no answers to say that you know there are possibilities that this guy may even reject the order he may say that I do not want to take do this order we allow you know for to certain extent we do allow this behavior because delivery executives may have certain preference they may not want to go in certain area or they may not want to maybe it's very late they were just planning to log off from the system and an order came so rather than accepting that order they would reject the order they can be various situations so now we are talking about building in those possibilities into our prediction algorithms building in those possibilities into our predictions of you know how much time would it take to travel to the restaurant and so on the other divorce when to assign right so for example if I have to serve an order and I see a delivery executive that is 1 kilometer away I have a choice of assigning this order to that executive right now or I have a choice to basically wait for some time and see hey is there going to be another delivery executive who may become available which is closer than 1 kilometer to this order right so time dimensions dimension is another liver I have then you know can I batch orders we talked briefly about batching what enables batching is that you know the customers 2 customers or 3 customers who I am going to deliver the order to have to be close enough not just in and the orders have to be close enough in terms of both the source which is the restaurant and the destinations which is the customers and also in the time dimension you don't want to you know batch 2 orders if they have come 20 minutes apart you want to batch 2 orders if they were close in time dimension as well now obviously the reason why this is important is can I predict whether a specific order is capable of getting batched with another order given all these constraints right so again you're talking about you know looking at lot of data learning from the past and taking real-time input so for example if there was an order in the system already you know 5 minutes back that came in which satisfies all these three criterias for the order that is coming now I need to be able to make that decision yes there is an order which potentially you know with a lot of certainty can be batched right so now when I it is important not just to predict this batching because you want to know what the capacity and efficiency of the system is going to be like but also because you want to promise an SLA to the customer right now imagine that there is no such order which is already there which is waiting to be paired then you basically ask the question hey what is the chance that in the next 10 minutes in the next 15 minutes you will get another order which will satisfy these three conditions right so in both so these two conditions are called second order batching and the first order batching and in both the cases we again do a lot of you know data analysis both you know real-time as well as building data science models to learn these behaviors the other divorce we talked about those different legs of the journey order to assign time first mile time last mile time preparation time customer to customer distance batching probability each of these are data science models you know for predictions and each of these have you know specific you know error into them I'll talk a little bit about errors so as we go along the important thing is that all of these models work in unison for us to be able to promise an SLA to the customer and as you know errors in these models add up the second important thing also is that some of these models are used for internal efficiency and internal predictions right and some of these models are used to make a promise to the customer so we also end up doing like different models sometimes to optimize our systems and to make a promise to the customers the the other way we look at data in swiggy also especially in the delivery team is also with respect to trade-offs for example if I want to make my efficiency very high I would want to batch orders together now when I batch two orders I give it to a single guy there might be another delivery executive who is waiting for an order but gets none right so this is a trade-off between what is the experience that I give to my most important you know resource which is in this case the delivery executive you may end up you know giving them suboptimal experience and they may end up leaving our platform right so if I make my operations very efficient I may compromise on the experience that I give I mean they are here to earn money and we are impacting their earning capacity similarly it could also be that if I batch two orders together one of the orders that came in earlier could possibly have been delivered first right so that order is made to wait because there's a second order that we are possibly going to batch it with so there's a trade-off there the first customer probably is getting the order in the promise time but lesser or it takes more time than what it could have possibly taken so there's a trade-off right there similarly you know when I want to make an assignment so I want to find the most optimal combination of orders with the delivery executive now while I found find the most optimal combination I may end up starving some orders right so some of the orders may not never fit in into the let's say I have only 50 delivery executives and every time I'm especially in peak times this happens every time I have like enough number of orders and less number of these compared to the number of orders and if I keep super optimizing myself you know those orders that are let us say away from where the delivery executive density is may end up getting starved so I can't let that happen because for those few customers whose orders are starving the experience will be very bad so these trade-offs are very important in every decision we make there's an efficiency versus speed trade-off right we talked about it when when I batch two orders for example one of the orders at least you know sees larger delivery time similarly there is a speed versus compliance trade-off right so especially in prediction models when we build our models I'll talk a little bit about why there is natural variance in the data and no matter what model we build there is going to be you know a noticeable amount of error and we have a choice to train our model in either a way that for most customers we deliver within the promised time or we have a choice to optimize our models in a way where you know we are as close to reality as possible but probably in the extreme settle still you know break right so those choices again are very important so when we are doing so now that we know capacity efficiency has each of these problems while have their own logistics challenges there have they have heavy dependency on the way we use data whether it's for predictions whether it is for real-time processing or whether it is even for debugging and you know understanding what is happening in the system so what are the challenges one is obviously the location capture accuracy it's a very common problem in logistics where because of you know accuracies of the devices signal presence versus absence and you know battery year drain considerations and all location accuracies itself are not very optimal so one of the problems we work consistently on that is that then customer locations and restaurant locations themselves while we do everything we can to make sure that they're accurate there are instances where both customer either has not entered the right location or in case of restaurants also for some reason or the other will see that their lack longs are not precisely at the area where they're actually located and even a slight deviation in these locations could have implications on how we are you know predicting our models then there are so all of these logistic space right it's not just in Swiggy but one thing that is very important in the logistic space is the accuracy of data right everything else being right just the nature of you know capturing a lot of this data from physical human beings is that they are all driven by their own motivations right so unless we incentivize each of these data input points appropriately we are bound to get signals that are explicit but not accurate right and one of the things that we focus on very highly on is how accurate is our data because if you don't fix that right now then our models also learn the wrong you know learn from wrong input and that's a error that you don't want to be there in the model so an example of D behavior is that yes obviously he may want to save the battery and so doesn't update the location but could also be that you know the D expects his food to be prioritized by the restaurant if he sends the signal that he has arrived at the restaurant although physically he may not have arrived right so that in his understanding you know makes his makes his food get prioritized in the restaurant stack and so his order will get prepared faster and so he will be very efficient on the other hand you know this how the human dynamics works right restaurants are smart people themselves so they know the some these are doing this or maybe many of these are doing this they also have this thing that they have offline presence right there are customers sitting physically on their tables there are they're also getting orders from outside of swiggy right whether it is offline or whether it is our online competitors we don't have access to that data restaurants are optimizing for those orders as well right so restaurants have an incentive in some cases especially doing peak time to wait until the delivery boy shows up now you see it's a chicken and egg problem right the delivery executive is saying I have already reached when he has actually not reached and the restaurant is basically waiting for him to come and then say okay only then I'll start preparing food because then it is guaranteed all of these are you know things that cause incorrect data input into the system this is an example so this is plotted on Kepler.gl and what we do is we instrument each of the data points you know where a certain event happened what time it happened and so on so this is an output of that you see the white dot here is a restaurant obviously this is in some percentage cases of our restaurant not all but where it happens it happens really badly the these are the pickup events where have the delivery boys marked the food as picked up right so those are the latlong around there and the white dot is where our you know offline method of gathering restaurant location when you onboard the restaurant you basically get their name address and blah blah and we also get the latlong there so this kind of an analysis clearly shows that there are restaurants in our systems which are off from the actual location so we use the we do this analysis at scale and then figure out what are the restaurants that are you know greater than X meters away from their from the centroid of the picked up locations and that's how we go and fix proactively the restaurant locations this is another example which is again done on Kepler which shows the deep D behavior right we were always aware that you know some of our DS they mark their arrived locations away from the restaurant but the white dots here are indicating so the denser regions there are the actual restaurant locations and the cloud of things that you see around those denser locations are where many of the DS have ended up marking the reached location right so while we were always aware through you know plain statistics you know saying that hey X percentage of our D mark locations for arrived you know are potentially a geo breached condition but when you visualize it in this format it looks very clear it is very clear that the problem when the problem happens it can be pretty bad you know because some of these locations could be as far as away away as one or even two kilometers away from the actual restaurant locations so what do we do about it we have a lot of product and tech interventions we have done things like automated detection of location geo fencing input verification so if the D is saying I have reached have you actually reached can you look at the restaurant can you look at the restaurant lack long and the current location and actually judge whether you have reached now you have to do this even in the absence of a very strong GPS signal sometimes right which is when you make the decision based on again data analysis what is the probability that he is telling the truth you can't block a D because you know that location inaccuracies are there I could be standing here and my location could be showing off on the other side of the road it is very possible very much possible you know few tens of meters of granularity sometimes you know 100 meters is there you can't tell a D that hey you know you are lying and I will not allow you to mark reach especially if he has actually reached so you have to accommodate for those errors and there is a lot of data processing and error handling that happens to probabilistically judge whether the D has actually reached the location or not and then obviously we have some operational interventions these two graphs are you know our last mile times and O2D order to deliver time you can see the variance is what I am trying to highlight here the data is pretty much you know high highly variant it spread across the X axis the numbers on the bottom are hidden this is not necessarily you know a sample from all zones is from a sample zone where the variance is very high but the variance itself is something that is inherently there in the system right so the idea is that as I told earlier there are so many external factors that are affecting variance and not each of these external factors are possibly captured by any data science model because you do not have data for it right you can develop the best models even the best model will have some errors but here there is so much natural variance that even your best model will have errors so what do you do about it obviously this limits our prediction accuracy and the way out of this is not necessarily while we do chase accuracies but you cannot not do anything about it you know and one of the things that we have tried doing is to proactively account for the inaccuracies right so any decision you take for example you want to decide when to assign or which D to assign if your accuracy in prediction is let us say two minutes right and if two Ds are one and a half minutes apart then making a choice between them is not going to make a major difference right there is one way in which we can use although in in practice we actually assume that when such a decision is to be made we actually assume the best possible you know choice there and then identify the sources of high variance so it could be load on restaurants it could be the D behavior it could be the stress on the system currently and there could be many hyperlocal scenarios like a traffic jam in a certain place and so on so the goal here is that some data like how many restaurant how many tables of that restaurant are occupied as of now is not something will will have will probably ever have but you know other things like how many of Swiggy orders are there you know what is the menu of each of those items right is it ice creams is it biryanis is it rotis lot of rotis each of these food items have different characteristics of preparation right so multiple of those you know inputs are used to increase the prediction accuracies so next I come to now that we have so many you know algorithms and so many data focus decisions that we take how do we actually verify whether what we are doing let us say you want to do a same change the system how do we verify whether it is actually functioning or not so we start with the hypothesis we you know go on saying that hey if we do algorithm 1 versus algorithm 2 here is what are the metrics that are expected to increase decrease move favorably move unfavorably and so on and these are the metrics that are my guard rails right I don't want to go bad on NPS I don't want to go bad on let's say my order to deliver time for some cases and so on so you fix these hypothesis then you run the solution you do the debugging and implementation and everything and very importantly do a lot of instrumentation right so every single aspect that is possibly could be affected by the changes that you are making needs to be captured then we do a very important thing called simulation right so simulation is what is happening in the real world can I make this happen in software can I simulate delivery executive movements can I simulate the order rates can I also simulate which areas the orders are coming from can I simulate the behavior that delivery executives do like for example at certain rate I will reject you know X percent of my orders at certain point in point in time I'm going to log off at a certain point in a day at so-and-so rate right so these are different variables that your simulation environment can do so run this change through the simulation environment and one of the most important things here we have learned us it's a kind of a challenge to keep your main code and the simulation code in sync so as much as possible we try and you know do this together then we do a shadow mode execution right because each of these logistics XP experiments are something that you have you have human beings who are running right they are taking these decisions so it's very hard to you know just run simulations and say what's the result going to be like mostly simulation gives directional trends if I do X this metric bill will improve in the range of Z to P percent right it may not tell us that this will be the precise improvement in the metric when you actually run it on the ground so you start running it in shadow mode where shadow mode may you're actually getting real-life data but people who are using the systems they don't see the effect so you're not influencing anything on the ground but you're gathering enough data to do analysis both real time as well as offline and then finally you run on-ground experiments right so simulation and on-ground experience experiments are something that I'm going to talk a little bit about more and then finally when the on-ground experiments are successful then we roll out now before we get into the next slide I have a small video I'll just show a snapshot of it so this is how our simulation runs it's a visual representation all the green markers there are the delivery executives who are free all the red ones are those where the orders are assigned and you see those red lines shortening is where the delivery executors are moving towards their destinations and the blue ones are where the delivery executors are moving towards the restaurants so this is how we run simulation and when the simulation on the top right left corner you will see this is a certain zone there are so many d's who are free at any given point in time there are so many the ratio there is representing what is the stress on the system and the other counts are how many free d's and busy d's and all we have so this kind of the system actually gives us directional results on if I take a certain algorithmic decision then what is my likely output going to be like now once so this is so this graph actually represents what is our you know results on simulation versus actuals right so you can see the blue and the red line the red line is the actuals and the blue line is the simulation results you can see they are not completely on top of each other they are not like duplicating each other but the trends are very much the same similarly here is the histogram the red one is the actual and the blue is the simulation again you see the same observation that to a very good extent we have been able to simulate what is happening on the ground right now the last part is you know how do you run on-ground experiments and on-ground experiments is a huge challenge in logistics because as I said too many external factors too many factors that you do not control so how do you choose a test set versus a control set is a significant problem now what we have learned is that first strategy was like simple pre-post we run it we run without you know doing any changes for X number of days then we run it with the changes and then we observe the differences this strategy typically works well if you have significant impact on parameters and if you have eliminated any large variable changes like if IPL match happens there is the number of orders like increases drastically or if a festival is there the number of supply the delivery to the level decreases drastically so you eliminate some of those large variables then this kind of gives okay results but it has to be observable enough so it has to make a large impact the other one that we tried was alternate days this worked a little better than pre-post because alternate days the variations is much lesser than it's over a two-week period then the third one is control versus test zone can I find a zone which is very similar to the zone in which I won't I'm running experiment so one zone is control the other is test we run it simultaneously for multiple days in each of these experiments there is some bias there are some variants it's very hard to eliminate each of those but mostly these are suitable for large variations then we have gone into more sophistication we've tried time slicing we do a certain strategy for five minutes you know and the next five minutes we change it and so on or we also do it at higher granularity one hour granularity and randomized selection of orders like on certain order we apply one treatment on certain orders we apply another treatment excuse me and then we measured the bias and variances in each of these strategy the last strategy which also works in a lot is randomized geo-special selections right can I run a strategy in certain geo location whereas in as nearby geo location within the same zone you run a different strategy now one of the most important thing here is network effects where if I take a certain decision for one order is it going to impact another order not an example is that if I change my assignment algorithm and if I you know assign one order to a specific D it has actually affected another order for another D right that order now cannot be assigned to another D which means the behavior for another D has changed because that order is never going to be assigned to the other D so some of these you know test and control mechanisms because of these kind of network effects you will never be able to apply right so in logistics it's very important to know how do you create the isolations right so for example if I promise one SLA to one set of customers another SLA to another set of customers their behavior can be independent so in this you can do the order wise split right but if you are doing you know if I if I batched let's say two orders right now if these two orders are bad they are influencing each other which means that the second order cannot be batched with the third one in the other set right so there are these kind of things that we have to be very careful about quick on visualization you know when we plot a lot of these points on a library like Kepler.gl we get a lot of insights that aggregate analysis doesn't give us right for example in this case you will see that the blue points are where this restaurant at the center the white dot is serviceable whereas the orange and the red points on the externalities they are not serviceable you will see some blue dots even on the edge so that let us to ask this question why are they there you know why are those red and blue dots very close to each other in this case we found that you know it's because of you know U-turns and all in some places you require a U-turn on a latlong on the other side of the road you may not require a U-turn that makes a difference there's another example where visual representation gave us a lot of insights on the extreme right you will see these are these are restaurant chain our orders our customers are not doing a great job of choosing which restaurant let's say this is a chai point is not but let's say this is so if I show three options to the customers for chai point which are all nearby and serviceable the customer doesn't necessarily make the best choice he doesn't know that the chai point you know this one is you know probably only 500 meters away and the other one is one and a half kilometers away the overlaps they are showing that right so it's very important in logistics to do these kind of visualizations without which you will only be restricted to aggregates quick summary of learning know the input data very well the source of variance and account and accommodate for it minimize the input accuracy fix the data source at source understand the trade-offs simulations use simulations and the experimentation very rigorously right without this statistical analysis of each of these experiments it's not possible to do any kind of progress in logistics questions