 experiences about detecting anomalies that happen in Flipkart's fulfillment network. Now I realize that not all of you would be familiar with the logistics industry. So we'll spend a few minutes introducing you to the logistics side of things. And then I'll go a little bit into the details of the engineering challenges as well as the data science challenges. Now I might be here standing and talking, but there's a big team behind this, a photograph and a credit where it's due. A little bit about me, been in the industry for about 12 years now. Everybody went West, I choose to go East, spend time in some of the universities in the East. Now I'm spending time in supply chain automation, predictive automations and actionable insights at Flipkart. The talk that Zainab mentioned is linked there, that's the platform that we built on and came to this. Let's take a little segue and let's pause our engineering brains, let's pause our data science brains and talk about aviation for just a few minutes. How many of you came to Mumbai via flight, the Shivaji International? Awesome, right? Let's talk about the Chhatrapadji, Shivaji aviation. How many flights per hour on average? 42 flights landing or taking off in an hour and that's a good big number. I mean it might seem like two digits but for an airport it's a big number. I know 40% flights are delayed, 8% of those are because of the air traffic control. Now I don't want to crib about flights being delayed, that's not the intent. The point I want to make here is that those 8% of the flights which are delayed because of an ATC call is because the ATC detected a risk. ATC detected that something could potentially go wrong and hence pause the flight. With me so far, that's the reason for the delay. I mean let's not assume that they are not doing the things right, they are actually doing the things right. They are pausing the flight because there is a delay, there is a risk. For them to do their job they need visibility, they need alerts, they also need recommendations. These are things which help them do their job better. Now with this in mind let's move on to the next step which is I will introduce you to the Flipkart logistics domain. Keep it in mind as we go through the talk, it will help you. I will come back to that a little later. I will cover the motivations of the problem that I am solving or the team is solving. The approach that we used and some of the learnings towards the end are fairly simple agenda. What does Flipkart fulfillment network look like? On the left hand side we have the sources of the packages that you get, what's inside the package, the sources. It could be a marketplace seller, this person could own a big store somewhere or could own their own warehouse. Or we could have our own warehouses. For context the warehouses are thousands of square feet with vertical space footprint as well as horizontal. And items are stored all over the place, a significant volume of them. A person has to walk around and pick things up and that takes time. For the marketplace model somebody goes to the seller, picks up the things and brings them to a hub. It's called a pickup hub where a lot of processing for those shipments happen. So this is the first stage of our fulfillment network. The second leg of our fulfillment network is what is called as a sortation center. So if I were to give an analogy, if any of you are still aware of our postal services, the head post office where letters would come from multiple places and then get sorted that this batch goes to this location, this batch goes to this location. So they are like aggregators and then hub and scope model. So that's what a sortation center does. So a sortation center could come in multiple steps of my fulfillment network. So there could be a Bangalore sortation center. From there it will go to Hyderabad, it will go to NCR. From NCR it may go to one of those locations. Finally shipments end up at a delivery hub and this fairly self-explanatory. From here the Wishmaster picks things up and brings it to your doorstep. This is a very brief view of our fulfillment network. The point to understand here is that look at this. It's a complex network in itself, right? And there are end-to-end processes here, which are complex. There is an end-to-end process starting right from the point our shipment has received to getting to the customer. Not just that, each component has complex processes. A sortation center for example is a five leg process with multiple inputs, multiple outputs. So it's a five leg process, a sortation center in itself. A warehouse is an even more complex process, right? So an end-to-end complex process with complex processes in each leg. None of these are homogeneous. They are all non-homogeneous processes. Each process has its own identity, its own complexities, its own differences, right? And we deal with high variabilities. Again, I will not talk about high variabilities because the previous talks touched upon it significantly. But yeah, we do have high variabilities, right? And to manage all of this, we have something called as a control tower. And this is where the analogy comes back, the air traffic controller and a control tower, right? So we have the ATC or the traffic control for aviation works in multiple levels. Likewise, our control tower, right? We have central control towers, we have asset level control towers, then we have process level control towers. And they're looking at all of the things and trying to make sure that the fulfillment network works as it's expected. And the customer gets the delivery when we promise that customer. Now, yes, we talk about 42 flights per hour and I'm not comparing the impact or the importance of an ATC versus here. But in terms of the pure volume, it's orders of magnitudes higher, right? Talking lacks of shipments. Again, they have a zero-delay target, we have a zero-delay target. Hundreds of facilities, thousands of processes, like I said. Mutations, again, 200 GB per second, these are numbers, they are multiple systems. This is going to give you understanding of the complexity of the problem that we are talking about, right? We need to tell them that something is going to go wrong with the 15 minutes turnaround time. And 15 minutes work for us and not seconds really because we are dealing with physical processes, right? Physical processes require somebody to go do something physically and that doesn't happen in microseconds. That doesn't happen in seconds, it takes milliseconds, even hours sometimes, right? An example is if I have more shipments than trucks, I need to get a new truck and that truck will take time to come. So we can live with 15 minutes, right? But what we can't live with is inaccuracy. So for example, top of the funnel data where you are looking at visitors, you could drop a hundred or maybe a thousand visitors, but your millions of visitors trend will remain constant. Here, if one shipment is missing, the person will spend time hunting the entire warehouse, the entire sortation center, where did that phone go, right? So, accuracy is paramount, latencies are okay. Again, they need visibility, they need alerts, they need recommendations. So this is where we get into the solution aspect of it, right? So these are the characteristics of the problem. These are the motivators for us to go and solve it. We started off like most companies do at an early stage with just crons. You understand crons. I don't have to go into details of crons. I assume everybody here, anybody who doesn't understand crons? Okay, we have an engineering audience. So we started with crons, right? You schedule crons, raw data gets pulled, it gets sent via email to an analyst or an operations person, that person crosses it in Excel or some place and then takes a decision, right? The previous talk was the point where we moved to dashboards completely. Cells serve automated dashboards that refresh periodically and the data is presented into you. We discovered the next problem where we ended up with a plethora of dashboards. Anybody has encountered this problem where you have just too many dashboards to handle? Awesome, so somebody is there. And we had a scaling problem here, right? How do we staff our control tower? How do we make sure that nothing gets missed in the multitudes of widgets and dashboards that we have, right? So we have reached this maturity curve. Any guesses? What did we do next? And some of you talked about this, previous tasks. We looked at alerts. Can we proactively detect things as and when they are happening and let the relevant person know? And that's where I could have ended because alerts is not a new problem. Alerts in itself is not a new problem. People have solved it in different manners, right? Systemic alerts have existed for eons if I may talk in and they have matured significantly. We could stop here, but we didn't and I'll cover the reasons why. So push based alerts based on dashboards is where we are today, right? The reason alerting for us is a different problem is because imagine a CPU, right? You can have a flat line threshold. You can say whenever it reaches 99.5% usage, let me know. With me? The different problem that the previous two speakers spoke about is how do I predict when Dhoni is going to hit a six, right? And that's something that we would want to know when we are ready. Here also it's a little bit of a similar problem. How do you predict what is the threshold for action? So this image is talking about a planned throughput of a warehouse versus the deviations, right? On a day to day for a month. So the x-axis is day of the month, y-axis is deviation from plan. I can't give you the exact numbers, but that's deviation from plan, right? As you can see, it's very variable. It's difficult for somebody to come up with a flat line threshold and for it not to be noisy. Similar, so I spoke about sortation processes, right? So sortation, let's take an example of three processes. There's a primary scan where things come into the sortation center. There's a secondary scan where they are batched together in big volumes and then there's a bagging scan where they are put into bags which go into trucks, right? So three processes for simplicity. And each of those have throughputs which vary as this, right? So they typically go together, but even isolated processes have different throughputs. Similar example for a delivery hub. So throughput of a delivery hub, again, very variable. Day of month throughput on the y-axis, right? So what is the nature of the problem and what would be an ideal solution for this? Let's define that now. We want to enable proactive action for any anomalies in the fulfillment network and the word proactive is a single word, but it has a lot of connotation behind it and I'll touch upon some of it a little later, right? So this is a problem statement and very simple problem statement. What would be a good characteristics, right? So we don't want people waiting in front of a dashboard. We don't want people refreshing their emails. We want them to know when things happen, right? Worst case scenario, from the time an event has happened, let's put an SLA of 20 minutes to make sure that they understand that or they know that something has happened, right? And again, I explain why 20 minutes, it's tolerable here. In certain other use cases, it may not be. It has to be backed by precise data and I'm not using the word accurate, I'm using the word precise because even single digits matter to us. We should allow for natural variabilities. So if it starts raining in Bombay a few months down the line, we should know that it's going to rain in Mumbai and the throughput is going to change, right? And we should not be alerting people for that because then it becomes noise. It has to be very specific to each use case. Each process is different, each asset. Asset is for us a warehouse, a sortation hub, that's an asset. Each asset is different, each component is different. So it has to be specific, right? And once a disruption happens, needless to say, it has to be ignored. For whatever reasons, if the throughput went to zero, it should not, for example, if we have a six-day big billion day. So the seventh day should not be based on the previous six days, right? Otherwise, everything would be alert on the seventh day. We use many algorithms for this, machine learning algorithms for this. I'll touch upon one that has helped us well and feel free to go read about it. I'll not go into the details of the solution. All of you are smart enough to do that on your own. So the isolation forest algorithm, the references are there. The goal is to isolate a few and different very quickly. A step is we construct a tree. We get a sample of data, randomly select a dimension, select a value in that dimension and draw a line. Now we do this repetitively till we isolate each and every point in the dataset. Now how do you detect an anomaly? The points which were isolated quickly by the fewest number of lines tend to be anomalies and then you can have your own score around it and then decide what's an anomaly and what's not, right? As simple as that. So here's an image representation. That point gets isolated in two cuts and hence is an anomaly. Here's a visual representation of the same thing. I'll pause for it to keep going on. So the score is 8 for a particular point, right? And you can have your own thresholds on how many cuts indicate an anomaly, how many don't indicate an anomaly. So this is an isolation forest. However, if you go read the paper about isolation forest, it was intended for static data. You understand static data? Data which is not evolving, right? Our physical world is essentially a time series data. Somebody made a statement, I don't know one of you, but everything almost is a time series data. Likewise, ours is time series data, right? And this is where some of our proprietary innovation has come in where we have taken isolation forest which works for static data and applied it to time series data. So this is what time series data looks like. And we change it to map to certain static data characteristics that can be consumed by the isolation forest algorithm. I'll not go into the details. This is IP for us. We'll be publishing a paper and we'll let the folks know when that's available, right? But this is something that we have been able to achieve. And that's the point I wanted to make here. So the whole platform works together. There are multiple components in the platform which work together to trigger business alerts for our control tower, right? And this slide talks about what are the various components. So this is the part which I spoke about in my previous talk, right? The incremental data processing platform. What we do here, and this is another one of our innovations, is that we don't process data in huge passes, which process data incrementally so that we are able to guarantee precision within a given time frame, right? We have a lot of sources on the left-hand side that they come into the incremental data processing platform and like I said, there are dashboards. And the dashboards are still used today and will continue to be used because alerts in isolation are not important. Enough. You have an alert, you want to deep dive. And that's where dashboards come in. What we have built for business alerts is starting here. So we built a sampling service which is going to sample the metrics that are relevant in a pre-configured, timely fashion, right? So it will keep building a trend of those metrics. It will put it into our Flipkart data platform. From there, we build and retrain models and that gets into our machine learning platform where the models are consumed. Then we have the business alerting service which is using inputs from the sampling service which tells the service what is the metric value right now across dimensions. And then we use the machine learning platform to figure out if that particular value is an anomaly or not, right? So remember the photograph that I showed you in the beginning of the team? So the team is spread across all of these platforms. Some of these are existing platforms, some of these are new. And then eventually we have a business alerts dashboard where a person can see what are the alerts that have come and what are the details of those alerts and then click and get it to the dashboard. So I come towards the end of the talk here, right? So what are the learnings that we have had? And I think some of you would be just interested in this slide. So let's spend a few minutes here, right? Going just from a scale perspective and manpower perspective, going from a team which is monitoring dashboards and trying to figure out crunch data or crunch patterns in their mind to going to a machine learning based system which is just telling you that things are anomalous and you should act on it, it is definite cost savings for the business. And although we are all engineers, this is the most important point that I want to make, right? That's a significant cost saving here. The second point I want to make is that we tend to feel that machine learning will do it for us. It won't. Machine learning is not enough and that's been a learning that we have. You need deep domain expertise. Somebody who understands that primaries can process in your mother hub and has been looking at it for two years to come tell you what is the nature of that beast. And then a data scientist can deliver a model which works for them, right? So this is again a very important point. Machine learning itself is not enough. Make sure that whenever you are trying to solve a problem through machine learning, you embed a person who understands the domain well. So in the gaming system, somebody who understands the domain well should be there, hot starts similar, right? Make sure you embed somebody who understands the domain well. For us, breach for FSG or for eCart, if you know, breach is a very important number and that gets dragged in every leadership meeting. Breach is the number of shipments which did not reach the customer by the promised time or promise date. But if I start detecting anomalies on breach, it's not going to help me. With me? Why isn't it going to help me? Sorry? Yeah, it has already happened. I can't do much. Breach has happened. The shipment is delayed. So now what can I do? Just now figuring out that it's delayed doesn't help me. What's important is to identify leading metrics. What leads to a breach, right? So breach is a lagging metric. What can potentially lead to a breach is the packing throughput in a warehouse. The packing throughput in a warehouse is a leading metric which will help you prevent breaches. If there's an anomaly in the packing throughput in a warehouse, you can do something, put in more people in that area or start diverting orders to another warehouse and prevent breaches from happening. So leading metrics are important and this is where the challenge comes. Your lagging or your metrics of concern may be few. Most businesses run with three. But your leading metrics, they are a plethora of leading metrics. And that's where things get interesting because each metric and use case is different. Just because you have a model which can detect anomalies in warehouse throughput doesn't mean that the same model will detect anomalies in your sortation center throughput. And more the number of metrics, more the number of models. So we have a dedicated model or we intend to have a dedicated model for each and every metric, each and every use case. A single model may not be able to catch all. And that's been our learning, right? And I know there are people out there who will debate and I'm open to that debate saying that, no, one model should be able to do it. Yes, let's get there. But today's learning is that, no, there is no one catch all model. False positives, false negatives are a way of life. Be receptive to them and learn from it. This will happen. At the end of the day, it's still a machine, right? So a human may be needed at the end of the day. An alert which may not require action, let the human decide. Or if the alert did not come for something that happened, let the human come in and give you feedback. The next step for us and would be to try and automate actions, right? So today the control tower receives these inputs and then they act on it, right? So the next step and that's the next leg of our journey where we would want to automate actions and automate recommendations. And this is where I want to go back to the analogy I started in the first place, right? The air traffic control. How many of you want automation there? Thank you. And the reason there is, for those of you who are into the details, the algorithm that I described and most of the algorithms that we use are unsupervised, right? So for us to get to a place where we can leverage a supervised algorithm which tells us that this breach happened and also this metric value, that that's of value. And that's maybe when we can look at automating actions. But we are a little far off from there, right? And I think the industry is far off from there. So this is overall the learnings that we have had, right? And the last point that I want to leave this audience with is this is a borrowed slide and the reference is there is that on the left-hand side we have a prediction of a projection of the volume growth of passengers for the aviation industry, civil aviation industry in India, right? And that's the growth, right? I can't share Flipkart's numbers, but imagine. Use your imagination. And if that happens, somebody detecting anomalies manually on dashboards is not going to work, right? Even alerts themselves will have to start getting more and more intelligent. So this is something that would be of value to anybody who is going to see growth very soon, right? And if you project growth in the near future, start working today because these platforms take time to develop. They don't happen overnight. And as it is, no out-of-the-box solutions exist either, right? Start thinking of these. And there are a lot of learnings that Flipkart has and a lot of players out there read their tech blogs and you should be able to do overall. Going back to the aviation industry, you don't want things to go wrong. That's what everybody does, right? And that's where as much as you can leverage the various tools at your disposal, you should. And this is one such tool that we have been leveraging and we use. I'll end my talk there and I'm open to questions. Good talk. I have two questions. One, is this tool built in-house or are you using any in-house? Interesting. My other question was just to qualify that statement, we leverage open source for tech stack. We have Elasticsearch, we have Kibana, but a lot of this intelligence is in-house. Are you planning to open source that? Hopefully, I can't give you a categorical answer today. Thank you. My other question was more on the anomaly detection models. Are those models supervised or unsupervised? And I really want to understand. I'm sorry, really want to understand what is the strategy like when starting to sort of, you know, build those models. Did you just go all out in the wild or did you start with something and then I understood a little more about that. The first question was answered. Most of them are unsupervised today. Because, I'll get into the reason for this. The data collection aspect of what happened on the ground and was that an anomaly, we are a little far off from that. So these are people working and they have their own metrics to track. So getting them to tell us that this anomaly happened and it's a little difficult. So that's the next step. It says acknowledge. It's a true alert. Or it's a false alert. So we are trying to get that data. Once we have that label data, we'll move to supervised algorithms. The second part of your question, again, very fair. We did not have data up front. So the first few models took a few days to start going live and then our accuracies were low but they have improved and that's the natural cycle of any data science or any machine learning to solve. The only learning that we have is that make sure your business is aware of it. So get by in up front saying that, for a few days, please live with us, work with us. At some point it will be good and get that by in up front. Did I answer both your questions? Did we try out? No, we haven't. Yeah, I agree. We might explore it. Today we use Elastic as one of our terminal data stores. Good talk. So you mentioned you used isolation forest, right? So what are the other kind of models that you tried with and also what is the kind of feedback loop that you have so that this anomaly is what you say are happening. Do you record it and then try and build out your label data for the future to take care of this false positives, false negatives? So the first question, the basics is where we started just trying to look at moving average for example, right? We use few other algorithms. I want to keep that internal for now. But going back to your second question about the feedback loop and that's what I just spoke about, right? Each alert is today being labeled as an alert which makes sense to business or simply a false alert, right? And that's the feedback loop which will go in. We try and look at how many alerts were generated over a period of time and how many of those were actually acted on. One of the insights, again, thank you for asking that question. This is where the insight comes, right? Your end metric cannot be a judge of your model because the end metric is going to improve otherwise you're not doing a good job, right? So make sure that your end metric is not the judge of your model. There is something else which is going to help you evaluate if your model is working or not. Making sense? Now from that, so you identify those tools and then you use those tools to keep improving your model, right? So for example, one of the things is I'll take an anecdotal example, right? We raised an alert and the business person just, the operations person came and told us that, yes sir, our tea break by one hour and that's why throughput has gone down. Now as a data scientist or what does that mean to me? Is my alert wrong? Is my alert right? What do you think? My alert is right and hence it becomes a true alert, right? But then do I need to learn from this? No. So you should be very clear and that's why deep domain expertise comes in that what's a good positive, what is negative, right? And there is no catch all for this. You need to have embedded domain experts there. Answer your question?