 Okay, so hello everyone. I am Gaurav. I work at Flipkart at the cloud platform team. And today I'll be talking about a product that we developed in-house for our own private cloud. So most of you would already be able to work on AWS or Azure or some cloud provider. So I just want to tell you before starting that Flipkart has their own in-house cloud, their own private cloud that we use to deploy our applications. And it's totally managed by the cloud platform. So there are few abbreviations that I will be using on and off during the presentation, which is VM is for virtual machine. VM is a bare metal and BCP is, DC is data center. So BCP is what under a program under which this product was developed. So BCP is business continuity program. So what it, the aim of this program is to make sure the business is running no matter what. So even if one data center goes down, we have to make sure that people or our customers are able to place orders, even if the data center is gone down. So it's a very critical for a platform or organization like ours to keep the business going. Even if there's a, if people are able to place their orders in for one second, it's a business loss. It's a loss for Flipkart. So I hope our business continuity program is made clear. I'll be talking more about business continuity, BCP in further slides. So as I said that Flipkart has grown, organically grown over the last 10 years. We have a lot of applications that were developed in-house, both states and stateless. So currently it kind of looks something like this, where there's a user service, then a user comes in, he comes to the app or a website. Then there's a whole entire set of ecosystems which collaborate with each other and eventually the order is placed. So if you can see there is pricing system. So inside pricing system, again, there are a lot of mesh of services which have been talking to each other. I mean, you can call it mesh or mesh, depends what is your perspective when you look at it. So all these are interacting with each other to cater to a one particular business flow, which could be any business flow. So similarly inventory. So it's just huge mesh of microservices and stateful services interacting and, you know, doing something or the other. Yeah. So as I said, there are thousands of microservices, stateful, stateless. And when I say, oh yeah, so it's continuous morphing. When I say continuous morphing, it's in terms of interactions. So there could be one feature which we want to shut down. There could be new feature that we have developed. So new edge would, you will see two services in your system and then new edge that started appearing, because now they're communicating with each other to cater to this particular feature. So this is how our order system looks right now. So this is a very application level view of the system. And this is just the order system, my services interacting with each other. So this does not include the interaction which is outside of the order system. Now this is the service level view of the same. When I say what is the difference difference between application and service level, application level is high order. One application can have multiple set of services under that. What I mean by that is you can give one app ID, which is an application. And under that application, they can be multiple microservices, let's say a web application or a data store, which under one umbrella of an app application serves the purpose, right? Nothing but class three. Again, for pricing system, this is the app level view. And this is how it looks for the entire Flipkart, which is the app level view, higher level. This is not even the service level view. This is just interaction at a higher level at Flipkart. And if you look closely, there are two clusters which are formed at, you know, in this entire mesh, we'll call your mesh. So what this means is this is two set of data centers that we have. So Flipkart has two data centers. So we will be talking about that. Why do we have two data centers? I hope most of you would have figured out that once data center goes down, we have to use the data center or make sure that services are up and running and able to serve any business or critical business flows in the case of a disaster. So having said that, so there are two having a private cloud of our own, there are two major problems, which is observability and handling the challenges, challenges spoke about, which is nothing but a huge disaster, which happens. And observability is what is the current status of my services. The architects design their service in certain way, right? Multiple microservices have to interact in a way, but somehow somewhere because product is involved, some new feature, new dependencies start creeping in in your microservice architecture. So there has to be a way to cater to tell developers or to show developers, this is the current status of your service interaction. So this comes under observability. Okay, so now as I said, this is the one availability zone or one data center. These are the oranges, the normal workflow or the service interact with each other in a happy, happy case. Then there's a reserved capacity that we keep in terms of sales coming up, the horizontally scale our services, you know, very happy scenario. Yeah, same thing. Yeah, but one thing I need to tell you this here is a BCP flow. What is a BCP flow? If you see those outline of some set of services, this is nothing but defining a business critical flow. A business critical flow could be an order system, then a checkout system and then a payment system, no matter and a catalog system as well, a user should be able to see the catalog, click on a product and eventually do a checkout and do the payment, right? That's a set of services which have to be running no matter what and we outline them or identify the set of services and bring them under a BCP flow. So no matter what happens, this set of services have to be running. Yes. So now if DC goes down, what do we have to do? We have to restore this BCP flow and how do we restore? So we need to know the boot order of the services. First of all, we have to know topological orientation or topological sorted way of this dependency graph because obviously a service which does not take any dependency have to be the one which have to be brought back up the first time. So knowing the dependency graph of this set of services help us to identify which services need to be boot up first in an alternate zone. Okay. Now, let's say now availability zone has gone down. What do we do? We have to bring those back up in our alternate zone. So we need to know the BCP flow, the order in which we have to bring the services up. And again, the order in which we need to shut down the services because we need to shut down some services which are running in the disaster zone and bring them back up in alternate zone. So enters me seeks. So first of all, why I chose the name me seeks because I'm a big fan of this sci-fi series called Rick and Maudy. And in one of the episodes, yeah, we have some fandom here. Okay. So while I was developing the service, while I was writing the requirements from a lot of architects and product. So those guys were, so when they get to know we are developing a service graph. So there are a lot of requests keep coming in, can it do this? Can it identify the amount of bandwidth that is flowing between the services? Can I identify the alerts? And it alert us once. So we have a alert system, but they want to see on the UI. Let's say if a service is going back, is going down behind a ELB if out of three instances, only one is remaining. They want to know on the user interface whether the service in red or in orange or in green. So a lot of such requirement that keep coming up. So I named the service me seeks because me seeks is a character in some from some mystic world, which just appears out of the book. We just have to click the button. It appears. You assign one particular task to that me seek organism, which looks like this. And it does that task no matter what. And it just, you know, just disappears like poof. And this is the entire purpose of me seek does some task and just goes away. So in early this product me seeks service graph that we built it is built for to perform a lot of tasks. We will talk about it, but what and all it does. And yeah, so that's how the name come up. So what it does, it does provide you service mesh of entire flip card, understand the dependencies, obviously, and then enable the worldwide scenarios like DCP, which is this continuity program, we'll see how it enables that. And then rich querying on topology. Once we have a graph, you can query on topologies, of course, and then intelligent cluster recognition. Okay, I'll get to that later, not now, but just keep in mind, it does some cluster recognition. So what were our needs when we are developing this? Our need was to basically to make sure that we don't want the developers or engineers across the card to put any sort of effort in identifying or come up with this particular graph. They don't have to make any configure code changes, you know, to take some time out of their own strengths or, you know, they're okay or planning to put this into their, you know, day to day job. So we wanted to make it as away from developers as we want. But at the same time, we wanted the information to derive to be extracted from the network so that we have accurate and precise information. So we wanted to build it bottom up manner, we wanted to understand the network traffic, and then build it from bottom up. So it would want to go from API to then near below that, and then below that, no, we want to go the deepest level, which is basically the TCP level and then bring back up or build it from bottom to top. Yes. So once the coal layer has been built, we wanted to solve for multiple use cases, TCP or business continuity program is one of the few states. What were the alternatives that we had? So as I told that, there were no, so there were no alternatives as such that there were a few products that we studied, which is Netflix virtual and Microsoft service map, which was pretty similar, but not exactly what we wanted to do. And what were the factors that influence our choice? As I said, do not want application inclusion. And then we have to run network and network data collection on all the mother ships we have and to collect the network traffic between between any VM, which is owned by Flipkart, then coming on to how we did it. So how we did it? We ran an agent on all the mother ship across Flipkart. So you can see all the mother ships in the cloud. And then what this agent does is, so I'll just tell you a little bit about the agent. So it's an if top based agent. If top is nothing but a utility in Linux, to make it more simpler, like we have top in our Debian or Linux systems, you run top command, what you see is a ordered process list, which process consuming most resources will be on the top, right? Similarly, what if top does is if top at the network level does the ordering of all the connections or a particular Linux machine or particular Mac, it give ordered the ordering is in terms of bandwidth, which connection is using more amount of bandwidth, but that's nothing but if top. So we use this utility made few tweaks in it. And we started running as a demon on all the mother ships, right? And then we timely collect data from it and that and data get ingested into ingestion service. What this data is, the data is nothing but a source socket and a destination socket, the source from which the request was originated and the destination is nothing but whether request has come obviously. Sorry. So once we go, once the data gets ingested into the MySQL is the raw data, we are processing layers start pulling this data from MySQL and start processing it. So what it basically does is it's type, it starts stitching information together. What I what do I mean by stitching? Stitching means as I told you in the beginning there's a app I application, then there's a service and under service there are VMs, which are nothing but virtual machines. What we have here is a socket, source socket and destination socket, which is source IP and port, destination IP and port, right? So we start stitching this bottom of manner. So we know VM to VM mapping. Now we go up service to service mapping and then application to application mapping. So this is very easy, right? So this is what processing layer does. And then it passes the data to data store clustering. We will get into that. So once that is done, we store this connections, this graph, this dependencies between multiple nodes in a graph model database, which is OrienDB. So right now we're using a community edition of OrienDB. It does the job. It, the enterprise edition is, is paid. So we didn't go for it, but you can get clustering in the community edition itself. So it works for us as of now. And then, sorry, and then there's a rest layer. So rest layer is nothing but whatever graph query that you want to enter that, let's say from my service, which is the third downstream service that is taking dependency on my service or eight downstream service, which is taking dependency on my service. So any graph query you can imagine, you can query using the rest layer. So yeah, enrichment of the data, build real-time service topology, intelligent data store clustering. We'll come on to that on the next slide. And then store in a DB graph. That's, that's pretty much it. Data store clustering. You must be wondering why this guy is so much, has talked about data store clustering so much. Why? Because in Flipkart, what happens is the creation of services is not very logically defined. Let's say if you want to boot up a service, which is a stateless or stateful service, what you can do is directly create an application and top umbrella. And under that application itself, you can start assigning instances. It's nothing but virtual machines, right? So there's no proper logical grouping of this multiple machines. You can run any, any service you want all the, on these instances. It can be a web service, it can be MySQL, it can be HBase, any service you want, you can put there, right? So there are a few developers who does logical clustering at their level. But what we do is we need to figure out what set of instances are running what services and how are they, how they form a single cluster. Let's say if there are five clusters and all those five clusters are running MySQL, right? I need to figure out first of all, what are the services which are running, whether these services or what set of services are running on that instance? So for that, what we do is we do port scan on all the VMs. Let's say if 3306 port number is open, we know MySQL is running on that instance, right? Default port of MySQL is 3306, right? Jumping a little. So these three green dots that you see is nothing but VMs. We do port scan since we have the raw data of source, socket and destination socket. We know the port numbers which are open to all these instances. So 3306 is open in all three instances. So we diamond with MySQL that MySQL is definitely running on these instances. Adding on the network data, let's say these two instances are calling this particular instance at port number 3306, which has to be a replication call because in MySQL replication call also happened at 3306. So once this is done, we have a network topology. It forms a cluster for us. So now we know for sure that these three set of instances are nothing but a MySQL cluster. We name that particular cluster and we bubble it up in Earth as a bubble it up as a new service, not new service, existing service, but it has been logically grouped together now. Why we do that? We'll come to that because we have to boot. You might have connected few dots because we have to boot this service in an alternate environment or alternate DC. Similarly for HBase, we do it for every data store you name it. So for actually the data set data node, name node, and the journal node, we will figure that out, their interactions and then we group them together under one Hadoop cluster or HBase cluster. Doing all this, it becomes more simpler, more easier to visualize our data cluster. So this one particular application is so green ones are the incoming dependencies for this particular application and the red ones are the outgoing dependency for that application. So it becomes very easy. Similarly for pricing system, now come the overlays. So the question is fine, you have built a graph, you have a first order of a base layer of dependency across the entire Flipkart. What are you going to do from it? What is what extra you can achieve from building a dependency graph between the services? So what you're going to do is I'm going to pick up two use cases. One is the BCP and other is a data recovery use case, but mostly the stress will be on the BCP flow, which is a business continuity program flow. So when I say overlay, overlay is nothing but enriching this graph with a lot of more information. When I say information, the information has to be tagged on the services as well as the edges because the graph is nothing but node and edges. So put more information on these two attributes and solve multiple use cases. How? For business, let's say the, let's see the BCP overlay. So we tag the services as for the criticality. How critical that particular services and entire this Flipkart business flow. And the criticality goes from triple A plus to double A, triple A, single A, BCDE. So we tag the services as for the criticality, so it's nothing but a node. We tag the edges, whether this edge is an essential edge or an or is an optional edge. Why do we do that? Because while creating services, while, you know, developing a lot of services, when you have a microservice architecture, knowingly or knowingly, you get a lot of cycles, dependency cycles in your service graph. So once these cycles keep start becoming more apparent in your architecture, it's very difficult to understand the topological or the boot order of the services. So that's why we have to ask the architects, please tell us what set of dependency are, are definitely essential and what set of dependency are optional. Probably a suggested dependency and all that, right? So that's why we tag the edges as well. And then detect anomalies, we'll definitely come back in some time. I'll tell you what anomalies are. And VR overlay is nothing but data recovery overlay. This is clubbed with DCP overlay, right? So this basically how much time a service takes to be back up. You know, so we have our own backup service running. So every data store in Flipkart gets backed up for side backup as well as the data center backup. So how much time a particular service or data store needs to be back up is something that we drive very, very diligently. So this is how it looks. So these are the set of services that belongs to these app IDs. And what you can do there is you can select what is the criticality, the drop down basically, what is the criticality of the service? And what is the service RTO? RTO is nothing but recovery time objective. How many hours that particular service could be a data store, right? It's going to take to be back up to restore the data from far side backup to boot all the set of services to start the replication and all that this entire thing calls an RTO. RTO basically one hour, over there it's mentioned one hour. So once all set of, all this information has been fed into the graph, it can do wonders for us, right? So this is how the tag services look like. So this is a service graph and we tag each node with criticality as well as the RTO. So I spoke about detecting anomaly. So let's say over here, the RTO service 3 is 2 hours and then the RTO service 2 is 3 hours. Up to here it's fine. But when somebody comes and says the RTO service 1 is 2 hours, it cannot happen because it's taking dependency on two other services. So the service RTO service 1 has to be at least 5 hours because if you total those 2, it has to be back up in 5 hours, right? So these set of anomalies we can easily detect and tell the users or developers or the architects that this cannot happen. Your service has to take at least 5 hours to be back up because of the dependency. Now comes the service boot order. So remember I spoke about we have to boot our set of services and if alternate DC having a service graph with us and having those edges marked as optional or essential, we can create a graph like this which basically this is an app ID, list all its dependencies. So yeah, one more thing. So the red and the blue dots are nothing but DCs. The red dot basically depicts that this particular service is in Hyderabad data center. We have two data centers. One is Hyderabad. The other data center is in Chennai. So the red dot says it is in Hyderabad and the blue dot says that particular service is in Chennai DC. So once the graph has been made, we can deduce this information that information that is flowing in from one particular service is going through or it has taken information from all these set of dependencies using it and then it caters to all the set of dependencies, other set of services. Yeah, so M6 journey so far. So so far it's been a good journey because whatever we jotted down all the points that we wanted to build ranging from a basic service that has to be that has to provide the bare minimum of services and their interaction with each other in real time is there. Over that we are able to create a BCP flow where architects, developers and the business guys, the product guys can sit and figure out, jot down that what are the business critical flows are and relate those business critical flows on the graph and start tagging those services as critical or as per the criticality of the service and start creating the business flows. All that has been catered by M6. So they sit on the M6 UI, see first jot down the basically their product or their set of business flows and directly map it onto M6. So that is something that M6 has provided which was not there in the company before. Doing so we can one figure out our business critical flows that have to be replicated in alternate data center and when I say it have replicated they don't have to be replicated as it is. They can be services which have to be booted up in a very degraded manner. Let's say if there is a order management system which interact with recommendation as well. Once you click on order there's a call that goes to recommendation to fetch the other set of recommendation for the same product for the same person. The downs if you go to product page on the once you scroll down there's a set of recommendations for that particular product for that particular person. If you boot these set of services in alternate environment we do not want to make that call. We can skip that call because we don't have the infra as much infra as we expected in the primary DC in the secondary DC. So we can downgrade our features to boot that particular service or boot that particular entire business flow in an alternate data center. So there are a lot of things that can be built on top of this service graph. Another thing that you're working on right now is alerts. Alerts as well as figuring out how much of a dependency or how much of a bandwidth is being utilized between two services. How many packets are being flowing from two different services? Why are we doing that? So Flipkart has a data platform. So we ingest entire data of Flipkart goes to this particular platform and these people, the data people who are managing data platform they want to know who are the consumers of their services, who are the new sort of consumers that have come up without their knowledge and how much a particular consumer is ingesting into a service the amount of data. So all this information which we can track easily using the network traffic that we have. So there are multiple use cases that we are working on that we've been solving every day. So yeah, so the journey has been great so far. Right? Yes. So another thing that I forgot to mention is the clustering. So once we did the clustering of the data stores, the NOAA is one team in Flipkart which deals with the backup backup service. It is nothing but a backup service for all your data stores. Now once the clustering has been done, M6, they use M6 information to come up with a dashboard. Basically these are the set of data stores. I don't know if you can read. So this is a list of data stores, MySQL, Aerospike, ZooKeeper or Kafka as in all that. So they can on the data source basis and on the team basis they can tell what is the cluster onboarded, how many data store clusters have been onboarded onto the backup service and how many and how many of those have been successfully backed up. Right? So data store clustering also added to that problem as well, which is observability. Okay, so we have everything now coming on to the network agent. So the network agent I spoke about was if-top that we were using. The problem was if-top was, as I said, it's like top. It has to do some sort of processing to figure out which packet have to, which connection have to to show on the top because it's consuming a lot of bandwidth. So for that it does some processing and going into the user space. And while it does that, it tend to miss a lot of network packets, network packets are the kernel buffer. So what we have to do is we have to look for alternatives. Why? We don't want to miss any packet. We need to know that how many packets have been exchanged at this particular VM. At this every VM we have to figure out all the packets. So our next resort was contract. Contract is another Linux module that you have to basically start once your Debian OS comes up. So once this has been booted up or this model is running, what it does is it keeps on running and it tracks each and every packet. It doesn't go to the user space to figure out or to tap the kernel buffer for packets. It does it everything in a kernel space. So doing so, it does not do a lot of processing and does not miss any packets. But the downside of the contract is the resource utilization. It utilizes a lot of memory as well as it utilizes a lot of CPU time. And moreover, the biggest downside of contract was that you cannot roll it back. If something goes down in that particular mothership, the rollback was very difficult. You have to restart your service and then basically you have to restart the entire OS. That's all. You can't roll it back on a running Debian. And another option for us is ER span. ER span is something which is one of the like one of the bright shining stars for us in terms of network connection. It runs on switches. It's a layer two level solution. So we run this on switches on our data centers. What it does very similar to what we're doing right now, it run on a switch, collect the source socket and destination sockets and publish onto a aggregator. The aggregator is nothing but very similar to an injection service where it sits, collects all the data. Once the data is collected, you can do whatever with the data. Just pull it into your system and just play with it. So ER span is the next option. What next was additional overlay, as I said, data backup overlay. The system health overlay is nothing but so we have alert services which collects both system level alerts and application level alerts. Use that alert information to aid to the UI. So on the user interface, people should be able to see this service, the health of this service is very bad or good or whatever. So I think you understood the point. And then enable service orchestration that which service service orchestration, nothing but the boot order of the service. If you want to boot up the entire service mesh in alternate DC, what should be, how it should be orchestrated. And then the next is to make it open source. So from this, these are the three takeaways that what are the challenges that we faced in our own private cloud. Then the monitoring was one of the challenges. Build a service graph. Once the service graph has been built, what all problems it can solve alone. Thank you. Any questions? Can you tell a little bit about the Orient DP that you're using? Sure. Okay. So Orient DP is an open and a community edition. It offers all the three models. One is the financial relational model, key store model, key value store model, as well as the graph model. So why we're using Orient DB is we have to store a graph. So our entire data is in form of a graph nodes and edges, right? And since the data is form of a graph, all the queries that we have is also in terms of graph that given this particular service, give me three other set of services without taking dependency on this. Not just that, let's say we have our own infra, we have our own cloud. Let's see, there are teams which keep on requesting capacity every now and then. Using service graph and the graph DB, you can let's say run an algorithm which is very disconnected components. See what are the islands on your graph. Once you see the islands, you get to see, oh, this service is not interacting with anybody. And one of the last thing this service interacted with one particular service. So once you have that, and how much resources this particular service is consuming. So when you have all this information using, you know, the disconnected component algorithm being run on the graph DB, which is very quick, you can recover those resources. You can ask the team that you're not using this particular service or the resources of this particular service, can we reclaim these resources? So that way, it's very good. Thank you, that leaves it. And another question is regarding, have you evaluated and why setting it up as a proxy and it will capture HTTP traffic and act as a MySQL counter also? And it gives a lot more metrics and it will really help us in the IFTOP and this stuff. So when you say metric, what sort of metric are you asking? So it, and why inherently will give you a lot of metrics and so that it's, it's kept, it's everything is going through it. Sure. We figured we actually explored and why the problem was the system utilization. As I said, we don't want to intrude into the application ecosystem. Let's say you're a developer and you have requested five instances to run your application or your service. So we don't want to use the resources of a lot of resources of that particular cluster. That would eventually, so a lot of people don't like it, the intrusion. Like if I start running a demon, which is consuming some amount of resources, even 1% of a CPU, developers may not like it. And why are you using in 1% of my CPU? It actually, so my set of metrics just goes off because eventually I'm collecting metrics for my system. And since you are using 1% of my CPU, things don't add up. So we wanted to stay at the mothership level and do everything at the mothership with minimum resource utilization. So that's the main reason we are still working with the line of fraternities. And we're going on further down to the layer two, which is, that should do what we do after. Hi. So I have a question right over here, right? So like what I can see over here, you guys are just building a service mesh, right? Are you guys even doing tracing or not? That's the next step. Tracing is the next thing. So they're already, the team is just working on tracing. There's a product called S-Trick. It is in beta right now. We're still exploring that. So yes, tracing is the next step. So and like I saw that you tagged every service that has 3306 with the MySQL, right? So that is logically you can group it with the MySQL. But like in case we both guys are using different like his database or like his MySQL cluster is something else and my MySQL cluster is something else, right? So we need to have a different kind of nomenclature, right? We cannot just play like you would also name my also MySQL. So let's say, so when you just to understand your question, right? You're saying let's say you're running MySQL on 3306 and he's running MySQL on XXXX. That's the question. Like he's also running MySQL on 3306, right? Okay. But his MySQL cluster is different from mine, right? And we both have different data stores of MySQL, right? So like both of us will be tagged MySQL, right? We cannot differentiate like this is a MySQL data, like cluster it pertaining to this service, right? Like if we have different cluster. So let's say so is are those two MySQL interacting with each other? No, like these are totally independent. So that is they won't be so we may not club them together. So we only so let's say there's a master and there are two slaves one time you say one other say let's say a backup place. When these two slaves start making a replication call to a master, we know that it's a cluster. Since these two say will not make a replication call to his master, you know that this is a separate cluster. But like how would you manage the name, right? Like because the naming we manage so developers when they create an application. So once the app ID the name of the app has been given, we start creating using that app name hyphen MySQL hyphen one two three four. So that's a nomenclature that we use, which pretty and we give the developers the freedom to go and change that name as well. So and one last question that I have like for just the like you wanted to know the service measure, right? Did you really need to go through the all the TCP packets, right? Like we could have a kind of a discovery architecture right before like before it gets connected to the particular service right or application that I need. So before then I have to go to a service discovery kind of a thing. I won't be directly connecting to the your service, right? I will disconnect to the service discovery and then it would route to me over there, right? So as of now Flipkart doesn't have a service discovery. We are in a process of building services. So yeah, I know what you're saying. I understand what you're saying when API API, but even then what happens is people run cron jobs on their instances and these cron jobs are hitting multiple APIs. So you have to capture that as well that who are these people running all these random cron jobs which are hitting those APIs. That's what I'm saying, right? We do not directly need to interact to the service. Everything goes through the discovery, right? It should go through the discovery, but it doesn't. Okay. I know what you're saying, which is true, but it should, but it doesn't. Because that is the only case within why, right? Like in why does that? I'm not saying go through a side car, right? It would utilize but sure. My question is regarding that you mentioned that we have automatic automated dependency. So like this makes sense in case of if an application is interacting with like maybe a data source then you will determine that this application is depending on this data source. But what if two different applications are interacting with each other? Then how you can automatically determine which application is dependent on this? So let's say if a stateless application, right? Yeah. So for a stateless application, all the applications in Flipkart, they're behind a ELB, which is a web, right? So it's a web to web call. So I'm using instance to web call, right? So if I'm an instance of one service, right? And I'm making a call to a web, which is ELB. If I make a call to a different service and I belong to a service. So once this EL, so once VM to ELB calls goes and it gets tracked, you know, this service is calling this service. So basically you talked about like talking at the switch level, right? So then you must have a bare metal and you have different VMs, right? Yes. So then how do you like know like this? So you must have a mechanism to know that on this switch, the call is coming from this bare metal. Span does that. Yeah, we have this. So span does that for us. You mentioned about data show clustering. I have found out that to be like a really crucial piece of the entire thing, right? Is that only for visualization on the graph? Or do you also use that data to let's say place those, you know, they'll be deployed on some instances in the data center, right? Do you also make sure let's say they're in the same availability zone or you know, or do any sort of, you know, post processing on that data of the cluster that you've developed? I didn't. So the first part I got, the last part when you say what kind of data processing? So I'm saying, for example, you've made a cluster, a logical cluster, right? Okay. You know, these three DB's instances are interacting to each other and they belong to the same app ID, right? Sure. Is that only for visualization or are there more use cases understood? So the visualization is one for sure. We've got it right. The other use cases, as I said, the most important use cases to keep the business running in case of a disaster, right? So once we are booting this set of instances and different set of in a different data center, we need to know what this instance is. We need to know what logically this instance represents or where it belongs. So once we have grouped them together, the MySQL as a Redis service, you know, it's a Redis service and this was its configuration then, right? So we can replicate that. Now, it's that set of instances that are the set of instances, this particular service will run on has to be a Redis service, right? So by grouping, we are telling everyone that this is Redis service and you have to bring back up a Redis service. Got it. And do you replicate, so for example, the Redis could have replicas in different availability zones, right? Would you replicate as is? Or is there also a, you have certain data that this is a cluster, right? Yeah. Do you optimize on how they are deployed onto hardware instances? Or do you only replicate whatever was there to another data center? So there are multiple approaches for replication. Deployment is one approach. Let's say if you're using a Redis, let's say Redis is a Redis, you can use a Redis with just master slave topology or you can use Redis as one sentinel and master slave, one sentinel managing multiple master and slave, right? So there are two different topologies. Let's come into the first topology. So using a master slave topology, there's no sentinel as such, which is taking care of your cluster. If a master goes down, then you need a manual intervention. Somebody has to go manually and promote one slave as a master. But if you're using a sentinel approach, the sentinel will take care of bringing one of those slave as master. You don't have any, you don't need any manual intervention there. When you say encrypted traffic. Like HTTPS, I mean, if those are like to, let's say yes, it should work, it should work because what we're capturing down is a layer two. That's a capture packets, which goes beyond. You want to construct the graph dependency graph, even if there is packet loss, what is the issue? So the issue, there's no issue. Eventually we will come to a state where everything is right. I mean, we have the dependencies, right? What happens is as I mentioned, there are teams in Flipkart, which want every packet to be captured, which want that what is the, let's say FDP, you have a Hadoop cluster and people are using a Hadoop cluster, right? So you want to know who all your clients, who all the services are ingesting into your Hadoop cluster and whatever the contracts are between you and the service provider, between you and the person who's using the cluster, whether they are abiding to that contract or not. The amount of data that they ingesting is actually the right data, right amount of data, as well as the bandwidth, right? When you take an instance of any type, there's a bandwidth issue with that inside. There's a bandwidth limit with that instance. So you have to make sure that whoever your client is, they're not pushing beyond the bandwidth. And if they are, you tell them that you're using too much bandwidth, please, they're limited. And people want to know how much of the VM to VM traffic has been going on, which are the two chatty motherships. Using this other use case that you can solve this to make or to move one VM to a different mothership, which is not as chatty as this VM, right? So it has both, you know, approaches both ways. Solve two problems, basically.