 Hello everyone. My name is Ambar and I'm the engineering director for observability platforms at eBay. What I'm here to present eBay's network monitoring platform called Net Vision. So what I thought we'd do today is I'll walk you through a quick background about why we needed this platform. I'll talk about what we achieved with this new platform. We'll talk about the platform, about the architecture on which it's built. And then I'll show you a quick demo of what we built. So for those who are not familiar, eBay is one of the largest online marketplaces in the world. We have a global footprint for the presence in multiple countries. As you can imagine the scale at which we operate introduces some very interest structure challenges for us. One of the biggest challenges is to make sure that our site is always available for the customers who rely on it throughout the world. So to assist with site availability, we have a centralized operations team called the Site Engineering Center. The charter of this team is to monitor the health of the eBay site and if they detect any issues or any problems, they quickly remediate using tools and tools that are available to them. As you can see from the picture below, they sit in this room with screens and monitors that are available to them on the wall and all of these screens and monitors. They display essential critical telemetry data in terms of metrics or alerts when they pop up. This centralized operations team uses a monitoring platform that is built to detect to ingest telemetry data from our entire infrastructure all the way from the data centers to the networks to the infrastructure in terms of the computes that are deployed on it, to the cloud, to the applications that are built on top of that cloud. So we get telemetry data for our entire stack. And this monitoring platform is able to detect anomalies based on either historical data or the rules that are configured within that platform. It generates the seed for this operations team. Once the operation team gets these signals, where we don't have auto remediation, this ops team can come in and quickly look at these signals determine if root cause and why these alerts are happening and then they are able to quickly remediate them. Now as you can imagine it is critical that this operations team is able to quickly pinpoint the problem when an alert shows up for them. So the picture on the right you see is a sample alert dashboard that is available to this operations team. In this case you will see load balancer connection stacking alerts that are showing up for this team. Now connection stacking is kind of a catch all bucket of alerts for us right in terms of if for whatever reason the app is misbehaving the load the connections and the load balancer side will start will start stacking. Right so there may be a variety of reasons why why an application might be might be stacking connections on the load balancer things like you know there's high GC or you know there is a high CPU spikes or there is high latency on the database queries. So there may be a variety of reasons why the load balancer while the connections might be stacked on the load balancer. Now we have other signals available to this which can help them start in rooms, but one critical thing that was missing was for them to figure out if there is a net problem that's happening. So they did not have that could indicate if there is an if there is a network problem going on and if there is very in the network the problem exists right what what top of Rack switch what bubble what pod is is impaired and what applications that are deployed on the on that particular part of our network which applications are being are being right. So the goal was to make sure that we are able to kind of answer these questions for that operations team. So with that in mind we build this new product called net vision. It's built on the concept of end to end probing to enable packet losses in the even network. So far, just for reference, Facebook, Facebook released a paper called net norad a few years ago, and that vision is kind of built on using the concepts that were published in that in that paper. So the idea behind it is that it is able to detect continuous packet loss by generating UDP traffic across our network. So this detects packet losses, it's able to utilize it's the algorithms that we built in it for for figuring out whether this packet loss can be classified as an alert and if yes then it is also able to further classify it based on severity in terms of warning minor major and critical. Net vision also has these concept of the concept of these agents that are deployed across our across our networks network infrastructure to be able to send these UDP packets out and we have automated agent health monitoring and coverage as well to ensure that we are covering our entire network. We have the presence of this of this net vision in two locations in VR ebay is deployed in three three data centers, but then we have an active active architecture with the ability to auto failover. We also have the ability to suppress and silence alerts within, within net vision. This is particularly useful when there is, say, plan maintenance happening in the network and we know that we're going to get alerts but we don't want to spam the operations team so we're able to quickly go in and silence these, so let's let's deep dive a little bit into, into the architecture right. So what you see in the, in terms of the dotted lines what you see below the dotted line this is actually the entire net vision, and then what you see on top is our even effort right so net vision has this agent and master kind of architecture. So the agents are deployed within the servers itself and then the master we, we call it shogun. That's kind of the heart of the system. So let's talk about the, the, the net vision system first and then we'll talk about how this thing and that entire thing works. So if you go from kind of left to right, we have this system called CMS, which is, which stands for configuration management system. Now it has the entire ebay topology in terms of, you know, the data center, the networks, especially for the networks, you know what devices are deployed, what are the links that exist between these devices so, so this cash this CMS in system has the entire topology already built out within it. That vision data store is a, is a configuration store that we use to for configuring these, these agents. So we have cash built out to basically make sure that we're able to quickly query this data out. This cash is refreshed four times a day. On, on top of this cash now we have the agent state manager, this is responsible for maintaining the, the state of these to make sure that these agents are always up running. If there is any problem then these agents, this agent state manager is able to remediate the problems within the agent itself. A health checker this, this company, maintain the health or to check the health of these agents, and if there is a, there is a problem then it kind of talks to the agents state manager and make sure that those, those things are immediate if there's an issue that's found. Lightning is a conference to, to publish health signals are to probe these agents to figure out if they're healthy or not. And it's also, it's also used to publish out config, config data to this to each of each of these agents. Event collector is a component within the, within the system that collects data that's coming back or collects events that's coming, those are coming back from the agents, and then it passes those on this algorithms that are, that are kind of running, that are kind of crunching the data and determining if the packet losses constitute an alert and then the severity of those alerts, which are then passed those, once it detects the alerts, those are then passed through the suppression handler, and then to the publisher from where they're published to a centralized kind of monitoring platform from where the operations team, operations team consumes those, those alerts. On top of the dotted line you see this is, this is a regular kind of network. We have servers within racks, and then on top of the racks we have this, this layer of top of rack switches. On top of that we have the bubbles on, then pods, and then we have the backbone networks. So like I was saying earlier, eBay is deployed in three regions, and then those regions are kind of connected through this, this backbone network. So let's an example of how, how this thing works well. So like I was saying earlier, we have this, we have this net vision data store that that has the configuration of what each agent needs to do. And then we also have this network CMS cache which map, which has a mapping of all these links and how these devices are connected to connected to each other. So the, the agent state manager crunches the data that's coming from this net vision data store and builds out conflicts for each one of these agents, and then the light, lightening system pushes that pushes those, those conflicts out to these agents. So these agents are now configured to start sending. So now that these agents have configuration in them to say okay, as, as a sender, and one will have destinations configured to it to say okay you need to you need to send you to be probes to multiple of these destinations. And once it receives that configuration, it starts sending these, these udp probes every, every minute that's configurable we can, we can reduce it to say every 30 seconds or so but let's just assume for now it's every, every minute. So when this agent starts up and it takes that configuration on. For example, let's just take an example where this agent that an agent is supposed to based on this configuration is supposed to send udp probes to n five, which is located right here. Right. So what it what it does is it starts sending the udp probes out. Now let's take an example where the udp probe takes the path of it goes to tor one. Right, then it goes to bubble one, then it goes to pod one, then it goes to the phoenix backbone up, then it jumps over to the, to the other data center through the backbone, then it goes to part two bubble three top of rack switch three, and then it reaches the server five, which is the intended destination. Once this agent receives that udp packet, it sends a message back or an event back to the nutrition system saying I received the packet, I was supposed to receive this packet and I received it. Now let's take another example where and one has a configuration where it needs to send in a udp packet to n seven, which is again right here. So it still takes the same path where it goes to part one bubble one part one, it goes to the backbone jumps over to the other other data center goes to part two, then goes to bubble four, or seven which is top of rack switch seven and then down into into the rack into the server seven. Now it's supposed to take this path, but for whatever reason, for example, it doesn't reach here right so server seven for that one minute cycle, did not report an event saying I received this packet that I was supposed to be received from server one from n one. Right, so I did not get this packet for example. Now, at the same time in the same cycle, for example, and three was right here is configured to send an event to n seven. Again, it starts the udp publishing a packet publishing, it goes from from n three to two or three it goes to bubble two, it goes to part one, then it goes to phoenix slc, then jumps, then it's supposed to jump over to part two, then to bubble three, then to talk three and then it's supposed to reach here right so that's the that's the path that's expected. But for that one minute cycle again, if if server five out of n five did not report an event saying I received this received the CDP packet. It means that package packet was lost somewhere in the network. Now the idea is that since there are and this is this is just three examples right but as you can imagine we have we have thousands of these these agents deployed are deployed on our network. And these every for every public cycle, they generate they generate millions and millions of events. So even though this is a very small sample set, as you can imagine if there are enough of these going. We will we know we know what what is the expected path between and seven and as for in this example between and one and and seven right so and one and and seven can can take a defined path. And we know what that path is based on this CMS data. So, when the when these packets start get start being lost, we can quickly figure out what is the common layer at which these packets are getting lost. Confidence algorithms do they look at what the data should be in terms of what is what that one should be. And then they look at which packets are getting lost. And then they start looking at what is the common common layer at which these are getting lost right so for example in this case, if you if you see if if there are many many packets that are getting lost in in kind of in this way or in this way, you'll see that the layer in which these packets are supposed a common layer that with in which these packets are supposed to go is this part too. Right so that's the that's the common one. So, most likely there is something wrong here where the pack where the pockets are starting to get are starting to drop. Right. And like I was saying, there are there are many many packets that are close to millions of packets that that we publish out every 60 seconds and then we get those millions and millions of events into the centralized system and then that's how we are able to kind of grant the data within these algorithms and figure out what these common commonalities are. So for the most part, we are we are very confident about what we about detecting the the layer at which these packets get lost. Now, again, like I was saying we are able to be able to define severity of these alerts based on how many of these packets one how many packets get lost and to how many how many of these agents are reporting packet losses. Right so for example, if only one agent is reporting consistent packet loss, it may not be a big, it may not be a major issue. But if many agents start reporting packet losses, commonly across a layer or across a device, then we definitely have a have a problem and we start. We start publishing out alerts to the to the operations team. Now, like I was saying earlier, we also have this health system where we send out these health probes to the system to make sure that the system the agents are up. And for whatever reason, if these health probes come back and as indicating that the agent is unreachable. This agent state manager goes in and either replaces the agent or makes sure that they can restart the agent. We've, we've done over over trial and over multiple trial and error. We've figured out that to maintain coverage across the entire ebay network. We need about two agents per rack right so that's kind of the coverage that we've that's kind of the sweet spot that we found in terms of making sure that we have coverage across the across our infrastructure network infrastructure and that we get enough signals to be able to detect these these packet losses. So that's that's kind of how the architecture works. This is the way we display the alerts to the operations right so this is on the like I was telling you earlier we have this site engineering center wall, and this is one of the dashboards that the alerting dashboard that kind of shows up on the on the left you'll see, we have some point of presence locations, and then data centers on which that are shown on the left panel, the red, orange, blue, sorry yellow and blue indicate the severity of the alerts that we are seeing in each one of these. On the right you'll see which exact device is alerting, and then you'll see, you know, with what's the ratio of the packet losses across these across these data centers. It'll also tell you which type of devices right so it's is it a bubble is it a access switch or a top of rocks with. The colors indicate the severity of the of the alert. The visualization and I'll show you I'll show you a demo of this as well. The concentric circles kind of from from outside to inside the outside concentric circle indicates the top of rock switches. There are so many of those ones that the inner one indicates the bubbles, then the pods, and then the innermost circle indicates the the backbone of the network right. So, so, where you see light green dots, all of those are indicate our top of rock switches and then you'll see less and less of those as you come inside because obviously you have the the number of devices on the in the bubble layer or less and then the ones in the the backbone is obviously much less. So that's this is one kind of detection dashboard that we have and then we also have this triage dashboard which once the once the operations engineer sees an alert. They're able to quickly come to this other dashboard and then drill down into it right and then then they can figure out okay which what's the actual device what's the severity. You know, look, get some more details about how many senders are reporting the problem how many how many how many senders actually send the probe. How many receivers are supposed to receive the probe but and then we have some probe loss standard deviation and percentages that we that we display. So when when you expand this you can see more detail. It's each one of these alerts, and then you also have severity and then distribution about what how many alerts are are being generated. So I'll show you a quick demo of how this thing actually works on the wall right so let me refresh the screen. So again this is a demo this is not a live product. As you as I was telling you earlier, this is how the dashboard kind of looks like on the wall. Tars pods or Tars bubbles pods, and then backboard right so imagine, you know we detected data loss and this is how the kind of the thing lights up right on the on the CC wall. So you'll see that okay we've seen two alerts at this at this bubble level. So we said okay two to Tars are reporting under this bubble and then one part is reporting under this bubble right so we are seeing packet losses against these these stars. So, at this point the the we now start seeing more packets packet losses reported across different Tars now. So, so we're seeing more and more alerts with each refresh cycle or with each cycle of those probes we've seen more. Now suddenly we see one more shop right so that the counter here is increasing. We're seeing more and more and then we're now increasing the severity of these alerts to because we are now potentially starting to see a problem in this in this region, right because we now have three of them reporting here. Then when we started when we start seeing a fourth one here, then we start elevating those errors to the next level right so initially they were reporting at the top of rack switch level. Then we once we started detecting more and more packet losses, we elevate that to the bubble level right so now the now the alert is at the bubble level, but as in the other cases if you see here, since there was only one alert. The alert was still here as well. There are two of them. The alert is is level but it's not, it's not critical. Now here we see more and more of these packet losses which means that there is something at the bubble level and so the alert kind of we raise the severity of the alert and we also make it. We also bring it into focus for for the operations team to say okay yours you need to start focusing on this. There's something going on here. The go up and then you'll start seeing more and more more and more of these show up in the in the critical in the critical bucket. So that's the end of how how this whole thing, whole thing works. Thank you for your time. And I'm happy to take questions. Thank you. All right, I do believe we have the phone bridge ready to go so amber you are ready to answer questions. That's good. All right amber are you able to see in the panel the q amp a with speakers. Yes, I'm answering questions there. Okay, great. Sorry, it was muted amber there's one that slipped into the chat as well, which is the other part of the q amp a I just put it in our internal chat. Okay, I'll take a look. If there are multiple equal costs. Yes, so I just answered this. I just answered this on the chat. So late, there were a couple of questions around latency. So this was a this is an announcement that we recently added where we are able to do now detect latency degradation as well. This is still being tested. So we have not, we've not deployed this in production. The production deployment only contain or only detects for packet losses. Another question was the average UDP packet sending rate. So we send in production right now we send UDP packets every 30 seconds across all of these configured parts across our network infrastructure. Follow up question. Another question was you mentioned that the agents test connectivity to other agents. Oh, so it's the same one. Oh, the agents talking to the central master do the master and agent communications use the same path as the UDP pro packets. If so, does it disturb monitoring. Yes, that's a that's a great question. So yes, it does have the, you know, we do use the same path. And there is a possibility that it can disturb marketing. That's why I was saying earlier that we deploy enough of these agents to make sure that we have in we have the coverage across our infrastructure. And we do have to per act like I was saying earlier. So that is to ensure redundancy. So unless there is a total network failure that we should be able to we should be able to we should be able to have enough coverage. I was wondering what the platform the net vision is running on. So the agents are deployed are built in Python. The back end is built on the this built on an elk stack right now. Elastic search Kibana stack right now. But we are moving that over to a different back end called click house, just because we the scale is too much for elastic elastic search to handle at this point. And the visualization that you saw is built is a custom build visualization using WebGL. And Amber, I'm sorry, but we are at time. I did post the message for everybody in the attendee chat with the slack channel. If you want to continue asking questions, Amber can answer them through the slack channel. Absolutely. I don't have to go on to any questions there. Okay. Okay, I'm going to go ahead and disconnect the phone bridge then. All right, thank you. Thank you.