 So I don't have my co-presenter here, Sudeen Dharamurthy. So unfortunately he couldn't travel. So I'll start myself. My name is Anand Palaniswamy. So I manage a cloud networking group for both eBay and PayPal. So that includes SDN, Neutron, LBS and DNS. So for both these companies. So I just want to go over, of course, you know this very well about our company. So how critical it is when you talk about networks and then virtual networks or the physical networks, because the serious business we are talking about is not just, you know, IT infrastructure we are talking about. And this is our cloud today. And this is being shared in multiple presentations maybe already. And just want to go over before we get into what we are talking about in terms of cloud networking and SDN. So currently, you know, I'm talking about only this open stack cloud. Of course, you know, our infrastructure is much more bigger than what we are talking about here. It's a journey to migrate towards this. So currently we have around 8,500 hypervisors and, of course, you know, 70,000 VMs and a lot of, you know, block storages and thousands of, thousands of internal cloud users. And mid-2015 it will be around 10,000 hypervisors. So if you look at the deployment architecture itself, so I want to go through, you know, in terms of, you know, how we deployed it and then I want to drill down visually actually how we have deployed specifically SDN where it fits in so that it will give you a perspective of where things are and then how, you know, we are moving forward in terms of scaling it out. So if you look at, you know, the deployment itself, right, so we have multiple regions in the geographical regions. And every region has, you know, multiple availability zones. And then every availability zone internally we deploy, you know, open stack. Of course, you know, if you look at the region itself, you know, we tie all of these regions together using a global keystone. So if you look at it as a single cloud, but every region is being independently orchestrated. Even if something happens to open stack in one particular region or availability zone, we don't impact the business. Even that particular region or availability zone goes down. It's not like, you know, other, you know, some of the interface clouds actually sometimes everything goes together. But we architect it in a way, it perfectly fits. Okay, even if you want to do some maintenance in the core, we take out only that particular availability zone for, you know, maintenance. And of course, we have redundancy everywhere in the switches and routers and cores and wherever, but sometimes actually you might, you know, get into, you know, some critical infrastructure upgrade, then we need to take out some availability zone of the traffic. If you look at, actually, we kept all those things in mind before we decide to go and deploy cloud in the production, right. It's not that actually we, you know, just deploy somewhere and then it didn't work. And then after that, actually, we take a huge hit and it impacts the business. That's more critical than not deploying open stack, right. And if you look at, actually, every availability zone is a fall domain, right. It's a very key in the design. So earlier days, when you talk about an application, hey, my application needs to have this particular availability, but I need to distribute across, you know, multiple fall domains within the racks, actually. Okay, I cannot have my VM under this particular half rack or maybe quarter rack or whatever. I need to distribute everywhere. No. Your application needs to scale across availability zones. Even the availability zone goes down, as I mentioned, it should not take any impact to the business. So, and you might, you know, very well end up in the same half rack if you don't have enough capacity in other places. What's the flavor that you are looking from the Nova? And you have to, you know, account for all of that. You cannot guarantee, okay, my application will be always being in multiple racks and then, you know, multiple switches and multiple fall domains in that layer, not at the rack level. It's all over, you know, availability zone. And when I talk about availability zone, it's thousands of hypervisors, right? So, availability zone maps to, you know, more than one physical network bubbles. That's key as well. So basically, you know, when we design, you know, bubbles, sometimes what happens is based on your network infrastructure gear that you bought it in the past, you might not be able to fit your availability zone scale into a single bubble. There will be more than one bubble, but all these bubbles are tied together in a single availability zone, right? So what does that mean? Basically, if you are doing, you know, maintenance at the bubble level, you have to be really, really careful because your availability zone is compressing of, you know, multiple bubbles. So that's very key, but the newer availability zones that we are deploying, we might end up, you know, going into the one bubble to one availability zone, but that has, you know, a lot of, you know, implications into the existing data centers, how you deploy, right? So, and then to scale it out with an availability zone, we have multiple NOVA cells. When I talk about actually a very availability zone, we'll have, you know, several thousands of hypervisors. It is not easy to manage with a single cell. We have to get into multiple cells. I'm going to be talk about, talk more about that in the next slides. Sure. So I'm double-clicking on the single availability zone, right? We went into the region, and then from the region, we went into the availability zone, and then, of course, everyone knows how to deploy the cells. And then every cell, actually, I'm talking about the racks put in within the cell, right? And we have our, you know, of course, I'm talking about from the end user perspective, if you are in eBay and PayPal customers, you know, of course, you are coming from the Internet. When you come from the Internet, of course, you know, you will hit our, you know, one of the firewalls and then edge routers, and after that, actually, I'm talking about only within the aggregation where cloud meets, okay? The public traffic, or maybe internal traffic also, right? Your cloud starts from there, and I'll talk about, actually, first of all, you know, when I talk about SDN, a lot of people for only overlay is SDN. Some people, you know, they don't consider this as SDN. And for me, actually, SDN is an API to operate your network infrastructure. It could be overlay, it could be underlay, or maybe you are touching core, or access layer, or aggregation layer. And of course, Neutron doesn't have all those APIs, and we might, we need to build all those APIs when you are talking about Neutron as a network as a service, right? So if you talk about aggregation layer, it's a three-tier architecture here, and then our access layer is nothing but, you know, your top-of-rack switch, and these are all the racks, actually. So if you look at, you know, this picture, you know, every availability zone, right? There will be multiple tens of, you know, thousands of racks, you know, sitting together, and the last piece is very critical. It is also attached to the aggregation layer itself, but it is also a top-of-rack switch, and SDN gateways, load balancers, and firewalls, they are all sitting there. But the rest of the infrastructure and everything is identical, right? And we could even talk about, okay, why do we even have, you know, load balancers and then, you know, firewalls in a specialized rack, because they are cattles. We, you know, they are not cattles actually, right? They are puppies. You need to special, you know, special rack. And what happens actually if you don't have enough space and power to add more and more, you know, load balancers and firewalls in there. So that's where actually things are getting really, really interesting. You know, why can't we move firewalls and then, you know, the load balancers as virtual instances within the computer racks itself, right? And have only maybe the gateways and stuff like that in a specialized rack. And of course it's based on the throughput that you want to get out of your each and individual, the advanced services that you have in your data center, and that depends actually whether you can move it or not. And that is always a question, okay, what will happen actually, you know, you can scale out your racks, but you can't fit, you know, all of our, you know, advanced services from layer 4 to layer 7. There is no enough space and power, right? Of course you can't extend data center within a day, right? So what happens for that? So if you look at actually this is what happens inside an ASE. This is our availability zone. And here if you talk about within availability zone, of course now we have the entire OpenStack deployment solution, right? Your Keystone, NOVA, NOVA cells, and then Sinder, you know, object storage, everything lives in within particular availability zone, right? And then availability zone, of course it's tied to a region, right? And then we have some regional services, okay? And then regional services, you know, sits in the global Keystone, the target state across, you know, all over our regions in multiple geographical regions, right? So on top of that actually, you know, when we talk about, you know, SDN the important concept that we introduce to meet our business needs is virtual private cloud. Of course, you know, some of the public cloud providers already have an API to create, you know, virtual private cloud and OpenStack doesn't have APIs to do that yet, but we have a business need to do that and we introduce all this, you know, concepts within our, you know, our own patching with what we have at internal. Of course, you know, we have an API also for that in, you know, multiple components that needs to be used in our solution. So, VPC, if you talk about virtual private cloud and then it introduce, it compresses of, you know, multiple resources that you group together, okay? A business unit or maybe a business function within an enterprise, okay? You have specific requirement where I cannot share the traffic between other, you know, business functions. You want self-contained in terms of networking, in terms of storage, in terms of compute, right? How do you isolate them? And of course, you know, you don't want to put firewalls everywhere, that creates an island for you, right? And if you want to move away from that, actually you have to rewire your, you know, infrastructure, right? So if you file a site secret, if you are, you know, managing a larger infrastructure, you have to file a site services and then they will take two or three days of SLA and they move the racks and then recable, reconfigure, you know, top of rack or maybe, you know, put it in a different VLAN, you know, think about how much time it takes for you to, you know, move capacity from one place to other place, right? So that's where actually, you know, I'm sure actually all of you know about SDN actually, what it solves and, you know, here actually very important for us, the VPC is very dynamic where the capacity needs to move from one business to other business. Sometimes actually, you know, business, what happens is, okay, you buy $100 million worth of, you know, gear and the business shrinks down or whatever or maybe it expands, you have to expand dynamically based on your business growth, right? That's where actually, when I talk about actually, you know, the agility is very important for the business and scaling the infrastructure network is very key for that, right? So VPC is a security zone. As I said, actually it's a self-contained and it has its own security rules and within the security zone actually every VPC is being firewalled up, of course, right? And you cannot talk through, you know, some of the ports or maybe some of the, you know, networks, you can't reach directly. There are, you know, multiple policies are enforced around VPC. But within VPC, you can still use fine-grained using security groups, okay? Within VPC you have multiple applications and then you want to set up some ports and then open some ports. You want to follow certain protocols so you can do all sorts of things within using, of course, you know, OpenStack security groups itself, right? And again, you know, to introduce the VPC, you know, we have to make changes to Keystone, right? So basically, you know, when I talk about Keystone itself, right, VPC will have multiple projects and there is an admin project. He owns all these resources. Underneath we have a lot of, you know, other tenants. They are inheriting all these resources capabilities. For example, you know, if there is a DNS zone, only the VPC says, okay, say suppose it's a valid, okay, mobile valid, right? It's a person of the application and it could be, and it might be living within one particular, you know, virtual private cloud and they all share the same DNS zone. But, of course, we want to have control to every application owner to manage that particular zone. So it will be managed only the tenant, admin tenant in that case. So we have to make changes to that. And of course, you know, there are multi-tenancy model, you know, what happens is, you know, within that particular VPC, we wanted to have one, you know, big routing domain, right? So what happens is, within VPC, you want, you don't want to have, you know, any latency, right? Of course, I'm going to be talking about some of the things that we face later. And we have, you know, big logical router that is spanning across, you know, multiple racks or whatever, and you logically carve out that. And we need to run that particular, you know, logical router that spans across multiple racks. And if you want to add, actually you add the network and automatically, you know, still being, you are within that virtual routing domain itself, right? And again, you know, what I'm going to be talking about is next, actually, we need to, you know, stretch it across multiple racks, that virtual switch or virtual routing domain. But, okay, if you want to have a slash 16 network for that particular VPC, can you have a virtual router that could handle that many, you know, that much traffic, right? What are the scaling limits? That's what we are going to be talking about. But what is the restriction that you have if you don't have that? That's what, in the next slides, we are, I'll talk more about it, right? Sure. Yes. You know, network design itself, right? So, as I mentioned, right? Okay. You have your VPC that you carved out, okay? Today, actually, we are running both, Overlay, as well as Bridge Network. And Overlay, we run on, run for, you know, some of the workloads. They are not latency sensitive. And also, they are not, you know, it's okay to have some kind of workloads within, you know, small footprint. For example, you know, if you can have your virtual routing within Slash 24 networks or Slash 23 networks, then we can run the small router to scale for you. Say, suppose, again, when you are talking about Slash 18, then you cannot have this, because you have to go out and then come back if you, you know, want to go to the next network, right? And what will happen in that case is actually, you know, you have to hit the headway, and then, you know, it is then carefully kept, and then you have other switches and routers. So basically, to introduce the latency. So those cases, we don't use Overlays. So basically, we go directly Bridge, okay? And also, the security group, actually, you know, there are some limitations, actually, we had, actually, when you introduce more and more security groups in the OVS layer, you know, how many flows that got introduced extra, and then what was the bottleneck that we had. I'm going to be talking about more and more about that as well. And also, you know, what will be the issues if you don't select 100% separation between your SDN controllers and your data path. Okay? If you do some maintenance, if you get an outage of, you know, just upgrading the controllers, if it affects your data path, what will happen, right? So, and we have some challenges as well, I will talk about that as well. And then, what we want to do in future, okay, how we are going to be somehow solving some of our problems. Of course, they are all in the talks, and then we are going to be picking one of the solutions that we are discussing seriously internally, right? So, this is what actually I talked about, you know, little bit about in the previous slide where what is the implications on the network design itself, what it introduces if your, you know, logical router doesn't scale beyond a certain point. Today, if you look about, right, the logical router, if you think about actually in the physical network, if you map this, this is your, you know, core layer, the logical router, and the logical switch is nothing but your top-of-act switch. We are, we came down to, you know, virtual networking now, right? And the VMs are nothing but your bad model servers. If you map your physical infrastructure into this, that's how it looks like, right? But think about actually your logical router today. You bought it for so much of, you know, circuit capacity, right? Now, you want to build another one because the capacity is not enough, and you will be building multiple of that. So, what else you need to buy to change that? Okay, your aggregation layer below that, so that's exactly the same challenge you are having it here, right? The logical router for us is a slash-empty network, right? And the logical switch is, of course, slash-24 because of, you know, some of the, you know, port limitations that we have in our SDN solution. They are all slash-21s today, right? So, if you are running, you know, slash-24, and then every logical switch is slash-24 here, they are all connected to the logical router, right? Of course, they are running un-overly. If you want to go out and then come back, you are getting the headway, right? For example, you know, logical router to logical router, of course, there will be latency, right? But if you are, say, suppose you are a VPC, if you are having multiple of these logical routers, you are introducing latency. That's one of the reasons, actually, we don't run, you know, the latency-sensitive application into the overlay yet. But the ultimate goal is going to run everything un-overlay, right? So, that is going to solve our capacity problem that I touched upon before even stuck in the past or second slide, and... So, if you look at, you know, other one, the second router, if you get in there, actually, you have to, that's the extra half. So, networking, actually, every extra half actually takes the extra latency, right? And, let me talk about the gateway bottleneck itself. So, what will happen here in this particular case, right? You are running Hadoop, okay? There are a lot of east-west traffic, okay? You are a tenant, you have outside, actually, 500 node cluster, you are running in the cloud, right? And, you don't have enough capacity within that particular logical router in terms of, you know, number of VMs. Actually, if you're talking about Slash-20, you, of course, you know, within the Slash-20, you cannot pre-allocate to a particular tenant, and then, later on, you know, then the university will have the capacity. If you are allocating, pre-allocating Slash-20 for a particular tenant, and it's defeating the purpose of the cloud because you are reserving the capacity for someone. Now, if you look at the orange one, the red VMs, you know, you have two VMs in, you know, the logical switch, too, and then you have another one in one, and then you have one more VM across, you know, one more logical router. So, it is introducing extra latency. So, what we have done, you know, some of the changes, you know, to make sure that these workloads are being constrained within the same logical router, we change the oversedular. If you are a tenant, and if you are, you know, if you are already having 5 or 10 VMs and there may be 200 VMs in that particular logical router domain, just, you know, best effort you place within that logical router itself, but it is not the solution. So, what happens, as I said earlier, if you don't have the capacity, you have to go somewhere else. So, this is the reality that we are living in today, where, you know, the big data use cases like Hadoop and H2Os and, you know, a lot of, you know, chatty applications that we have, and we run all of them on bridge today. So, security group, okay, this is an interesting, you know, piece that we keep up to very, very hardly after, maybe I would say two, three days, actually, we don't know actually, you know, why the hell there are so many flows running in high processors, right? So, what happened was, you know, there are, you know, self-replexive roles that users get introduced. In our cloud, what happens, actually by default, okay, every tenant lives in a particular VPC, they are all being firewalled up from other security zones. They're okay to run within that particular zone without shutting down some of the ports or some of the protocols. Sometimes what happens is that there are applications, they want to shut down some specific ports, they want to run only on specific ports. So, when you introduce, you know, all this, you know, security rules in a default security group, the, all your flow table gets exploded. It introduces more and more flows because you cannot summarize all of that, right? Then your controller, SDN controller started managing so many flows, millions of millions of flows and it introduces a lot of, you know, load on your SDN controller. And if your SDN controller is not capable of handling those many, then you'll be in trouble. So, what we have done for this, actually, you know, and it is, this is one of the reasons, actually, most of the time what happens is actually, you know, it is not able to use the mega-flow because your flow table, you cannot summarize. Because say, suppose when I say, you know, mega-flows, I hope when everyone knows what mega-flows does, it's not actually just a touch on that. See, you have, you're an application, right? Okay, this is a security domain where all these flows can talk each other, right? All these flows, actually there will be only one flow. So, summarize all of that and put it into single flow. But if everyone has a different security group, you need to run a separate flow path for that, for every security rule that you have, right? And it introduced the number of flows are increasing. It introduces more and more memory footprint in your controllers again because your table size increases, right? To simplify that, what we have done is actually, you know, we took some of the patches in the community where, you know, whenever you are the same tenant or maybe someone else, you know, submitting the same rule and again and instead of, you know, exploding them, you summarize all of that, you give that to your SDN controller. So that, because ideally the SDN controllers should do this job because they should summarize internally and then it needs to push those, you know, summarized flows, summarize security groups to the hypervisors. But here, actually, we took the job and then put it into, you know, the plugin, Newton plugin, right? So that solved this particular problem. But still, you know, somebody creates it, you know, different rules that if not summarized, then you will run into the same problem and then again. So this is another, you know, pain that we went through before there is no separation between SDN control controller and data separation. So what happened was, okay, as I said, right, the millions of millions of flows, that state is getting managed, you know, your SDN controllers, right? And they are getting pushed into the hypervisors. Whenever you are creating VMs here and then across the network you are creating somewhere else, then you have to, you know, plumb this and controller, you know, manages the state where this particular hypervisor lands and then where the other point, other end of the tunnel. So what happens is in this case, now, all these things are getting managed within controller and when you upgrade the controllers, that's where the, you know, fun starts. Where what have, where, you know, you are restarting your controller now because in as part of the upgrades and it needs to recalculate all these flows and it has to send the newer version of the flows to the hypervisors. Then to think about a scenario where you are connecting from VM A to VM B, you are one part of the tunnel calculated and you push to the hypervisor, but the other part of the, you know, tunnel still not being calculated, right? Because you are pushing the partial flow to the hypervisor. If you have a bug or maybe, you know, then you are not talking about only one flow. There are millions of flows like that and if you are pushing, you started pushing only the, you know, the partial flows, then you get into trouble. Then, you know, we took care of those kind of, you know, issues when you run, you know, at scale these things happen, right? And that is one of the problem we faced most of the time during the upgrade and another problem, big problem is, okay, there is no clear separation. Okay, what about actually you lost your control cluster, right? Because the control cluster manages the entire flow state for the network, the entire message manager, right? If it goes down, then your control data plane needs to, you know, stay as it is, even though you are not going to get new flows into the mesh, but the existing flows needs to run as it is, but we didn't have that control percent separation because they periodically, you know, check the flow health and then if it's not there or whatever, you take it out from the flow table and, you know, we ran into a situation where we took a big hit actually when the controller went away to three controllers because multiple controllers we run and, of course, they are running with, you know, 256 RAM or whatever and they all ran hot ups on point because of the security rules that I talked about in the past because there are, you know, multiple, you know, several millions of flow and even if it explores, you know, how you handle that in the data plane, it was not there earlier when we started, you know, three years back. Now, you know, all these features got introduced in our SDN, you know, controllers and the SDN gateways. Now we are at a better state, right? And one more thing is actually, you know, the painful upgrades, okay? So you are upgrading your SDN controller. Now we are talking about several thousand subcontractors, right? So earlier days if you are talking about only upgrading work, you talk about only the physical infrastructure, right? Your physical switches and routers, of course, you know, it's not many, right? Maybe hundreds. Now your switches move to your hypervisor, right? This is what I was talking to a couple of my folks, actually, you know, you are seeing a lot of pain because, you know, every upgrade takes, you know, 5 to 10 hours of upgrade window because, you know, it's taking live production as I said and you cannot just like that go and, you know, take, you know, restart all OBS and all hypervisors. We have to batch it out and then identify, actually, the impact is very minimal. When I say impact, actually, we are talking about only 10 or 12 seconds or 15 seconds max, the OBS restart. Even we can't afford that, right? And then that introduces, okay, a lot of pressure on the OBS community whenever you reload a kernel, how much time you can afford? Can you make your data pattern as it is and then swap the kernel? Or maybe, you know, within one or two seconds actually, can we restart that? So to avoid that, actually, we have to clearly, you know, identify all the hypervisors and then, you know, batch it in 100s, maximum of 100 to 200 hypervisors in batch and then run them as separate whenever you upgrade the hypervisors because now we are talking about upgrading, you know, thousands of thousands of OBS virtual switches, right? So that introduces a lot of, you know, firefighting and then, you know, sometimes, you know, there is a compatible usage between the hypervisors and then, you know, the OBS version you are running, what kernel version you are running because it's infrastructure grown over the period of time. You don't have, you know, the same version of hypervisors, the same version of OBS kernel running everywhere. There will be incompatible days. How many things that you could test in your lab before you move to the infrastructure, real production, right? So these things were the real, real pain in running, you know, overlay based and OBS based, you know, infrastructure that we are running it currently. It's not about actually, you know, buying, you know, controllers and then make it work for 100 but if you get into the real operating world and they are really, really in a bleeding edge, the upgrade cycle is, you know, very often now. It's not like, you know, you buy your top-of-rack switch or core switch or core router, you upgrade, you know, nearly twice, right? You are upgrading, I'm talking about in monthly, you know, once or monthly twice, even very often, actually, you know, bi-weekly or whatever. It depends on, you know, how many bugs and then how many, how much maintenance that you need to do on this infrastructure. So for that, actually, you know, you have to have really, really, you know, reliable, you know, software or maybe the orchestration software for during the upgrades. So that, you know, solved us a lot of problem in the later point of time to stop having 10 or 15 hours of upgrade cycle. Now we reduced it to, you know, we could say around, you know, five to six hours we could do it effectively, right? Right. So the gateway scale is yours. Okay. Say, if you talk about the logical router resides on a single gateway, right? Say, suppose, you know, you are running a slash 20, our case is to slash 20 network. We are talking about how many around slash 20, you know, network, those many VMs are being handled by a single gateway. Say, suppose you are pushing gigabytes of gigabytes of data towards a single gateway. Of course, you know, we have active standby. Okay. Something happens to this. Actually, it fails over to the standby within, you know, 10 to 15 seconds or 20 seconds that we have seen. But it introduces lot of, lot of, you know, latency when a lot of people are pushing more and more packets of thousands of thousands of packets per second when you push it, you know, the single gateway doesn't scale. Right. Now you need to distribute across multiple gateways. But we don't have that, you know, capability yet. Because I'm going to be talking about actually how we are going to be solving that using some of the, you know, broadcasting technologies that we are talking about. Right. And also we ran into a situation where, you know, port security, you know, always scans across our infrastructure. They always send R packets. Right. And we had a bug actually in our, you know, gateway. What happened was actually this keep on increasing over, you know, R packets in a queue. And after some point, actually, you know, it choked the gateway. The entire traffic went down for an hour or so. Right. Now, how you debug that? Because people will scan. Right. And finally it comes to a point, oh, we are running over, you know, physical network infrastructure for years and years. But even, you know, SDN a gateway doesn't even, you know, hold for a simple scan. But a simple scan is being handled in a way that is, you know, the number of packets after some time if you cannot process it, just put it into the queue. And then the queue size has to be only very limited, maybe one or two packets. Why you want to queue up, you know, all the packets and then you queue up your memory or hit your memory. Right. And of course it's a bug. But that bug, how much impact it introduces in infrastructure. That's what more important point is. Right. And, you know, we have to, you know, introduce some of the solutions for that. Okay. Because these are all, you know, to roll out this bug, we have to, you know, get the bug fixed and then, you know, test it and then push it to the infrastructure. So until then we have to run the infrastructure because it is taking production traffic. Right. And we introduce some rate limiting scripts actually identify all this, you know, packets and, you know, just rate limit them. Okay. If you are sending, you know, from the same source just stop them actually if it is more than, you know, three or four packets. We introduce those things in the gateway and then that is a stop cap that we introduce. And of course those things are fixed by now. And also we identified who is sending this much of traffic. What is the source? Okay. And we use, we identify the source IPs, source subnets and we block them automatically. So that's the solution actually we identified initially to block it actually when we debug and then rootcast and that become a solution until we get the actual fix done. Right. But identify this actually who is the heavy hitter, who is coming from where because in the infrastructure because of the, you know, single, the gateways running a, sorry, the logical router on running on a single box. So we are all talking about, okay, we are virtualizing, we are moving away from the physical switches and routers, they are all looking good. But if we cannot scale, we are on certain limit on X86 what we are talking about, it is not good. But there is a way you can, you know, scale out or distribute that we don't have today. That's what, you know, we, how we are going to be distribute on the X86 box itself to run you know, several thousands of, several thousands of flows are several thousands of IP address for a particular VPC that we are talking about, right. And we wanted to distribute that, how we are going to be doing that, there are technologies and you know, that's what we are exploring it today, right. And other issues, okay. Now, okay, we are talking about gateway, single gateway, how, how much latency we tend to use and then what is the impact of running that particular slash tenting networks on a single router and now we are talking about the AC size itself, availability zone size itself because a single control plane of SDN controller manages our availability zone for SDN controller, right. So, it also has, as I said, right, the central control plane, it has to manage millions of, millions of flows and from multiple hypervisor and how many, how many hypervisor that central control plane could cold, right. And then we have a scale limit there too. But if you look at actually, you know, if you want to run my availability zone of only a couple of thousand or two or three thousand hypervisors, no. Because that introduces a lot of cost in terms of you know, building, stamping out your multiple availability zones because every availability zone when I talk about, there are multiple bunch of other infrastructure services that you need to deploy. Money is one thing and then time is one thing, right. Of course, you have automation and all of these things. And also, you know, every availability zone become an island after that. And if you want to tie all of that, you have to go to the core router and then come back, we don't want to do that. So instead, we wanted to scale out more than 2500 that we are talking about today, right. 2500 is our availability zone scale. But we wanted to run at least 10,000, right. To run 10,000, if you have a central control plane, if you cannot handle more than 2500 hypervisors, you need to scale out the SDN controllers itself. Then where do you calculate all these flows? Whether you calculate in the central controllers or you calculate in the hypervisors, there are, you know, multiple ways to do that. And of course, you know, there's a journey that we started and we are sticking to that. And of course, you know, we, we, we, I'm sure actually we could get that whole very soon, right. And again, there are a lot of concurrency issues in the neutron that we can enter. Yes, exactly. I want to talk about that actually. So what happens actually in, I know, infrastructure is eventual consistent system. You all get it. But say, suppose if you are creating a hundred ports in neutron, right. And you are, of course, you know, to scale out, you will be having a control cluster of, you know, neutron control cluster. And your SDN controller also will be running under, you know, particular, you know, group of cluster. And to keep the cluster, you will be maybe, you know, having a load balancer, virtual VR, whatever, right. You are hitting that. And based on the algorithm, it is going to pick one of your, you know, one of your controllers in that cluster. And how much time it takes for distribute that particular, you know, state, the state change that you made on that controller to distribute to, you know, other nodes in the cluster. And you, you go to a node one. And immediately you go to the node two because of the, you know, round robin stuff, for example, if you're using in our virtual balancer, then it goes and hits other other node. But the state is not there. So what will happen is, okay, it will introduce, it will try to create another port. There will be, you know, duplicate ports for the same IP, right. So we ran into this issues and then, you know, we, we defect so much on this actually, right. And they all worked until, you know, we, until we had around 500, you know, 1000 hypervisors. But when the scale started increasing, that's where we have seen most of the issues where people are creating lots of, lots of ports and lots of, lots of, you know, issues. And and, and, you know, we had to, you know, fix all these issues in terms of, you know, how the controller is being load balanced across, you know, multiple, multiple nodes. And future, future, future enhancements, what we want to do is actually, you know, as I said, actually the controller, the gateways are running in a single gateway. Of course, we wanted to advertise all these routes using PGP as some other mechanism. And also, our VPC is sitting within the same data center today. And then we wanted to, you know, extend that, you know, traffic across multiple data centers, because if you are a VPC and then we wanted to have a seamless connection from, you know, data center one to data center. We are talking about EVPN and stuff like that. So now I think we are very close to the, yeah, of course, you know, there are so many challenges we are firing. We are interested into being part of this, you know, solving this particular problem, you know, send me an email or, you know, connect me to my LinkedIn and we can always talk about, okay, we got five minutes for questions. Five, maybe ten minutes or five minutes. Yeah. Yeah. So, when I talk about, you know, all these challenges, right? So we started the journey three and a half years back. And when we started three and a half years back, actually, we had only a few set of technologies available those days and now we are looking at, you know, options how we can, you know, liberate the VX LAN and then, you know, different options. Yeah. Of course, we are doing, we are working on that. Yeah, we are with one of our partner actually. So we are using ICRA controllers. Yeah. So of course, you know, this is a journey. We started three years back and then we are in a much better state and, you know, than earlier. Yeah. Yeah. So the keystone changes actually earlier when we started in Palsam, right? We don't have domains and hierarchical tenants and stuff like that. Basically, keystone changes is like, you know, VPC constructs an admin tenant that admin tenant owns all this, you know, networks belong to that particular VPC, right? So they can all talk each other and that network is being firewalled. They're all belong to the same firewall domain. So we manage that particular information in the keystone metadata. Yeah. Yeah. No. No, we use ICRA. ICRA. Yeah. Any other questions? Okay, sure. So if you guys have any questions or comments and then send them to my email, let's catch up. Yeah. Thank you.