 Today, I have the pleasure of introducing Chang-Hoon again. Chang is a student of Jennifer Rexford. He graduated from Princeton. And since then, he's been working on a lot of projects related to data centers and network virtualization. I'm sure many of you would have read the VL2 paper in SACOM. So it was, I mean, Chang did a lot of work on VL2. And he's going to talk about how Microsoft is managing their data centers and how well they are operating at scale. Thanks, Chang. Thank you for your introduction. Thank you, Yanis, for organizing this. Could I be wearing a microphone or something like that, or is it fine? No, it's fine. OK, great. So I'm Chang. And today, I'm going to talk about network virtualization for large data centers. A lot of these technologies and architectures are, of course, also useful for enterprise networks. I'm currently at Windows Editor. I was at Microsoft Research formally. So the other day, a friend of mine just asked me this question casually. And then as a data center researcher, I pretty much knew that almost all the bytes are coming from data centers these days. But thanks to the College Information Project by UCSE, I now have good numbers. So basically, data centers produce the bytes. Internet and CDN disseminate the bytes. And the devices around us consume the bytes. That's the way it goes. And data centers are indeed running all sorts of applications today. Search, email, social networking, cloud computing, data mining, high-performance computing, and even online games today. So it's no wonder that data centers are proliferating. So this picture shows the data center build-out plan of a few largest online service companies just this year only. And then it's not these IT giants who are building data centers. Most large enterprises have their own separate, but maybe not physically separate, data centers. And then worldwide, you have some hundreds of thousands of data centers today. And a data center size is today usually determined by the electrical power and economies of scale. And it's between somewhere around 50 to 200,000 servers. So data centers are the IT era analog factories, producing bytes rather than goods. And factories are expensive to build and run, and so are data centers. So suppose you have a small data center composed of just maybe 50,000 servers, these would be your monthly build. Only you are able to pay monthly. The truth is that only power consumption is used today, and you have to pay all the other things upfront, costing some $100 million every three years, and hence this golden rule. Let's maximize the amount of useful work per dollar spent. You have already spent the money, so let's make the best out of it. And to abide by this golden rule, data center providers and tenants are usually adopting two best operating principles. First, multi-tenancy, time and space sharing. To keep all these servers as busy as possible, these servers are divided up among a large number of tenants. And second, dynamic resizing to avoid squandering servers and yet append them when needed is the tenants are usually adjust their server pool sizes dynamically, up and down. And to support this key operating principles, what is the most important technological mechanism or architecture in data center is agility, which I define as a capability to assign any servers to any tenants any time. And in fact, systems research communities make huge progress, especially for the agility these days. Machine and suite virtualization are a great example. Unfortunately, the network holds short and several aspects. So let's see why. So this is a brief explanation of the conventional data center network architecture. So this is a hierarchy of switches and routers reaching from the racks of servers at the bottom to the core routers connecting this data center to other data centers or the internet. Internet technology is quite popular because it essentially allows self-configuring. But to enable self-configuring, internet heavily relies on broadcasting and flooding. And so it doesn't scale very well. So to scale well to a host of hundreds of thousands of servers, data center providers usually build a large number of small subnets corresponding to internet relays and then connect these relays using IP routing technology. And IP routing technology, as you may well know very well, uses location dependent addressing, meaning that if a server belongs to one subnet, it cannot use the other subnets address prefix. So, and I'll soon tell you that why this location dependent addressing is very bad in a data center. A cluster is a server management unit typically composed of one or 2,000 servers corresponding to a few internet relays or subnets. And this architecture has been around since only 2000 when data centers were predominantly hosting web services. Nowadays, web services are stories of the past. So we have some huge problems here. So one central problem this architecture is that it depends on high cost mainframe style network devices, some like this, the upper layer. So these devices are usually used in the large ISPs to meet their huge capacity and reliability requirements. And these devices have loads of unneeded hardware and software features, especially for data center. For example, the packet buffer, which is needed to cope with huge end-end latency happening only in the internet. But in data centers, latency is almost always less than one millisecond. So that kind of feature is almost never used. But in the absence of reliable alternatives, the only data center provider has just ended up buying a small number of those high-end devices and then wound up creating hugely oversubscribed network. And this oversubscription causes a lot of problems when it meets with today's traffic demand. Today's data center applications are highly distributed, almost inversely distributed style like map reduce, distributed blob storage, distributed in-memory caching, and et cetera. So these kind of applications usually drive a huge amount of intra-data center traffic, also known as east-west traffic, and then you have some serious problems. So what are those problems? The consequence is poor agility. So let me give you an example. Suppose you have two tenants, blue and orange, each given servers in cluster A and B respectively. Now the blue tenant just needs a new server because one of its servers has just failed. Now the provider has some available servers in cluster B, so it wants to assign one available server to the blue tenant. But doing this causes a lot of problems. First, this limited cross-cluster bisection bandwidth will damage the performance of the blue service. It's going to be especially helpful if blue is running some distributed applications such as map reduce because the other blue servers with relatively more network capacity will end up wasting their own CPU cycles and memory while waiting for the completion of this particularly slower networking job exercising this poor path. And the orange tenant will also get collateral performance damage. And also, since this is a replacement server, the blue tenant would naturally want to use the IP address of the old server that this new server is replacing. But using this, the same as this is particularly important because it minimizes service disruption by avoiding changes to the application state or configuration. Unfortunately, this data center architecture does not support location independent addressing and hence this new server has to get a new IP address. So what happens is that the provider usually confines each tenant within a strict boundary cluster or it just spreads a tenant service widely across data centers and force the tenants to painfully swallow these huge performance variants and address changes, frequent address changes. Either way, waste of resources inevitable lowering data center utilization significantly. So given all these limitations of the almost decade old networking architecture, data center providers and tenants have been using a lot of retrofitting, deploying a lot of retrofitting solutions. For example, they try to distribute virtual machines or jobs over an optimal set of servers chosen based on network performance of topology so that they can maximize the amount of work done per unit time. Given an oversubscribed network, they also try to do traffic engineering so that they can adjust traffic forwarding path for the dominant tenants. And then doing so will hopefully ensure or increase the affected network capacity useful for the dominant tenant. There indeed, it might throw up such approaches which essentially try to outmaneuver the distinct problems in the underlying layer with limited information and features available at the layers above the network. So sometimes it can work well, sometimes it might not work well because suppose you have came up with the optimal routing solution for a particular tenant work with the current snapshot, by the time you have generated and deployed that new solution the network state might have already changed making that solution no more optimal. On the other hand, we really haven't spent a lot of time to actually eliminate this underlying problem completely. So my research for the past few years has been precisely focusing on this drastically different approach by presenting fundamentally new network architectures that support agility at scale in the first place. And I argue that network virtualization as an architectural principle achieved this ambitious goal and I'll show you that turning those architectures into operational systems is quite possible as well. So before moving into the solution space let me articulate the goal. What do I mean by support for agility in the context of networking? So agility means that suppose you are a tenant your servers will be disseminated widely across in a data center. And in that situation what you want is a technology that helps you stop caring about the placement of your servers. And I turned this high level goal into three specific technical requirements. First, I want to assign any IP address to any servers with this your servers can use the same persistent IP addresses even when they get replaced with new ones. If yours, yep. I mean isn't it the persistent IP address problem? Yep. Doesn't that indicate that they're sort of a problem like the way applications are designed? Like for example my computer doesn't care like which USB port I plug my keyboard in. I don't care which DRAM chip my program raises. It seems that this is a problem the way applications are designed that basically they have to have this low level this address which course this sort of low level knowledge of the network and if their IP address changes then they break. Some applications, some new applications are actually exactly designed that way so that they can absorb the underlying changes. Stateless applications do that. If you are a group of, if you have an army or you know, software engineers who are well trained you can build your applications actually that way. Being does that, Google does that, all these big applications guys do that. Existing applications using existing libraries and OS they just want to simply migrate their using binaries from enterprise to the cloud. Those things can easily fail. Yep. I have a small question. Your observation, I don't think that application so application ran a deal of IP address. And most they did work, you know, so URL and that kind of thing. IP addresses are more operating system issue rather than application. And even the TCP stack has the, you know the fundamental underlying assumption that the IP addresses on both end will remain the same. The changes it just totally goes TCP state machine. So if your VMs are, if your servers are actually virtual machines with this feature you can, and along with light migration you can actually preserve ongoing connections and ensure service continuity. You can also assign the same address prefix for all your servers regardless of where they are. And that can make your application and service management extremely simple. Second, I want to offer high networking performance between any pairs of servers in this data center. With this, you can actually enjoy consistency of the performance. And why is this good? Because you don't need to worry about the wasted CPU and memory because these servers, other servers just don't need to worry about particularly slow networking jobs between two distant servers. And also with this, you can make your job placement or VM placement algorithms very simple. And finally, I just want to protect tenant from one another. With this, I can remain carefully, even when my servers are located right next to my competitor servers or even malicious servers. So these are the three key goals which I call as abstraction and isolation and efficiency. So this is the summary of my network virtualization project. So I began with BL2. BL2 is precisely about offering this huge virtual switch abstraction to all the tenants within the same data center. So forget about this highly oversubscribed location dependent network architecture. I just want to give them a huge layer to switch. And tenants just can't move around. They can be proposed servers. I don't care because this switch offers flat addressing and uniform high capacity just like physical switch. And then I extended this abstraction in the two other, actually a few other projects, VNet and Seawall. And IQ is another project that I've been doing with Bellagy and Dave and Vimal here. So instead of having just one huge shared switch, let's just give one individual switch, a virtual switch with unlimited capacity to every single tenant. So each tenant can have as many ports on this switch as they wish. And then they have complete isolation in terms of addressing, in terms of reachability, and also in terms of performance. So I'm going to talk about BL2 and VNet in this talk. And I'll be happy to talk about Seawall and IQ offline. So this is basically a single slide summary of what my network virtualization architecture specifically do and how. The first, flat addressing. So the reason why conventional networks cannot ensure flat addressing is because they assign just one IP address per server. And that IP address works both as a name and location identifier. I solved this problem just simply by assigning two addresses or two identifiers to each server. One working as an invariant name, and the other working as a variable location identifier. And I introduced a simple address translation and resolution mechanism working between these two types of identifiers. And second, uniform high capacity. Again, the reason why conventional network cannot ensure uniform high capacity easily is because they're oversubscribed. And my data center measurement research indicates that optimizing your routing, especially data center on this oversubscribed network might not lead to huge benefit. It may lead to some benefits, some cases, but not great game-changing benefit. So my approach is just building a bottleneck-free network that ensures oversubscription-free capacity under any traffic pattern. And let's just forget about optimization. That's my approach. And of course, the key question is whether this is feasible in practice. And my core contribution is that yes, this is feasible if you use a certain type of topology. And then I'll believe this routing mechanism called barely-in-load balancing. And also if you use TCP as a transport protocol. And at the same time, I also wanted to achieve both, these two goals, even when tenants do not cooperate with or trust one another. So let's move on to the flat-addressing part. So I'm going to first talk about the simple flat-addressing design. And then I'll move on to the next design where I assume that the tenants are mutually distrustful or uncooperative. So to design the proper network architecture, we first need to understand the unique challenges and opportunities of data-centered networking environment. So let me begin with some challenges. And so especially in the context of flat-addressing, the key challenge is just the scale. We're talking about bitually millions of virtual machines and physical machines. And their rival and departure rates are just huge. And then at the same time, their locations and addresses can keep changing due to agility. And no dissing routing protocol can handle this kind of workload. They're simply not designed to handle this kind of work. So the problem you're addressing is routing related issues and nots. You're assuming that the switch level, the layer two level is, there's no issue of the switch. Because the switch doesn't deal IP address. The only thing that we require is that the switches can somehow build their own switch level topology and then deliver packets from one switch to the other using either layer two or layer three. They don't need to know about the post-improvement at all bucket. But there's some great opportunity in the data-centered networking environment as well. So first of all, these virtual machines or physical machines are not, of course, humans. They're not coming and going or moving around on their own. Instead, in every single cluster in a data center, there is a logically centralized cluster management system which knows everything about this server and server state. And it actually, instead of learning this state, it actually generates those state. If this cluster management system said that this VM is there or this VM is not there, even if this VM is running there, it's not there. Because it's not managed and it will be just eventually discarded. So there is a almost God-like system in this environment. So it would be natural to take advantage of this almost free of charge and built-in capabilities. And also another benefit of this cluster management system is that it can actually adjust the rate and extent of server state changes. Unlike people who will just move around at random pace from this building to the new building, these VMs are exactly coordinated. There is another key opportunity here. So whenever these VMs or servers address or location changes, that information doesn't need to be known to all the other servers instantaneously or atomically. Because networking protocols have retransmission mechanisms. They can tolerate some brief mis or loss of connectivity. So eventual consistency is fine. So these two factors give us a lot of freedom. So although it might appear that this huge amount of server state and churn can be very challenging, but in fact, managing that information is not the worst thing that you can imagine in the networking work. So let me begin with this flat addressing design. The gist of this design is surprisingly similar to the virtual memory technology, performing virtual to digital address translation. So here is a BL2 network. Each top of rack switch was a rack of servers. And then I intentionally did not show the interconnection topology between these switches. I just left it blank because that's just totally irrelevant for this flat addressing design. Now these servers have virtual addresses. And exposing these virtual addresses directly inside or to the network is a very bad idea for a few reasons. First, there are so many of them. And second, they keep changing due to agility. And third, they're not even topologically aggregatable. So instead of exposing this virtual address to the network, I just let switches own their own location-dependent physical addresses. Switches have physical addresses, and they're aggregatable. And then I just simply run any conventional IP or even Ethernet routing protocol that maintain only the switch level topology. And this actually reduces switch routing tables size significantly, and also reduces the amount of routing information exchanged. Because even in the largest data center, the switch level topology alone is fairly stable and compact. Now the key issue here is that these switches know how to deliver packets from one tour to another tour. But they know nothing about the servers. So they cannot deliver the packets that actually use servers' virtual addresses as destinations. So that's the problem. And I solved that problem by actually introducing translation at this demarcation line. But to translate between these two addresses, I need to know about these mappings between virtual addresses and physical addresses. How do I get those? And it's actually easier, because the cluster manager generates those information, rather than learns it from somewhere. So what I just need is a simple directory service which can offer this mapping information when and where needed. So suppose X need to send packets to Y and Z. It first looks up this directory service and batch Y and Z's physical layer information, or physical addresses. And then it performs encapsulation. Or it prepares another packet header using these physical addresses as new destinations. And then simply forwards it out. And since the switches know about physical addresses, they can deliver this correctly. Now suppose, by the way, in analogy to virtual memory, this directory service is like page table. And this small cache at the individual servers are like TLBs, translation leukocyte buffers. And just like translation leukocyte buffer, if there is an entry that is not used for a few short idle timeout, the servers will also evict that kind of entry. Now suppose the cluster manager somehow decided to shut down this rack for power saving, for maintenance, or for whatever reason. So the cluster manager had to actually create a replacement server of Z somewhere else, for example, under Tor 3D. Now while this cluster manager is performing that job and updating a directory service, X will forget about Z's information because preparing a replacement server today takes at least some tens of seconds. And then when X needs to talk to Z again, it will relook up the directory service and get the up-to-date information about Z. And then everything will just follow in practice. So this mechanism is realizable with low end commodity switches. Switches need a small routing table. They don't need to exchange host information one another at all. And it also protects network from the server state churn. So this is good, but cloud data centers need more than just flat addressing. And here is why. So for enterprise customers, using cloud data center is way more cost efficient. And hence, they want to just migrate some of the existing computing infrastructure into the cloud. But they cannot move the entire infrastructure from enterprise to the cloud on a flag day. That's why we need to think about this partially cloud-based service deployment, or also known as hybrid cloud scenario. So in this scenario, corporates have their own enterprise network or enterprise data centers. And then they build another new data center, their data center or site, corporate site in the cloud. Then they connect this new corporate site to their existing enterprise network using protected channels such as VPNs. Now, they can move the existing services from on-premise to the new site in the cloud. Or they can simply create new services from scratch in the cloud. In this situation, we're facing two critical challenges. First, bring your own address space. Since IPv4 addresses are not enough, everybody is using the same reserve private address space, which is 10 slash 8. It's reserved by IETF. So everybody using 10 slash 8, GM is using that, Thread is using that. And even this cloud provider is also using that. So if this cloud provider assigns its own choice of IP address out of this range to this customer's VM, that IP address can actually collide with the address of another machine in that customer's on-premise network. So what this means is that customers should be able to bring their own choice of IP address into the cloud. And this also leads to some interesting problem. Because again, since everybody uses 10 slash 8, the customer's choices can overlap with one another. And they can also overlap with cloud's choice as well. But why is this a problem? How about you explain that you have a flat L2 address space and you also realize IP addresses. So I mean, having the virtual address space is overlap shouldn't be a problem. That's exactly what I'm trying to solve. I mean, in the previous case, I assumed that addresses can move around, but I didn't assume that the addresses can actually overlap. In this case, everybody's sharing exactly the same address space. So it's not that these addresses can move around freely, but they can even overlap. But I mean, you know, in virtual, you abbreviated it with virtual memory, right? And in virtual memory, all my programs can be linked to the same virtual address. You're right. The underlying technology is actually... It's the only problem with the physical addresses. Yeah. The underlying solution is actually very similar. We're actually building on the previous architecture. Yeah. So if, within the data center, if you do avoid with something, everybody is on one something, like within the whole data center. And then in that case, everything will be done on the switch level, and there's no bounding level involved. Wouldn't that solve the routing issue? It's an operator that's one huge dinosaur. Are you saying that we could have solved this problem using something like... Big VLAN. Everybody has its own VLAN, and then everybody... No, no, no, everybody don't VLAN. The whole data center is one VLAN. But everybody's... Because they want overlapping address space, everybody should... Every tenant needs at least one VLAN, its own VLAN, so that they can use their own addresses. But at some point, the physical addresses or some physical vouchers will sort of... All these VLANs will merge at one physical or a small number of physical vouchers because they have to go across the internet. And at that merging point, you'll have address overlapping problems. I thought you were trying to solve the problem with being the data center. I see, it's not just data center, but across, yeah, as well. So, the next problem is actually reachability isolation. You could actually have the same addresses, the same customer addresses everywhere, and yet, a VM belonging to GM's virtual network should be able to talk to only the other VMs in the same virtual network regardless of address overlapping. So, we solved this problem using an architecture called VNet. And as I told you, this is actually building on the VL2 architecture. So, in the original VL2, the demarcation between virtual and server addresses happened right under these top-of-brack switches. In the quality environment, tenants always get VMs, not physical machines. And also, in each physical machine, we have this virtual machine switch, which is basically a software switch connecting all the VMs in that physical machine to the network. And we moved this demarcation line down below the VM switch because only the VMs need virtual addresses. And the physical machines can actually have physical address, location-dependent physical addresses because they're used only by the provider, not any tenants. So, switches and servers use physical addresses in this architecture. Now, you can have red virtual network and blue virtual network and green virtual network. And you have multiple Xs and Ys and Zs. That's fine because when red X needs to talk to a red Y, the VMs which will intercept that packet and then look up the directory service and only look up this red table, not other table, because it knows that this tenant belongs to red VNet. And the same delivery mechanism using encapsulation can deliver this packet to the destination and the destination will need to additionally confirm that, okay, this is indeed coming from a red tenant. I'm gonna check that and then deliver it to Y. Now, if Bluezy sends traffic to Y again, the same mechanism happens. Now, suppose this cluster manager decided to migrate this VM, BlueVMY to a new location. So what happens is this one. So, cluster manager migrates this information so directory service now has a new information and when this VM switch, but this VM switch, since VM migration can happen actually very quickly within some 100 millisecond or up to some seconds, this VM switch can still have this stale information. This is wrong information. Y is not anymore under PM4. So, but that's fine. You can correct it because we use this simple mechanism. When this VM switch actually sends packet to the wrong location, fortunately that PM is still alive. The VM switch is alive and hence it can actually tell the sender that hey, you've got wrong information. Just evict that information and then relearn. So, this VM switch will relearn the Y's new location and then send that information quickly to PM3. So, this way you can support VM migration. So- Only if PM2 is actually alive, if it was down, then you'd have to- Only if PM4 was down, yes. But if PM4 was physically down, there's no point of VM migration anyway. You lost your ongoing connections, VM migration won't help you anyway. You might be recovering from a snapshot. Yeah, if you had snapshot. You might be recovering. But that means there's more gaps. When a system goes down, you might recover from it. Then you'd send to it and you wouldn't be able to detect on use until you actually timed out. For the sake of time, can we keep questions for the end? So, this design is great, but if there is one component in this design that makes me just stay up late at night, it's the directory service. Directory service fails, no new connection. Everything almost stops. Huge core lady fail here. But there is one nice thing that we can take advantage of, which is eventual consistency. We don't need complete atomic update. So we use this loosely coupled distributed system idea, which is similar to Google's distributed locked service such as Chubby. So basically we have two tier design. A large number of read optimized cache servers and a small number of write optimized master servers. And between these two layers, I use lose synchronization protocol. And these servers are actually implemented as software and they run in regular VMs. So if I need to spin up more number of these servers, I can just easily do that on demand and hence I can meet the increasing workload of the directory service. So in production, we use five master servers and some 60 cache servers to recover about 100 million VMs. And did you have some questions? Oh, sorry. I have a question. Yeah, we're talking about directory service. Yeah. Are you aware of any work that is being done towards enhancing DNS, PSTD to incorporate some of the functions that you want to improve in your directory? Yeah, yeah. Actually, this kind of indirection can happen at the DNS level. We're doing it at the IP and internet layer. We're using ARB. But you can exactly do the same thing between host names and IP addresses. The thing is that host names are not as universal as IP addresses. Sometimes customers still hard-code their IP addresses. So being perfect for to enable address resolution and translation, I have to modify the server networking stack. And encapsulation introduced some cases problems, which made the servers might not be able to run at full 10 GBPS line rate. So we had to modify that traffic-spreading mechanism that actually spread 10 GBPS of traffic volume over multiple CPU cores. And this is a better release, actually. And then there's thousands of customers, enterprise customers, using this feature today. And let's keep that part. So now let's move on to the second part, predictable and uniform high capacity. So again, let me begin with some key challenges and opportunities. So this is the challenge slide. And one of the key challenges is related to traffic patterns. And a traffic pattern is basically a mathematical descriptor describing how these servers in the network talk to one another. And why do I care about it? Because if I know the traffic patterns, I can optimize the routing topology and routing scheme to best serve that particular traffic pattern. Basically the biggest bang for bug approach. And to understand that, actually, I instrumented a large data mining cluster composed of more than 1,500 servers and then derived distinctive traffic patterns using some well-known machine learning technologies. And the observations were very surprising. First, even from the traffic measurement measures of one single day, I was able to identify more than 100 unique traffic patterns. And then those patterns were quite different from one another. And second, those traffic patterns change it frequently. And then when they change, they do it in an unpredictable fashion. And these findings were just immediately being in an Olambert in my mind. Because these are so different from what networking researchers have seen other types of big networks, such as ISP backbones. In ISP backbones, if you do some same kind of analogy, in a single day, you'd see less than a handful number of distinctive traffic patterns. And then they repeat every single day. There is a very strong diagonal pattern. In data centers, that kind of regularity is very, very weak. So what does it mean? It means that if you are to optimize routing to avoid hotspots, then you better do that very frequently and rapidly. Otherwise, the efficacy of that optimization might not be right. But fortunately, there are some great opportunities as well. These opportunities are related to the characteristics of traffic flows or TCP or UDP connections. And again, why do we care about it? Because to avoid hotspots, networking mechanisms always employ some traffic spreading mechanisms. And then the unit of traffic spreading is always connection. Otherwise, you would end up causing another delivery. So according to the same checking measurement wizard I just mentioned, for more than half of the time, each server communicates with at least 10 other servers and sometimes up to 100 servers. So that means there are so many concurrent flows in the data center. And at the same time, more than 99% of all flows terminate within a second. So what that means is that these large number of flows are actually quite small. And even if you want to deliver like 100 megabytes of chunk, the 10 Gbps network, it will definitely terminate within one second, because it's just so fast. And this also agrees with other researchers' recent measurement data published in some top-notch conferences. At the same time, there's another great opportunity. First of all, to avoid cash-miss problem, modern operating systems pin A connection to a particular CPU core. And then that CPU core today usually cannot put or receive more than 3 to 4. 2-3 is a little old, but 3 to 4 Gbps today. And then modern CPUs scale by increasing the number of cores rather than making individual core fast. So I would expect that this kind of behavior or a pattern would last at least for quite a while. At the same time, the network links between switches are way better than this. There are at least 10 Gbps today, 40 Gbps maybe next year, and 100 Gbps in a couple years. So what does this mean? Let me give you an analogy. The closer balls and links are bins. And when you have a large number of small balls and you have relatively large bins, just throwing these balls randomly to these bins might not fall too much behind optimal spreading. So what that means is that simple probabilistic traffic spreading might work well enough in data centers. And we're actually taking advantage of this factor. Now with these observations in mind, let's move on to the solution space. So our charge is building a network that doesn't have any bottleneck under any traffic pattern. But what do I mean by this any traffic pattern? And to define that mathematically, I borrowed this notion of post-model. So since this is centered, I presume that a lot of people are quite familiar with value and load balancing in the host model. So post-model just enforces this one simple constraint. What it means is that the amount of traffic leaving all these centers to a particular destination, the sum of that is not larger than the receiver's network attachment capacity. And why do I need this simple rule? Because if you don't have this in-network congestion, it's inevitable. Suppose everybody sends the full team of C, pasty C, to this single destination J, then you will have severe network drops at this location. So you think network congestion is inevitable. So by simply requiring only this single constraint, the host model actually presents the most lenient traffic model that is admissible. And to explain this in plain English, what this means is that senders do not send more than receivers can draw from the network. And here's another interesting observation. TCP actually enforces the host model. So why is it so? Let me talk about another example, again, using J. So suppose J's network attachment capacity C and the whole volume arriving at J will not exceed beyond C because the TCP stack, if all these connections were TCP, the TCP stack will enforce that naturally. And then that admission control is done by the TCP stack at the senders, not somewhere in the network or not in front of J or not in the J. So that's precisely the host model. But still remember that host model is very lenient. So if you don't have good network topology or routing, you can still have in-network congestion. For example, this one. Suppose you have this topology and the traffic demand between these four nodes are this way. So suppose you're doing shortest path routing, they'll both take this path. And if this link has less than 2C of capacity, you'll still have in-network congestion. So what this means is that you need good topology and good routing scheme. And what are good topology and good routing scheme? So here's an ideal good topology and good routing scheme, which is full mesh topology plus valiant load balancing. So valiant load balancing basically was proposed in the early 80s for a mechanism for parallel processor communication on a hypercube topology. Researchers had recently looked into this problem or this mechanism for the on-chip communication and also for internet wide area communication. And Professor McEwen here especially worked on the latter part. And my research is first using this theory for data center network. So again, BLB assumes that we have this full mesh topology. And suppose this server, I need to talk to send some amount of traffic to this server six. Instead of sending this directly from one to six, BLB first spreads this traffic evenly over all the other servers. And then each of these intermediate servers then pours that traffic down to the final destination. So this indirection is the signature of valiant load balancing. And if you keep doing this for every single connection, then you have some great effect. BLB actually mathematically proves that once you do this, you don't have any in-network congestion at all inside this network. And the proof of this is relatively straightforward. So I'll be happy to talk about this offline. But this BLB theory is really, really beautiful, especially in the context of data center networking. Many, many good reasons. First of all, this indirection is very, very cheap in data center. It introduces just few microseconds of latency increase, which is negligible in data center. And second, data center network topology is just very regular, if not full mesh. So we can easily implement BLB, unlike wide area network. And at the same time, the theory of BLB ensures that when you do this, the network is not just free of bottleneck, but the link capacity on this full mesh topology is also minimized down to 2C over N. So what this means is that if you increase the data center size by having more number of servers, you increase N. That means your link capacity actually can go down. That's another beautiful part. And finally, this also shows the power of randomized algorithm, treating any cases even best and worst case, just like average case, ensuring predictable performance even when the input traffic is totally unpredictable, which is exactly the case in the data center. So now the question is, how can you realize this in practice? And here is the topology that we propose. Basically, it's an adaptation of closed network topology. And closed network topology has been there for about a half century now. We call this a folded closed topology. So in addition to coming up with this full mesh topology between these top of rack switches, I first try to reduce the number of switches just by aggregating or introducing another layer called aggregation. Basically, what it does is it just hosts many number of, each aggregation switch hosts many number of tours without introducing any over subscription at this layer. I just wanted to reduce the number of switches so that I can make something like closed easily. Now, if I wanted to come up with a full mesh topology, I have one problem. The problem is that the number of aggregation switches should be identical to the number of ports, IO ports on this aggregation switch. That's a bad requirement because when you have a big network, you also need incrementally big network switch. And there is only certain kinds of switch that you can buy right now. So to avoid that problem, we use this folded closed topology. Basically, the signature is this full VIPRE type mesh. And then what it introduces is these two separate parameters, K and D. K is the number of aggregation switches and D is the number of ports on each aggregation switch. And the key idea is that these two can differ now. So we can build a big network using a small D device. And in fact, just for simplicity, suppose D and K are identical, this is the device that are available right now. And then with this kind of device, you can build fully non-oversubscribed network for up to 20,000 servers. With this kind of device, which will become available very soon, you can go all the way up to 200,000 servers without any oversubscription. Now, finally, how does VLB and this new topology work together? So suppose there is some traffic demand from T2 to T5. Let me remind you the key principle of VLB. The key principle is just spread traffic in a destination-independent fashion over all these possible paths. And how do I achieve this without introducing any new hardware solutions or some funky mechanisms in the switches? Fortunately, I didn't have to introduce anything because IP networking technology already has something called ECMP, called cost-multipath boarding. What it does is this. When a router has multiple equidistant paths to the same destination, the router actually chooses a next top for this packet in a deterministic fashion. And then it actually chooses different next top for different connections. And the specific mechanism is actually taking the hash of five tuple values so that all the packets in a connection can take at least one single path. But for different connections, it uses different next tops. So that's very similar to traffic spreading, if not ideal traffic spreading. And thanks to this topology, pick any two tuple break switches. We have a large number of equidistant paths naturally. And those paths do not actually introduce any indirection. Any paths always take this path. So this traffic volume will be naturally spread among all these paths. Now, the key question is, of course, how close is this to the ideal case? There are some cases where we actually hold short of the ideal BLB. First, we're doing random assignment, not round-robin. And second, flows are not identical in terms of sizes. If we did, maybe byte-by-by, or packet-by-by, packet round-robin, it might be really close to BLB. But we're doing per-connection random spreading. So there will be some sub-optimality. And then I'll quantify that in the next slide. This mechanism harnesses huge aggregate capacity in this network without introducing any esoteric traffic routing optimization or relying on new hardware functionalities in the physical switch hops. And it also ensures robustness to values because we have many active paths. And it works well with switch mechanisms available today. So finally, this is what I have built about three years ago. Basically, some AD servers interconnected through the closed network topology. And as you can see, everything is perfectly unprofessionally wired. And I'd really love to show you how neat and clean all these things are in production network. But unfortunately, I can't share any pictures of our production data centers. So really, we do have this now. And then this is way bigger. This is not just a standard architecture for Windows Azure, but this is common for all big divisions in Microsoft. And along with my colleagues, we build these networks in all the data centers we have around the world. And the smallest one has more than 60 terabits of capacity. And the worst case oversubscription ratio is 2 to 1. And each server has 10 GVPS network attachment. So with that, each server has a network IO throughput larger, actually surpassing the disk IO throughput. So each server can actually send, even in the worst case, more than 600 megabytes per second, which is larger than the theoretical maximum of Zeta 2.0. Now we use basically just IP routing in the switches. And hence, we can actually build this using any pizza box switches. And we actually did that. So the cost saving, given the target capacity of, say, 62 terabits, if you compare the cost of building that network using this design, and with that of the old design, then we actually saved more than 96% of the cost. So we already checked. I just wanted to validate how closely this network follows the theory of BLB. So I ran some old-to-old traffic shuttle tests using 75 servers, because that's the most challenging task for network. And then this chart actually shows the actual aggregate good food value over the 400 seconds. And then this has actually been 400 seconds. And the aggregate good food value is actually the sum of the bytes that are actually delivered to the application across all 75 servers. And to compare this with the ideal case, I actually plotted this theoretical maximum of BLB. If everything worked just perfectly fine, you would see this aggregate curve and then dropping all at the same time at this moment. And as you can see, BLB actually is quite close to the theoretical maximum. It's actually 94% of that theoretical maximum. But I wanted to focus on this 6% suboptimality. Why is this happening? First of all, as I told you, we're doing per glow random spreading. That can introduce some suboptimality. And also, to enforce the host model, we're using TCP. And TCP is not perfect. It's great enough, but it's not perfect, because it takes some round time. It does so too spatter, and it also introduced some packet loss to actually find the right rate. So there are some reasons here. But I believe that TCP is good enough. And I didn't want to further improve TCP. I just wanted to see what this flow level random spreading affects to this suboptimality. So I actually went into the reproducing system and looked at the link utilization. Because if there is any suboptimality, it will manifest itself as some link utilization being higher than others. So I looked at, I collected this link utilization statics from the production network over a whole week. And this is per minute link utilization. And that's actually 50% to 80%, depending on the workload over the week. And the standard deviation of link utilization measured separately at every single minute was just less than 1.5 percentage point, meaning that link utilization is quite even. And hence, we can confirm that we have to approximate the ability, even in production, fairly well. But on the previous slide, you made it look like some stragglers that hit hard, as opposed to kind of, in the final way, spreading them. This one? This bump? That was in-cast. That was in-cast. TCP in-cast. We're using not DC-TCP or something. Because they're all bursting together, and then some of that just damage it a little bit. Great point, though. Is that one and a half percent variation? Is that across links, or within a minute? Across all links, across all links. I have a question. There's a lot of work that has been done here in characterize elephant clothes and mice clothes. In service providers. Is there data available to characterize that traffic in data centers as well? Because the nature of the traffic is also changing? Yeah. Yeah, yeah. Some of that actually is available. So I'll contact the owners of those data, and then get back to you. And yeah, I think the nature is changing, especially in data center. Because it's not generated by human. It's just totally generated by some code. And then the application programmers are fully aware that if I generate huge elephant clothes, I'll get some serious issues in data centers. They actually. Or a lot of mice clothes could cause a lot of problems, too. Yeah, that too. And in your design, did you look at the size of the buffer that you need at the NICs? Yes. Because you are driven by the source? Yes. Yes. To manage your traffic? Yes. But are you talking about TCP socket buffer, or are they to NIC hardware buffer? I'm saying NIC hardware buffer for simplicity. OK. OK. Can you take this question offline, so that I can add? Yes. So about the 1.5% deviation. Yeah. So that's for one minute difference. Yeah. Every single minute, I compare it between the link utilization of all values in this. So given that the majority of data center clothes last less than a second, good point. I mean, if you're trying to measure and balance and really quantify how much it affects true performance, you probably need to go down to the level of, say, I definitely agree. Yeah. It's just the artifact of not having fine-grained measurement to be processed. Question? Do you have to get in the new site? What do you expect? But at the same time, you also looked at at least the tail drop. How many packet drops have happened on this particular network? We also enabled ECN on every single hop to see actually ECN marked packets happen frequently. Those things look quite clean. So I have pretty strong confidence that the transient oversubscription that lasts less than a few milliseconds is not happening frequently at all in this network. Question? Have you analyzed how NPTCP might affect us to improve the performance further? Do you remind me of what NPTCP? Multi-cast TCP. So you could actually employ multi-cast TCP on this to power it, because you indeed have a lot of multi-cast. You could embed that. Now, I would say that it's a very equitable approach here. But one other thing that I want to mention is that NPTCP wanted to take advantage of resource pooling. Because if I can use multiple paths, I can actually get more bandwidth. In this case, that benefit might not be present because your network capacity is essentially bound by your net capacity. So no matter how many TCP connections you use, each of which uses different paths, your ECMP is already doing that. Not at the connection level, but multiple connections level. So that benefit might not be used in this architecture. So I just wanted to, this is the last slide before the conclusion. So net virtualization is comparable to other key net virtualization technologies such as virtual memory. Because it meets the key principles of any virtualization technology, abstraction, isolation, and efficiency. So let me just go fast on this slide. So abstraction, what is the abstraction of net virtualization? Location, independent addressing, and uniform high capacity. Basically, the physicality does not matter. That's the abstraction. And isolation, reachability isolation, address space isolation, and performance isolation. That's the key feature of net virtualization. Efficiency. As if virtual memory hosts multiple process sharing exactly the same virtual address spaces, we also ensure that concurrency of virtual networks using a feature called bring your own addresses. So all this actually improves the agility in the data center significant. So net virtualization enables agility, meaning that providers and tenants can use any available servers any time in the data center. And net virtualization is comparable to virtual memory or machine virtualization, storage virtualization. And I have also shown you that turning these high level architectures into operational systems is quite possible when your focus is on simple solution and intuitive abstraction. And the intuitive abstraction that we're proposing here is basically huge virtual switch. And this virtual switch gives you flat addressing and predictably and uniformly high performance. And essentially, I also turned this into individual huge virtual switch dedicated only to a particular tenant. So with all that said, I'll be happy to take your questions. Thank you. We have time for two questions. Otherwise, you have to take it offline. Okay? Any questions? So service providers have implemented similar virtualization using layer 3DT. Very similar to layer 3DT. Yes. Yes. This technology might be applied in service provider networks. What are the big issues that you see? So at least in terms of the capability of coming up, say something like VR app, virtual routing and forwarding, which allows routing protocols to disseminate overwrap and address ranges. It's similar, but those kind of protocols do not address assignment, address resolution and those mechanisms. For example, when you disseminate this information, you're obviously talking about two or three of those magnitude more amount of routing information. Because everything is here proposed, not per aggregate or per prefix. And hence, when you disseminate that, you don't wanna just randomly disseminate using something like EVGP, full match or route reflector. That's not going to scale at all. So you have the directory for the end of that? Yes. Any other issues that you see? Address allocation and address resolution is the other problem. The existing VPNs, technologists do not care about those parts. They only care about coming up with multiple VR apps and then synchronizing those VR apps. Intercepting DHCP and ARP and all those things. Yeah. So. How much, can you give an estimate of like, how much sort of, how many additional switches you need with the fully closed versus traditional factory? And also how much is the core traffic increased by BLB versus not using BLB? So, let me try to understand your question. Were you trying to compare between this architecture and that tree or? Yeah, so, I mean, I think the folded clause looks like a factory with more aggregations, which is mainly. Yes. And also, BLB requires sort of kind of two trips through the core, potentially. Yeah, two trips through the package. So, my question I guess is, how much more hardware do you need with this versus a traditional data center network design? You, basically, as you go up at each layer, you still need exactly the same amount of full bisection bandwidth. If the aggregate of, suppose you have n number of servers each with C, you have nC capacity, that has to be preserved at every layer, unless you are actually willing to introduce some small or very subscription. So, definitely the number of switches and links increase, but I think that's just natural cost of offering more bandwidth. Between this topology and that tree, the UCSD folks, I don't actually see much difference. They're also using multi-routed tree and the, as you go up to the level, you still have many mesh edges as well. So, I don't think there's some fundamental difference between these two. Let's thank Zhang Jian. Thank you.