 I enlarge their views when they have to sit down in the airport. The one usually is just as uncomfortable as I'm supposed to. I don't think she's enjoying either. Hello everyone, welcome to the next seminar. Shall we start? Okay. So today's speaker is Vimal Kumar. Vimal is a student here at the computer science department. He's working with Professors Mazures and Balaz Brabacar. And he has been working on different things in networking for network troubleshooting, performance, virtualization, and network emulation. And today's talk is on IQ, a work that he has done between Stanford and Microsoft. Okay, so thanks Giannis. So I'm not going to be talking about what this is. Maybe I'll come to it at the end of the talk. So I'm going to talk about what assistance Stanford called IQ. So Giannis mentioned his job well with Stanford and Microsoft. And yes, Rockstar as well. So, you know, things used to be really good sometime in the last decade, you know. Companies or enterprises used to build their own data centers and maybe one thing was clean and some others used to build their own brand data centers and put it in a co-location somewhere. So the nice property of this data centers was that you knew you bought your own IQ network and you knew exactly the performance at lower bind. So if you bought a gigabit of network, you get a gigabit of network, you bought some CPU, you get exactly that CPU. Dedicated, no problem at all. But then sometime in the last decade came the cloud and it basically promised to lower the cost for many medium scale enterprises as long as everyone started sharing resources. So we share whenever we want more, we scale up and we scale down whenever we don't want. So then what happened with performance? Well, this is an example of what could happen. So this is a graph that I took as of today morning. I'm going to start with about 6.30. So this graph shows the performance of a memcache decline on a shared memcache cluster posted by Google App Engine. So you can go to this website and check the live statistics. So what this graph shows is that the mean and medium latencies for the memcache to get requests are sort of okay, they are well within like tolerable limits for an interactive application. But then if you see on the y-axis, the 99th percentile latency just shoots up and down. Depends on the time of the day, probably depends on whom you are co-located with. But the part is that the 99th percentile latency latency just varies throughout the day. And you can, in fact, check the historical trends on this website. It's very interesting. So some of you might be wondering, okay, why do I care about the 99th percentile latency? Or, well, yeah, of course, or even latency. And to say why, let me give you a very simple example. So today a lot of applications cannot fit really in a machine, so they run in a distributed state. So it's a typical web application as a front-end web server that gets requests from the World Wide Web. And in the back end, you have some servers, such as a database, which is typically distributed across multiple machines. This could be even caching machines or databases or whatever. And then the front-end web server, depending on the request that it gets, is going to find out a number of internal requests to generate the final web page that you're hoping to see. And this architecture is typical, like a partition-aggregate workload. And it's been seen, like, in many online web servers, Google, Microsoft, Facebook, Reddit, and so on. So the interesting thing about this is that the front-end web server is going to issue a lot of requests to its workers. The response latency, which is the time the user sees for the web page to be rendered, is going to be affected by the speed of the slowest worker. In fact, you can work out the map and it turns out that as your 99th percentile, even if the high percentile is a latency increase, then you will be dominated by the tail end of the latency for your external web response. So this is why some people care about the 99th percentile latency. And, well, meanwhile at Facebook, I saw this graph at MSDI this year. And this basically denotes the data dependency tab. So every node is a piece of data that has to be fetched, possibly involving multiple rounds between servers within the data center. And this is the dependency graph for, I don't know, what request, but they say a smaller question. So complex web applications do exist here. Yes. In the web context, this is a simple example of a kind of line blocking. There's a well-known solution to that, and that is just to allow things to kind of be returned out of order. There are both web systems, like Spindy, which allow you to return things out of order. And there are also many excellent examples of web pages that load the framework, and then this thing comes back to just come back. Sure. You can do that. In fact, a lot of applications actually do that. In a sense, they're not going to wait for, like, every single web server or response to return. But this is going to lower the quality of the results. And, in fact, the lower quality of the results you're having is probably going to fall down. And there are a lot of studies that show why the latency action matters. But anyway, so this is a complex dependency graph. But then this is not just endemic to one particular cloud provider. So I said Google App Engine in the first slide. But then, so some of my friends here, they run a startup on Amazon EC2. And I work for the servers. And I basically asked for pairwise ping latencies collected over a period of three weeks. And this ping latency, I just plotted them on the x-axis. The x-axis is just time. The y-axis is the ping latencies in microseconds. The mean latencies are sort of OK. So ignore this 10, 20, 40. It doesn't matter. It's just a time series. Look at the 99% cloud latency. It shoots up and down. So it's just not one cloud provider. The problem exists in many cloud-cells providers. So there could be many reasons for why the performance is actually very unpredictable. It could be CPU. It could be shared without a hardware advisor. But one thing that is guaranteed to cause this problem is network induction. So in fact, the sad state of affairs of how we manage a network bandwidth today in our data centers can be summarized by this one cloud. This is a huge traffic jam. So the utilization of the data centers are continuously increasing. And the reason why the utilization is increasing is because operators or providers, they want to pack more and more services onto a single machine to increase or decrease the operational expenditures. So we're seeing an increase in utilization. An increase in utilization is going to cost more congestion. More congestion leads to more packet drops and packed forms for everyone. So how are people reacting to this congestion? So before I go into that, as I said, the key issue that we have today is that the trace transport protocols are not dealt or are not built to satisfy political activity. I'll come back to this point in later slides. But let's see how people are actually dealing with these issues today. So just survey the blogosphere. People have a lot of industry comments. So one thing popped up. So this was an analytics company that ran their entire startup within the cloud. They said, well, the performance that we have in the cloud is just so bad that we just gave up optimizing our applications and we just moved off the cloud. So providers are bad because they're using revenue as customers move out of the cloud. This does not build confidence in a shared infrastructure inject. Some other organizations, they are technically very mighty and they love challenging the status quo. They love solving interesting problems. And that's what they do on this. They just go around. Okay, thanks for that. They just go. They rewrite their entire application stack. In fact, many of you have been reading the Netflix blog. They've taken a lot of pride in saying that, okay, they've developed a lot of interesting applications. They read tests or applications against like performance variability. Well, they can probably do that. But frankly, I don't have so much patience to rewrite my entire application stack. So these are like the two major approaches that people have come to like deal with this performance variability. But I'm going to say that there's perhaps some crisis center approach. So one way you can sort of deal away with this performance interference is if you start offering each virtual machine or each service some kind of a red candy. Sometimes positive red candy is a real thing because it achieves some kind of acceleration. Because the rate guarantee that you provide or that you give to a customer is not going to be dictated by the performance interference of like any other customers. So predictable that you can actually mean different things for different applications. Now if you don't know anything about the traffic patterns, perhaps rate guarantees are the best thing that you can give to an application. That you do whatever you want, I'm going to give you some amount of money. I'm going to delve into that what this actually means in the air slides. But then if you're running a web server, maybe all you care about is like, as I said, bounded response agencies for your web address. And this is actually, interestingly, it's a function of the map. I love the subject. It's a function of your map. And similarly for map-reduced jobs, you can say, well, what I really care about is not really rate guarantees for my map-reduced jobs, but I want my jobs to complete within a certain model of map. This can be a definition for productivity for some kinds of jobs. And similarly for network storage. But the underlying theme behind all of this is that there's a notion of great clarity that you want to give to maybe a VM or a set of VMs that's going to help you design algorithms for all these things. So come on back to the status quo. What do we have today? Well, the state of the art in managing network bandwidth has been deployed, widely deployed for the past five years as TCP. And TCP has really served as well in managing congestion. But then TCP is not really good because it doesn't satisfy our predictability requirement. So back in those days when we had download accelerators, just open more and more parallel TCP connections to get more bandwidth. It's easy to gain the system so therefore there's a good time. So of course there are other ways to deal with this. And you might say, well, as a provider, I'm going to start great limiting my current customers, maybe at the granularity of VMs. So if a customer has 10 VMs, each VM just gets a gigabit of front-end capacity. And this should give you some sort of acceleration. But no, it's in fact very easy to gain this system as well. It's very easy to see why. That's why you have two tenants, the red and the green, they're distributed throughout the data center. But then two VMs of different tenants are co-located on single physical machine. And maybe red is given 2.5 gigabits per second and green is given 7.5. Maybe in this scenario it works out all right. But then as red keeps increasing its VMs, it's just going to get more and more bandwidth at the receive side. So just starting great limiting on the transmit side is not really sufficient. It doesn't solve the problem. Unfortunately, this is what providers do today. They don't do anything more than great limiting on the transmit side. Some of you might say, well, why don't you do great limiting on the receive side? You can monitor for bandwidth. If the traffic exceeds, then you start dropping packets. But it's not really going to help because the traffic has already done its damage in transmitting across the network, consuming all the resources available in your network only to be dropped at the receiver and it's not going to really solve the problem. So in fact, there are a number of other approaches to solve this problem as well. You can say, I'm going to use like class of service queues. I'm going to give each tenant its own queue. So this is going to give me some kind of isolation. Sure, yes, this is going to give you isolation. But then if you look at the public cloud provided today, you have tens and thousands of tenants. Each with maybe a couple of VMs. If you want to create a queue for every single tenant throughout the network, this becomes an operational network. If tenants come and go, they might spawn more and more VMs. Let them keep configuring your network. And actually, this doesn't really solve the problem because it might give you isolation on a single link. But then if there are packets getting dropped on that link, why are you even admitting the traffic into the network when you drop packets? This cannot say bandwidth. And of course, if you have been reading the network in literature, people have been posting like, okay, I'm going to build a full bi-section bandwidth network. If you have a lot of bandwidth, then it's really going to solve my problem. The answer is no. Because let's say you build this ideal network. You have this network which has, let's say, infinite capacity. And you have a number of physical machines that have access to it. This cannot be infinite. Maybe it's a gigabit per second or 10 gigabit per second. So in this scenario, you have like two tenants, the Red and the Blue. There's any traffic here. But then as Red is going to send more and more traffic, there's going to be a fast transition on the last part of the access links. So no matter how much bandwidth you have within your network, you still have to deal with this problem that is happening at the edge. And I'm going to show you evidence that networks are pretty close to this ideal networking fabric. And in fact, companies are building these networks today. So for this, we actually did a condition study on Windows alias networks. So Windows alias builds networking equipment like this. So they have a number of flags in B-20 to 40 servers per rack. And these are connected to a layer of top of flags. So these might have about 64 quotes. And the nice thing to note here is that the leaf layers are connected by a high bandwidth spine layer. So these are switches with high fan out and they interconnect all the racks within your network. And the sole purpose of having these switches is to provide enough bandwidth across your data center. That's uniform and it's free from any internal bottlenecks. And you could build for the bi-section bandwidth networks today. They're not that expensive. And what we have seen in the wild is people don't really build for the bi-section bandwidth networks. Or even smaller. And this is in fact being built today. So on this network we tried understanding where does condition really happen. So we started monitoring a storage cluster because in fact one of the hottest clusters that they are operating with a lot of tenants use a storage cluster. And we looked at link utilization on two types of links. I'm going to call them the core links which are the blue links here and the edge links which are the links connected to top of rack to the servers. So if you look at the link utilization collected over a period of two weeks and let's look at the 99% type or 99.9% of link utilization. So even over a period of two weeks this percentile corresponds to like minutes of link utilization. Like tens of minutes. And what we saw was that the core links are about 30% utilized at the 99.9% type. Not average 99.9%. And the edge links were very highly utilized. So this sort of suggests that, okay, indeed, in technology networks condition this is a proxy for condition like link utilization is a proxy for condition and this happens more at the edge than the core network. This could be due to other reasons as well. There could be a lot of local traffic which could explain the same behavior but still just supports the observation that condition happens more often than the core. In fact while this give you like macroscopic use of condition collected drop counters from all the links which are cumulative. So they're activated for over a period of weeks. And we found that in the hottest clusters for every packet that's dropped in the core of your network there are over 1,000 packets in the core of your network. And packet drops is a sure indication of condition of happening in the network. And in fact in over 16 of the other something clusters that we saw where there's no packet drops happening in the core. So in this cluster they were running storage application which uses TCP. So the nice property of TCP is that it doesn't send more traffic into the network than what it can actually create. It ensures that the traffic is invisible. So we have two centers sending to one with some upgrades of these two flows to not exceed the capacity that is available at the CBO. And of course these networks have a lot of capacity and they use ECMP. So if you have a lot of capacity you have a lot of capacity. So what I just told you was that there isn't a network not built in a very random fashion. Although there are papers that say you can network it randomly. In fact they are built to satisfy the needs of the applications running in the cloud. And there are a number of research proposals that have been proposed. And in fact the point that I want to make here is that these research proposals are real. In fact they are the multiple cloud providers today. And of course the other point is that it's multipath and traffic admissibility effectively push the congestion towards the edge of the network. So if edge is where the problem is occurring maybe perhaps that's where we should solve the problem of traffic congestion. So that makes up a system IQ. So what we build is a system of bandwidth parameters. So remember how I told that way guarantees tie into this motion of providing predictable bandwidth as far as networking is concerned. So this is good news because now the customers can now provision their VMs and their services just as they use to provision their dedicated services. So you get a VM specify some amount of CPU and specify so they can place their VMs across the data center. And of course no provider today issues this so we'll have a competitive edge as well. So I'm going to show you how our system actually solves this problem of providing predictable bandwidth guarantees to each VM. So note that the unit of allocation is on a VM. Every VM gets a guaranteed transmit and receive capacity. And the system works in some fashion that I'll get to it I'll get to it to meet these guarantees as quickly as possible. So where does this all fit in? I just talked about like one piece of a big picture which is data center resource management. So the way people are thinking about data today is not that I have a collection of machines and I just want to run some service in there but think of data center as just a physical infrastructure that provides the resource pool. This compute storage that manages this entire resource pool. So maybe a fair situation is going to give you CPU guarantees or CPU isolation and what our Q fits in is the network inquiry. It's going to give you rate guarantees. So this is a brief recap. I said network condition predominantly happens at the edge of the network and I give you evidence supporting the claim. I'm going to show you how this leads to a very simple design for managing bandwidth at the edge. So let's start the goals. So isolation is one. The other goal is that let's say there are VMs in your data center and maybe some tenants are just sitting idle. If you have a spare capacity you should be able to redistribute the spare capacity to tenants who probably need it. So I'm going to come back to this using a simple example. So notice that I said IQ operates at the edge by the edge I'm going to say for this example it's a shin layer which is sitting inside the hypervisor and it's able to intercept so often we switch for example to intercept all traffic and this exists at the edge of the network. So I show you that the network is not predominantly a source of connection I'll get back to this assumption later so I'm going to remove the network and let's say there are two tenants the red and the blue and I give them two gigabits and eight gigabits and of course they share a common 10 gigabit per second power. So initially let's say there's two flows threads is sending to a receiver there's a lot of capacity available so each flow gets 5 gigabits per second there are no problems but let's say blue starts sending traffic so it needs some level of bandwidth so when blue is starting to send traffic it's going to cause congestion at the last stop now this congestion notice that this condition is actually local the server can immediately detect the fact that this condition is happening just by looking at the link utilization it's going to reach the passive so it's going to use this information I'm going to get to what is the exact mechanism where with me and say give this blue 80 gigabits per second because that's what it asks for and then the remaining capacity is a bit equal let's say we'll need the two sources let's say a third flow starts now between the Vm on the left distant to the Vm on the right it once may be 5 gigabits per second and of course this is going to cause congestion here because the available capacity is just 10 and the utilization is 13 and let's say you split the capacity in the state flow 5 and 5 now what I wanted to notice is that the act of reducing bandwidth here on a physical machine that's somewhere in your data center is going to free up capacity at the receiver and this can again be locally detected and the spare capacity can in turn be water filled so you can give the available capacity to the 90 gigabits so of course you cannot meet the rare guarantees for the Vm on the right here because it's bottlenecked at the source so wherever you have sufficient demand you get your rare guarantees at your bottleneck but then I cannot make promises because this is what you would expect even on a physical network you cannot get rare guarantees everywhere what do you do 5050 rather than proportional allocation you can do proportional allocation as well this is just an example so any questions so far okay I'm going to dig into how this actually works under the hood so recall that I said this link utilization as it changes can be observed locally the way we do this is we instantiate what are called condition detectors now these condition detectors are just byte counters that are created for every single Vm on a receiving machine and there are two here so you get two and these byte counters are basically going to track the link utilization for this particular time okay and in this case it's allocated a capacity of 2 gigabits per second and 80 gigabits per second and it's going to track this now the job of this condition detector is just very simple is to ensure that the aggregate incoming weight matches the capacity that is being allocated to this condition detector in this case it's 2 gigabits per second now this is the job of condition control algorithm I mean we do this using condition control and that's what we do we use a condition control algorithm that's developed here at Stanford called RCP I'm going to call it RCP star because it's a slight modification of it and the way it works is it just monitors link utilization and it sends explicit feedback to the sources and the sources are going to be rate limited and notice that this rate limiting is happening in the trust with domain inside the hypervisor so we don't expect tenants to obey this very feedback at least at the center they're actually going to create rate limiters and then rate limit traffic and notice that these rate limiters are dynamic you don't create a fixed rate limiters for either operating in a fixed capacity it's going to the rate at which it's going to train is determined by what happens at the receiver okay so let me delve into the sub-problem in this quick in a little more detail yes how do you monitor the congestion how do you know this congestion is look at the queue size or you look at the number of packets how do you call it we basically monitor link utilization at like 200 microseconds that kind of thing yes we're going to talk about how it's done yes okay so so consider this sub-problem so you have a condition detector that is being allocated some capacity in this case take a few bits per second and make sure that the net aggregate rate matches the plane rate okay and the way this works is that this condition detector is going to track off number of parameters the first is actually the link utilization we're going to call it yi so this is the counter that's being updated every 200 microseconds now of course there's a capacity which is allocated to the camera which is here and there's an alpha parameter I'm going to come to this later and this job is to determine just one rate r to this 3 so how this is going to work is that this condition detector is going to measure yi you're going to compute r i using this equation and then it's going to sample incoming packets you're going to take one packet and send the feedback with the current value of r i to the source that was in this packet and this goes from going to the operation to know that you don't have to keep track of how many senders there are it doesn't matter you just sample incoming packets and send feedback okay so let's see how this works so initially let's say all the VMS start blasting at land okay that's the initial value the aggregate utilization is going to be 10 gigabits per second the capacity is just 3 gigabits per second so you're going to say you know what you're sending too high but try 3 gigabits per second it's going to do guess work according to the equation yes how have you said this 3 gigabits per second I'll come to that later okay and it's going to come and say okay send it 3 gigabits per second the next iteration all centers are going to send at 3 gigabits per second so notice that you don't have track or keep track of the number of centers they're going to send at 9 gigabits per second in aggregate of course it's still too high because your aggregate capacity is 3 so you're going to say okay slow down even more maybe you say send it at 0.5 gigabits per second okay the next iteration is going to come to 1.5 gigabits per second now this is too low right now you're going to increase your guess of R according to the control equation so I'm going to say okay you know what try at 1 gigabit per second and this is a nice thing because now you enter a fixed point which is as you measure the utilization it matches the available capacity to you and you send 1 gigabit per second and this operation 9 news so this is a fixed point iteration okay yes you don't really you don't really communicate with all sources right you said you randomly pick one yes okay the random something works because it's going to pick a sender with sending a high rate high probability do you have any other questions yes they work well with the constant traffic rate but how does it model against the bursting I'm going to come to an example that shows you how this works like using real workloads and experiments I'm sure you can really start yes so it's going to pick the one sending a high rate if the final size distribution goes across flows you can sample like every pipe actually with a provide yeah okay so let's say in this example one of the senders just goes away notice that this process happens continuously so one of them goes away you again compute a new arc we keep advertising and this happens every 200 micro sites okay so let me give you a sense of like how this rate definition actually looks like so let's say you have 10 senders 100 alpha that compute that determines like how quickly you adjust these rates that is going to say okay initially they're going to start with like 0.001 a very small value and then the rate on the x axis you have the iteration so you have the first iteration here the second iteration and so on and the maximum I've put is like what 50 iterations doesn't matter on the y axis you have the rate evolution as a function of the number of iterations right what you converge to is 1 divided by 10 so it's like 0.1 and for different values of the control parameter you converge in like different patterns so for 0.5 you're being like very conservative so if my utilization the smart mathematical capacity I'm going to slowly increase my rate okay and that's why the blue curve increases slowly and then it eventually converges to 0.1 you said 1 it's more quick but then if you start setting like about the fix point of course if you have a very bad parameter then you're probably never going to that's the green line okay so we have done like an analysis of this equation and the equation was here what was the time of the x axis is number of iterations about 40 iterations and notice that we can re-compute these rates every 200 microseconds so 40 iterations is like about 10 milliseconds okay so this is the parameter alpha that I was very and we actually use alpha equals half and we have shown that irrespective of what value of r that we start with you will be able to converge within 30 iterations and that corresponds to about 6 milliseconds so what this means is that if you have like a traffic pattern then you're going to get bandwidth within 6 milliseconds at 0 and for various applications whose traffic patterns are inherently bursty you need bandwidth very quickly okay so why this is important is that today's j-center switches actually don't have so much of buffering d-puffers which are pretty extensible so you need to react quick enough so that you don't start overflowing the top of flash switch buffer causing packet drops because packet drops is just going to merge your application performance and the 6 milliseconds actually bodes well to the fact that the amount of buffering available today is like a megabyte and of course to illustrate this we have an experiment we had about like 14 dv-senders they're all trying to blast at maximum rate and there's also a tcp-sender which is located on a different physical machine all trying to send traffic for a single receiver so these two are bm's which are co-located on a single physical server sharing a tentative link and all these flows started like they started tentative now what we did was we started the tcp flows first and then we started to synchronize these currents so no attempt was made to synchronize these currents they all started immediately and what we saw was that from time t equals 0 to like about 18 seconds if you look at the utilization for each tenant tcp gets like 10 gigabytes per second when it starts the mechanism is more conserving so you start getting all the capacity now when utp starts it converges pretty quickly in fact we weren't able what we see is that at least graphically the rate control converges within a couple of seconds and of course the nice property of having an explicit feedback like rcp is that it tells send at this particular rate doesn't say slow round speed up and so on in fact we did try with a couple of other control algorithms notice that you can use any condition control algorithm to limit the sources to restrict the capacity so we tried so for the same experiment if you use dcp you don't need to know what dcp is but the key point of dcp is just it says slow down or speed up that's it there's a single bit feedback the receivers send to sources of trouble now the single bit feedback actually does work since each tenant actually gets its bandwidth but then it takes a long time to finish it takes about 200 milliseconds you could go beyond single bit feedback and try another control algorithm that uses multi bit feedback which says it's a 2 or slow down by a factor of 4 that also works but your convergence time is going to be slightly longer so we found that about 15 milliseconds we get like a guarantee the advantage of rcp is that it tells you boom just send it to the stream and it converges within a couple of milliseconds so putting it all together the components that we have are for every single physical machine the vm and what we have is a destination detector followed by a scheduler here that ensures that each vm on the transmit side gets the bandwidth that it first would want so if there's a contention which is happening on the transmit side each vm will get its bandwidth and these rate limiters which are created on a per destination basis is going to control how much traffic is sent to any single destination okay we have a valid vacuum it's going to determine capacities for each of these buckets and the job of this control algorithm is to ensure that the senders navigate send traffic which does not violate what's happening in the scene so all I've said so far was under the assumption that the network core itself doesn't get interested but then in practice there can be enough capacity so flows will start providing within your network because ecnp is not perfect and in that case it's not that we don't do anything we do have a fallback mechanism and what this fallback mechanism guarantees is that no single tenant will be able to start you don't start any single tenant but you still keep bandwidth which is a portion to the number of receivers that tenant has so what the way we handle that is using ecn it tells the n-host it's a mechanism by which switches can tell n-host that there is condition happening at some link in your network it doesn't matter which link so it just starts backing off so we do incorporate ecn feedback into a rate control mechanism so talk to me if you want more details about it and this ensures that you basically fallback the interest condition and of course if you have we have a fallback mechanism too it's borrowed from tcp we don't send more traffic if you don't get like great feedback from an equity okay so all that I've described today is real we built we built iq in software we built it both for the nuts and windows the nuts version is open source you can visit this website and download it the nice thing about iq is that you don't have to modify any of your software to take advantage of the fact that you can provide great guarantees to endpoints so it works regardless of tcp or udp now you can start safely admitting udp traffic into a network it's just a current query which operates regardless of what you're talking about there's also a fully functional mainnet version so you can download and play with it on a similar machine to see that it actually works for your applications any questions so far yes is the 6 millisecond convergence time a magic number completely independent of network size meet bandwidth number of 10 minutes number of work increase yes yes that is true and the reason why the 6 milliseconds is actually independent of the number of traffic sources or the capacities which are going to mainly determine the convergence time is that the control loop so it doesn't matter whether your capacity is one or a thousand or ten thousand it will converge within 0.01% of what a capacity should be and that's what we're going to do yes two related questions if you could would you have found that you can tell the benefit in having the transmit side out of the rcp-star dialog in the hypervisor on the v-switch but the receive side be on the TOR server facing port so yeah if that last thing was congested and it did nothing for control packets to still make it you may use your rcp signal name for other reasons and the related question is do you know that any switches on the market of the newer generation either software based ones that would have let you do that either from switch to switch and still get the same implementation yeah I mean certainly the mechanism that I described here would perform much better if you started preventing it because it sees condition immediately as opposed to the end goals like trying to infer it I don't know any switches that can actually start sending feedback there's sort of like some feedback mechanisms like priority for control and things like that you could maybe hack around or play around with those bits and achieve once you have it in hardware I mean you cannot change it for a number of years so I was testing the marketing claims of the market which is all these new ones that say I'm so programmable you're going to love it how true it's for your purpose so many of these programmable switches actually offer a great deal of programmability in a controlled way the problem congestion is that it's a data playing phenomenon and the way fuses today allow you to see what's happening in data and that was like a standardization that took a number of years that gives you this multibit feedback I don't think we still have RCP operating on switches yet I don't know why maybe we should just talk later and ask why but we do have the ECM support which you can use to make sure it embellishes faster and your performance that happens yes ignoring UDP for a second I'm curious this is more of an implementation it already has a way to rate limit from the receiver side it's the receiver window technically it's not really used very often because it was built in for a different purpose where you couldn't work for a lot of packages but I actually wonder if you were to essentially hijack the receiver window on TCP and use it do you expect it to work similar of course it wouldn't deal with UDP but if it was TCP I wouldn't be traffic you could hack around but in principle it is used to do a lot of things but there are a lot of issues that you have to go through first thing you have to see how many TCP connections there are you need to basically modify the receiver window for every single TCP connection second thing is you need to figure out what should the window be that's going to be a function of your banded delay product and things like that third thing is in a cloud environment and there are other reasons as well the smallest way that you could bridge them to TCP so without having a queue at the center there's going to be a window there and that can actually be quite heavy let's say you wanted to prioritize fairness and convergence times over scalability and robustness so you decided to throw away your nice distributed solution and go which versus one versus the other whether that would be practical so how would a centralized solution look like so every millisecond every host sends to one location of a single packet that details everything needed for a central entity the schedule for everyone that has knowledge of the current topology and then push that down you would have to find really hard to make that thing work and the reason is that notice that the feedback packets that our queue is generating and even operating under every single micro every couple of hundreds of microseconds it takes six milliseconds to converge so if you were to operate at every millisecond and assume that you have enough servers spread out and then push it back to the hypervisors it's going to take at least an order of magnitude more time to converge so yes certainly you could do something more interesting if you have a more centralized solution where you have a global view of your network but it's probably just good okay so I was actually specifically looking for whether you know the complexity of the central solution because it would guarantee within one iteration you would converge not really no no no no the centralized solution doesn't mean that you would converge within an iteration and the reason is that you really don't know how much each sender has to send in the first place if you have an accurate knowledge of the number of senders which are active to a particular secret which is in fact a very hard thing to do I don't believe a centralized solution would work in that case and if all those flows have infinite demand which is they will send let's say for at least a second which is again hard to guarantee in a day-to-day environment may not be the case then what you say might work within a single iteration but most often that's not the case but I can get back to that do you have to go through the slides and learn I'm almost done so I'm going to walk through one simple experiment or not one simple experiment one of the largest experiments that we did to test this mechanism against a real-life workload involving that cache so we had about 16 servers in packet and each of these servers had a tent with a bit of make and 12 of these were designated as client servers and they basically had a single VM which is running my cache new client and a UDP client I'm going to get back to you four of these are actually servers and notice that UDP is actually correlated on each of these so for UDP also experimentally we generated a simple workload which is there's an external close open-loop load generation tool which constantly generates 144,000 set-reversed per second and each of these set-reversed is distributed uniformly across all the clients and each client basically picks a server uniformly at random to send the set-reversed to set-reversed is about 6 kilobytes so we work out this works out to about 2.3 gigabits per second per server UDP on the other hand first execute a more malicious workload so each UDP client here picks a server at random it sends the maximum rate that it can for a period of 0.5 seconds and it sleeps for 0.5 seconds this is going to really stress like how quickly IQ can provide great guarantees and that's it each client basically does the same thing so notice that if you look at timescales of over a second or maybe even 2 seconds the utilization of UDP tenant is 5 gigabits per second so if in this example you want to give equal capacities 5 gigabits per second each to memcache team and UDP the utilization of UDP is like well within its limits but then a short timescale is actually trying to be very bursty so under this workload we saw a performance with and without IQ for like 4 different cases the first case is just a sanity check baseline performance establishing the baseline performance of the cluster without having our mechanism in the data path you have a fresh new cluster which is memcache team the 99.9 percentile latency of each of these set requests was about 666 microseconds if we insert IQ in the data path it's of course going to be some overhead because this is done in software and there's some overhead here the 58 percentile latency jumps up a little bit within 99.9 percentile latency actually comes down and the reason is that this control mechanism that is operating in the data path RCP is much faster than TCP and it helps avoid those fine timescale in-cast effects that can happen within your switch and it's going to improve the latency at the 99.9 percentile and of course in each of these cases the cluster is well provisioned to handle the external workload if you throw in UDP which is doing its bursty workload across the entire cluster we see that the median latency actually jumps out by an order of magnitude and the median latency jumps out because there's excessive queuing delays happening at the source and within the network and it's going to impair the performance of the co-located non-casual server and of course 99.9 percentile latency just shoots up because of a lot of timeouts and TCP as we know takes a lot of time to recover from timeouts the throughput didn't drop surprisingly because there was enough capacity so TCP eventually recovers so there's actually a latency difference and of course if you have IQ within the data part operating sending these feedback messages and we have equal great guarantees to both UDP and cache decline and the server you see that the median latency well it's still about it's still more than 900 microseconds by 4 microseconds and the 99.9 percentile latency actually comes very close to your biometric performance now you cannot expect this latency to be equal to your biometric performance because it's more in the first place so there's going to be like slightly higher latency but then the point is that without having to modify all my network switches I'm still able to achieve latency which are very close to my biometric performance by just operating in this way and giving this way guarantees the UDP good question I was never asked this question I'm not sure so what happened to UDP is that it gets it's 5 gigabits per second over the plan 5 seconds and over the next plan 5 seconds it has nothing to send so well IQ cannot do anything so if you look at the utilization of UDP it'll be at 2.5 gigabits per second so that's it so what I described today is IQ IQ you can think of IQ as an edge-based flow scheduler it's conceptually similar to like CPU schedulers and things like that it tries to allocate rates to flows in accordance to VMs bandwidth guarantees and this operates in a completely distributed fashion you don't have to change your network switches to avoid these benefits you can deploy it today and in fact the source code is available online and I hope some of you will try it out that's it thank you because the presentation was so clear that it really led us right into this into this possible solution of presentation of problems that was really great although one other thing that I think you could you could say is that you really presented it as this problem that the full bisection bandwidth network doesn't actually extend to the VMs it only actually only really goes to the hardware and so it seems to me that you know on the up sort of an additional strong end proposal you know you know to compliment Brandon's dynamic proposal is as long as you don't really care about as long as you're willing to give up work conservation you can really have a static configuration that simply extended the full bisection bandwidth through the VMs and said because the key thing that you mentioned because the key component is that clients have to request their bandwidth anyway so you know they have to do this once yeah they have to do this once so as long as you don't care about work conserving work load you could really use your same mechanism in a static in a static fashion and just extend the full bisection bandwidth to the VMs the only reason you need the dynamic is if you want want to be able to take advantage of that extra capacity so I mean that's a good way of thinking about this is like how much extra capacity do I need to get rid of this problem and as I show it's static because if you want to provision your network for the worst case and say you have n VMs you need to have n times the capacity given to each VM in order to completely avoid this problem and n can be obviously large well another way of looking at it is that the problem is you have this unfair incast where people can gain the system by adding different things and then you can give the client a fixed amount of capacity and say look you have to share that among all your VMs and if you want more capacity you have to pay yes that would need to like severe underutilization I'll get back to you with like a more concrete example of that yes so you're wondering doesn't assume any flow control like you'd like to walk me through a case well I mean the thing is that for instance like if you monitor a box and a switch and they start filling out you may send like a poor signal back to a previous note and that's like congestion prevention algorithm in a way and so how bad would you think would interact with the algorithms without like many details I would say this would interact if you have like two control loops interacting we'd have to pay attention to like at what time scales each of these interact with each other and without like much details I would be able to say but then one of the advantages of having like this thing operating at the edge is that you can operate your network at like near minimum queuing because let's say you don't allocate like 10 gigabits per second to all your VMs you may just operate at like 9 gigabits per second peak utilization then you run the feedback the back pressure packets from the extent because that happens when the worst thing happens when the queue starts building up try to why didn't you build up in the first place then maybe it's ok yes so this is related to what you said in the beginning of the talk do rate guarantee is always translation to the response in a time then guarantee then deadline guarantee in the field so in fact this was a finance exam question in CS244 which is let's say you're given a link some capacity and you have that request which are like of some size and you have this partition advocate worker where the front end in this case sends request to his workers and expects all the responses to the front end because the total amount of data that you want to transmit from all the workers to the front end is fixed the total amount of bandwidth that you have is going to determine the completion time it's also the way over which you know there's also yes yes true and in fact which is that really what you know depends on what you're trying to do trying to you know bound the maximum completion time what you're trying to try to minimize the average completion time against that schedule no I'm talking about the total completion time for the external trust so it doesn't matter when each of the flows complete because you have let's say 1 megabyte in aggregate to transmit which is like a couple of milliseconds at 10 gigabits per second if you have the available capacity within that millisecond then you can say okay if I have 1 megabyte RPC it's going to come to you so rate guarantees can directly translate to the completion time guarantees for flows and if you look at each flow because we have the total load experience and to get back to your previous question how does this constitute deadlines maybe for map reduce jobs in fact there's been a number of literature work people have in maps of research they work on okay I have this map reduce cluster I want to do jobs I need this tunable knobs on the amount of bandwidth that the job gets the CPU that it gets and the number of workers that it has in order to attain a deadline for the job it talks about but yes it gives you the tunable knobs which are a fairly low level and now it's up to the provider to use these knobs to provide other kinds of guarantees yes so this technique works well if you applied an edge which I've considered is a physical machine can this be applied if it happens in a switch as well like for example like PCP in class problems can this be applied if like the last experiment it's not as bad a problem as possible just because you're working a little bit yes but so in this experiment we had like a very bursty workload generated by map cache D and we saw that if you have IQ in the data path it actually reduces the 99.9% latency by a little bit and the reason why this was the case was so yes certainly this could help in like for solving other kinds of problems like in custom so it's applied still in the server but it's oh this is still applied in the server yeah this is completely n was missed my question was you and I thought about applying different points of the nature of this no and if you have like more information into like what's happening within your network certainly there are other interesting things you can do but I haven't explored your network and anything can be implemented like it's just a matter of consistency okay I just have a last question can you talk a little bit more about what happens when congestion causes signaling packet loss it's a good question so in the slide where I talked about like what happens if there are failures within your network we have collisions or maybe even a feedback packet is good to get dropped the source doesn't need feedback like instantaneously it just takes a while but if you never get a great feedback for about like hundreds of milliseconds then you start multiplication because you know something bad is happening so you start coming up so what is the overhead from the feedback packets that you need to sell you so for sampling that we did it's like 1 in every 10 kilobytes of data I received the worst case overhead assuming that you send a feedback for every single packet that I received with minimum size packets is 64 megabits per second it doesn't irrespective of the number of VMs so in let's say in the clusters that you see at Microsoft what was the overhead that you would put into the network I mean given the number of VMs it's 64 megabits per second in the worst case for every server how much? 64 megabits per second for every server or for every pair of VMs for every server regardless of the number of pairs of servers communicating and principle and sort of adding congestive control things like UUQ flows which are okay thanks