 Let's start, welcome to next seminar. Today's speaker is Ali Gotzi. Ali comes from KDH Sweden. He's an assistant professor there. And also for the last three to four years he's been a visiting researcher at UC Berkeley. And today he's going to talk about multi-resourced scheduling and some very interesting work that got him the first prize at last sequel. Thanks. All right, so just a little bit background. So for the three, four past years I've been working in this area of multi-resourced scheduling in general, not just in networks but I found it to be a rich area in, you know, VM scheduling, cloud computing in this setting and so on. So I think there's lots of interesting things one could do with multi-resourced fairness. So I'm hoping that this also can be applied in other contexts. But in this talk we're focusing on the network setting. And so this is a joint work with El Sikar, who's now an assistant professor at Stony Brook. Matej Zaharia, who's a grad student who's graduating this year from UC Berkeley and El Sikar. So the background of this talk is that we're seeing that packet processing is becoming more and more complex and sophisticated. So some trends that we see, and I don't need to talk about this here, I guess, in Stanford, but SDN is used for, you know, access control, VPNs. We see a profusion of middle boxes that are used to serve, you know, enterprises, their own customers and so on. But we also see that they use these middle boxes for external customers' data. So we're two papers at SIGCOM, one in which they were showing that ISPs are using middle boxes to filter customers' data. And another one that showed that cellular providers were using it for mobile data that's coming in. So these middle boxes are all over the place. We see rising software routers, like Rubrics, and also more usage of specialized hardware, like hardware accelerators. We saw the SSL shader project from NSTI. So all these trends have led to, like, the data plane no longer is just doing forwarding, but it's doing a lot of other complicated things, such as fingerprinting for WAN optimization or HTTP caching or intrusion detection and so on. So, given this, the motivation of this talk is that we think that flows increasingly, or network flows increasingly, are using comprehensive resource consumption. So they're not just using one resource anymore, but they're using different resources. And I have three examples of that from the literature. The first example is from Vern Paxson, the Bro system, where they observed that when you're doing these fingerprint computations for intrusion detection, you can easily bottleneck on CPU. So CPU becomes a scarce resource. So there are a lot of complicated or complex computations you want to see. The other case is from Rubrics, and they observed that in Rubrics, when you have really small packets, you can actually bottleneck on memory bandwidth of the system. And finally, we have the classical case that we focused on in networking for the past decade, which is, if you have packets and you're sending them out without just doing mirror forwarding, you can easily bottleneck on the link bandwidth. So these are three of our cases. And in the paper that we had, you can see we ran some traces, and we actually confirmed that this is the case. If you do indeed, you can bottleneck on different resources if you look at different middle box functions. This is sort of a sign that there are kind of architectural errors in the system. I mean, there's a reason we don't push all packets to remain memory in the network switch, right? Because it's a stupid idea, and it precisely overspresses the memory bus. And also, you just remember your system in general. Like, scheduling's not going to help with management design. Sure, that's true. But as, you know, some of the functions that you want to do on these middle boxes is really complex. So, you're going to end up off making something if you have a balance system. They're trying to design these systems to be more and more balanced. So that, you know, you could over provision everything else but the bandwidth, but then you're wasting resources. So, and we've seen this in other contexts, too. In the cloud computing context, we see it now, too. You can bottleneck on different resources. So, given that, we think that scheduling just based on a single resource is efficient. And the problem we want to tack here is how to schedule packets from different flows when they have, when the resource consumption is heterogeneous. So, they're consuming resources of different resources. So, another way to see it is how can we generalize fair queuing to handle multiple resources. So, that's the problem that the whole talk is about. And there's a lot of work on fair queuing, as you know, in the literature. So, this next slide basically tries to put this work in context of the related work. So, we're building on lots of different things. So, the way you can see it, so, the way one can see it is that in the case of single resource fairness, maximum fairness was suggested, you know, 50 years ago or something by American philosopher John Rawls. And, you know, he was saying, in an economy, we should help the guy who's worst off. An economist called this maximum fairness lexicon, the lexicographic minimum. But it took some time until 1990 or 1992 before in networking, we generalized this to fair queuing so that we could do time multiplexing. So, what you can see is maximum fairness just gives you the allocation in space or the static allocation, what it should look like. But how to do it in a dynamic system where packets are coming in and we need to schedule them and multiplex them, that happened in 1990 or 1992. And that was fair queuing. And the way, the main concept that they introduced was this notion of virtual time that we'll come back to. So, that was fair queuing in 1990. So, could you concisely state what the criterion for maximum fairness is? I could. So, in the single resource case, it's simply just find the allocation. What you do is you can, so the way I would do it is I would say you sort a vector. You get a vector of all the flows in the system. I'm not asking for the algorithm, I'm asking about the criterion because you've got a really interesting analogy that said it was sort of like Pareto efficiency. And I wonder if you could give a clear definition of what the maximum fairness criterion actually is. Yeah, yeah, there are multiple of them. But one of them is... What's the one you're using for the purpose of this? Yeah, yeah. The one we're using is the one in which you pick the minimum. You always state that it's one in which you found the allocation vector that is the minimum of the lexicographic vectors of all the allocations. It gets a little bit technical. There are multiple different definitions. What does that mean? I could go through. But I'll come back to it maybe in the second part of talk where I focus more on this. Because they're kind of technical, the definitions, and it gets even slightly more technical when you're in the multi-resource stuff. But I'll come back to it. But intuitively you can think of it as you want to improve the allocation of the flow that is worse off. You want to maximize the one that's worse off. Once you've done that, you recursively want to improve the allocation of the next flow that is worse off. You just do this recursively. One way to characterize it is to say that when you have a maximum allocation, it is impossible to improve any of the flows bandwidth without hurting some other flow that is already worse off. See, that's a better criterion. That's like the Pareto optimality which says that we can't reallocate resources so that nobody, no individual person is worse off. That's a good observation. So that's what they did there. More recently, we started looking at the multi-resource problem and others have looked at it too. The one we're building on is this thing that we introduced called DRF which is dominant resource fairness. What it does is it basically generalizes maximum fairness to the multiple resource case. But it's again just this static in space. If you had lots of machines, what would they write? How many percent should we give to the different users if you have multiple resources that have different utility-oriented resources? So what we're doing in this talk is this box here which is that we want to generalize these two. So we want to generalize DRF so that we can do it in time. So we have a dynamic system so that we can multiplex packets to achieve these DRF allocations in time. We also want to generalize here a queuing so that we can do fair queuing on multiple resources. So another way to see it is that we're basically taking this notion of virtual time that fair queuing introduced and we're generalizing it to multiple resources. So we want to have virtual time for multiple resources. So that's what we did. So the other thing I'm going to talk about is this box. I'm going to start by going through some of the natural policies that we investigated initially. We spent a lot of time actually looking around for different ways to do this, but they turned out to not satisfy some crucial properties that were identified in the past literature. After that I'm going to go through DRF. This is related work that we're building on. And thereafter I can introduce this DRF queuing. And finally I'll talk a little bit about implementation knowledge. So in the past literature two properties have been identified as important in the multi-resource set. The first one is called share guarantee. And share guarantee simply says in this setting, in the network setting, that each flow can at least get one end, but at least one of the resources. So that's the share guarantee. And it's a straightforward generalization of the share guarantee that exists for a single resource, which is that you should be able to get one end of the single resource that you're sharing. We're just saying that you should be at least one of the resources that you're sharing. And it's an isolation property. So it's a very important isolation property. It says that no matter what you guys do in here, I wouldn't be able to always get this much. This is my minimum guarantee. Does it matter what the resource is? Yeah, so this gets kind of hairy. But what we will do actually, we will do something quite strong, is that we're going to actually give you one end. The way we will do it is we'll give you one end of the resource you want most of it. That's how we do it. The second property is a strategy proofness, which is that a flow shouldn't be able to finish faster by artificially using extra resources that it does not need. So that's strategy proofness. And in some sense, weighted fair queuing from the past picture is doing strategy proofness in some sense, because what they're doing is they're saying, I don't care how you packetize, how big your packets are, you're not going to be able to cheat the system by using bigger or smaller packets. So this strategy proofness is the same here, but we want to do it not for the resources. So let's look at the simple example. The most natural thing you would do, which is what many routers or middle boxes would do. So we're applying just basic fair queuing, and we assume that we only have two resources. So we have CPU and the network, and they're being used serially. And by that, I mean that when a packet comes in, some module starts processing it, and it uses CPU resources. Once it's done doing this, it sends it to the link, and then it's using link bandwidth. And in particular, these two flows, we have two flows in this example, and the two flows have the following resource consumption. So the first flow, every packet that it has, first consumes two microseconds on the first resource, and then one microsecond on the second resource, which is the link. And the second flow has packets that always use exactly one microsecond on the first resource and one microsecond on the second resource. And assume that we just want to apply fair queuing, and we're only applying it to the NIC resource. So if we're only applying it to the NIC resource and ignoring the CPU, then what it's going to do is it's going to say, okay, these guys are actually using exactly the same amount of this resource. Every packet they have uses the same amount. So we're just going to alternate sending one packet from each of the flows. So let's see what would happen if you would actually start doing that. So x-axis here shows time, and we have the two resources, CPU and NIC, and we have the two flows, this dark blue and the white, flow one up, flow two. So first it would start by scheduling packet from the first flow, and it would take two microseconds. After it done that, it alternates to the other flow and send the packet from the other flow. Now in parallel with this, since the first resource is done with the first packet, the second resource can now start processing the first packet of the first flow. After that, again, it alternates. So since it's trying to alternate, it's now going to pick one packet from the first flow again, and that one again uses two microseconds. And in parallel again, the second resource consumption of the first packet of the second flow goes on, and this just repeats itself. So it's just alternating these like this. So we're going to get this pattern. And if we ignore this first time slot here, which is just a warm-up period, we're just seeing that this pattern is repeating itself over and over and over again. So what's happening is that since we're using more of the first resource or we want more, there's more aggregate demand for the first resource. And that's both for making, and it's completely being used 100%. Whereas the second resource is slightly unused. So if you look at the allocation that you're getting here, you're seeing that CPU is 100% used and NIC is used 66% of the time. But in particular, if you focus on the second flow, it's only getting 33% of these two resources. Because we're ignoring how much it's getting on the first resource and just applying it to the second resource, it actually got less than half. It's one-end fair share of both resources. So this actually violates this basic share guarantee property. So we can't just use the... You decided to give two-thirds of the CPU to one flow and one-third to the other, and that's what you got. It seems to me that you scheduled according to one criterion, so it doesn't make sense to judge it by another. Yeah, so we ignored the first resource and hence it's being unfair. This is the first guy using a lot on the first and that's what's the bottleneck resource. So this is just a straw man saying that this wouldn't work very well if you're just doing fair computing applied to one resource. Something smarter that's been suggested by the literature is called bottleneck fairness, and they're saying that instead of doing it this way that I just explained, just determine periodically what resource is currently the bottleneck. In this example I gave in the previous slide, it was the CPU. And then just apply fair computing to that resource. So let's see what happens with that. Here we have again an example with this bottleneck fairness. We have two resources, CPU and NIC like the previous example, but now we have three users and their demands are 10-1, 10-14, 10-14 microseconds of the two different resources respectively. So if you look at these three, if we just look at one packet from each of these flows and we just look at the aggregate demand for the first resource, we see that there's 30 demand for the first resource, but the demand for the second resource, 14 plus 14 is 29. So we're going to bottleneck on the CPU, right? So CPU is a clear bottleneck if we would take one packet from each of these. We couldn't possibly, you know, split equally the second resource because you see that the difference between is 10-1, so we would completely bottleneck on the first resource. So this is what we get. So if we think that the first resource to bottleneck, we split that equally with fair queuing, then each of these three flows get 33% of that first resource, and the second resource is almost 50-50 between these two flows that use a lot of the second resource. And the first one is using a tiny bit in the bottom and there's a little bit of slack in the middle because we're not going to bottleneck it on. Okay? Now, see what happens if the first flow artificially increases its demand on the second resource by some means. So it's instead using seven instead of one. Then what's going to happen, if you do the same thing with it, if you add these up, you'll notice that, you know, 7 plus 14 plus 14 35 is going to be the bottleneck. So we're going to now bottleneck on the nickname step. So if we look at that now, we see that we're going to split. So we would then apply fair queuing to the second resource, split that evenly across these three flows, and, you know, there would be a little bit of slack on the first resource. So what's happened now, what we can see now is that this flow, by increasing its demand to 10.7, is actually getting now more of both these resources. Okay? So it's been able to gain the system basically. So this is not strategy proof. Doing this bottleneck fairness like this is not going to be strategy proof because you can actually benefit by wasting resources. You can get more of all these resources. Okay? So that bottleneck fairness violates this strategy proofness. Okay? So after this, we turned our attention to something that we actually wanted to try. And these two were in 7.7 straw man, but this is what we actually wanted to do. And we spent a lot of time exploring this policy, which we thought was a good policy. What it does is that you have a buffer between each of the resources you're using. Okay? So after you use the resource, you put it in the next buffer and so on. And we apply fair queuing to each buffer independently. So we're just doing fair queuing independently on each of these buffers. Okay? And then we see what we get. You know? It's as good as bad and so on. So it turns out that this per resource fairness that we tried also is not strategy proof. And the example is as follows. You have two resources, two flows with demands 4-1 and 1-2. I'm not going to go through exactly how the, you know, how the interleavings look like, but just trust me that if you run this, what you end up with, if you're doing fair queuing on both of these queues, you're going to get an allocation that looks like this. So, you know, flow number one is getting 57% of the first resource and, you know, 14% of the second, you know, the rest goes to the other flow. Now, if the 4-1, the flow with 4-1, artificially increases its demand to 4-2. So in some sense, we're now getting symmetric demands. 1-1-1-2, another one wants 4-2. So if you ignore scaling, they're actually asking for the same vectors, then, and you run the system instead with this first guy asking for 4-2, you're going to get the following symmetric allocation. So now, each of them gets 66% of the resource they want most of, and the other one gets 33% of that resource. So what's happened here, oops, what we see that happened here is that, again, by increasing its demand, it's been able to get more of both of the resources. So it went from 57-14 to 66-33. So per resource fairness also violates the strategy improvements. Okay. You might think that's not so bad. So I mean, we continue playing with this PRF, but one of the things is that, you know, if it flows waste resources, they can actually benefit from the system. But more importantly, this PRF actually requires that you have a queue between each of the resources, and that was the main reason we abandoned it, to be honest. Because, you know, oftentimes many of these modules consume resources in parallel. So you're using CPU and the network and, you know, in parallel. You're not doing them sequentially. So you can't force to have buffers between each of them, in which case this method doesn't work. So this was the other reason why we abandoned it. Okay. So just a couple words about strategy improvements, because this usually always comes up when I talk about strategy improvements. People ask, you know, what should we really care about this? Like, you know, does it really matter? And we think yes, because if you don't do this, you're actually encouraging wastage of resources. And wastage of resources essentially means that you're going to get flow or output in the system. Okay. Then some people say, usually when you say this, they say, well, but will anyone ever really gain the system? Do people go through the trouble of actually gaining the system? And we think that, especially in network, in the network setting, this is quite common. We've seen, you know, peer-to-peer applications that do everything to, you know, manipulate the network to get more resources. There are, you know, applications like BitTyrant and so on. And it would be pretty easy to, you know, just probe different packet sizes and so on, dynamically in an application to figure out whether you can get more bandwidth or not. So this is the second reason. And finally, I think one of the reasons we get these questions in general is that this, for single-resource fairness, it's purely strategy-proof in the sense that we're talking about. You know, if you ask for more, you're not going to get more. You know, it's applying maximum fairness. You're getting already your fair share. So it's in the multi-resource setting that this appears. So that's why this becomes relevant. So let's skip this. So let's look at the policy that we actually want to implement. We're building on this thing called DRF. And I'll give you a two-slide quick summary of how DRF works. So DRF was originally proposed in the cloud computing setting. So we have machines, and we want to schedule tasks from different jobs on these machines. And it actually satisfies both of these properties, strategy-proofness and sharing guarantee. And the way it does it is that it has two definitions. The first one is the notion of a dominant resource. So the dominant resource of a user that wants to schedule jobs or, you know, tasks in a cluster is simply the resource that she's allocated most of. That's her dominant resource. And to go with that definition, there is a notion of a dominant share. And dominant share is simply just the percentage of your dominant resource that you got, okay? So let's look at a simple example. Actually, okay, so before that. So what DRF does is it takes these dominant shares and it applies maximum fairness to them. So in some sense what it's trying to do is trying to equalize the dominant share of all the flows or users in the system. Okay? Or applying maximum fairness to them. Correct. So here's a simple example. Let's say we had a cluster with 16 CPUs, 16 gigabytes of memory, and we had two users. The first user demanded to run lots of tasks, each requiring three CPUs and one gigabyte of memory. The other user wanted one CPU and four gigabytes of memory. So if you actually look, as soon as we start allocating these, the first user wants much more CPU. It wants three sixteenths of CPUs. It only wants one sixteenths of the available memory resources. So its dominant resource is CPU. The second user did the same act. It actually wants more memory. So its dominant resource is memory. These two users now have different dominant resources. And what DRF would do is it would find the allocation that would equalize their dominant shares. So it would give, you know, 12 CPUs and four gigabytes of memory to the first user respecting this three-one ratio. And it would give three CPUs and 12 gigabytes of memory to the second user. So what we see here is that it's equalized the amount that both of these flows got on their dominant resource. So that's DRF in a nutshell. Any questions on this? Yeah, who has their idle time memory? So yeah, this gets into how you do this scheduling in this cluster. So here, we're typically using CPU containers. People close or tasks in advance tell you how much they want of the different resources, and then you isolate them based on that. But you could imagine a system where you would actually, you know, let this be used, you know, whatever. So, yeah, that's a good point. So, okay, so let's look at, so let's look at, we want to apply this in this setting. So DRF is doing this in space. So it's doing this static allocation in space. But what we want to do now is we want to, we want to do this in time. So we want to do it in the time domain. Particularly, we want to multiply packets to achieve these DRF allocations over time for different flows. Let's say in the middle box. Okay? So let's turn to how we can do this. Okay, so the RF queue. So the RF queueing for this networking setting. So the first thing you bump into, there are a couple of challenges that are new in this multi-resource setting that we didn't have with fair queueing before. And I'm going to mention, the first thing that you bump into is that when you were doing fair queueing in the past, you always could determine a priori, the packet link usage. So you would know how much bandwidth the packet would use as soon as it comes in. And the way you do it, you know that the packet size, you can just divide it by the throughput of the link. So you know how much it's going to use. In the multi-resource setting, we can't do that. So we don't, a priori, it's unknown how much a packet will consume with the different resources. It depends on many things. But the simplest one is we don't know which modules will process this packet as soon as it comes in. Different modules in the system, you know, depending on which ones it goes through, the resource consumption will be different. Okay, so this is the first challenge that you have to deal with. And for this reason, we decided to leverage start-time fair queueing, SFQ, and I'll describe briefly what it is. But the nice, the reason we adopted start-time fair queueing is that what it does is it schedules packets based on their virtual start-time. Okay, so it has something called virtual start-time. And the interesting thing is that this virtual start-time of the packet is completely independent of the resource consumption that that particular packet has. So you don't need to know how much resources it's going to consume in advance. So that's the reason we leveraged SFQ. Okay, so we want to use SFQ. But there are two requirements, two basic requirements that we need to satisfy in this multi-resource setting. One of them is old and one of them is new. So the old one I'll cover that first is the memoryless requirement. This is a lesson learned from the virtual clock system by Lixia Zhang. And the way virtual clock worked, this is 90s, 1990, and the way it worked is simply that it tries to simulate that each of the flows gets a dedicated one-end link. That's what it's doing. And the way it does it, actually many of these concepts I've seen fair queueing appeared already in that paper. So it attaches these start-and-finish tags to every packet. And it does it according to this simulated, dedicated one-end link that it assumes that everyone has. But then what it actually does when it's serving packets is that it's actually serving the packet with the smallest finish tag. So it's using the full bandwidth, serving the packet each time with the smallest finish tag. So in some sense what virtual clock is doing is trying to be worth conserving, which means use all the bandwidth while sort of simulating a system that is reserving resources. You know, a virtual circuit kind of, you know, or TDM where you have reserved your dedicated channels. So it's trying to marry these two concepts of worth conservation. So the problem with the virtual clock that's been known for a long while is that if the system has light load, so there are very few active flows at some time, you're going much faster than your dedicated one-end simulated networking. You're going way faster. So then if new flows show up and start becoming active, what might happen with the flow that's been going on for a long while is it gets punished. It experiences a long delay or, you know, jitter. And you can see that the reason it happens is that, you know, if I have lots of backlog packets, you know, they get tags according to this dedicated link, but I'm going way faster. So at time 100, I might already be serving packets that have tags 200, because I'm going way faster, right? So my next packets have tags 201, 202, and so on. A new flow shows up and becomes active, and it immediately gets tagged, you know, at time 100 it gets tagged 100, 102. So it's going to now have the smallest tags for a long while, until it catches up with the active long flow. So you can actually get fully delayed in virtual clock. There is no limit on how much you can get punished. So that's bad. Well, it's not that bad. I mean, Linux CFS works exactly the same way, and the only real problem is, I mean, this is a problem, and it also allows you to gain the system, but in practice, it's not actually a problem, and that's why it's the standard that gets worked with Linux. Yeah, but in the networking setting, you know, if you're doing video streaming, and you're doing TCP for 10 minutes, and you've got the whole link, then for the next 10 minutes, you know, another flow will get precedence over you. So I think in this very time-sensitive where jitter is really important, it can matter, I think. So, I mean, so what you're saying is that latency is actually a resource which should be taken into account. Yeah, so we want the system that actually gives some guarantees that if you penalize, don't use more than your fair share. I mean, typically interactive apps on Linux sort of work that way. Yeah, that would be one way around. Yeah. Or have you played with shortening the window over which you're tracking that virtual time? It's sort of a bit like the global warming debate today. Some countries have done 200 years of polluting. Do you take that into account? These would be the other countries, and here instead of your analogy back at 100 versus 200, if you're only tracking a trailing 10 or 20. Yeah, so this is very good. You can bound it a way. So you could just say, we're just going to look for this one, and you know, after that we're just going to reset it. And we're actually going to use a similar idea later. But we saw, yeah, so this is a good idea. We want to go to the extreme here. And this is what a lot of fair-pubing paper does. So we want it to be completely memorable of us. So we don't want to look at the past and say, you know, because you use this much, so a close share of research should be completely independent of how much it used in the past. Why do you think that's a good thing? I sort of believe in the global warming thing that there should be a penalty for all this pollution in the past. Well, here in the setting is, in some sense, we had extra resources. These resources are not, you know, we're assuming they don't call any. We spent power right now. That power is never coming back. That's true. Because you just have like 15 minutes and you can just defer questions for the end. Yeah. Okay. Yeah, I should probably speed up. Let's give some stuff. So, okay. So this is the first requirement we had, memory scheduling. The second one is more interesting, or actually weights. Okay. So, yeah, let me quickly go through how they fixed this, how they achieved this. And the way they achieved this, and I'm going to go quick over this, I mean, are you all familiar with virtual time? I don't know. Okay. I'll do a quick intro over virtual time. So the way they did it is that they said, the system is not giving the same service. One unit of time means different things depending on how many flows are active. Okay. So the main insight was, let's figure out how much service is being given for each unit of time. And they introduced a notion of virtual time which said that virtual time doesn't progress at the same rate as real time. What it instead does is that it progresses such that each unit of virtual time always corresponds to the same amount of service for all the flows. Okay. So in real time, if we have x-axis real time and virtual time in the y-axis, in a simple example, we have two flows. If at time one, we only have one backlog flow, it's actually receiving, and the system was designed for two flows, it's actually receiving twice the service that it would if it actually only had its one end or its half resource. As soon as another flow becomes backlog, this flow changes to one. Okay. So we're going to keep track of this with virtual time. So virtual time essentially lets us know where you would exist in this dedicated system. You know, it says that, you know, at time 20, you're actually where you would be in time 40 if you had your dedicated one. Okay. And with these systems where defective units, they schedule packets according to this virtual time. So when a new packet came in, they would adjust the starting time and finish time so that it would match the actual service that the system would receive. So we would eliminate this memory less, this problem that we had with the memory. Okay. The second requirement which is new in this setting that we have is what we call a dovetailing requirement. And this is a new problem that you will have in the multi-research setting even if you're not doing DRF however you want to solve this, this is a new requirement. And the easiest way to understand it is that fair queuing said that we shouldn't flow size should determine service, not the packet size. In some sense, all of this fair queuing was doing over all these decades, was trying to make sure that, regardless of what packet size you use, you shouldn't be able to get different types of service. So in particular, ten one-kilobyte packets should get the same service as five two-kilobyte packets if they're backlogged and they're just being sent. So in the multi-research setting we want the same thing. We want to use flow processing time rather than packet processing time. And the easiest way to understand it is to look at these two flows. One of them is alternating its resource consumption from being one, two, two, one. One, two, two, one. The other one is just using three, three, three, three, three, three. We would like that two packets from this flow should receive the same service as one packet from this flow. So it shouldn't matter how you discretize this over the packets. So this is the dovetailing requirement. And so it says packet processing time should be independent of how resource consumption is distributed. So these are the two properties we would like to satisfy. One is old month. This is the new month. Unfortunately, it turns out that these two direct odds with each other. So you can't really fully satisfy either of them 100%. Because dovetailing in some sense, this one, two, two, one, requires you remembering what was going on in the past so that you can dovetail. But then memory loss says you shouldn't remember anything from the past. So there's a trade-off here. And the way we solve this is that we basically develop DRFQ in three steps. First, we develop a version which is called memory loss DRFQ, which is completely memory loss. But it doesn't do this dovetailing. Then we have a version which is dovetailing DRFQ, which actually does dovetailing, but it's no longer memory loss. And then we have a generalization two, which lets you trade-off between how much memory loss and dovetailing you want. So that's how we do it. It's also easier to understand it when you go through it in this order. So let's start with the memory loss DRFQ. So remember what we want it to do is we want to equalize the dominant share of the different flows. The research to use most stuff, that's what we want to maximize there. So what we do is, just like all this related work in fair queuing, we attach a virtual start and a virtual finish time to every packet. And I'm going to tell you how we compute those. So they're called S of P and F of P for packet P, start and finish. So I'll start by telling you how you compute the finished one, because it's easier. The finished time of packet P is simply the start time of that very packet, plus the amount that that packet is going to use on the resource that it uses maximum. Remember, this is trying to emulate DRFQ. So P time of P of I is simply the processing time of that packet on resource I, and we want to take the maximum of that. So that's how we compute the finish time of a packet. How do we compute the start time? So the way we do that is that the start time of a packet is simply the max of two things. It's the maximum of the finish time of the previous packet that is buffered currently. So if there are any packets buffered from this flow right now, it's the finish time of the previous packet. So if they're back log back to back, the start time of the next one will be the finish time of the previous one. It's the maximum of that, and C of T, where C of T is the maximum start time of any packet that we're currently servicing. Not buffered, but we're actually servicing it. It's using some resource. And if there are multiple of those, we pick the maximum start time that exists. So you have these two cases, and actually what really happens is that our code will be the max. So basically the start time will simply be the finish time of the previous packet for that flow. It's very simple. And if this is the first packet of the flow that we're receiving, it's simply we're just going to set it to the current start time of the packet that's being serviced in this system. And if there is no packet being serviced right now, it's just zero. So we just set it to zero. And then what we do is we service the packet with a minimum start time. So let's look at a simple example of how this works. So these were the two rules for how to compute this. Let's say we have two flows and they both become backlogged at time zero. One is alternating this one, two, two, one pattern. The other one is using three, three. So what's going to happen is that as soon as we get first flows, first packets, well, it's the first ones we just set the start time to zero. And then when we look at the maximum resource usage, it's two, so the finish time is simply two. That's this. The other flow, first packet that it sends, again start time zero, but the maximum it uses, the maximum it uses here is actually three of these resources. So its finish time is three. Now the next packets that I'm coming from these flows because it's backlogged are simply going to have the same start time as the finish time of the previous packet. So the next packet that comes in here, the start time is simply the finish time of this one. That's what we get here. So the maximum is always going to be two regardless of which of these two packets are being sent. So you can see that it's just incrementing it. Start time equals finish time of the previous packet and you increment it by two all the time. Similar here, but you're incrementing it by three because the maximum is three. Now the problem is if we ignore these first which adds zero time, we said that each two of these receive the same service as one of those. Each two of those receive the same service as those. But as you can see, the time of this flow is now much higher than this one. So they're not going to get the same service because we're always servicing the packet with the smallest start time. So this one got punished. So dovetailing is not working here between the one two and two one because we're always assuming it's using maximum. So that's the memory of us DRFQ. The dovetailing DRFQ tries to fix this and the way it does that now is that it actually keeps track of the start and the finish time for every resource. And we're going to now keep track of how much it's using the different resources. And when we're actually doing scheduling, we're just going to use the maximum start time across the resources for that packet. So here's an example. Here's a packet. Now we know how much the start time for that resource, the finish time for that resource, the start time for the second resource, finish time for the second resource. But when we're scheduling, we're just going to use the max of these always. So this is an example again. Exactly the same example, but now we're going to use these two resources. So start time is zero, but now the finish time is going to actually be one, it's using one on this resource but two on this other resource. So start and finish time is different here. Next packet, we see that dovetailing is happening. Now it's using two, so we're adding two to the finish time of the previous packet and one to this. And the red here is the maximum for each of the packets. And the other flow just looks the same as before, but it's actually using the same amount of both of the resources. So if you now compare these two, what you can see is that these four packets get the same surface as these two packets. So the start time here is the same as the start time here. So is that clear or when that goes very fast? So you get the flavor. It's basically now we have virtual time for each resource instead of just having one virtual time. So what is then the RF2? So the RF2 is simply that we want to, we bound the amount of dovetailing to delta, processing time. So we do is we dovetail up to delta processing time units. So up to that this dovetailing is happening. Beyond this delta, we're just being memoryless. So that goes back to the thing that you said earlier, that we use a window sort of. So basically now we have the RFQ. It's a generalization of these two extremes that I showed before. If you set delta to zero, you're always going to be memoryless. If you set it to infinity, you're going to be completely dovetailing. And what we actually do is we usually set it to a few packets for the processing. And this is a window parameter. There might be smarter ways of picking this. But the reason we've set it to a few packets is that that's the amount of time that you have concurrency in the system so that buffers actually allow you to actually achieve dovetailing. We don't want to set it to infinity because you know, if one flow was using one, two, one, two, one, two forever for ten minutes and then using two, one, two, one, two, one, we don't want that to, you know, you shouldn't be able to benefit that way. Okay, so shortly about the implementation and then I can take one question or something. So, okay. So we implemented this in click. We ran the M57 traces through these. And this first experiment here, we just wanted to check what happens, you know, elephant flows do they actually affect mice flows? So we had one, two flows that were one was doing basic forwarding, which is just basic forwarding. One was doing IPsec, which actually quadrilinex on CPU. And they were each sending 40,000 packets per second. And for that particular one gigabit link that we had and the packet sizes of, I think, 1.4k per packet, that would completely saturate the resources on that, on that machine. And then at the same time we had these two mice flows that would just send one packet a second. And they were just too basic forwarding, nothing else. And we looked at the latency that these would receive. So the y-axis here is now logarithmic and look at overtime. The two basic flows, the ones that the mice flows barely see any latency. They have very low latency, whereas these two actually they're backlogged, so they're actually each packet is sitting in the queue so it's the obvious thing. And you can also see that the one that's doing IP security is receiving slightly more latency. Okay. And maybe I should also well, I can quickly mention this one. We also simulated this bottleneck fairness for every same workload. And here we have an example where one flow was using 1.6 and the other one was using 7.1. And another bad thing that can happen if you're using this bottleneck fairness, which was, if you remember, I was the one where you try to determine what's the bottleneck and you apply fair treatment to that, either you get into these oscillations where, you know, it keeps jumping from one resource to the other because here it's really unclear. There is no one bottleneck. Really both resources are bottleneck. We have demand with this sort. You can't give 50% here and then assume, you know, then you would need way more in the second resource. And we actually ran experiments with TCP here and saw that, you know, it actually affects github. So this is very bad for TCP, like I mentioned earlier. If you're doing, you know, audio or video traffic. So, in summary, we're seeing that packet processing is becoming much more sophisticated. It's, you know, flows are using a continuous demand, especially in these middle boxes. And the natural policies we tried out, especially this PRF, either fail strategy proofness or they don't have sharing incentive or they're hard to implement because they require buffers between each resource. And we proposed this DRF queue with generalizes fair queuing to multiple resources and also the concept of virtual time. And it gives you the straight-off between memory less and day-in. And also satisfies the two properties that I mentioned. So that was the last slide. Thank you. If you want to have... Oh, okay. Quick question. When you blended the two, the bounded memory with the memory, have you tried if you turned it into a continuous time? That feels like a first-order differential equation, because one of it is on the margin and on the derivative, the other is on the integral over the flow, over the packet in the flow. And whether that has a closed form that you land back on your feet with the data? So is this for analysis or you want to... To see what is going to converge, I mean, is there an easy... Yeah, I mean, we didn't try this. This is an interesting thing we could try. If one analyzed if there is a closed-form formula for how this behaves over time, that would be interesting. We typically just set it to a very small value and we saw that it's enough to match what actual resource consumption is happening in the system. But yeah, that would be definitely an interesting analysis to do. And also figuring out this delta, because now it's kind of ad hoc the way we set it. There might be ways to... Because now, if you set it to some ad hoc value, maybe flows are able to figure that out and gain the system using that. So that would be enough. But it might even become more relevant than others. So it would be interesting to try this. They are Q in general when you have first Q doing VM scheduling. It's very similar. It's again time multiplexing and the VMs are using usually heterogeneous resources. So that would be interesting. So one big difference between the network and the CPU scheduling is that in the network the packet size is determined by the application, basically, or by the networking percent system. Whereas in CPU scheduling, the time slice is arbitrary, although you could imagine it could also be specified by the application. So my question is, I guess, how does this change when you apply it? Is this relevant to CPU scheduling? You said it's relevant to VM scheduling. How does it change when all of a sudden you have the ability which you didn't have before to change the slice size? So I don't exactly know. I haven't done the research. So first of all, this is start time fair Q-ing. I know they've applied to VM scheduling. So that turned out to work for that. But I mean, we could run into trouble here if we're doing it. But also the second thing is, so you're saying you could decide when to pre-empt. But it's also that, you know... You can't pre-empt a packet, but you can pre-empt a CPU. There's a cost, right? If you want to pre-empt, you have more overhead. Right, right. So that gives you some flexibility. But then on the other hand, it seems that also sometimes, it just, roughly all of a sudden, it doesn't need any resources. So it's just blocking on some resources. And all of a sudden, it doesn't need to refuse some resources. So it seems that there is... It goes in both directions. I don't know. It should give you some more leeway to do better scheduling. Maybe you can then prove better bounds. So I didn't go through the bounds here. In the paper, we have some proofs on the bounds that you get with the system. So I think you could probably improve those. I think the bounding idea is a good one. Bounding slash window idea is a good one. But perhaps it... And practically, that means it's just pushed back to the application, which is that all applications have the equal chance to gain the system if they want. And it's considered to be fair for sure for that reason. But in some sense, that seems like a bit of a burden on it to force apps to do that. So maybe having the window would be a reason and having it be a tunable window might be a good thing to add. Yeah, definitely. No, the window idea is... And especially, as you mentioned, it's kind of a center actually. Most of your results are based on steady state flows and not only flows. If you have bursty flows, it seems like the form of doubt-telling that you're doing seems to benefit bursty flows. Because they basically can accumulate some credit when they come. They go through this burst and they enjoy that burst. They enjoy speed boost every time they come. And that should cause some oscillation. Have you looked at how that caused oscillation on the flows? Yeah. Let me see if I have... Because probably that is a maximum we want to set the delta to be. Anything more can actually cause flow starvation on the other flow because the other guy accumulate enough traffic to basically burst through and spark the guy for a while. So one thing that I didn't mention very clearly is that first of all, when we run the M75 traces, they're bursty flows. So in the evaluation, we actually run the interface through the system. The examples are stylized where everything is backlogged. You get the intuition for how it works. So you don't get any doubt-telling if you're not backlogged. So the doubt-telling cannot accumulate it. As soon as you no longer have buffer packets, all doubt-telling is gone. So you shouldn't be able to benefit doing that. So I don't think that the bursty ones actually benefit from this. So that is the doubt-way. So the question is, how is this affected when sort of your outgoing or incoming bandwidth is varying perhaps drastically over time? Say you're on a wireless link, or say you have a shared upstream link like from Stanford and all of a sudden a bunch of people start running BitTorrent and then they switch it off or whatever. How do you take that into account? So there, I mean, we don't really do anything. We don't do anything. But leveraging start-time fair queuing is what actually applies. So one of the reasons SFQ was suggested was that many of the previous fair queuing algorithms assume that you have a fixed bandwidth. So they don't work very well when it's varying. So the wireless setting, they don't work well. They don't also work well in the hierarchical setting where some parent class in the hierarchy all of a sudden starts using more resources. Then the amount available to a sub-part of the hierarchy becomes less. So start-time fair queuing, since it's kind of subtle, but this CT here, the fact that it's using this means that you're always basically synchronizing your virtual time to what's actually happening in the system, how much resources are available. So we get this benefit for free out of start-time fair queuing. Do you understand what I'm saying? A single packet still goes out at one gigabit, but your upstream bandwidth may actually get a real bottom. So in some sense, you don't have any way of measuring that upstream bandwidth because you can't actually... You really have to look at flow completion time rather than packet completion time, but you don't know how long your flow is going to be. So it's a big problem, I think. Well, we just measure... So this assumes that we know for each packet. So this gets into another thing that I didn't mention much of. How much does each packet consume? That is, you know, accurately getting that. That's hard. That's actually a hard problem. I didn't mention much of that. It might have backup size on that. But, you know, the way we did it is, you know, because that's actually, I think, is a whole other paper by itself. Have you considered other resources like power and cost? No, we didn't. But other people who apply DRF have used DRF for other things and applied it to those concepts, you know, contexts. So this is the thing. One of the problems is that we really don't know how much a packet consumes. And if we use CPU counters or something like that, it would be way too costly. You can't really do that. So that's actually another challenge that you have in this setting. So this is a separate problem, really. The way we solve this, we actually do linear estimation on the processing time. So what we do is, you know, for each module M and each resource R, the CPU and memory turned out to be, you know, really good fit. So you can, you know, linear regression work really well. So we just assume that, you know, from the packet size we can tell how much you're going to use on this resource. But, you know, this is the way you can get the system. You know, if you're off here, you know, so this accurately, you know, determining resource consumption is a difficult problem. So are there any thoughts of putting this into Linux or some way that we can get it? No, no, no, I'm not doing that. I mean, not because, I mean, it would be an interesting thing, but there are other applications of this DRF that I'm investing in. Like the hierarchy setting and so on. If anyone wants to do it, I'd love to work with them. But not right now. It's, so DRF is implemented in the latest Hadoop. So it's shipping in the Hadoop setting, but that's not for the networking. But here I think one could follow up and do some smart things here. Occasional probing using CPU helpers to get more accurate, you know, than this very simple linear thing that we use. And also this linear thing will break down for certain other resources like this, which hopefully you're not using that much rather. But yeah. Do you have one last question? Well, this is a call, right? Have you think about back compensating this miss error? Because you do not know in advance, but you do know when the packet gets processed. So if you actually do that, you can actually track different flows and back compensate it to the third flows, right? So of course that is going to cause some jitter, that's going to cause some delay. I think system-wise, maybe estimating this is going to be just too hard, right? The DLW is basically feeding back the algorithm to the trace table. We do this. We have this token bucket scheme in which we actually look at in the past and we change the amount of tokens that we're using that we're granting to the different flows too. Because the whole line could be off. So we have a sort of way to deal with this. But I think it's far from optimal because very far, because the way we saw it is that this is sort of orthogonal. It's important to get this working, but we want to understand how should we get this working, just the scheduling. But yeah, these issues about resource consumption and multi-resource setting, I think they come up in other contexts and there's been some papers about this too. Thank you very much. Thank you.