 Good afternoon everyone, I hope you enjoyed your afternoon tea on Australia Day. Everyone's out at the beach, swanning around there. Simon Holman is going to be giving us a talk on the network's bandwidth isolation. For those of you who've got questions, if you could raise your hand before you ask your questions so that you can get this microphone, it will be handed to you by one of our floor walkers. That will allow AV to pick up your voice so it doesn't look funny on the film. Over to you Simon. Hello everyone, thank you for sparing your time for me today. I'm going to be talking about network bandwidth isolation which basically means doing QoS in virtual hosts. As it says my name is Simon Holman, I work for my own company in Tokyo although I am actually from Sydney. So just a quick overview of what we'll be talking about today. I'm going to talk a little bit about the scope of this talk. And then I'm going to talk about how we can identify packets that are coming from guests. I'm going to talk about how we can schedule them. And then I'm going to talk about some rather interesting problems that I came up against when deploying this on Red Hat 5, which I believe people still use. So what's the aim of this? Why do I care? What are I worried about? We want to basically ensure that all of the guests receive a fair share of bandwidth. And importantly we want the definition of fair share, we want that to be configurable by the administrator. Specifically it's quite common to have different levels of service in the silver, gold, bronze kind of system. And of course if you're a gold customer you would like to get what you paid for. Specifically we're looking in an environment where the different guests have been essentially sold to different customers. They're not necessarily cooperative and even if they are they may not have upgraded their machines correctly and they may have some malicious software running there. So essentially we're talking about a stream of UDP packets which is going to cause a DOS on the machine and we'd like to mitigate that somehow. This work was all done with Zen. I'm actually doing quite a lot of work with KVM at the moment. So I do understand the KVM side of things much better than I used to but the assumptions that are made in this talk are Zen. If you try and take this and use it on KVM I can tell you with certainty that it will not work as you expect. We're talking about bridge networking. We're not routing the packets from the guests or anything like this. Everything is running Linux and I'm only talking about the transmit path. Doing QoS on ingress traffic is quite a different topic and I don't even attempt to tackle that. So what are the resources that we need to worry about? What are the resources that I considered when I was worrying about this? There are three. There's a bandwidth of the NIC itself which is I guess the obvious one. Less obvious but more important is the CPU on the host where DOM0 is what Zen people call the host and also the slightly tended to be less of a problem but the memory of DOM0. So the guest is running on the machine and it basically has limitations placed on it but when it sends packets they have to be processed by the host and that naturally consumes some resources on the host and if it's able to consume a lot of resources on the host then that may cause the host not to be able to service requests for other guests in the manner you might like. So briefly I'm just going to introduce the concept of packet scheduling. It is what it sounds like. You prioritize packets based on the domain and this allows us to basically very obviously maps directly to the bandwidth usage and also the CPU usage because CPU usage is directly proportional to the number of packets that are being processed by per second. So those kind of get grouped together. The memory usage actually relates to the number of packets that are in queue in the system in any given time which kind of relates to how fast they're going in versus how fast they're going out. But typically the queue lengths are fixed so this memory is essentially managed. So I want to talk real quickly about how flow control works in Zen and this is not how it works in KVM. So what have we got? We've got a guest over here and this net front is basically the guest driver. This is how it's sending packets and over here we have the host DOM0 and in between we have this magic ring buffer thing. So basically what happens here we have just one packet and it's got a bit of a header there which is labeled packet for some reason but it's got a fragment, just one fragment, it could be more but usually just one and some metadata and this essentially is going to consume two slots. It's pretty straightforward so far. The metadata gets processed straight away and goes straight back on the free list so that slot can be reused. So if this thing was completely full we now have one free slot which is not enough to send a new packet but half enough. The fragment itself that doesn't get freed straight away. That basically goes all the way through the network stack until eventually it gets transmitted on the physical NIC driver. And then finally once that's happened it goes back on the free list and well if nothing else has happened now we have two slots free over here we can send another packet. Obviously there are more than two slots on the ring buffer. I think the default is about 256. Fundamentally the idea is that if packets are coming in here much faster than they're going out here this will have to slow down. It can spin the CPU doing other things but that's okay because it's its own CPU. It can't send oodles and oodles of rubbish here which then has to be processed by over here. It will be slowed down and this is fundamental. This is flow control or back pressure. Without this none of what the rest of what I will speak of today would work at all. Okay so that's the background. That's sort of the foundations of where I'm coming from and the terminology of how to describe what I want to do. If we're going to do flow control on packets coming from guests we need to be able to identify them somehow to be able to say this packet came from guest zero and the packet came from guest one. So this is what a bridge network looks like in my brain. I guess you can draw this diagram differently but essentially you have several guests and the several guests are connected to a host. The host, you have basically virtual network here, which is the other half of it's here and you have several of them and then they all get bridged here and then they're sharing a single output interface. I'm primarily focused on packets that are coming out here. Of course they could also do this kind of thing. That's fine. That also fits in this model. I'm just not going to put that in my examples. So how do we identify packets? There are several ways, two of which are interesting are IP tables. IP tables can basically look at packets that are going through any interface and it can place a mark on them. Another technology which is a bit newer and I don't think it's actually particularly useful for Zen but is the net CLS and basically this is a C-group for networking so we can place tasks inside the C-group and then essentially any packet that originates for a task in a given C-group will be stamped with the mark that's also associated with that C-group. I'm not sure. I actually added that quite recently to this slide. That would work quite well in KVM because essentially you can quite readily associate a host with processes. In Zen it's not clear to me that that's possible although I haven't specifically investigated that. In any case I will be using IP tables for my examples today. So this is the same diagram again with some arrows drawn out so we're looking at packets coming out here. So basically what we can do is we look at this interface or this one and this one. We work out which interface the packet's coming from and we stamp a mark on it accordingly. This is how we identify the packets. So these are some very simple IP tables rules which achieves that. Basically what we do is for each of the three virtual interfaces 2.0, 3.0, 5.0 we give each of them an individual mark 100, 110, 120. It's not important what that number is it's just important that it's unique between the three or not. If you wanted to group them you could give two interfaces the same mark. And it's also of course important that when we use those numbers a bit later we consistently use the same numbers in the same way. Okay so that's how we identify packets. It's not particularly a challenging part. We're now going to look at how we can schedule them so how we can slow them down speed or not speed them up. Slow them down or not slow them down. So there are a couple of different things we do when we're doing packets scheduling. Firstly we have to filter the packets which essentially is the identifying part. We just have to hook into that mark that we've already set. Then we can prioritize the packets. We might choose to delay from packets and we queue them up an event or we can drop them. This is basically the kind of things we're doing when we're packet scheduling. So just going back to the memory briefly the amount of memory that a given guest is consuming for its packets in the DOM0 kernel is limited. It relates directly to the number of ring buffer slots. We basically can't have more packets in queue than there are ring buffer slots and each packet has a finite size. So that's kind of nice. In terms of the speed of the packets going through the system which relates to the CPU usage and the NIC usage, bandwidth usage delaying packets should be sufficient for our needs. If we're getting too many packets from the guest too many packets per second it seems logical to just reduce the number that we're actually processing per second and because of the flow control semantics that I described a little earlier that will slow the guest down. Dropping the packets I kind of initially thought that maybe dropping the packets might be a good idea. It's actually not a very good idea at all because of course as soon as you drop the packet as soon as the host drops a packet the ring buffer slot is freed the guest can send another one and you tend to get into a very, very fast loop of packets coming in and being dropped and then the CPU usage typically goes to around 100% and this is precisely what we're trying to avoid. Okay, so this borrowing is a slightly different topic but the idea of borrowing is essentially let's say that you have two classes set up or two guests set up and they're each allowed 200 megabytes a second but you know you've got a gigabyte, Nick and you know that absolutely nothing is happening on the system and you think maybe, okay, that's fine you can use the entire Nick if it's free and basically that's the concept of borrowing is to allow, so you have a rate which is like a guaranteed amount of bandwidth and then you have this ceiling which is you might get that as long as nothing else is going on it depends on your view of bandwidth of whether or not you, how you want to limit it but essentially if your bandwidth is free maybe it's all in the same data center this notion makes quite a lot of sense so here is the class hierarchy that I'm going to try and set up for my three guests and I'm going to have borrowing so I have each of these guests each of these circles here represents a guest so there's one for each of the three guests and there's a third one which is the default which essentially means traffic coming from the host itself we have a parent and this basically sets the global limits and this ties into the borrowing idea so essentially we have a global limit of 900 megabits a second and so if only one guest is doing anything that's the limit that's going to apply because I've set the ceiling also to be 900 in each class but if there is some contention on the network then essentially these limits will come into play so I've given one guest 500 megabits a second and the rest of them 100 each these numbers are fairly arbitrary I don't think they all add up to 900 but I think you get the idea so how do we configure this? what's command do we run? it's a TC command, the traffic control command and this is used to configure traffic control it's used to configure the filters it's used to configure the classification and it's used to configure the Q disks so firstly we do the root class which was the one right at the top of that diagram with the green circles there it is, it's an HDB class HDB is just the name of the specific algorithm I've chosen to use and then we also do the inner class which is the one just below it which essentially is a pool of bandwidth for the borrowing and here we've got rates 900 and the ceilings 900 this is actually redundant because the ceiling will default to be the rate I just put it in there to be quite explicit about what's going on and now we add the leaf classes again these numbers correspond to the diagram the first one has 900 megabits a second and the following three have 100 each all of the classes have a ceiling of 900 megabits a second and these, the parent so these numbers here, the parent number is the parent of the class on the previous slide again it's just a number that and here we name each class we give it its own name so we can refer to it a little bit later and then finally we put a Q on the end actually this is also not necessary we put there by default but again it's good to define these things explicitly I think so we have one Q disk for each of the classes on the previous slide these numbers here correspond to the handle names we gave in the previous slide these numbers are fairly arbitrary because these Q disks don't have any children and we're giving a limit of a thousand packets per Q this again is pretty arbitrary in practice I found that anything above about 16 was enough anything below that I'll discuss that a little bit more when I get to the problems I found okay so the last thing that we're going to do with TC is the filter so this is basically when the packets are coming through we have to assign them to the class and the way these filters work is we're actually hooking into the firewall mark which was set using IP tables a little earlier and that's these numbers here 100, 110, 120 you'll note that the default flow is actually being filtered because it will go there by default these handle numbers are fairly arbitrary I just chose to use the same numbers again so now essentially we've set up all of the rules as per the diagram with the green circles okay so maybe I should stop now and ask if there's any questions I'm not going to demo this I tested it extensively there's not a lot to see you push packets through and you see if they go through at the right rate if you could put your hand up when you've got a question we'll make sure a microphone comes to you I think there's a question at the back Simon you were talking about the ring buffer that's used to talk in the user domain into the DOM 0 so if the default buffer is 256 fragments because you're using the when you're using the TC filters aren't you risking adding a lot of latency you've got packets queued up before they're being transmitted I guess I'm thinking around the Gettys buffer bloat with a you know sort of idea that's a good question so the question I've recently started doing work on latency this is obviously much more directed towards throughput I don't think that actually 256 creates a big problem because the last time I checked the way that the when the kernel network in core wants to send packets it basically sends them and grabs a bunch of I think it was 100 packets from interface and sends them and then sends another 100 from the next interface and goes around in a loop like that and 256 is more or less in keeping with that because it's only actually 200 and it's only slightly over 100 packets so I wouldn't expect packets to be queued up extensively in the ring buffer but you are right I mean I'm adding more and more queues and yes I think that does have latency implications and I haven't really assessed that you mentioned early on that there's some implementation differences in terms of how the bridging works between Zen and KVM have you done something similar with KVM in terms of producing this behavior right so I have done some work on KVM quite recently so this mechanism I describe with the flow control this is the nub of the difference I've described how the Zen side works the KVM side I'm a little less familiar with but essentially what you have is the packets end up in the QMU process and then they get pushed into the kernel through a socket so you have a socket buffer which is essentially your feedback mechanism for this kind of back pressure unfortunately there appear to be some bugs in that and there is a school of thought that the socket buffer size should be reduced to zero which would mean there would be no back pressure and there would be no chance that this works at all ignoring that the current situation is unless you have a corner case in your configuration which causes your SKBs to be cloned as they go through the kernel you will get back pressure and it will work more or less as I've described here you probably want to use net classes rather than IP tables that's just details so have you been doing it with Enterprise 5.5 or have you been using so this work here was done on 5.5 my KVM work has just been using kernel.org kernels to 637 is what was current before I came on this trip especially with Red Hat Enterprise 6 and the adoption of C groups as part of the supported base there's a whole load of work in that network stack that is one area myself I haven't had a chance to play with yet it's a little bit more complicated than a few of the other C group modules that you can use it would be interesting to see how much work you can do in terms of ingress and egress packet control but that's one of the same ingress I did do actually some work on ingress and KVM recently the main problem with ingress is the policy engine can only drop packets and that doesn't then I can easily create scenarios where the CPU is spinning quite hard especially for instance if I'm doing ingress filtering and the traffic is going to another VM perhaps we should take it offline because it's actually quite a... if you want I can hook you up with one of the guys in the Red Hat BT lab that would be great but my personal opinion my personal feeling of the KVM stuff right now having worked on it for only a short time is there's been a lot of focus on performance and correctness and not so much focus on making sure the QoS side of things works properly I would like to change that I think you're going to see an awful lot of work in that area because it's the only way we can compete with VM we're in the virtualization space yes so I mean I have I go to my customers and they're mostly using Zen this is the QoS issue is very important to them but maybe we should move on to the interesting problems I found which also all have fixes so before you put your hands up and ask for the fix I will get to that so I was using Red Hat 5.4 real 5.4 I believe something like that I think these problems are all relevant because I haven't filed them in the bug track yet and I apologize for that so what's the problem here VLANs don't support ScatterGather they do now they don't in the 2618 kernel which is what Red Hat is based on what's okay so ScatterGather is basically the idea that I have a bunch of fragments that comprise a packet or an SKB I don't have to join them all up into a contiguous memory segment before doing the transmit so if they come in as multiple packets they pass through the whole multiple fragments they can pass through the whole stack as multiple fragments and the network card will magically combine them at the very last minute it's a performance optimization you want to use it regardless of the problem I'm trying to describe if you have a nick that doesn't support ScatterGather which is essentially no nicks just happens that VLAN is a corner case that hadn't been fixed yet it will linearize them so it basically makes a copy of the SKB and copies all the data from the different fragments into one big fragment this completely destroys the flow control I described in that diagram which had lots of arrows on it which means that the ring buffer slots become freed a lot earlier than they should be and so the CPU basically will just get into a feedback loop of the guests send something and the host drops it the results are quite profound you essentially lose any kind of interactivity to the host it's a problem so to work around if you're worried about this, this will work on essentially any version of Zen on any kernel is to rate limit the virtual nick itself actually has a rate limiting facility and this is how you activate it this still uses quite a bit of CPU because there's dropping going on but it just turns out that it drops them early enough that the DOS kind of scenario doesn't manifest the solution to this problem is to enable scatter gather on VLAN and that has been done in newer kernels at least but the problem would still manifest if you happen to find another interface whether it be physical or virtual that didn't support scatter gather so there's a general problem there I kind of doubt that anyone's actually running into it other than the VLAN case so here's some details of the patches that fix it up they're actually quite short problem solved okay, problem two there are only three problems so I'm getting towards the end anyway bonding we observed a different problem when we were using bonding interfaces essentially what we would try to do is set some classes up to use around about half a gigabit a second and we'd be getting sort of about 10% of that not really what we wanted not as bad as the DOS problem but it's still not real good so some solutions to this problem which are well, sorry, I'll go back so why is this a problem how does this problem how does it happen basically what happens is by default the bonding interface which is a virtual interface of sorts has a Q length of zero and the bonding interface is always going to be backed by some physical interface it has its own Q why add another Q okay, that's fine when you add a Q disk to any interface its default Q length will be that of the interface it's been connected to okay, so we end up with a Q disk of with a Q length of zero and essentially the way that the HTB algorithm works and I can't explain exactly why this is the case but the way that it works is it actually needs a little bit of buffer if it can only has one packet to play with it can't actually do its job effectively and that's the result that you're seeing at the bottom of this slide fortunately this problem is trivial to fix I did actually ask the the bonding maintainer whether this behavior was intentional and he said yes so I didn't, there's no patches but all you have to do is increase the Q length um a thousand seems to be the magic number that everyone uses I actually found that anything from about 16 was okay uh, so problem solved works all as expected okay, the third and last problem sorry oh, Herbert so yeah, back on that bonding issue wouldn't another possible solution be to actually apply your rating on the physical interfaces rather than the bonding interface itself could I apply the rate limiting on the physical interface rather than bonding interface? probably yes and that might actually be more logical yeah uh, yeah, actually and in this case the bond was um they were doing it for high availability rather than aggregation so that would work just fine, I expect okay, so maybe this was a poor design problem okay, so the last one is is a bug which has been fixed um so TSO is a mechanism by which essentially SKBs that are much larger than packet SKBs that are much larger than MTU length traverse the network stack until right at the last minute when they're diced up or segmented into MTU size bytes and the reason is it's much easier to it's much cheaper to process one very large packet than many small packets okay, that's fine, unfortunately there's a problem with this in relation to the way that HDB accounts for the cost of a packet um, and it's really geared towards the idea that basically everyone's using 1500 byte packets whereas in the TSO case um you can quite easily see 64,000 64 64 kilobyte packets coming in from the guest so it's significantly larger than 1500 bytes um, and the result is essentially that you might be missed it's basically the same as if you don't have any of this configuration I spoke of earlier said at all um, so it's a bit of a problem so I'll try and explain very briefly uh, why this occurs um, essentially we create, we the code calculates this value called um, CL log, which is related to the log 2 of the MTU where the MTU for some reason I don't understand is 2047 which is a number close to 1500 but never mind, it doesn't matter um, that's actually unimportant to this problem the point is that this is quite a long way from 64,000 um and the answer to this is 3, I actually modified this code um, this slide's not about the math, it's about the fact that it's a log based on some number around 2000 um, and we get 3, and then what do we do we basically set up a table based on this, and this table is gonna basically tell us for each of a range of packet sizes how much they cost they should be charged and the table looks a little bit like this is little tiny packets like this um get a value of 16 oh sorry get a value of 16, yeah, and packets up to about this size get a value around 2000 and this is the critical one any packet that's bigger than this gets a value of this ok, that's not so bad but this means that a 2 kilobyte packet will be accounted as 2000, but a 64 kilobyte packet will also be accounted as 2000, so that's not good, it's being under charged by quite a margin um so, I should have mentioned earlier, I was not allowed to modify the production kernel at all so this is why there's these workarounds um, the workaround number one which doesn't really work is to turn off TSO and the guest that's ok, except I said earlier on my guests were not necessarily very friendly so they could just turn it back on it ok, it's a problem I guess yeah, I guess I could have refused it somehow, but I'm not sure look at round number two is much more effective we changed that magic 2000 value to another magic number which is 40,000 um, we observed that 40,000 is also knocked 64,000 but it is much closer and it turns out to that in terms of the way that table I showed earlier works that that is a good number it's big enough to account it's big enough that the really big packets get charged sufficiently highly and that we don't lose too much granularity at the lower end and this basically works ok, so if you have this problem this is what I would suggest doing the solution to the problem however is to of course fix the kernel so all it does is basically says ok, you're 64,000 byte packet, ok so that's like 32,000 byte packet so we'll just charge you 32 times the maximum rate plus any change that's left over at the end and that works basically the algorithm now works as it was intended so this is the end of my presentation today so what did I get out of this process existing infrastructure works quite well for doing bandwidth control I expected to do a lot more development than I did I mean other than these bug fixes I did none, which was disappointing because I'm a developer you know so the key you need to do is you need to be able to identify the packets the example I gave was extremely simple where we're just trying to basically break it down on a per guest basis you may quite easily want to do something more complicated, in fact in the scenario that this is going into the host was actually using iSCSI and so we wanted to ensure that it got a fair amount of bandwidth and so it had its own class the fact that HTB is hierarchical made that possible yes, so you need to identify what the problem is and design appropriate class hierarchy and perhaps the most important thing is there are subtle traps this TC in particular, but in general Linux networking is very complicated, there are many many knobs you may think that you've turned the ones to do what you want to do you need to test, I'm pretty sure that there are holes in the rules that I wrote above, but they passed the tests for the traffic that we were interested in and I'm not sure there's much more you can do than that thank you everybody for your time and I think we have time for more questions so is there actually anything specific to a virtualized environment here I mean it seems to me that you'd have exactly the same problem with a bunch of real physically different hosts behind a switch as long as the things on one side of the switch have a greater aggregate bandwidth than what's on the outside which would almost always be the case I would think yes, so this is a many of these problems are generic at many levels so yes if you looked at a physical network you would see similar problems even actually if you looked at processes from forget virtualization you can create similar scenarios so it's a very specific work look at a very general problem all right well Simon thank you very much indeed well thank you