 Hello. It's only four combinations. So eventually this works. Welcome to my talk. I'm heading. I'm going to talk about this new queuing subsystem, which is not entirely committed right now. So lots of the stuff you're going to see here is actually not on the tree yet. And good news first. I'm not trying to fit 80 slides this time. So when we're talking about queuing, we are obviously talking about queues. So what queues are we talking about? Queues are really simple. They are temporary storage for packets all over the stack. A queue is usually nothing more but just the chain of embuffs using the embuff next pointers in the embuff structure. We are going to come to this in a little bit. And traditionally, they've always been processed in a 5.0 manner. The first packet that was queued was the first one to be dequeued at a later time. Quick recap how packet travels to the stack. The network interface card, when it receives a packet, interrupts. The interrupt handler gets the packet of the DMA rings of the network interface card, puts it into the IP interrupt queue. Of course, we are ignoring IPv6 here. And afterwards, schedules a soft interrupt. That's basically just a flag that's being checked before the kernel would return to userland and then act based on that. The soft interrupt subsystem calls IP inter to dequeue those packets, which in turn gives it to IP input, which processes them one by one, in the case of a forwarded packet, which is what I usually care about. We end up in IP forward. And eventually, we are in IP output. It gets put into the queue of the outgoing, on the nick for the outgoing connection. And well, there's, of course, the completion interrupt and everything, but we don't care about that here. For the typical and simple case, we are dealing with two queues. I said this is forwarded. No encapsulation, no IPsec, no tonnels, no IPv6. What else? No we either, no bridges. So for these, the simple case, the typical forwarded packets, two queues, there's the aforementioned IP inter queue, and there are the queues on the nicks for the outbound direction. And if queue is actually pretty damn simple, you have the hat of the queue, just a pointer to the hat embuff. You have a pointer to the tail. This, of course, is kind of an optimization, because if you want to queue a new packet, it has to go to the end, right? So walking the entire chain to find out which ones last, and then n queue is kind of expensive, so it's much cheaper to keep the extra pointer. You have a variable holding the current length. You have a variable holding the maximum allowed length. A counter for how many packets have been dropped. And there's a congestion mechanism. Basically, when we notice we're congested, I spare you the details how we do so. We set a flag, which makes certain subsystems skip expensive parts of the work, namely PF stops doing roots at evaluation standard, only handles established connections. When we're congested, we're dropping packets anyway. This is just a more selective dropping. But this is not the topic of this talk. The basic queuing methods we care about, there's the aforementioned FIFO, which is what we had forever. There's priority queuing. Priority queuing is used to lower the latency for important packets. Of course, it also has an effect when you have to start dropping packets, because then you are dropping the lower priority ones first. Priority queuing is really just changing the order in which we take the packets off the queue, nothing else. And obviously, this really only makes a difference when the machine or the link behind the machine is overloaded, or very close to being overloaded. If there's plenty of CPU power and plenty of bandwidth available, we are going to end up with like two or three or 15 packets in the queue and can process them very fast. So reordering does not really make a difference there. There's one thing to keep in mind that's very important. You must never, ever reorder packets that belong to a single TCP stream. Of course, TCP doesn't like reordering, right? The other queuing method we care about is bandwidth shaping. So you have your nice gigabyte link, and your uplink is only a couple of megabits, and you're giving bandwidth to your neighbor or a customer or whatever. You don't want to use up all your bandwidth, right? So some kind of limitation or limit the bandwidth you use for downloading or pouring so your SSH sessions still work. Actually, I'd use prior queuing for that. Bandwidth shaping obviously is more complicated than prior queuing. There, we actually have to measure the bandwidth certain classes of packets take up, which also implies that we have to classify packets. And if we are reaching the bandwidth limit for that class of packet, we have to delay them, which in turn means leaving them in the queues for a little bit longer. The classification, that's the decision how we queue a packet. The actual queuing is a separate step from the classification. Classification just means we mark the packet somehow, details later. What we really do is write the priority value or QID. The QIDs are invisible to the user into the Ambov packet header. And this being open VST, of course we use PF to classify. Not just because we love PF, but also because it just does not make sense to re-implement the same thing over and over again. The actual preservation of shaping, obviously has to happen when we dequeue packets. We have to put the packets into the right queue or sub queue at n queue time. We have to process them in the right order and at the right timing for bandwidth shaping at dequeue time. And obviously this can only happen about the IF queues because otherwise we just have a packet of flight and no way to store it somewhere, right? Priorization is being useful on any queue, no matter where it is, because as said, this just changes order and lowers latency. The bandwidth shaping also pretty obviously is only useful on the outbound queue. If you're doing bandwidth shaping on the IP inner queue, this is kind of useless, so we put the packets into another queue afterwards, but they're being delayed or sent out immediately again, right? Oh yes, and at the IP inner queue we don't know the interface speed yet, so we have no idea whether that link is congested or not. Priority queuing is something we pretty much always want. There's always a certain type of traffic or certain types of packets you don't want to lose, even if you're under severe load. You don't want to lose carb announcements. If you lose carb announcements, your backup host will think that the master is that and take over and suddenly you have a master-master situation, a split-brain situation, where they both think that they are the only master and have to process traffic, which leads to duplicated packets, that leads to all kinds of interesting problems. And obviously the same is true for the spanning tree announcements. Another kind of typical case is somebody, again, downloading way too much porn and you cannot SSH into your router to fix this. So you do want to prioritize your SSH management traffic over porn downloads. Priority queuing happens to be everywhere. If you look around, the VLAN header has a priority field. Oh, surprise, it's eight different values, 0 to 7. Better switches, pretty much any switch these days, I think. I'm not familiar with the home switches, so don't ask me about those, but priority queuing basically is everywhere. All the better ones have four or eight queues, typically eight, four is the older ones. They have built-in classifiers, but they are extremely simple and usually not good enough. And many of the high-end network interface cards these days have multiple send queues. So you can use the hardware to split those apart and don't have to do it yourself. I don't think we support that in any driver yet. The hardware makers usually sell this as a virtualization thing. So each guest could have a separate queue, which doesn't make too much sense, but that's a different topic. If you can put the virtualization David on it, that's better, right? So we already have bandwidth shaping and priority queuing in the form of odd queue. Odd queue is a research project by Kenjiro Cho. He did that around 2000. This is, as I said, a research project because bandwidth shaping and priority queuing were kind of new back then. His goal was to investigate different queuing methods. He called the schedule as back then because as I said, there was no experience with those. It was originally developed outside the OpenBSD tree. It was his thesis at the university. And since this was, as I said, a research thing and used to figure out which schedule works best, it's highly modular and pluggable, which also means that there's a considerable overhead, considerable complexity that we typically don't want to deal with for production stuff, right? It's okay for research, but not for production. How does it work? It replaced the IFQ strike that I showed earlier with an IF odd queue. It adds new N and D queuing functions that do the actual bandwidth shaping or priority queuing. And our infamous IF.h macros have to exist in two versions, one for IF queues and one for IF odd queue. And this is the third one, by the way. That's another topic. Whenever I worked on IF.h for too long, I have to ask for new brace keys for my laptops. Odd queue has to be enabled explicitly per interface. As I mentioned, the N and D queuing functions are being replaced. The traffic classification and configuration are originally happened with odd queues own classifier and scheduler, which was barely usable if you could read the documentation in Japanese. Otherwise, it would have been a mess. So, Kenjiro and myself merged odd queue and PF to use PF as a classifier in 2003, which also leads to the interesting situation that I'm replacing my old work here. After the merge, the simplified odd queue had one priority scheduler and two different bandwidth shaping ones. We initially had CBQ, that's class-based queuing, and later added HFSC, which is Hierarchical Fair Service Curves, which I'm going to explain a little bit later. Problems. Haddock doesn't only come from Schnaps. Oh, you can't read that, damn projector. It's a Japanese manual saying Schnaps. I keep being surprised which German word spread around the world. And we're going to have Schnaps later. So, it uses the separation between the IF queue struct and the FOQ struct has quite some drawbacks. This means that each and every Navigate Fiskart driver had to be adjusted to be converted to our queue. And it also means that lots of support functions macros had to exist in both versions. And as a side effect, the queues that didn't happen to be in the Navigate Fiskart drivers are not and were not all queue capable, which is of questionable use. But anyway, the configuration for all queue, as I mentioned before, was drastically simplified by putting the configuration part into PF, but it was still kind of complex. Most of you use PF, right? So, who's using all queue here? That's quite a few. Who's using HFSC? Wow, two! That's more than I expected. That's way more than I expected. Basically, even with the simplified one, it's still too hard. Partially, because one of the decisions we slash I took in 2003 was wrong. So, very hard to use. Comes down to that. The other issue is there's too much overhead. All the abstraction and pluggable and modularization brings a lot of code that has to run, and of course, there is a cost to it. This leads to, if you just enable or queue an interface without doing any classification or any actual queuing, you lose about 10% of performance, and I don't think that's acceptable. To give you an idea about the queue size, for itself, it's 9,000 lines of code, and the diff to remove it is more than 12,000. So, the new simple priority queue. The idea is it should be extremely simple, because seriously, priority queuing is not rocket science, right? You're just the same priority levels. It should have very, very, very low overhead, and due to that, we want it to be always on. Of course, as I mentioned before, we always want to prioritize certain classes of traffic, like the carbon announcements. To do that, I modified struct IFQ and the NQ and DQ macros, and about a handful of help-up functions that we don't care about here. And instead of having one queue head in the struct, we simply have eight. And the priority value is in the packet header amp-off. So, instead of having one head and one tail, we have an array of eight. That's the entire change. The NQ function just takes the priority value as a number from the packet header and use that as an array index, which of course means we have to verify this is in bounds before queueing. And the DQ function just loops over them. So, that's simple, and that's cheap. There is no configuration for that possible, because it's not required. PF is still used for classification in a very, very, very simple way. You just do a set prior to a number between zero and seven on matching packets. The priority field in this packet header is inherited from the VLAN header if it's coming in on a VLAN interface. And carb and spending tree announcements, the BPDUs, are being prioritized by default. You don't have to do anything but. And as mentioned, you cannot turn off the priority queue. We don't believe in buttons. That being nice and that being in tree, we still need a way to do bandwidth shaping. There is no need to have multiple schedulers. The fact that we had multiple ones was due to the research part. We only need one. And HFSC is the most flexible one. It was pretty unusable because of the hard configuration, but the actual algorithm is the nicest of all of them. You can express CVQ entirely in it and it's also more precise than CVQ. Like the resolution is a bit better. The configuration, I keep writing on the point, the configuration getting that straightforward and easy, that was the big challenge. Well, turned out there were a few others but that was the biggest one. HFSC stands for Hierarchical Fair Service Queues, as a Curse. There's a typo in my slides. A service curve consists of a bandwidth, a burst time, and a second bandwidth. So for the first d milliseconds, like the time, the burst time, for the burst time, the queue or the curve, the packets matching that criteria get the burst bandwidth and afterwards, they get the regular bandwidth. And of course, the burst time and the bandwidth are optional, then it gets the regular bandwidth all the time. Each HFSC queue consists of three of these service curves. There's one controlling the minimum assigned bandwidth. This is often referred to as the real time, sorry, the real time service curve. The second is the target bandwidth, like if all the queues are overloaded for a long time, that's the bandwidth share that they'll get. And the third is the maximum that this queue will ever get. Which in practice means that you get some bandwidth between min and max and we try to give you the share that's defined by the middle one, the target bandwidth. And so just like the CVQ scheduler, HFSC has a tree of queues, so it's hierarchical, like the name implies. Each queue has a parent, except for the root queue, of course. There's one root queue per interface, it's not a global one. And HFSC always operates with borrowing, so each queue can borrow bandwidth from its parent if that has bare bandwidth, up until it hits its maximum service curve if it has one. So the plan, it's not a five-year plan this time. The plan was to use the existing core HFSC algorithm, including the code with a few cleanups, remove all the old queue grew, and the three big parts to work on. PFQTL, especially the parser, as said, getting the configuration right is the big challenge. The PF parts from the LQTLs down to setting up the actual queues on the engines and the classification as in marking the packets. And then of course we have to hook the actual engine into the N and T queuing macros functions. Hoking HFSC in isn't all that hard to be again modify our struct if queue. We're not introducing yet another IFQ variant. We are modifying the one by adding a pointer to HFSC specific stuff and a pointer to the token bucket regulator specific stuff that I'll come back to a little bit later. If we're not operating HFSC on that interface, that is simply a null pointer. The token bucket regulator or TBR controls how many packets get dequeued when, like at which point in time. It does not do any bandwidth measurements itself. That's the job that HFSC fulfills. And the dequeue functions simply look at that pointer. Null pointer, call the classic dequeuing functions. If there is a pointer, call the HFSC specific functions. Oops, I was here. Yes, the configuration again. As mentioned, the first attempt in 2003 failed. The, what we want, the common cases, the common use cases should be as simple and straightforward as possible. And the more complex ones should still be possible, they should still work, and they should still be readable. Working the syntax out, talk a bunch of developers sitting in your cave, this is not going to work out. In that, I totally believe in a white board or a piece of paper with no computer at all. Beers optional, that helps. Oh, you can't read this again. This is an awesome piece from Japan. It says, beer communication, beer presentation for you. I love jinglish. So, the queue setup and classification still happens in PF. There was a little bit of detail whether the queue setup actually belongs in PF, because that's kind of iffy, but using something yet separate is kind of strange too. So we opted for still keeping that part in PF. The classification does not change at all. It remains at what it is. You match on packets by some criteria and set queue, queue name. If a queue, by that name, as said, you can do this any time. You can do this on the inbound side. By the time the packet hits the outbound, outbound interface, the outbound queueing, if there is a queue by that name, we'll use it. If it's not there, we'll just use the default queue. So, that's the new syntax. Worked out with a white board and lots of beer. You define a root queue. Root queue is simply a queue that does not have a parent. It sits on a specific interface and it has some bandwidth. The children refer to their parent. That's exactly the opposite from the way you all queued it, where the parent listed the children that you had to specify afterwards. And why did I have the third one there? One, that's why. One of the queues has to be marked as default queue. So all packets that hit that interface, that do not have a queue marker, a queue ID assigned to them, or an invalid one is in the queue that does not exist on that interface, go to the default queue. You can limit the length that queue can take, like the maximum number of packets in that queue, just giving it a queue limit. You can specify burst time and burst bandwidth. So here it gets 100 emits on average, but for 100 milliseconds, it's allowed to use up 250 megabits. You can apply the minimum and maximum bandwidths, that's here, that has a maximum of 50 megabits, but is allowed to burst for 100 megabits for 50 milliseconds as it's maximum, but on average it only gets 30. Can I get? Basically, no, it's not basically when it's idle. It's not the zero queue length, but it's pretty much like that. If it's idle, that figuring that out is not as simple as looking at the queue length. But basically it's that. Once you go back to idle, you're allowed to burst again. But can you? I'm coming back to this in a little bit. So, and as I said, the queue assignment remains simple. You just do set queue foo. The nice trick we have for prioritizing empty TCPX, that's TCPX without payload. That's important, without payload, because otherwise this could be abused. And packets marked with type of service low delay. You can specify two queues. In this case, all normal packets will go to the queue foo and the empty TCPX and the low delay packets will go to the queue bar. Daniel Hartmeyer figured this out almost 10 years ago. On a DSL line, a typical DSL line where your uplink is much slower than your downlink. When he was running a big download and tried to do interactive SSH work at the same time, the SSH session would lack like mad, even if he prioritized the SSH packets. Why? Because the TCPX for his download were clogging up his tiny bandwidth, a tiny uplink bandwidth. So this one takes down the interactive traffic and two slows down the download. If you prioritize those TCPX, your download continues to be fast. This basically does not interfere with your regular uplink use because those packets are tiny. That's what this is for. And fortunately, SSH for interactive work, it marks the packets as low delay. And if you're doing bulk transfers, like SCP, it marks them as not normal, what's the other? It's not throughput. Well, there's another term internally. I keep forgetting that. Whatever. Basically meaning this is a bulk download. So status and outlook. The simple priority queue is committed. It even was in 5.1, but the syntax wasn't final. I changed this after 5.1. The syntax you find in 5.2, which is about to be released any day, which is about to be released any day, has the final syntax. Final as in for the current five year plan, it's final. The bandwidth, five years from 5.2 on for five years, this is the final. I'm not saying kind of solution here. The bandwidth shaper basically works on this laptop. So don't steal it. I have backups. I invented crowd computing after losing a diff because I was being stupid. I started doing SMTP backups. Basically, I go to the directory where I'm working in to CVS diff and pipe this into mail. This to a random developer with a subject you are back up. It works perfectly fine. This is crowd computing. So the status, the root queue is totally weird. It basically doesn't work. And neither Kanjero who did the research in that area, nor me really understand why yet. The children work fine. So revisiting what all queue was doing. The root queue in all queue was hidden. When you did the all queue on interface foo, well, that's bad. If you did all queue on EM zero, there always was a root underscore EM zero queue. But this was invisible. So there was no traffic assigned to it ever. So I kind of think we might go for the same idea and have an invisible implicit root queue and be done with it. But well, we'll try to figure this out a little bit more. This decision has not been taken yet. This is almost the showstopper right now. That's the outlook for tonight, almost. The ability to watch the queuing in action is written and works. So that's pfqlo-vvsq. Of course, there is no documentation but this presentation yet. There are several consistency checks missing. So you can easily have a hundred embedded interface and a hundred embedded root queue and children distribute it, like children summing up to 10 gigabits. Of course, the algorithm will not do what you expect in that case. So yeah, there needs to be some checking. The performance should be much better than all queue. I have not actually met with this yet, but I know it is. One of the biggest costs, surprisingly, is the constant calls to micro-uptime. Like, obviously, HFACC is very good timing, like high resolution timing. And reading high resolution timers is surprisingly expensive. I will not attack this before I commit this. This is not a change to the current situation. That's kind of a self-contained problem. We're looking for solutions to that. This is probably going to be a solution that is more generic and not just affecting the queuing. I still want to get this ready for 5.3, but since I ran into these problems with the root queue, ugh, might get tough. So it might be in 5.4. No surprise that the current five-year plan doesn't work out exactly. For the transition from all queue to the new system, I want to leave them in in parallel for some time, so they don't have to do the transition to the new one immediately when you do the release upgrade. But unfortunately, some keywords clash. And I don't want to use really bad keywords for the new subsystem, just to maintain compatibility with the old one. So we have to rename some of the keywords that all queue is using right now. Foremost, this is the queue keyword. So if you want to continue using all queue for a release or two, you have to change your queue definitions to old queue. Pardon? Old queue. Are you trying to resurrect the switch pf to German diff? Shut up and heck, so you do it. The idea, of course, is to get rid of all queue eventually. Of course, having those systems in there in parallel just makes certain areas of the network stack very hard to follow, especially once again, IF.h. And once we got rid of all queue, there's so much cleanup possible in IF.h. And then it might be understandable by more models and I might be able to work in there without wearing out my brace keys all the time. Any questions? Cueing without configuration is the question, basically. This probably comes down to some form of fair sharing algorithm. I didn't, leave me alone with a buffer bloat. Somebody good, no, I'm not saying this here. I'm not into that. I don't see the point, to be honest. I do think you want to do priority queueing all the time, as said, to prioritize some important stuff. If you want to do any bandwidth shaping, you should do so explicitly. And last but not least, all this queueing stuff is only really effective if you're short of resources, be it CPU or be it bandwidth. As long as you have plenty of bandwidth and plenty of CPU or system resources available, it just doesn't matter, right? Because you're sending out everything pretty fast anyway. The delay is going to be in the network, not in your machine and the directly connected section of the network. The point there being is that, the idea is that if a packet has been queued up for let's say 100 milliseconds, it's better to throw away some packets at the front of the queue. That's right. Instead of the back of the queue, simply to get the signaling, packet loss is a signal to TCP to get that faster on the whole round trip. So that's basically red on steroids. Well, yes, and it's computationally, it's much simpler than red, so you don't have to scan the cables and so on, yes. I left, red is random early detection. That's basically the queuing system tries to figure out early that it's about to run its queues full and early kind of randomly dropping certain packets so TCP scales back. And I have commented out all the red bits for now, because that differs way too big already. I want to keep this in manageable pieces. I failed on this part, but if I put the red in right now, it's going to be total nightmare. So we might opt for not going for red, but implementing that. I have not really spent time on that yet, so I don't know. You should have a look at it because it's... Sounds interesting, yeah. Yeah, and also the logic is not at the end queue time, but at the dequeue time. So the dequeuer towards interface will start dropping some packets if they're in there too long. So you simply timestamp it, the moment you put it into the queue. Oh, timestamping is expensive. Well, you can use TSC or something like that. You don't have to convince me anymore. You got me when you said it's simpler. That means less work for me. Awesome. Awesome. Peter keeps picking up my groundwork and polish it to the point where it's really usable by the vast majority of users. And I really like that. Thank you then.