 I'll be ready. Okay. Do you have your pointer? Is the mic good? Okay. Hello, everyone. All right. Thank you for being here. This is the last, pretty late in the day on a pretty warm day in Austin. I personally would like to be out sitting with a cold beer, but I have to do this. I know what your excuse is. I'm just kidding. I'm very happy that you're all here. We're a minute behind, so let's just get going. A brief introduction. My name is Vinay. I work with the Seattle Cloud Lab of Futureware Technologies. I'm an architect there. My interests are in Kubernetes, skates, computing, and networking side of things. That's where my focus has been for the past few years. I'm also the latest, newest maintainer of the CNCF, CNI Genie project. We're hoping to bring some of the new ideas that we have. We'll be talking about in the MESA project today into the CNI Genie as well. I also am a ski instructor. I teach little kiddos how to ski, which essentially is bribing them with gummies and then asking them to do things and they do it for you. It's a good currency. It works. If you're ever on a ski slope and you see a kiddo come racing down the ski slope and knocks you over, it's probably my fault. I'm going to have Fu introduce himself. Thanks, Vinay. My name is Fu. I'm a software engineer also at Futureware. I mainly work on Kubernetes and networking there. I have a miniature Australian shepherd who is about five years old and a ragdoll cat that's about three years old. I was actually told recently that the cat really misses me and has been sleeping on a shirt that I left on the couch for the past week or so. We'll see what happens when I get home. With that, I'll pass it back to Vinay for our first agenda. Okay. Let's see if we can get this working. So the agenda for today, we're going to look at what QoS means for network traffic. Fu is going to talk a little bit about EBPF and XDP at a high level. I think a lot of people already know a lot about this, but we want to do an intro of that to segue into the Mizar how we use XDP. Then we'll look at the QoS use case that we have, the specific thing around which we built our solution and talk about our design, dive a little bit deeper into the details and look at how things work with our solution. We're going to try a demo, fingers crossed. Just before this talk started, things had crashed. Not our fault. It's MacBook Pro's fault. It's a really nice beast, but it has some instabilities. But it's up and running now. Let's hope we don't have to use the video. Then we're going to look at some of the results and then discuss some of the next steps that we have planned for in the coming year or so. Then we'll open it up for Q&A. QoS, well, it's a fact of life. We experience it around us every day. Take for example, emergency response. The sheriff's car can get priority access to the streets and traffic lanes and the traffic signals when he's responding to an emergency compared to the tow truck that's towing of MR Jalopy. That's because sheriff's business is important business. Responding to an emergency is important business. Those fresh hot donuts aren't going to eat themselves. I'm pretty sure I'm going to get a jaywalking ticket tonight. Another example is HOE lanes. There are incentives to driving green carpooling. Lastly, take for example how I got here. I took the flight from Seattle to Austin to get here. When I did that, I experienced some QoS back there. I got an extra bag of peanuts, so it wasn't that bad. Well, let's dive into what QoS can mean for network traffic. In a similar mapping to real life, network traffic, not all traffic are born equal. We have real-time applications like Zoom and FaceTime. These are applications we're very familiar with after the last two years. They have certain characteristics. They have some requirements from the network, like low latency, low jitter, and low loss. They need some bandwidth guarantees. It won't be great for great experience if they were losing a lot of packets. Then we have streaming video and audio kind of traffic, YouTube, Netflix, also something we are very familiar with after the last two years. They are a little bit more tolerant to losses because they tend to buffer traffic at the receiver end before starting a playback, but they do need some bandwidth guarantees for the duration of the time that the application is playing with your audio. And lastly, we have data traffic, file backup, log backup, emails. These are important traffic, but it's okay if they don't have to get there now, now, now. They can tolerate a little bit of loss, and as long as it's within reasonable time, it's all about capacity planning. So now I will hand it back to Foo for a description of eBPF and overview of eBPF and XDP. Sure. Thanks, Vinay. So eBPF, there's actually been quite a few talks recently in a couple of weeks here about eBPF, but I'm just going to go over a very brief intro about it. eBPF is a technology that allows you to slavery-run programs in the kernel. The current use cases for it include networking, security, and observability. It's actually deployed in production today at many big companies, including Cloudflare, Netflix, and Facebook. There's actually work in progress now to put eBPF on Windows at Microsoft. Some example use cases of it in production include a packet inspection, performance analysis, security monitoring, networking, and like a firewall. eBPF programs are event-driven. They can attach to kernel hookpoints and execute code when an event happens. Furthermore, instead of having to call numerous kernel functions, eBPF allows for easy access to the kernel via a very fixed, easy API called the eBPF helpers. These programs also usually come in pairs. A program in user space can modify or push information to the eBPF maps. Then these maps can then be accessed by another program running in the kernel space. The safety of the eBPF programs are guaranteed by the verifier. It does this by not allowing certain things such as unbounded loops and a limit on the number of instructions. In this picture right here, we have an example of a eBPF program monitoring the return of an exec syscall. On to XDP. XDP is a kernel hookpoint on the RxPath in which you can attach an eBPF program. This hookpoint happens before SKP allocation and it allows for extremely efficient packet processing. It's not a kernel bypass, so we can still rely on the safety features provided by the kernel. The main thing that we're using XDP for is we use it to process packets and modify them and then send them out like a different interface or something. We do this by using these five actions that XDP provides. These actions include XVTX, which sends the packet right back out the same interface on which it was received, XDP redirect, which sends it to another interface, XVPath, which passes to the network stack, XVDrop, which just drops the packet, and XDP Abort, which also drops the packet, but usually you use this action when there's like an error in your program or something. Now on to our project Mizar. Mizar is a container network plug-in for Kubernetes that utilizes eBPF and XDP. It provides multi-tenant networking via Arctos, which is a Kubernetes fork. We use Geneve as the overlay and encapsulation protocol. It's similar to VXLAN GRE in that it has a variable length optional header. We utilize these optional headers and BPF to implement multiple VPCs and subnets, network FastPath, and load balancers. This is a very basic diagram of the Mizar architecture. This is the Node architecture, actually. The Mizar Node agent, we call it the TransitDemon, is a user space program that's responsible for pushing information to the BPF maps from user space. We have two programs that run in kernel space, the Transit XDP, which runs on the ETH0, and the Transit Agent, which runs on the VEATH interface of the pair between, for the container. The main XP program on ETH0 is responsible for the RX and the one on the VEATH pair in the root name space is transferred packets in TX. With that, I'm going to pass it back to Vinay to explain how we utilize QoS and Mizar. Thank you for the intro. Next, let's take a closer look at the enforcement points of QoS. When you have your device, you connect to some kind of a service, maybe it's your Netflix app, what happens is packets traverse back and forth between the service, the server process, and your device process. And they traverse a series of network switches and routers. And each of these switches and routers, including the ingress and egress points are enforcement points for QoS. On the egress, we can use differentiated services. It's an internet standard that's been around for a while. And it's a way of tagging packets and using that tags to classify packets into different priorities as high, low, medium, or best-effort real-time. And the techniques that are used for doing this are priority Qs, token bucket rate, for rate-limiting 5-4 fair Qs. These are all facilities that have been available and are all for some time in the Linux kernel as well as in the network switches and routers. They understand the server classes. And then when the packets reach the ingress, if the host application, the receiver application is not able to handle the rate at which the packets are coming, they can be rate-limited or dropped, or if there is a policy which defines what's allowed for them, they can be. It's typically not the best to drop packets because of some number at the ingress because the packets have already traveled all the way from the source to the destination and used up the network resources. So unless there is no other go, we generally want to avoid that. And for that reason, in MISA, we haven't considered the ingress QoS. We have focused on the egress side of things. Now let's go to the next slide and talk about the use case. So consider this. We call it a sender VM because it's hosting two parts, two Kubernetes spots that are sending packets out to different destinations. One of them, it's a pavement processing pod that's handling processing, maybe create card swipes and then authorizing it from the checkout counter at the store somewhere. And then the other one is back, it's a backup pod which is backing up logs or files or something. Clearly, one of these pods has more important things to say, it's competing for the same egress bandwidth. In the interest of fairness, we might drop traffic from the high priority. We consider this as a high priority pod because it's bringing you money. But to be fair, we want to drop traffic from here. What that does is it materializes as it manifests as a long checkout line because people are waiting for their cards to go through and it's not happening and somebody at the end of the line might get frustrated and say, drop their card and leave. That's money lost for the business. Not a good thing. But we can do better and in the next few slides we're going to see how. So now let's look at the implementation. This is a design that we came up with for Miser. It looks like there's a lot going on here but in reality it's fairly simple. It's a two-step process so let's break it down. We came up with three classes, three traffic classes, a premium, the best effort and the expedited class of traffic. In that order the highest is premium and expedited is the next highest traffic class followed by best effort. Going back to that example where we have two pods, one is doing backup and the other one is doing payment processing. What we want to see is how can we give the payment processing pod traffic a higher priority or more privileged handling. For that it's essentially a two-step process. The first step is classification which is kind of figuring out which pod is the higher priority or more the higher class of traffic. We do that by looking at EBPF maps. The EBPF map contains a mapping of the source IP of the pod, the pod IP to the class of traffic. This is written down by the... Oops, there we go. We have the transit team in Mizar. It looks up the pod's priority from Kubernetes and then uses annotations on the pod and then determines what the priority of that pod is and then programs the class of the pod down in the EBPF map. When the packet arrives from the pod, the pod sending traffic arrives at the VE sphere. There is an EBPF program, XTP program really, that's running here and that looks up the map and looks up based on the pod IP. It knows what class of traffic it is. Now we know that this pod is sending... This pod belongs to a premium class and it's sending premium class traffic. That's great. Then this pod here is sending best effort traffic, let's say. The next step is action. What action do you take? How do you route them? We have choices, right? We use the XTP redirect action, which pretty much takes the packet and deposits it straight out into the TX queue of the ETH0 here and then gets sent out as quickly as possible. That's the fastest path that we can have in this architecture. The other option is to use XTP pass and send it to the host network stack where SKBuff is allocated for that and then we have priority queuing and scheduling. For best effort traffic, we also use rate limiting. Now we're able to classify the packets and then take different actions based on the class of the packet. This essentially gives us different levels of service we can offer to the pod users. Now we're able to classify the traffic and then prioritize them and offer queues at the egress level. There's one more thing we do when we determine the classification of the packet. The overlay or the packet header itself if it's egressing the node, then it has an IP header and there is a diffsurf bits field there. We set the code in that. Once we do that, the diffsurf code can be recognized by switches and routers along the way, as I mentioned earlier. Then that allows us to bring end-to-end queues for the pod traffic. There is a dynamic rate limiting feature that we introduced for best effort traffic. Premium traffic generally behaves well. They have a certain rate at which they send, so they don't really need to be rate limited. Your streaming application knows that it needs a certain data rate and then it sends at that rate. Best effort traffic tries to send as much as possible, so we need to implement some kind of rate limiting in the presence of premium or expedited traffic. We do that by running a bandwidth monitor thread that is present in the MISR transit daemon. What that bandwidth monitor thread does is periodically, every one second, it looks up the TX stats and the TX stats contains how many bytes were sent by, let's say this best effort pod, how many bytes were sent by the premium pod in the last five seconds. Based on that, it averages and says, okay, there is increasing premium traffic. I need to clamp down on the best effort traffic that can be allowed to the EBPF map. Now that rate limit that's programmed here is picked up by another EBPF program that's hooked to the EED-0, the EGRAS-NIC, and there we enforce EDT-based rate limiting. EDT is a simple algorithm to rate limit packets. It's kind of like, you can think of your train station. You had trains have departure times. The closer the departure times, the more trains get out and more people get out of the station. In the same way, you timestamp the packets closer then the more packets get out. Since we have classified the traffic, we know which packets are best effort. We can timestamp them closer or farther out and that controls the limit at which the best effort traffic gets delivered to the network. So this is how we achieve rate limiting and this is dynamic. So the bandwidth monitor, periodically monitors how much high priority traffic is there and based on that adjust the rate limit higher or lower. So that allows us to give premium service and expedited service to the higher priority pods and yet conserve, use the bandwidth when it's available. So what did this all do? Doing this, it gave us choices. So in our case, what we got is we have a notion of traffic class and we have notion of traffic priority and to put it in a table, we have premium, expedited and best effort that I mentioned earlier and in each traffic class we can have high, medium or low priorities and these are treated differently based on the DSCP code which is the differential services code that is applied to those those priority levels and that is the QOS that it gets during the network and not only at the during the network at egress as well, the priority queuing takes care of prioritizing say premium high, well not so much premium but expedited high is prioritized higher than expedited medium and so on. So as I covered, as I mentioned earlier, we have enabled end-to-end QOS by using diff serve. Well all this is great but how does the user tell us well we go back to the good old annotations that Kubernetes allows us to attach to the pods so we created a couple of annotations MISR IO network class and MISR IO network priority so in this case what we are looking at is an expedited medium class pod so fairly simple isn't it well this is great for our implementation we understand it but other implementations may not understand it and we want to see if we can have a more formal API and Kubernetes for this so we'll be starting up something for that hopefully sometime soon with this let's at least we'll cross our fingers and try to do the demo Foo is gonna do it so he's gonna he's gonna do the hook alright thanks Vinay let me just go over the demo setup before we get started here so on a single VM we will have three senders representing three different traffic classes the first class is best effort you can think of this as like low priority traffic that can be completed at any time things such as cloud backup or emails as Vinay mentioned before the expedited class is the next one and this represents somewhat important traffic but it can be annoying to the user if it's throttled this is like video streaming like you're watching something on Netflix and it gets all blurry all of a sudden and finally the most important of all the premium class things that are really high priority such as like bank transactions which can be big trouble if something goes wrong the demo itself will consist of two parts part one I'll have the three senders send traffic simultaneously and we'll see how traffic is divvied up between the three senders part two we'll have the best effort traffic started first and then we'll see the premium traffic start and see what happens from there all right let's get started with the demo so we have three colors here the green color represents premium traffic the yellow color is expedited and the white color is best effort the receivers are all set and listening I'm gonna go ahead and start a little traffic test to make sure that they can receive traffic but they're still waiting for my signal to start so everything is connected everything works the ping replies so now I'm gonna create a file to start the signal and start the traffic all right not sure if you can see the numbers here but on the right side we can see the bit rate of the premium class is about around 8 to 900 on average the yellow is expedited which is around 800 on average and the best effort is around 400 on average as expected we can see the actual average spit out by hyper should be about now so expedited is 857 or sorry premium is 857 expedited is around 790 on average and best effort is around 442 okay so on to the second part of the demo here I'm just gonna clear these screens start the server back up so the best effort I'm gonna run it for 60 seconds we're just gonna do the best effort in the premium just to show the start difference between the two so we see that the best effort is getting the full bandwidth right now 1.3 gigs or so and now I'm gonna start the premium and let's see what happens so the premium takes over it gets the full bandwidth 1.4 gigs and the best effort drops back down to 100 or so megabits per second and that concludes the demo and I'll pass it back to Vinay to show us some graphs and pretty colors thank you Foo so thank you so holding the fingers crossed worked in this case what you're looking at is essentially the best effort traffic being throttled by the presence of premium traffic here and when the premium traffic is not done we're getting more bandwidth back to the best effort traffic right here so that's the dynamic bandwidth we're limiting at work so well that is good now you can crash it actually crashed right before this for some reason yeah it did that's the funny part back to the presentation alright you can open your eyes now okay we're going to enter in the form of a table which makes it a lot easier to understand this was a previous experiment where we saw with QoS the premium traffic got more bandwidth than the expedited followed by best effort which was clamped down to around 20% between 10 to 20% and we did this experiment with QoS turned off and we got this which is it's anybody's bet at that point best effort happened to get more in that case it could really have been premium it's just anyone's guess the other part, the second part of the demo that we saw was that what you see there in the red orange line is the bandwidth usage by best effort traffic so we were at 1.2 gig or so up until 10 to 15 seconds in this case we were 10 seconds where we hit off the premium traffic the rate limiting kicked in and pulled the rate limit all the way down to 10% and the premium traffic got all the bandwidth it needed this is a very crude experiment we were using 3 UDP streams this is not how it's supposed to behave but it kind of gets the point across and the best part about this is to do all this to achieve CNI networking with Mizar with XDP how many lines of kernel code did we have to change to be zero that's the magic of EBPF so so what's next this is probably the most important slide to me at least this is you could say a laundry list of some things we wish for and some things we're working on the TX hook for XDP this would be nice to have because currently we are using XDP program attached to the Vs to do exactly performant if we could minimize its footprint have it do very little and then have most of the work being done on the TX in driver mode or better yet in offload mode then it would really supercharge the data path for us and that would make the solution really compelling another thing that I briefly touched upon before is the API enhancement for Kubernetes we're seeing an increasing demand for a better way for users and we're seeing that we need so much bandwidth for our parts traffic and in our case we need this QoS levels so we want to propose we're looking to propose, make a proposal of a formal API change to Kubernetes so we'll see how that goes CNI Genie as I mentioned I took it over and we're looking to bring in some XDP related innovation that we have from MISA into that and make it an appealing solution for multi-home networking today we have CNIs out there but we don't know how well their control plane performs for example when the pods landed how long does it take for the CNI to get your part to network ready which really is like ok you got an IP that's great but is that IP reachable can that IP reach all the services and other pods and the external network outside the cluster those are questions which we need answers for and we're doing some work related to that and MISA itself has some things which are in the implementation like consistent hashing and dynamic the distributed hash table implementations for true scaling and fault tolerance these are things we want to do at some point and but the bottom line is that what we have what we have going on is that the way I see it Thomas Graffin is a posse of kernel hackers have done a great job a tremendous job bringing EBP of mainstream and now it's up to us to help carry the torch and realize the full potential of the EBP of technology so with that I will conclude this demo and this presentation and open it up for Q&A thank you and we have some links about our projects that we are working on that are shared in these slides these slides are available online please look them up and if you have any questions offline please bring them in so questions yeah hi thanks good demo so quick question here so for now I understand from your demos you have a couple of demons that run on the nodes that basically and these annotations they need to be specified in the pod spec when you are creating is there any kind of management plane or anything that you are looking at to auto manage and the second because rather than if you have say hundreds of thousands of parts it will be hard to tag each one of them but it will be if there is a way to centrally manage that and second is I mean you guys looked into like a service managed like STO or something if I have if I am using an STO how to do this interact with that or is this a parallel technology okay so let me take that one by one the first question was whether if there is a management plane yes the control plane okay and there is a control plane which we wrote specifically for Kubernetes it is based on an operator that observes the pods and then it talks to the the demon that is running on the individual nodes we took this design it is not exactly based on the watch kind of mechanism that is popular with Kubernetes but this has allowed us to do one thing that a normal Kubernetes compliant implementation wouldn't do which is implement multi-tenant networking for another project that we have called Arctos which is a scale out architecture there was a talk by Wang Ying and Dr. Shuang my boss there earlier this week that's where the solution that was the motivation to start MISA there was no solution that's out there which can provide multi-tenant networking today so we started MISA based on that and regarding could you please repeat the question I don't believe there is a lot of interaction but I just want to make sure I got the question right No, I mean with Istio we have a concept of sidecar Yes Yeah, so Istio pardon me but I don't know the full details of the span of Istio world I know it has a sidecar model where it allows you to connect across and then we've been doing some work to bring alternatives to Istio which don't need the sidecar that much I know if your question is about can it manage QoS on the pods perhaps it can but the QoS tagging has to be done so that the operator sees it first at least in our current design once the pod is scheduled and the EBBF map is programmed we don't have a way to change the QoS so does that answer your question The current interface that we offer for QoS is via the pod annotations so that's So again I'm measuring that probably there will be some way to automate the annotations right because of scale because mapping it on a per part basis added it because the annotation is some more of a deployment artifact Yes, so for that I think what our take is that the user the person who is deploying the pod knows what kind of pod it is what traffic characteristics it desires so based on that they will judiciously choose the class and the priority of the if you make everything premium high then they'll all get the same QoS it's as if having no QoS so we kind of trust the user to make a decision on hey this pod it's processing payment traffic so it needs to be premium high or this pod is doing a lot of stuff so we don't really need to be sorry to add on to that something like OPA or mutating webhook might be a great way operationally for a cluster admin to take other metadata and reply those pod annotations just because it has to be on the pod does it mean developers have to put it there so just as a thought to answer his question Yes, so the developers the person who is deploying the pod as you know today you have standard Kubernetes resources specifications for CPU memory you know what your pod requires the minimums and or the maximum the limits that you don't want it to exceed you specify that nobody else will be able to tell that we can I mean there are ways that one other project that's going on is to modify it to scale it up and down without killing it that's a different discussion Hi, thanks so much for the detailed diagrams great them as well could you help me understand what benefits or differences this kind of eBPF XDP kind of like redirect this queuing system is like over like what what's the tradeoff versus like C groups with net prior pardon me could you please repeat the last part so there's a there's a C group controller called net priority okay and I don't I've never thought much about this problem from a container scheduling perspective but is there not like a way to configure net Pryo from a Kubernetes pod or have you compared this against NetPryo at all? No, we have not done any comparison with NetPryo I don't know what NetPryo is honest to be honest we have to look that up but if it is something we can leverage then certainly yes C groups as far as I know was all about at a container level so the pod, all the containers in the pod currently share the same network name space and we are operating at that network name space level so all containers in the pod belong to the same QoS class C groups might be on a per container basis and there might be some in the design there that might need further consideration I can't really answer your question but I have a question the namespaces and C groups go together so the process namespace helps you have your own PID counts and then you can use C groups to limit the amount of CPU usage or the amount of memory usage so one of those that applies to the network namespace would be the NetPryo but NetPryo I don't remember all the details I was just looking at the docs it gives you a number to order so compared to other NetPryo C groups on the same machine then it will be an ordered Q priority but I think there are some other things that you guys are implementing like it's bandwidth aware like you have in your database I don't know if NetPryo does that part but maybe some of these things might be complementary I think I will have to take a closer look at NetPryo to better answer that question but from what you described it sounds like that might be a better solution compared to what we are currently doing this is an experimental prototype and we looked at when we route traffic over to the network stack and we're going to have to prioritize for levels of service and the face was okay, TC and priority queuing so if C group NetPryo is there and we can configure that we might go with that instead so that's future versions work I don't hear a lot of people talking about solving this problem at all especially in a Kubernetes context but thinking from a Linux context it feels like there are some pieces that could be combined this is really cool thank you so much I have a follow-up question if someone else doesn't we have two minutes okay Justin is asking can we run this alongside Cilium well can we run this alongside Cilium probably not but I might be able to tell you in a year's time so the idea is to take this technology and then try and see if we can bring it into CNI which is a project that allows you to run multiple different CNI solutions and we think there are some advantages to this with the multi-tenancy part of it and if we we're still in the very early stages of looking at the design of that we just proposed high-level ideas to the CNCF DOC in the annual review that we sent out so there is a whole lot of there is only so many things we can focus on at one time we are short of resources so I hope that answers the question but the short answer is hopefully in the future we might be able to run this alongside Cilium but as of today no for your demo there are four virtual machines pardon me were there four virtual machines in the demo yes yes so we took a local that's we got the M1 MacBook it's a beast it can handle it but it sometimes freezes up the machine that was hosting the three competing QoS clients was it running Kubernetes or was this just yes okay yes so this Kubernetes or Arctos and Mizar or just Kubernetes and Mizar it's Kubernetes running Mizar so in this the QoS sender okay okay so we are out of time but we can finish the answer here the QoS sender here is one VM and this is also the Kubernetes master VM we are allowing it to schedule user pods and the three receiver VMs are not running any pods here in this demo setup they are running IPERF natively on the host and that's so that we don't have to deal with the XDP in generic mode that otherwise would be an overhead that we are to deal with and they are in separate VMs so that they are not competing for the same receiver bandwidth that's the reason we picked this demo setup thank you thank you thank you very much thank you