 This talk is going to be on OpenVswitch by Aaron Knoll. Hi, everyone. I'm Aaron. This talk is on OVS Debug. It's going to be very kind of terminal-oriented and a lot of text, so sorry, I guess read your email if that kind of stuff bores you. So this talk is going to be about debugging networking with OpenVswitch. I don't mean like debugging the C code of OpenVswitch, so we're not going to go do anything with GB, but we are going to do use some kind of fancy OVS commands. We will talk about tracing packets, and yes, that does mean we'll be using TCP dump a little bit, but no, TCP dump is not the only thing you need, or rather, it's not the only tool you need to reach for when working with OVS. Finally, I'm not going to touch net filter, routing table, any of those things. We'll get to why in a bit, but just if you have a problem and you think like, okay, OpenVswitch and net filter aren't playing well together, we will cover it, but I'm not going to talk about net filter. So two types of people I've kind of geared this talk for, people who are writing SDN orchestration tools, and people who are supporting, you know, in kind of a support role. The most common things that come up are packets don't go out, packets go out the wrong port, performance is bad. Those are kind of the big ones. And then most recently, when we enabled support for running stuff under SE Linux, OVS doesn't start, but we should have solved that. Those are real OVS bugs. All right, so how does OVS work? It's two demons primarily. We have the OVS DB, which is the configuration database, and the Vswitch D, which does the forwarding decisions and the flow pipeline. There are some important commands that go along with it. OVS VS Cuddle is one of the most common ones. That's how you add ports, add bridges, you know, dump database information from OVS DB. Another one that's important for debugging diagnostics is OVS app cuddle, and that will allow you to actually send commands to specific OVS applications. So you can do OVS app cuddle commands for the DB. You can do OVS app cuddle commands for the Vswitch D. You know, any of the demons that are running will have their own set of commands, and OVS app cuddle is how you would access. The OVS DB contains just configuration information. Ports, bridges, interfaces, mirror information, that kind of stuff. It doesn't contain, you know, other kind of data, I guess. It doesn't hold copies of packets. It's not involved in the actual forwarding at all. It just says, this is the configuration. You can dump that information by using OVS VS Cuddle show, OVS VS Cuddle list, etc. Sometimes the DB can contain what some people refer to as stale information. What that means is that someone has added some port configuration, we'll say, for a port that doesn't exist. The DB does not enforce that you have a correct configuration. It's just like a configuration file where you can throw in whatever interfaces you want. The DB will allow you to put anything in it. So, yeah, beware. The V-switch D is the other side of OVS, the forwarding side, and that will pull all the configuration out of the database. Okay? It will make sure that the running state of the system matches what's in the database. And it will clean up any flows that have been installed in any of the data paths periodically. And it will make sure that new flows that are required are inserted. Okay? That's basically all it does. I mean, we'll get to some other kind of minor things it does, but for the most part it's just making sure things are matching what's in the configuration that's been requested. Okay? There are two important data paths that V-switch D cares about. NetDev and NetLink. So NetLink is, you know, sometimes we might call the kernel data path. It's important to note that, like, OVS runs on Windows as well as Linux and, you know, Mac and all that, and FreeBSD and whatever else. And so some operating systems, notably Windows and Linux, they have support for using this NetLink data path. So, you know, the V-switch D in that case will generally prefer to use the NetLink data path. We'll get to why in a second. But we call that the kernel data path usually. The NetDev data path is all done in user space. That means packets come in into the V-switch D and the V-switch D processes them and pushes them out as well. So it's kind of simple what happens in a data path, right? Pack it in, pack it out. There are kind of two paths. There's the fast path in kernel or, you know, we'll get to something in NetLink, in NetDev. And then there's the slow path, which is everything that fast path can't do. So when fast path can move a packet, it does. When it can't move a packet, it defaults to the slow path. Okay, that's what we call, like, kind of an up call, right? I actually like to think of it as a down call, but they think of it as, like, going up to user space. But the packet, you know, when the packet doesn't match any rules in the kernel flow table, it'll get pushed into user space and then the user space has to figure out what's going on. There's no net filter processing. So OBS, the OBS data path does nothing that you don't ask it to do or rather only does what you ask it to do. So if you don't ask it to send the packet through something that handles net filter, so if you want IP tables processing and you add some IP tables rules, selecting on packets that are in your OBS bridge, you'll notice that those rules don't do anything. That's because the packet comes in and is processed by the data path and is pushed right out. There's no chance for net filter hooks to operate. You would need to, like, somehow distribute it to the local host, push it to some kind of local interface that has those net filter hooks, maybe a ETH device, maybe a TUN device, something like that. Otherwise, OBS isn't going to call those things and it won't push things out to contract, for instance, without you telling it. So really, OBS tries to do the most simple thing possible, pack it in, pack it out, and give you the building blocks to build what you want. So this is kind of like, you know, this picture kind of just illustrates what I've been talking about. Packet comes in, that packet is matched against the flow key table. Okay? If there's no key that matches that packet, so meaning whatever metadata is associated with that packet, whatever stuff makes up a flow key, so for instance, IP source desk, ETH source desk, you know, ports, what port it came in on, those kind of things. If those aren't in the flow key table to match, then it will be sent down to vSwitch D, you know, or rather they like to flip the picture and say it's sent up to vSwitch D. And a packet is processed by the vSwitch D and then pushed out and simultaneously, that flow will get installed into the flow key table to match future packets that come in. Okay, the net dev data path is a little bit different because there's no need for an up call as it were, right? So, and it can do some other things, so it can take advantage of some packet batching if that's possible and it actually also uses a whole bunch of caches. And maybe if we have time, we can talk about some issues around the caches. This is kind of an illustration of what happens, like a batch of packets would come in, would be pulled off of a port, they would be run through the, what's called the EMC or the exact match cache. There's actually another cache called the SMC, but we'll just call that part of the EMC. That EMC is very small, so you can see, like, the cost, I've tried to illustrate it, getting a little bit more each time you have to go to the next cache. The EMC is very small, but the idea is it's very fast. If the packets don't match in the EMC, they're pushed onto the data path classifier and if they don't match in the data path classifier, they go through or proto-processing. In OBS, rather in open flow, everything is like match action. So fields like packet type, IP header information, all of that, those are what you can match on. And you can also match on some metadata, what port it came in on, or, you know, what bridges do you use, that kind of information. And then the actions are all like what to do with the packet. Jump to other tables, I'll put the ports, push it over to whatever contract implementation, you know, modify parts of the packet, drop the packet, those are all actions. All right, so when do things go wrong? Open V-switch never takes action unless it's been told to. Right, so the netlink data path is simple, just forwards packets, maybe it'll go out to contract, but that's it. NetDev is a bit more complex because it has those caches and it has to be involved in kind of pulling the packets and pushing the packets, but really it's still just forwarding packets. And it's really software-defined networking. And what that means is the most likely when you have a problem with a packet moving, just like when your computer, when you have a problem with a program running, most likely you told OBS to do something that you didn't intend. So you told it to take some action and it's taking that action, but it's not doing, it's not taking that action, the result is not what you expect. But usually it's not a fault of OBS, you've told it what to do, it's carrying it out. Orchestrators probably misconfigure things, we see this a lot. Things like adding ports and then forgetting to delete them because of race conditions internally. Or adding improper flow rules for the system that forward packets all over and create loops. Bad port parameters, so setting things up, setting queues up incorrectly, or setting priorities incorrectly, or binding queues to specific CPUs incorrectly. Failing to restore flows after OBS restarts. Some of them don't detect that OBS has had a fault, crashed and come back up. And so then your system has no flows. It's not going to process anymore. And failure to observe faults in OBS. So it's important to remember, upstream is always available to help. Everyone in the OBS community really does want the OBS software suite to work. So go to openvswitch.org, seriously not joking, go. Sign up on the discuss and dev lists. Right now, people already have their laptops out, and you can do it on your phone too. It's pretty simple. So I'm not kidding, it's good to do. There's a lot of good information there, and people are very responsive. So for the remaining part of the talk, I'll try to do some examples. It's always good to have a real test environment. So I like to use network namespaces and VEATH devices. VEATH devices actually work for both data path types pretty well. They're simple to set up. It's simple to set up network namespaces. It's like 11 commands or something to set up two network namespaces connected through VEATH devices so that you can ping from one to the other back and forth. And by default, this will work. I mean, you can send packets back and forth. Another great environment where you can actually work with a real orchestrator is OpenShift includes this Docker and Docker cluster kind of hack script. That's really cool because it does set up like OpenVswitch. It adds flows. It allows you to start pods on your local machine, and you can play around with it. I actually like that quite a bit. All right. A lot of times, problems that get reported can be solved by just looking at the logs. VSwitchD logs a lot. It is configurable, but VSwitchD definitely logs any errors, warnings, all that. And if you're using the net dev data path with DPDK ports, all the DPDK log data is also in the OBS VSwitchD log. And I don't know how many times we've gotten bugs reported where in the log, it actually says this port is not available for whatever reason. And people complain to us, oh, we don't know what's going on, like why OBS isn't working. That's the thing they say. In the log, it actually tells you this port failed to add, and it tells you why. The IOMMU is misconfigured or something else. You can actually see right there what went wrong and go fix it. A lot of people ignore this. It could have answered simple why questions. Really, the logs are quite good. Sure, any time. Same one from the shirt you're wearing. Now that I'm halfway through my rambling, they'll give me a mic. Is there any thought on making that better or making it a little bit, because especially with the different kinds of net devs and DPDK errors look different than regular using the standard DPIF? Yeah, so that's a good point. One thing that's nice though in defense of the logs, what I will say is anytime there's an error, it actually you can just rep for that ERR or worn string. I know what you're saying. I agree sometimes it is difficult to understand the faults. I'll get to that. Well, not in the next slide, but in a couple slides, there are some stuff I'll talk about. Check your firmware, check your kernel. Make sure the version number for the firmware are appropriate to the software you're using. We did actually have instances where NICs were sending up multiple packets, duplicate packets, and it was being blamed on OBS, and the team hadn't upgraded their firmware in two or three years, and it was mismatched with the driver, and the driver was actually thinking it was programming something to the NIC, and instead it was telling the NIC to duplicate the packet and forward it up into Q. So really it's important to make sure that the configurations are set right. Sometimes some offloads do cause problems for certain network scenarios. I know Andy just did a talk and he said, oh, people always just disable offloads, and here I am on stage like advocating, yeah, just disable offloads, but sometimes they don't make sense. And sometimes you have hardware that does require additional work to get the kind of functionality that you want. There may be additional kernel module parameters, additional BIOS setup, additional other things that have to be done for that hardware to work optimally or even at all. So a little bit to answer your question or to go back to your logging question. So a lot of times when a port is misconfigured, it just shows up in OBS VS Cuddle Show, right? So if you do an OBS VS Cuddle Show and there's a port error, it usually just shows up, like right there. In this case, these ports are set up correctly, but a lot of times it will say, like if you add a port that doesn't exist, it will say that port's not found, you know, right there. You could just see it. So there's no need to grep the logs in that case, although it will show up in the logs too. So, now someone might ask, oh well, you know the port's not there, can't you just write a clean-up script? It's actually a little bit difficult, right? You have to know what kind of port you're dealing with. For instance, VHOS user ports won't show up in the kernel IP, you know, in the kernel, like if you do a netlink query to get all the interfaces on the system. You won't see VHOS user ports. So you have to know, like, which ports to whitelist. So you might assume that they're non-existent if you do, like, a simple naive match and you might remove a working config. So a clean-up script is really difficult. I like to say it's best for the orchestrator to clean up the ports it adds, you know, like that, because the orchestrator is supposed to know. OBS really can't. And then for the netdev data path, which really only applies to open stack deployments, I don't think OpenShift is using DPDK at all. But DPDK ports do require extra configuration to get optimal performance, or even sometimes to get performance at all. So you need to check your hardware topology, make sure, like, your NUMA nodes, the hardware is correctly matched and the VMs are correctly spawned on the right NUMA node to get optimal performance. Are your kernel parameters, or 2D parameters, set up correctly? Are these ISO CPUs? Did you turn off the RCU processing? Did you do, like, did you allocate enough huge pages? Are your VMs, you know, on the right node, or even, you know, accessing those huge pages? Is that configuration right? There's a lot of additional stuff on top of OpenV switch for that to work. And finally, you should know, when you're debugging this stuff, the network topology is supposed to be. So what was the... A lot of projects actually set up their network topologies differently. OpenShift wants to configure OBS differently than OpenStack and probably different than Rev and probably different than, you know, some other project that's using SDN and controls OpenV switch. So I say, like, all bridges are not created equally. A lot of times developers make assumptions about how packets should flow when they do, like, when they add an OBS bridge. But if you read that blog, which I wrote, so it's a plug for me, but if you read that blog, it actually goes over that really the OBS kernel data path, a bridge is kind of a fiction on top of a bunch of flow rules. It's not really... It doesn't exist as, like, a thing in the way of that packet. So it's not even a bump in the wire or something. OpenShift and OpenStack, you can actually read how they like to set up their network at these two URLs. So there's a lot of good information there. You'll find out about, like, what BRN to BRX, you know, all those different bridges do, and for OpenShift it's radically different. Does your system, you know, when you're using OpenStack, when you're using OpenShift, this is true. Does your system use the kernel IP stack, the kernel networking stack, in addition to OpenVswitch? For OpenShift it's true. And they have a ton of ice and they forward packets through that to provide IP tables hooks. So OBS doesn't directly use the routing table. OBS, I mean, it can, and in some cases it will, but generally speaking it doesn't. OBS doesn't use NetFilter. It uses, like, contract, and only if you've told it to. So it's not really, like, it doesn't, I mean it's integrated with the kernel, but it doesn't use those parts of the kernel you haven't asked it to use. Question? Or how would you map these rules that you've set up? Is there a way for the user to use them? The rules are different. If you're asking about, like, topology, I would say use PlotnetConfig, or I think there's actually another tool called Skydive, and both of those will actually, like, detect what ports you're using. They'll kind of give you a graph, like a new plot graph that shows how the interfaces are kind of interconnected. It won't show you the flow rules though. Maybe Skydive will. But I will get to how to debug those flow rules in just a second. And then do the old things like BGP and all that stuff, do that still exist in this world, or is that just a different world since you're not using routing tables? So OBS operates at kind of a lower level, right? It's just move packets based on matching these fields from one place to another. So all that routing decision, all that BGP, OSPF, all that, that's done at kind of a higher layer. Okay, we can follow up. So sometimes when the setup is wrong, you can actually see how it was made wrong by using the OBS DB tool. So we do this OBS DB tool show log and point it at the database. It will give you a rundown of the transactions that happened and which process executed those transactions. So it's quite helpful if something got set up incorrectly. You can also grab some stats. This is like for the NetDev data path. So you can see like running statistics for how the forwarding engines are working. Kernel has other ways, and you can pull some interface statistics when the port is a non-DVDK port, like if it's a kernel port. You can pull those interface statistics using your standard IP and if config and if tool. So sometimes packet goes out of an interface and we have no idea why. So we do something like dump flows. In this case, it's really simple. There's one flow, it's normal action. Oh, okay. So it's behaving kind of like a switch. And a lot of times you can just, if your flow rules are small, you can just watch which flow has these end packets. You can see which end packets are increasing. That works great if you have a static setup. There's no data going through. And you can push the packets. It doesn't work well on heavily loaded systems. And a lot of times you'll be reading through reams of flows. You can do something crazy like I've done before, which is like you can dump the flows and use diff and like try to compare them. But that's the C and kernel and all that programmer and me coming out. Like that's not really, people don't like to do that. And I like to equate it to finding the Higgs boson. Like a whole bunch of stuff is blasted through this and you're kind of just sifting through all this data to figure out what's going on. And what complicates it or makes it worse is the flows as they look in the kernel data path are completely different than what the open flow rules look like. So because again, as I said, the kernel data path, for instance, it's just a flow key match. It's just these specific things match. This is all you have to do. There's no processing. Whereas like in the user space side, it will evaluate these rules. So it's a bit more complicated. But maybe there's a better way. So we'll take a quick detour, right? What's an SDN system? It's programmable. It has instructions, a pipeline. You know, it's like kind of a processing chip, but it's specific for packets. And that means we do have some cool debugging tools. So the one that I would reach for to answer your question about tracing these flows is off-proto trace. You give it a description of a packet or you can give it an actual packet dump and it will show you how it evaluated those rules. So from the example, I made a change to the flow rules. From that demo example I showed. I made a change to the flow rules. And you can see here an ARP ping works, but an ICMP ping does not. So if I use off-proto trace and just say, okay, show me ARP, it actually shows that, okay, it matched a rule ARP in port one that that priority, the action is output to two. Right? But if we trace ICMP, we see that there were no rules matching. So clearly in my flow rules somewhere, I have accounted for ARP. I might have accounted for TCP. I might have even accounted for UDP or SETP, but I forgot ICMP. So we can go through and debug. How much? Okay. So yeah, as far as getting packet data goes, all right, sometimes that's what people reach to. So you could just reach to TCP dump. TCP dump works great if you have a kernel interface. It doesn't work at all for Vhost user. It doesn't work for DPDK ports. But OBS includes OBS TCP dump, which sets up a mirror, and that works internally for OBS for all kinds of ports, kernel ports, Vhost user ports, all of that. And then it has this other cool gadget called OBS TCP undump. So remember I said ORC proto-trace can take packet bytes. You can actually pipe TCP dump into TCP undump and you will get those bytes out. And you can then feed those to an ORC proto-trace. So in conclusion, sorry for concluding so quickly, but OBS debug really shouldn't feel daunting. There's a ton of documentation. I know I pushed a lot of URLs up there, but there's a ton of stuff on the web to read. OBS documentation is really top notch. You can go to openvswitch.org. You should already be there from signing up for the mailing list. So you can just click over and actually read through some of the docs. OBS is almost always doing exactly what it's asked to do. That software sometimes it has bugs, but usually what you're seeing is not a bug in OBS, it's a bug in what you've programmed into OBS. Finally, those are some of my email addresses. I'll see you all in the mailing list. Questions? So could you go back two slides for me? If you have a backward... OBS TCP undump, you can actually take the byte stream from that and pass it to the command, I think in two or three slides previous. In that particular example, you had a really nice pretty print like import this version, type ICMP, I think it should... So you can just pass it a raw stream? Yes, you can pass the stream bytes. I forget the exact syntax, but it does actually take it. That's cool, that's really cool. So I could have tried to cook up the XR pipeline to make it happen, but I'm not that cool. Just a simple question on the logging thing. On systemD systems, do you log to the journal by default? On systemD systems, do you log to the journal as well as or instead of the log file? That's kind of where I look for logs. Yeah, so it's true right now, we aren't logging to the systemD journal. That's probably a good enhancement to make because I think a lot of tools do make use of that journal now. Propose it on the mailing list, maybe. So you said that when a packet comes down or upflow, and then as a result of that forwarding, there's an update on the... On the flow database. Yes. So is that from the get go or there is a configuration for all the forwarding done ahead of time? Is it always learn or is it? Yeah, it's always learn. So if you go back, if I go back to this post here. So see this vlog post, if you actually go there, so that talks about programming the kernel, like the OpenVswitch module in kernel. It does not contain flows by default. So by default, the packet comes in and it doesn't match anything and then it's pushed to an up call and processed. And then that table is updated. I think I'm out of time. I don't think this... No, good talk. Thanks. So to answer your question about Aaron Manchin, I think he was rushing towards his slides. So on the high level, I'm going to do a shameless plug. There are two other utilities, one is plot net config, which gives a static x-ray of the system within the server and another project called skydive. That is exactly what Aaron mentioned, but I just wanted to mention again. So those... Correct. Yep, you can get a map. We can talk offline also. But there are other utilities above and beyond this. This was at a lower level, but there are higher level maps, sort of network operating system, all of that stuff available. Yes. Yep. In addition, yeah. Thank you, Aaron. I just wanted to highlight that there will be a party tonight at 7 p.m. at Ziskin Lounge. So please do collect your tickets at the registration desk if you haven't already. And if in case you don't want to attend and you change your mind, please do give it back to us because we just have 200 seats for that. But we do hope to see you all there. Thank you. So what's the protocol on changing our laptops? Okay. This is not yours. So we can like... Perfect. And is there UDP or is it only HDMI? There is HDMI and there is VG. What do you have? I have many display cord, but I also have HDMI. Okay, awesome. Okay, I didn't have to run around. Okay, here you go. Thank you. Since I've got like two minutes. And then is there a dongle for... Yes. Sorry. It's okay. What is your time? Open shift for operators. Thomas Cameron. Thomas Cameron, yes. I'm just trying to get... Trying to get a setup. I'll be right here. Oh yeah. Good. How are you doing? Good. Check, check. Yes. Check, check, Mike, check.