 Okay, so I'm ready. Welcome. My name is Yoko Shiro. I work for the networking services team like the last presenter And today I would like to talk about actually the title is quite nice. It says Kenobi has to be further optimized But if you read the description, it's more a little bit more about Finding a good baseline and then tune the performance So we get steady results So what is open V-Suite actually about so open V-Suite is a general purpose Virtual switch that use open flow rules to make the switching decisions And I think what I want to emphasize here is the general part Because it means it should work in all scenarios and it's not really optimized for a specific scenario To do the best performance. So if you would like to increase the performance a very easy way It's just to add more CPU resources to it And then you can increase the performance for probably all the scenarios that it's running running on So what is the performance for OBS DPDK? What is it based on? So the data path for OBS DPDK is basically it's a dedicated threat or a number of dedicated threats on a dedicated Set of CPUs which are only used for packet forwarding And what it does it gets at most 32 packets from a specific driver Reads those 32 packets and then tries to classify them with specific flows that are configured And then when all the flows are all the 32 up to 32 packets are classified They will be grouped by flow and then all the actions are taken So they're basically they put if you have 32 packets come in there for two different set of flows We will put them together process all the actions for once and process all the actions for the other The idea behind is that your cache is nicely hot. So you would do the same packet processing of all those packets And then once all the actions are done based on the group the packets are sent out. So that means that packets get reordered Per flow enough per flow, but per Set of flows so you might see packet reordering in that scenario So what is really influencing the packet performance in general is one of the main things is the flow lookup that takes most of the time So there is a different set of actions that's taken when the packet comes in to figure out what the flow is On the DPVK data pad it first tries to do an exact match. So would calculate the hash over the five tuples of the packet And once those five tuples are known, it will try to see if it's in the cache This cache is limited. I think about 8k on default If it hits the cache the packet is forwarded directly and we're good to go If it misses there is a secondary. Well, there's a new up lookup, which is the signature match as MC This is newly introduced in OBS, and it's still an experimental phases So it's disabled by default, but it adds another layer of caching And then you have the data pad classifier as the next level if it finds the flow there It will go back into the tree So it will program the MC cache or SMC cache if you have that enabled and you will go back and the packet gets If you even miss the data pad classifier you go to the open flow protocol classifier And I marked it as red because if you look at the kernel data path The first three items the blue ones are actually the ones that happen inside of the kernel So that's the data path part. There's different caches, but it's the same principle And then if you hit the red type that basically means you go to user space So the kernel module is calling user space and just look up And this is the part that the handler threads are taking care of So if you see a lot of CPU utilization in your kernel data path on the handler threads That's probably because you have a lot of packets going to a slow path because there is no no flow In the data path What else influences the performance for OBS DPDK? I think one thing is the the mutexes and the locks that are in the code in several places There are very limited places in the data path that have locks, but there are some like bonding for example has one known And then the other thing that I think is probably one of the biggest one is the the syscalls So they try to make the data path for DPDK as syscall less as possible But there are still syscalls being taken due to the existing infrastructure Of the legacy or legacy the kernel data path of OBS. So the slow catalogs One of them. I think that takes the most syscalls. Maybe not the most CPU, but the most syscalls Is the The locking And the ledges the ledges you see here, which is the use for the rcu a lot because they use a lot of their own rcu library within OBS And that actually takes care by waking up the threat by using the right call You see quite a lot of them. I have a nice example later on And then the cross-NUMA Communications also if you have two virtual machines both in a different NUMA node Their memory transfer is going to be at least I know has to cross the interconnect of the two CPUs So what can we do as a developer? So There's things you can change Inside the code that might make it faster for for a specific use case One of the things that's recently be added is the partial hardware upload which basically There's a nice block from cloudy on that as well on our blog page But what it does it takes the package that come in It tries to match up the the flow and then put the marker in the hardware saying that if you see this flow again Let me know what the flow ID sort flow marker flow identifier is So you don't have to actually go up and do all the lookups and do the cache you Sorry the hashing algorithm, so you directly know which flow to pick out of your tail Um, of course full hardware upload. It's being worked on currently upstream That will actually take off all the cpu load or at least most of the cpu load load of the packets You can disable the emc cache if you have a lot of Traffic streams flowing at the same time and you will be trashing your emc cache It doesn't really make sense to have it enabled because you still have the overhead of calculating the hash And trying to find it so you can rid of that part Maybe you can come up with better hashing algorithms to use SMC is one of the examples that's recently been added Another thing is if you have multiple threads that actually pull your packet drivers Some interface might be very busy and a couple of other interface might be very You know relaxed you have a couple of packets coming in you can do a rebalancing of the queues through different different course If you have a very busy core that's doing three queues And you have another core that's only doing one queue you might be able to shift Some of the workload to a different core So you have you know more balance within your processing You can try to see if you can remove locks within the system It's going to be a tough thing because a lot of the code is shared between the kernel data path And the dbtk data path and maybe additional data paths coming up in the future So it might be hard to change because it's you know, it's an embedded thing within the entire infrastructure And then we can see we can remove syscalls Some ideas that I was looking at and it's like maybe we can offload some of the syscalls that take a lot of time to a separate threat Into the system it requires probably another dedicated core, but I think there's other Features that actually also move into having an additional core in the system So maybe we can you know move stuff out an example is for if you have configured a external controller And you have an unknown packet that needs to be sent to the controller Currently what what happens is it the udp? Transmission in the kernel data path happens inside the pnd threat. So basically you're going to call A right inside your pnd threat saying I would like to send out this udp packet Which is quite costly if you do it on your worker threat because even if you a couple of Million nanoseconds not processing packets on ingress. You're probably going to drop some so that's a could be a nice enhancement So even if we do all that work, I mean the main question is will we increase the performance and I think the The answer is that it all depends on your environment Like we like I've seen in the past is that everybody has a different environment And people test like different environments as well So if you look at people that are doing like upstream development on obs They tend to try to optimize a specific use case or their specific use case And they don't Test with all the other possibilities of people Whether you're using obs because obs is general sometimes you might miss a test case for someone else So you don't really get the performance increase. You might even get a decrease in your specific scenario While other people will get an increase in their scenario So what are the dependencies for your environment? I think the number of pnd threats you have might be an issue Why why you get a better or worse performance? For example, if you have 15 virtual machines with all one queue you need to pull 15 virtual machines queues constantly If you have one threat and you pull all those 14 15 queues on one threat If you make a performance optimization in your your driver For that V host user you might get the performance increase because you're pulling like the 15 ones But if you Have some of the parts that you have a hardware driver pulling You know if you make V host optimizations, and you only have hardware drivers. It doesn't really help you in that scenario So it's pretty important. What kind of Drivers you have how many queues you have assigned to it because you might be pulling 15 queues on one hardware device driver So that if you make an optimization into hardware advice, then it might help you there Um Something that's very important is is the The open flow rules that you program and how often you change the open flow rules because if you have your data path Rules programmed your open flow rules are there and then you make a decision to remove all your open flow rules and insert a new set That basically means you're going to flush your caches and every packet that comes in from a new flow needs to be relearned So if you like every second change your open flow rules That means that you might get a lot of flushes and you get a lot of packets to your To your learning to the learning phase going all the way up in the cache diagram that I showed earlier Then of course also the data path rules that are are active in your system So let's say that you have a million flows going into your system In theory you could have a hundred data a million data path flows in your system But what the system does is that it times them out. So after three seconds no traffic Your flow gets removed if a new someone else opens a web browser gets a new website that flow needs to get added So if you have short-lived sessions or a long list session that also influences the performance Of your system because you need to refresh your your cache or not. You need to do the additional lookups or not So it's not only the volume of the traffic the type of the traffic But it's also like how many sessions are you tearing down up and down a second? Do you have long list sessions short list sessions stuff like that? Is it only tcp because you have different you have a complete flow set Is it like a completely different protocol that you're using or a custom protocol that you're using? so I think From an upstream perspective what we really need is a bunch of reference architectures where we can say, you know in the field people using xyz a set of users are using Another setup setup system. So maybe some people use it like Maybe only a couple of virtual machines and that's it other people might be running 100 virtual machines But with low latency traffic and other people might run it with high latency traffic or a million flows or only 10k flows um That way if you can test it If someone makes a change upstream you can maybe run your baseline Tests like maybe the five or six whatever baseline tests or setups you have And then you can see for which specific use case you have an increased performance And you can see for what specific use case you have maybe have a decreased performance And then it's it's easier to make a decision You know, are we going to apply the patch or are we going to enact the patch and say, you know At least a bit more optimization for a specific use case um So how do we get that data and i'm i'm looking at you guys that i think if there are people that are willing to share some of the data um Then that that would be at least a good start to create a use case um what I currently Do is I Test like my scenario normally with what do we call a pvp test and i have a link later on If you want to take a look at it But basically you send a bunch of traffic in you look it back in the virtual machine and then you you send it back out And based on the number of flows You please close the door So what i've done is um What will be nice to get so I I put a list up here with some stuff that would be nice to get Dumping some of the flows dumping some of the ports. So we have an idea what actually Is used in the system And we can probably come up with the reference architecture for that when we were testing Our scopes quickly go over it. I it's in the slides that are online So you can if you are willing to share your data You can look it up the other thing that I have is the sys calls So I've been I'm able to get whatever sys call I want In my setup and I but it's not clear how to optimize for what sys call Because in some scenarios or at least in all my scenarios It's in the common scenarios I'm not getting a better performance if I get rid of the sys calls But it would be nice to see what kind of additional sys calls are actually in Customer environments or test environments that you have and then we can probably optimize for specifics What I've done to get an idea is What I do as I run a perf script for a couple of minutes Capturing all the sys calls that the pmd threads generate And then what I've done I have a small python script that actually It tells me what sys calls are called for which pmd thread It's just the overview screen. So it doesn't really tell you much only that you get a lot of a lot of them But then what it also does it analyzes the callbacks for all the specific sets and it gives you a callback For every sys call and then collapse them. So in this case, you see you see that you have like Was it 56k of callbacks for this specific code path? So we can try to see what would be the best one to optimize if we ever get time to optimize it So what's next? I think if you were willing to share it, you could send me an email and then my statement would be that I would try to You know, collect them all try to make some general use cases out of them share them upstream and then For me, the ultimate goal would then be to actually add them to the link that you see here the bottom link Sorry the top link That is the script for running the pvp test and then if I could incorporate some of those use test cases It will be easy for me to run it against whatever, you know set of data or patches that are available And then the other link is for the Python script that if you would like to do some analysis of the sys calls So that's the data I have so hopefully there are some people that are willing to share some other data Maybe have a better casting upstream That's it. Any questions? Okay, if I have the open nv testing, this is the vsperf one, right? Yeah. Have you ever tried to set it up? Okay, I think that's the answer. So so yeah It's possible to use that as well It has similar features that's what I have But it's really hard to set up because it requires you to have a defined setup and it will do everything for you It will configure everything download the virtual machines and in the development environment It's not really friendly because you have your own machine You really want to run a test and with this I just have one command line I run I get the graph or whatever I want and it's done So yeah, but it does the the performance testing for you as well. It has the same features as this I just walk around the lines but if you work with this If everyone is doing his own side Yeah, I mean that will be yeah, but the thing is with the open the that test suite It only does like we send wire speed traffic. See what the performance is we do rfc 2455 test and that's it and there is no environment out there that That that mimics that environment right Tell me like one customer that has the rfc 2455 type environment, right? And that's the problem because you can opt the thing is you can optimize To your likings for that specific test scenario, but the performance for a real life scenario might be really bad Yeah, yeah, I think the automatic optimization is outside of b-switch It's outside of space. Oh, sorry. Yeah, so the question was Is there any Work in progress to optimize the flow set? So if you configure the rules to optimize them That's the question, right? So from an ovs Open flow perspective if for rule optimization We rely on the open flow controller to do that to get the most You know the best set of open flow rules and then the data path will do its own optimization So if you look at the slides a couple of you see the smc Those will do try to do some optimization on the given flow to be faster to look it up So there is some optimization there no more questions