 Hi everybody, my name is Bruce Richardson, this is Harry VanHaren, and like the last speakers actually, we're both from Intel and Shannon in the west of Ireland. So our talk today is entitled Glider Networking Code up to 11, and I hope you all get the spinal tap reference there, 11 being obviously better than 10 in this case, we'll just go with that as the explanation for it. So this talk is all about vectorization, so the first thing is, what are we talking about whenever we talk about vectorization? So whenever we're talking here vectorization and vectors is a little different to let's say vectors in BPP terms, okay? What we're referring to is, if you remember your computer science classes, those of you did computer science at college, CND or single instruction multiple data. So our vectors here are literally just arrays of numbers designed for use in CND registers, okay? So with single instruction multiple data, the idea is that instead of having an add instruction take two individual numbers and add them together, it would take two vectors, two arrays of numbers and add them together in parallel. So with one instruction you can actually do four, eight, or sixteen effective operations from that one instruction. So CND now is very widespread, you know, on ARM you've got NEON, you've got VECPAR PC, but for obvious reasons this talk is mostly focused on Intel streaming CND extensions, better known as SSE to you and me, Intel advanced vector extensions, or AVX all the way up to AVX 512. So that's mostly what we're focusing these examples on. So if you look on the internet and do your Google searches and try reading up on vectorizations and SSE and what have you, you'll find a lot of examples doing, you know, vector additions and matrix multiplies and things like that. There's not a lot of matrix multiplies done in packet processing normally. There's a lot of packet header processing though, which is considerably different. So why would we want to vectorize? Why would we want to use these CND instructions? Well there's a number of reasons, okay? One, it gives us the opportunity potentially to work with multiple packets in parallel. For example, we see in one of the cases here where we'll try and process the flags from four packets at a time. Alternatively, if we're doing maybe lookups, we might have the opportunity to do a lookup of four hashes at a time. Alternatively, instead of doing that whole CND as with multiple data, we can perhaps work with these as larger blocks, larger block registers, okay? So instead of just working with eight bytes of data from a packet at a time, maybe we could load and work on all 64 bytes of packet headers within one register and working out at a time. So it's not multiple packets, it's just more data from one packet, okay? So in that case, we could potentially save eight load instructions and just do one load instruction instead. Or we could perhaps in for comparisons or operations work on three or four packet headers simultaneously rather than processing each packet field one of the time. Lastly and perhaps most interestingly, it also opens up new possibilities of new ways of doing things. Taking advantage of the capabilities and the novel instructions that these vector instruction sets provide. So we'll see used here a bit, well actually in the case of byte shuffling you'll see here that used a lot as we work through these examples. So that's where we change the order of bytes within a register, okay? And you can't really do that except manually through a sequence of instructions if you're using scalar code. With vector code, you can do a lot of shuffling really quickly. Then also, especially in some like AVX 512, they've got this new thing called Hamast which allows you to do masking operations to work with partial registers simultaneously. And Harry's a big fan of these instructions. I'm gonna touch on those as well afterwards. But these are again new capabilities offered by these instruction sets, okay? However, the overall upshot really is that these vectorization instructions, these AVX, SZ instructions all allow you to do more work with your instructions. You get more bank from your book from your CPU, okay? So here's a brief outline of the rest of the talk. Initially I'll hand over to Harry to talk a bit about how they do packet parsing in OBS and how we can accelerate that mini flow extract function using vectorization. Then because you're not getting rid of me that easily, I'll be back on stage again to talk about some of our DPK promo drivers and how we use vectorization there to speed up those an awful lot. And then finally, Harry will come up and talk a bit more about another part of OBS, the classifier part, and how that's done for vectorization. So that's the rough outline of what we'll talk about. So, Harry. So folks, so as Bruce mentioned, I'm talking about packet parsing, particularly in the context of SIMD. So there's multiple ways that we can use these instructions. Bruce mentioned we can do multiple data or we can look at one big block of data. And in this case, we're going to focus more on the one big block of data use case. So what we're really trying to do here is when OBS receives a packet from any backend, DPDK or any of the others, it basically gets the packet data that we've received over the wire. And now we need to interpret that. We need to do something with it. We need to match it to some rules, right? That's what OBS ultimately does, perform some actions, and then TX the packet again, either to a guest or to the network. So a very big part of that cycle cost is actually parsing the packet, understanding what that packet is. So when OBS is actually parsing a packet into its own internal shape or data structure, it parses it into something called a mini flow. For those not familiar with the mini flow data structure, I wasn't a couple of years ago, so I'll presume not everyone here is. It's worth talking about this data structure for a little while, because it's a really nice concept of how to compress data in a very cash-friendly way. And it's also something that's relatively, it trades off compute and memory kind of accesses. So it's an interesting data structure, and I'll talk about it in the next slides and then later again. So I'll spend a couple of minutes just explaining how this works or what it does. It's composed out of two parts, and I draw most of my things graphically. So over on the right-hand side, we have kind of the graphical representation of the mini flow. The mini flow has bits and blocks. The bits to your left is essentially a bit mask of what is parsed into this mini flow. So if you receive a packet that has an Ethernet header, then an IP, there will be an Ethernet bit, and there'll be an IP bit or an IPv4 bit and a separate IPv6 bit somewhere else. So those bits are kind of like individual identifiers for is a particular thing that the mini flow is aware of. So IP, VLANs, VXLAN, all of those kind of things, are they present in this mini flow data structure? If it is present, the bit is set. If it's not, then it's not set. So it's just an individual bit to identify is this, for example, an IPv4 packet? If it is, then we move to the blocks. And the blocks, the data contained there changes based on what bits are set. So in that way, it's kind of a dynamic data structure in a way if you like. So the bits identify what each block represents and the blocks contain the actual value. So IPv4, we have a source and a destination address. Those actual values, so 127.001 as an example, will, that value itself will be stored in a block. But the fact that it's an IPv4 packet will be stored in an individual bit in this mini flow bits. So that's how the bits and blocks kind of work together. The number of bits set is also the number of blocks that are filled with valid data afterwards. So there's always a relationship between these blocks and the bits. So in total, the data structure has 128 bits. So two UN64s in terms of scalar code, a UN64 being the widest integer that most modern CPUs support. And the blocks, essentially, they scale out depending on how many things this OVS instance is aware of. So it largely overprovisions for most use cases. I think it's about 580 bytes or so. So the mini flow is a very large data structure. But because of the way it works, we always compress the data that we use in towards the bits. So left pack all the data, if you like. That gives us very good cache affinity with this data structure. And that's the reason it's actually really nice because together with the bits and the blocks, you can interpret any type of packet and you can represent a very wide range of packets in a very cache dense way. So the last thing, I kind of referred to it earlier, but the count of the blocks, so the number of blocks is equal to a pop count or the number of bits set. So that kind of becomes more important later on in the presentation. So if we look at the packet parsing itself, bringing us back to a more simly context here, at the top right, I've graphically represented a packet, an Ethernet IPv4 UDP packet. And we want to parse that into the mini flow. So again, the mini flow here on the bottom right, I described it earlier. So we want to build up the mini flow that's specific for this type of packet. We receive this packet from an Ethernet device. So we know that the outer frame is going to be an Ethernet header. But what we don't know is what the either type is. So what does it contain? What are the contents of this Ethernet frame? To build up the mini flow, we're going to load the MAC, so the actual Ethernet MAC address source test. We're going to copy those into the mini flow data structure down below. So you can see one block gets consumed in this case. In theory, it's actually two. The slides are slightly simplified just for manageability and kind of being able to see it and represent it easier. So we pull these MAC addresses, we store them in the mini flow in a block and we'll set a bit in the bits array saying, hey, there's an Ethernet header present in this mini flow. The next thing we're going to do in a scalar mini flow extract is look essentially load this Ethernet type and then we're going to perform a branch on it. For those familiar with the assembly statements, we're going to jump somewhere else in the instruction stream. And we're going to perform some type of operation based on that. So if it's equal to 0x800, that happens to be an IPv4 as a protocol contained inside. Then we're going to parse that next block of data as an IPv4 header. And we're going to load that IPv4 info, we're going to store it to a mini flow block. We're going to set a bit in the mini flow saying, hey, this is actually an IPv4 header as well or an IPv4 packet contained in this mini flow. With the IP, we're going to now load the protocol field. Again, we're going to load it. We're going to branch on it because we don't know what's inside this. It could be a whole range of things. In this particular example, it's a UDP. So now we go and we interpret the next piece of data. You can see it's kind of very scalar. We load something, a small piece of data, we store it into the mini flow, we set a bit. We now branch based on whatever properties that that particular protocol field had. So either net type or IP protocol. And it's a very slow and stepwise process from a CPU performance point of view that we need to understand what the next or the either type is before we can interpret the next piece of data. We need to understand what the IP protocol is before we can actually store that piece of data. So it's a very scalar and step-by-step way of building up a mini flow. So if we try and vectorize this, what could we do? How could we use SIMD instructions to do better? And this, as Bruce referred earlier, we can slurp up a huge amount of data into a register. So SSE being 128 bits or 16 bytes, AVX-2 extending those registers to 32 bytes and AVX-512 actually giving us a full 512 bits or 64 bytes or a cache line on most architectures, a full cache line of information in a single register. So in this case, we could actually load the ethernet IPv4, the UDP and a little chunk of payload into a single register. And now we can operate on all that data in parallel. So one instruction could influence all of the bits of that packet header basically. And that gives us a lot of flexibility to start thinking about compute and how we build this mini flow data structure. So if we're aware of the types of traffic that run through our data center, there was some great presentations earlier about the SN or the SV protocols and that you can run specific types of traffic and your encapsulation is something you'll be aware of. And basically we can specialize those. So I have here on the right-hand side that we could load the whole packet data. We could do a bitwise and with that and compare it to a known protocol. So we could write code or we can write generic code in a way that the packet headers that we receive most common, so 80% or 90% of our traffic will probably have a specific layout or a specific set of protocols. And we can create essentially a register or a bit mask with values to compare. So in this case, we're loading the data, we're gonna do a bit mask, we're gonna compare and that will tell us if all three headers were present in a specific layout or in a specific order. So in this case, Ethernet, IPv4 and UDP, all those checks that we done earlier can be done in parallel because we have a very wide register and all these operations can occur in a single instruction with multiple data. So if we do hit on this protocol, then we can create something called a shuffle. So that's kind of represented on the bottom left here. A shuffle is ultimately a way to transform data inside a register. Bruce is gonna go into detail in the next couple of slides how they work, so I'm gonna kind of gloss over it here. But let's hypothetically say that magic happens, that packet layout, whatever it was, gets transformed into the mini flow layout there below. And now we can use one store instruction to take that whole packet and create a mini flow using a single store. We store 64 bytes of data into our memory subsystem or into our L1 cache. And in the space of about five or 10 instructions or cycles, we've now done that whole mini flow packet parsing. Whereas previously we had two branches and a whole bunch of smaller loads and stores, we've reduced that to a handful of instructions and equally so most likely a handful of cycles. So that's a vectorized mini flow extract. We'll build on this mini flow and more vectorization in a subsequent part. But first I'll hand you on to Bruce to do some packet IO. So I'm gonna talk a little bit about some of our, as you can see, I'm not as good at the pretty pictures as Harry is. So unfortunately I put code in these slides so that's a good thing or a bad thing depending on your interpretation. So inside DDK we do a lot of work taking packets from our network card and handing them over to the application to process. And in order to do that, every Polmo driver needs to take the metadata supplied by the network card and transform us into the metadata structure expected by the application, which is the emboss that you heard about in the last talk as well. So what I've got in this slide here on the right hand side, I've got a list of the descriptor fields from one of our Polmo drivers and I've taken the IXGB Polmo driver because it's one of the simpler ones to talk through here. And then I've also got on the left the snapshot of part of the emboss, the emboss data fields that correspond to the metadata we were seeing from the NIC. And roughly speaking, we have a one-to-one correspondence for a lot of these fields and it's something like this. So we've got packet lengths and data lengths for a packet length field. VLAN information comes across. We've got a packet type field that needs to be mapped to a packet type field. And then we've got someone used fields and some that we actually don't directly go into the emboss. So the status field, which we'll process later, it gives us information about like when the descriptor is valid or not, okay? Though in this particular case, one additional wrinkle is that the packet type field, even though it has the same name, it does need some additional processing. So it's not a straight map, okay? But in short, we essentially have those one-to-four different field arrows here of fields in our NIC descriptor that need to be moved about and then written into the emboss in a different order, okay? So that's a very, very common operation we have to do inside our poem of drivers. And to do that, we use the shuffle operation that Harry referred to earlier. So it's the exact same kind of operation that needs to be done for that mini flow where you've got a set of fields inside a register and you need to store them somewhere else in a different order. And this can be slow to do in scalar code using a lot of loads of storage, a low one field storage, but we can work a lot faster using vector operations and using shuffles. And again, for simplicity's sake, I'm just using an SSE version here. If you look in DBK at our I40 poem of driver, we have an AVX2 instance there, which will use shuffles in pretty much exactly the same way. So this is applicable for AVX2 and AVX512. It's just due to the larger size of registers this is well over flow slides. So I'm limiting us to 16 bytes here for simplicity's sake, okay? So if we start off looking at this, the first we have on the top, we have the actual source. So this is the descriptor with the original data. And on the bottom, we have the emboss fields that we need to fill in, okay? So how do we map them here to here using a shuffle operation? And at the bottom, I'll actually fill in the actual instruction set used. The set was called the shuffle mask that has the square to pull the bytes. And this way I'll show you how the shuffle operation, how you actually use that shuffle instruction, okay? Again, everything is unfortunately right to left because whenever we fill in the fields in the register, it actually, the parameters go from right to left. So the lowest byte goes on the very end. So everything is consistently right to left. That's why it's grown that way, it is deliberate, okay? Even though it may initially look backwards. There is methods like madness. So the first field is a packet type. But unfortunately, as I said earlier, it does need some additional processing so we're not ready to fill it in. So what we wanna do for now is we want to zero out that field. So the shuffle operation, setting the high bit, means set it to zero. Sending the high bit, I would find a bit confusing. So I actually, and in our driving scene, we set it to FF, which is just a big flashing sign saying, this is not just a regular field, this is gonna be zeroed out, okay? So okay, we'll have to skip over that, that's zeroed, okay? But now we start getting to actually moving the data back. So in the shuffle mask, the value that gets filled in is the original field position that you want to put into the new place, okay? So we're now four bytes in, okay? So in byte four, we want the byte that was in position 12 because our packet length is at bytes 12 and 13 in our source. So it's filling 12 and 13 in our packet length field. However, if you look at it, that's all I don't think anybody would have noticed. Watch back in the end box description, packet length, and you can see it from the diagram here as well. It's actually four byte wide field. So we actually need to pad in some zeros as well, and that's done by sticking even more effects. So now we've got our first two fields filled in. So a set of four zeros, then we take bytes 12 and 13, and then two more zeros, okay? Another cool thing with shuffles is, as well as moving the data around, you can duplicate it just by putting the same numbers in twice. So our next field is data length, while that also needs to come bytes 12 and 15. We fill it in there again. 16 byte field, 16 bit fields, no need to put in any zeros. And then the rest of the data falls into the exact same pattern. Okay, our VLAN is a bytes 14 and 15, so we stick it in next. Our RSS field then comes from bytes four through seven. So, well initially this can look quite complex and daunting. It's actually a relatively mechanical process whenever you're building it up. You just fill in the position you wanna take the bytes from, put your shuffle mask. And this is what it actually looks like in our final code, including line numbers, who have actually came to the D2K source code. It's the same data, just find out. And here is how it's used, okay? We had four different fields that in Scatter Code, we would have to do four loads and four stores. We can get it down now per descriptor to one load, one shuffle, one store, okay? In the main loop, we actually work up four descriptors at a time. I'm showing two here again for simplicity. So we load two descriptors. This is just to prevent compiler reordering. So we load descriptors one and descriptors zero. Then we do two shuffles, one in each descriptor to get the actual data in the right envelope order and two stores. So, one load shuffle store per packet gets us all those fields in the right position, okay? So, we're saving huge amounts of instruction, okay? So that's one way we can move data about. And one thing we do in our formal driver and just briefly before I hand it back over to Harry I'll talk about the other kind of, or another kind of data movement we do in our drivers. We're just where we set things up for parallelism, okay? I talked a little about the status bits and about how they don't get transferred directly into the envelope. Instead, they're used to determine of the four packets we're processing in a loop whether they're valid or not. But we don't wanna do that individually per packet as that can be wasteful. Instead, we wanna take all those status and error information and combine the four packets into a single register and use SIMD, I was intended to do things in parallel, okay? And we can do that using these other instructions, these unpack instructions. And here's a very brief diagram using simple boxes and looking fancy of how you do that, okay? We are able to use unpack high to take the high half of each register, of each pair of registers and version together as shown here in the middle row. And then we use an unpack low to stick the low halves of each one together. So with three instructions, we've gone from having the data spread across four registers to all the fields we care about for 32 bit values inside a single register and we can now operate on it in parallel using our SIMD instructions, okay? And again, the actual code in all this flurry. We mask off the bits we care about and that then allows us to actually shrink it down even further from 128 bits to 64, at which point we then can go scale and do the rest of the operations in scalar code because we've got a 64 bit value. All we're actually doing here is a pop count which counts number one bit set. So all we've done is we've masked off the everything but the valid bit. So now we, by doing a bit count, we can work out how many of the four packets we've read are valid, update our stats and determine if we need to break out our loop or if we've actually got no packets in process, okay? So this is showing the advantages of using shuffles to very quickly move data about and then how we can use unpacks to take disparate data read from different scripters to different packets and combine it down into a single register that we can then use send the instructions to operate on in parallel for further processing, okay? So as promised, I'd talk a little more about OVS and matching rules in particular. So keeping that mini flow data structure in mind, the one that we were talking about earlier. If we look at what OVS does most of today in the actual like matching code of a packet comes in, we've parsed that packet, we've made a mini flow. The rest of the OVS pipeline or data path actually operates on the mini flow. So the packet data isn't actually referred to anymore. It's all the metadata that OVS can match on or could care about is going to be cached in this mini flow data structure. So when we try and match a mini or a packet mini flow against the rule set that OVS received from an SDN controller somewhere, what it's really doing is comparing mini flows. So it doesn't refer to the packet anymore and refer to open flow rules. Everything becomes a mini flow. It's a bit of an overloaded term for that reason. So there are mini flows for rules, for masks, for packets, et cetera. If you've read the code base, you might have like scratched your head like there's a lot of this data structure. How does it work? In theory or in concept, it's actually quite simple. What you have is a mini flow for a packet that's represented on the top here. There's a table or a sub table. That's essentially one tuple that OVS can match on. So when you program an open flow rule to OVS and you match on, let's say, an IP field and a UDP field, as this example shows, then what OVS is going to do is it's going to create a sub table with those two properties, so IP and UDP. It's gonna wildcard the rest of everything because OVS has this capability of doing wildcard rule matching. And then when a packet comes in, it will try and compare it with this table or sub table. So this is all happening inside the data path classifier of OVS. I should specify OVS has multiple ways of matching packets. This is all the data path classifier or the wildcard matching engine. So really from a scalar point of view, what is it trying to do? It's trying to find everything that's defined in the table. So in this case, IP and UDP, it's trying to find the relevant block in the packets mini flow. I mentioned earlier, this was a dynamic data structure. So it's not always in the same place, depending on what properties that packet had, if there were VLAN tags or if there was other properties, maybe it was an IPv6 and then IPv4 encapsulated packet, those offsets are going to be different. So OVS can't just do a straight lookup into this mini flow and say load offset, whatever, like we usually would with a data structure. In this case, it has to actually iterate through the bits and try and find the block it's looking for. So that's really the compute workload that's going on here, is loop across the packet mini flow and find the block that we care about. So IP goes up to IP, next we go, again, we have a loop to find that UDP block. Now, OVS doesn't know at compile time, usually what the table properties are. So in this case, there's two blocks, but there could be three blocks. So we could have nested loops, one for the quantity of blocks, one to find the actual block index in the scalar packet mini flow, and then a branch to see if this one actually compared equal. So it's kind of your worst case scenario for branch prediction, that you have nested four loops and then branches on the inside. You don't want that from a performance point of view. So let's look at the vectorized version or rather getting towards a vectorized version, because first there was a lot of branches and four loops in this scalar implementation and branches and four loops are kind of or can be difficult to do SIMD or vectorized processing with. So if we look at it from a more compute point of view, is there a way that we can not like loop and find but actually calculate an offset for this packet mini flow? So if we generate a bit mask when we create the table and then use the pop count instruction which will tell us how many one bits are set in any integer, then we can actually like calculate the block index as opposed to looping through a data structure and trying to find it. And that happens to be a lot easier to do a vector version of. So with this approach, the compute approach to this problem, we can just get a lot more performance. And even nicer is that I should touch on one more thing. We still loop the number of blocks we have. So in the previous example, we had IP and UDP. Those are two blocks. So this loop would execute twice for this particular sub table. But what's even nicer is if we implement an AVX 512 version of this same scalar code. So the same compute that was happening on the previous slide, we can now essentially loop unroll into the SIMD register. So each iteration that was previously occurring like in scalar code each loop iteration. So each time we'd go through this loop, we can lay those out side by side into our SIMD register. And we can compute all of those in parallel. There's no loop carry dependencies here. We can just roll eight different blocks out into a single register and do all that compute using one instruction with multiple data each time. So this gives you much more compute per cycle and just increases your performance by, well, whatever, 8X if you have eight lanes active. So earlier we mentioned this Kmasks feature. Also in the previous example, we only had two particular blocks we cared for. So we need to be able to disable lanes that we don't care for or be able to disable some compute somewhere. And that's where this AVX 512 Kmasks comes in as Bruce mentions my favorite feature in AVX 512. So it allows you to switch off lanes. And what that really means is you can do compute on a full register, but in the locations where you don't want the compute to happen, you can set a bit mask and also pass that to the instruction. And it won't perform the compute on that particular lane. Now that's really nice because that flexibility usually you would have to use extra instructions for in SSE and AVX 2. But with AVX 512 Kmasks, we get this feature and the Kmask is represented here on the top right. It gives you a huge amount of flexibility or orthogonality or thagonal kind of way of writing your code. What that really means is the width of the register you use, you can now manually define and that's really, really useful. So you don't have explicit blends in your instruction stream anymore. That would have been the previous way to solve this type of problem in SSE code. And it's much easier to manage per lane ops. So based on Bruce's packet descriptor thing, sometimes we want to reduce the CRC offset in a packet away from our statistics bytes. That's one way that we could do this. We could add for or remove for from an entire register, but only enable the one lane that we wanted that action to take place on. So that's a nice usage of the Kmasks as well. In summary, there are a number of benefits of vectorization. I hope this talk has been useful in showing how we can take some packet processing workloads, hopefully a lot, or common packet processing operations and actually adjust some or work with them to take advantage of the vectorization for the things given here. Like being able to do larger nodes in stores. So you've got fewer instructions to get data in and out of the core. Or to increase the amount of compute you're doing per instruction. Whether it be working on bigger blocks of data, or whether working on multiple blocks of data from separate packets in parallel. And then we've also touched on some of the novel instructions you have in the ABX and that's the instruction sets for doing shuffling and masking that you can then use again to get more work done for cycle. Because it really all doors boil down to being able to have fewer instructions for the same amount of work. And whenever we take a piece of code and vectorize it, this is inevitably what we find. And probably look to see, sometimes it can be spectacular reductions in the number of instructions. And this is what then can give us our performance benefit from vectorization above all else. So see the signs we have just five minutes left. So we have a couple minutes for questions. On this. So how much of this improvement was the actual vectorization doing multiple things at once? And how much of it was getting rid of branches which you could also do with the C-move instruction program? Repeat the question. Okay, so the question was, how much was actually gained from using SIMD instructions that we had multiple lanes of data being operated on in parallel as opposed to branch or reductions in branch count or could we do this with the C-move instruction, for example? It depends on the type of compute. Ultimately, a lot of these questions, engineering questions are, it depends. In practice, the SIMD gives us a huge, huge speed up. So I'll let you come Bruce on the descriptor processing side of things. I know in the OVS case, just the fact that we can unroll loops literally just gives us that X number of reduction in instructions. You saw the compute. If you unroll that, you're not losing anything anywhere. It's just doing more in parallel. So there's certain things like that that just make a lot of sense from the mini flow extract. So the first part of the presentation, that one, because we take kind of a novel approach as well, because we can use so much data in one register, that gives us huge speed ups. So we're talking, I've seen that run in five cycles to basically probe a specific pattern and move to a specific mini flow for that packet. Whereas I've measured it somewhere around 60 to 70 cycles in scalar code. So there's orders of magnitude to be had there in certain situations, right? And the answer is still, it depends. And another good example of the benefits of vectorization is for our I-40 Polmo driver, we have an SSE version of that. And then we have an AVX2 version of that. And there is a performance benefit. I'm trying to remember, don't remember the exact numbers, but as far as I remember it was 15 to 20% faster going from SSE to AVX2. There was no branch removal or anything like that. It was just going to bigger vectors, being able to do more work per instruction. And we got additional benefits from that. Okay, so yes, going branch-free can give benefits. Going branch-free and then vectorizing on top of that gives you additional benefits. So please do both. Question over here. Okay, so, well, let's summarize the question. Does using AVX give us a performance benefit? Is that too simple or? Yes, we see improvements in the overall performance of LVS, absolutely, yes. Other questions there? No, no. Other questions in the room? Yeah, yourself? Yes, so in most low-level packet processing libraries that are well optimized, we try and cache line line things. We are very aware of where cache line boundaries are. In the case of LVS, we try and do as much as we can. However, it's not always possible to actually do cache aligning for everything, purely because it might be comprised as part of a larger data structure where the mini flow is embedded in the inside and over time that could change or even it could just be unaligned right from the start. In practice, the loads of cache local data, even in an AVX 512 register that's referred to as a split load that it's not cache line aligned, have very little performance impact in my experience as opposed to the amount of benefit in compute that you're actually getting. So, I mean, if your loads are your bottleneck, then there's something very strange in your code going on. So, from that point of view, good question. I don't think it's a concern that or it's not something I've experienced myself. All right, we're informed time's up. Thank you very much. Thank you. That was a great question, wasn't it? I don't think I did.