 OK, I think we'll get started with the next speaker. So I'd like to introduce Sergio and Patrick from Intel, who are going to give an introduction and overview on VPP. That's when I have Mike. Hello, everyone. My name is Sergio Gonzalez Monroy. I'm a network software engineer at Intel, and I work on VPP and DPDK projects. On the first half of this presentation, I will be talking about VPP. I'll give an overview and an example of how you can modify and create new nodes. For the second half, Patrick is going to talk about practical analysis of VPP on interarchitecture and some recommendations. So this will be the agenda. Like I said, I'll talk the high-level overview of how VPP works and the functionality and features that it provides. I'll give an example of how to modify and adding graph nodes, and in this case, it will be the work I did for VPP integrating crypto dev. It will be as a plugin, so we'll see all the stuff that you can do with plugins. I'll also talk about the performance and how VPP kind of scale, basically almost linearly, and a final summary. So VPP stands for Vector Packet Processor. As you probably all know, it's a project inside the FIDO Foundation, and there are other projects inside FIDO. It's been in development since 2002, open source since 2016, and it runs in commodity hardware. It's fast, scalable, and consistent. It can do up to 40 million packets per second per core, no packet loss, and it has a scalable and hierarchical FIB, which provide great performance and update. It's also optimized for this commodity hardware, so it can take advantage of TPDK for IO, now also ODP. It can take advantage of vector instruction, depending on the platform you're running, so from SSC, AVX, AVX2, or Neon in ARM. And it's also optimized for instruction per cycle. So at the end of the day, it's all about performance. It runs in a multicore platform, and it takes the main concept will be batching packets and being cache and memory efficient. So it runs in a run-to-completion model, whereas every core runs like a copy of this graph node that VPP is. We avoid context switch. It's all running in user space, so there is no mode switching, and there is no blocking. Everything is lockless on the data path. As you can see, it's extensible and flexible design. So the whole idea is implemented as a direct graph node. That, as you can imagine, it allows you to create new nodes, reconfigure the whole graph node as you need. And it provides also an easy mechanism as a plugin. So you don't need to directly modify the VPP sources. It provides you a mechanism to add your own nodes and to put them into the overall VPP graph node. It provides control plane CLI, and it's very developer-friendly. So for those developing applications at this level, I found that it was invaluable. It provides packet tracing so you can see the packet going through all the nodes at any time. Every node can provide different information for that packet. It provides error counters, cycle counters, so you can see the average number of cycles spent per packet on every node. So it's very easy to identify it. So if one of the nodes or the node that you are implementing has a very high cost on the whole packet for that traffic workload. It's running like I mentioned before. It's running in user space, and it can be used with a common developer and two chains, just GCC or CLI. It's a full feature implementation. So we have full L2, L3 implementation, also now L4 with TCP and UDP. It provides CLI, basic IB2, and it's integrated in different ways. So as I mentioned in the previous presentation, so there is a netcoff junk for ODL. We have a Python API for Kubernetes Flannel and different language bindings like Java or Python. I think now also there is a REST agent for Neutron on OpenStack. And we have OSB packaging from different vendors such as Ubuntu and Red Hat. So how does this work? So the most relevant, we have two most relevant type of nodes, right? Internal nodes and input nodes. The input nodes will be those nodes that inject packets into these graph nodes, which usually are going to be your, can be IO. If that's a DPDK input, in the case of DPDK input, it will be direct IO from interfaces we are next. It could be AF packet input, or it could be from any other offload or FPGA or NIC that you may have. So once we get a bunch of packets from one of those graph nodes, we move them through the graph node in vectors. A vector basically is just a collection of packets, a bunch of packets that we're reading at any given time. Once we get them from the input nodes, like I mentioned, depending on the packets that we have, they will get classified and they will go to different nodes. So if we have Ethernet inputs that they are not IP, they will go to the Ethernet input node. But if from the input node we have a flag identifying the packet as an IP packet, it will go directly to the IP node. So we are saving cycles that way. So it can take advantage of the different functionality that your NIC will have and leverage that offloading and capabilities. Once all the vector packet is flushed out, right? So this, like I said, it goes through packets. So if we get, let's say, 200 packets, we work on those 200 packets until they have been out of the graph. That could be because we're dropping the packets or because we're doing TX on one of the output interfaces. So it's optimized for cache and memory efficient, as I mentioned before. The idea here is that each node will fit in the instruction cache. So when you get a bunch of packets, the functionality in that node will be specific. So it will be in the instruction cache. You wouldn't have an instruction cache miss. And you would then use prefetching to get the packets and execute that functionality on the node, then move to the next node or multiple nodes. So I mentioned before the input nodes and the internal nodes. So here, in green one would be the input nodes and in blue would be internal nodes. The way that you configure each node, each node needs to know what are the next possible nodes. So you don't know where the packets come from or from which node packets come from, but you need to know the packets that they are going to. The plug-in mechanism is very easy, and they are first-class citizens. So even if you are a plug-in, you can add APIs, you can add CLI commands, you can add new nodes, and you can rearrange the graph in a different way. So you don't need to modify VPP itself. You can do it as a plug-in. In this case, also, DPDK is implemented as a plug-in. So we can create input nodes as a plug-in or other internal nodes. So now I'll go through the example. So this is the work I did for integrating the crypto-dev DPDK API, which basically is a crypto API for the IPsec implementation that VPP already had. So the idea here will be to modify the DPDK plug-in itself. We will create new nodes, and we will rearrange the graph to take advantage of those nodes. So initially, this will be the default path that you will take without other encapsulation for normal IP packets that are going to get encrypted. We will get a vector of packets from the DPDK input node. The next, because the NIC likely will recognize that there were IP packets, it will go directly into the IP4 input node, node checks. So in that node, basically, there is a bit of IP before header sanity and TTL decrement. So then we go to the FIB. The FIB will say, OK, this packet needs to be encrypted. So we will go to a virtual interface, a virtual IPsec interface, which is the IPsec IFE output. Once we go there, basically, we're going to say, OK, we need to encapsulate this packet with ESP and encrypt it. That will happen by default on VPP with OpenSSL. And once we have a new tunnel IP header, we go back to the FIB to find out what is the output interface, what is the new L2 that we need to rewrite, and out to the TX packet. So we modified the DPDK plugin to provide new nodes. Those new nodes were the DPDK crypto input, which is another input node that will be pulling from the crypto devices that DPDK exposed. We have a DPDK ESP and encrypt, because we do the encapsulation slightly different, and then we enqueue those packets down to DPDK. And another DPDK ESP and encrypt pass, which basically just gets some metadata. So the moment that we create those nodes and the plugin gets registered, as you can see, there is no connection from the IPsec interface output to that new node. So those nodes are registered. They are there, but there won't be packets being routed towards. That decision will happen later. So we can do that on runtime. On runtime, there are APIs that provide us to check, OK, do we have crypto devices? Yes. Then we can reconfigure the node on runtime to take advantage of those. If VPP doesn't find any crypto devices, it could use the default open SESL implementation that it already provides. So as you can see, the plugin is very powerful, and it allows you to do basically whatever and adapt to different platforms and function ID that you may have. So this is VPP performance data. So it's from the CCID project, which is another FIDA project that continues system integration and testing. And it's basically showing the scaling of VPP. In the left side, we have 1 million entries for IPv4. And on the right side, it's IPv6, half a million entries. As you can see, the scaling of interfaces and cores is almost linear. And you can see that the limit here is the bandwidth of the NICs and the PCI, and not VPP or the cores. So this is running on a dual socket system, Intel processor E5 with 256 gigabytes of memory, and 6 Intel XL710, which is a dual port for the Ging Nick. So as a summary, I think the main thing to remember is VPP is fast, it's scalable, and it's very flexible. It's developer-friendly. It has full feature, L2L3 implementations, also now L4 implementation with TCP and UDP. It's easy to integrate for your own necessity and to modify. You can have your own node, or you can work upstream, and it has a great community very involved. So that will be for me. If there are any questions, I'll take them later. And I'll let Patrick first get with the second half of the presentation. Thank you. At least USB is also needed. Hello, everyone. Thanks for coming here today. So my name is Patrick Liu, and I'm from Intel's network platform group based in Arizona. So just a very brief background about our team and myself. So we are the performance analysis team, very focused on IO performance analysis for the past 15 years. So I'm myself in the team for almost seven years now. So we have a very deep understanding of PCIe architecture, like we use in protocol analyzer to look at every packet. But throughout the time, we also develop various methodology, like using hardware counter or software technique to gather this kind of IO measurement inside. So today I'm going to share with you a couple of practical tools, as well as techniques for identifying IO bottlenecks, as well as on the CPU side, on the core side. If anything that goes wrong, how do we spot it? And then a couple of tips on how to improve performance. So the presentation is going to break into two parts. It's on core and core. And I know very now every one of you are familiar with this terminology. So I'll have a graphic representation in the next slide. But basically, I want to show you how to measure PCIe bandwidth, memory bandwidth using a set of hardware counters. And of course, the open source tool or how to get to it is nothing proprietary. Also on the CPU side, if there's an IPC instruction per cycle, it's a common matrix. But a couple of things maybe for the people new are Intel architecture may not be aware. And also, some other things to help dive into deeper into IPC. So this is kind of a diagram of a modern Intel Xeon server CPU. So I break it up basically in two parts. There's a yellow part for the CPU, the core side, which is comprised of an execution unit, layer one, layer two. And there can be multiple cores that run independently and then simultaneously. And the blue part is basically where on core. This means everything not core, we call on core. So like a modern CPU, surprisingly, the core is actually the efficient part of it. So the main real estate was the last level cache. So it can be up to 55 megabyte on the fourth generation or memory controller, IO, all this kind of peripheral. But the peripheral today, especially in our domain, the network processing is really actually the key for interacting with the world. So we need to have good tools to get more visibility into our core space. And then the purpose of this slide is not to go through all the bullet points, although you're certainly welcome to visit this because this show very complex of interaction between CPU and IO for basically life cycle of a packet, how it go into a general purpose CPU. But really, the takeaway here is I want to highlight the traditional, our performance methodology is very core-centric. That means we look at IPC, we look at branch prediction ray, we look at TLB or hit and miss, those kind of things, very from CPU perspective. But with our workloads, IO very century, we need to have a new ways or new methodology to look at IO century workloads. So yeah, let's jump into the practical thing. So today for my main tools, I'm going to mention for processor counter monitoring or PCM, it's a tool I call develop with other Intel software developer. It's open source now in GitHub. Basically, it has a very lightweight of, it's set of lightweight user space tool that program our hardware performance counter, but they display in a very high level abstracted. So it has a formula to convert. So you can, for example, just from a glance of you see memory bandwidth or various level of cache miss ratio. So you don't have to do all those converging by yourself. Also, it will have like a C or C++ API if you want to do like a next level integration into your workload to get a more fine region by region kind of performance inside. It has API allow you to do that. And it's a BSD licensed. So I understand that all of the deploy environment can install or deploy custom tools, but Linux Perf is probably pretty ubiquitous. So in the backup slide here, I will provide like a simple command line so you can do a similar profiling with up to date kernel means for that all plus with a set of wrapper so you can accomplish the similar thing with Linux Perf utility. So just a brief words about PCIe. So ever since our first generation E5 processor, we had introduced technological data direct IO, DD IO. So maybe many of you already have a first-hand experience with it, but it's basically whenever you have IO, DMA, read or write transactions or your network receive transmit, instead of going into memory directly, it now will go straight into last level cache. So the advantage is in two-fold. So if CPU can consume the packet directly, your latency is much quicker, at least twice the saving compared to going out to the memory. A second, if that is indeed a case, your memory controller could be practically keep at most idle. So if you idle, the memory controller can like a power down or run at a lower frequency so you can save platform level power. But there's some condition where the packet right into last level cache will still eventually get evicted to the memory and then I'll show you how to catch that and what are the ways to avoid it. So, yeah, to kind of major PCIe bandwidth, it's pretty simple to get started. So you just download the tool and then run the Mac commands and then invoke PCM PCIe. And there's some cryptic name here, but so I kind of highlight in the four-point takeaway. Let me just get a little later. Yeah, so the first thing is, oh, by the way, because of the time constraint, I probably couldn't go into the lowest level detail today, but we are going to have a white paper coming out, explain all this in much greater detail and a search right now, we are also available after the talk. So if there's a deep question I didn't cover, we can certainly grab a whiteboard after that. So four-point takeaway is you can look in at the right event to measure your inbound right. So this is your network receive, whether you have a full cache line or it's a null cache line, that's like a null 64 by a line, it's called read for ownership, it's inbound right. And then there's also the transmit path. So you want to send the packet out to the network interface to the wire that will trigger the PCI recur events. And MMIO here is basically your outbound operation. CPU wants to address like either reading the status from the device, a PCI device, or want to notify the device to start initiate asynchronous DMA operation for like a receive or transmit. So what you want to kind of do with once you know what you can measure, what you basically want to do is want to match your IO throughput for whatever you expected with this reliable hardware counter because as you get more experience to develop like a packet processing on general processor, you may not need to do this as often, but initially like I have, we have seen couple of cases from customer where they forget to include in, they were using like a storage device, PCI NVME to store data coming from the wire or they're using hardware crypto accelerator. So they didn't consider that part also using PCI bandwidth, equivalent of PCI bandwidth. So you are hitting basically PCI bandwidth limit. So it's like really 1.5X or sometimes 2X on the wire for every packet coming into a system. So it's good idea to just calibrate your IO bandwidth, your workload, make sure there's no surprise, right. And then come to the MMRI, it seems DPDK, I think this concept is kind of well understood, but to just batch your MMRI operation because you want to save those available PCI bandwidth more for the receive transmit function, not just this kind of doorbell signaling steps, right. And the orange part is like a what you want to avoid. Unfortunately I don't have time to go to the great detail, but this one is absolutely I want to call out because MMRI, I don't know about other architecture, but on Intel architecture it's very long latency operation, right. It can easily take up to 200 cycle for each single read. And there's very little technique you can hide those kind of long latency, like pipeline, auto order, it doesn't work with MMRI. So avoid MMRI at all costs, right. Okay, so come into a major memory bandwidth it's as easy as the PCM, PCIe, just compile and run PCM memory. So what I want to highlight here is a typical like a fully loaded two socket configuration, right. There's a three main component here. You have a CPU, you have the memory and you have the IO devices. So all three of them have to be aligned on the same socket. If not, they can introduce some memory traffic and not only that, they can introduce latency, right. So when you see memory traffic in the system, if you know it's best, but when you do see these are the kind of three common reasons you do see them. If you have a wrongly allocated NUMA for any of the three component, then try to fix it in the code or isolate it, right. So, and if you have a DDIO miss, that means the packet, every packet is coming into last level cache, but if they don't get consumed by the code fast enough, then the new one come will evict the old one out to the memory. So means try to process or move this packet as fast as you can. And the second thing I highlight here, although I talk to Sergio was, it's not universally applicable, but it's one of the tuning techniques to basically sizing your descriptor ring size. So like you can be typically 1K for RX descriptor or similar for up to 4K, right. So the bigger you size your descriptor ring, you can absorb more packet and probably avoid packet loss, right. But on the contrast when you have a larger descriptor ring size, you're basically giving the chance you have more packet buffer in a system and more system, more packet buffer in a system can have a more likelihood of causing collision. So size that descriptor ring appropriately, like don't just say 4K, it's good for every test case. It may be okay for 20 gig, 40 gig, but if you scale to 100 gig, 140, 160 gig, you may have to look at system level and then reduce the descriptor ring size per device accordingly. Finally, it's actually the most common way you see memory traffic was where you have a hash or your lookup table, depending on the size of it and then how well they can be cached into your local level cache. If it's very big, there's a highly likelihood you will get it from the memory, right. But if they are streaming, means they are all coming in in a pipeline fashion, you might be okay, but if not, then it can occur latency impact your performance. So a check for latency. So yeah, kind of the takeaway here from the kind of on-call side was the first and also it's found the easiest to the most advanced to address it was to calibrate your IO bandwidth with your expected performance with this hardware counter available to you, right. And also check for unexpected cross-NUMA traffic, like if you have only single circuit system and you will never run into this issue, but if you are scaling, now run your same workflow on just the second CPU plug-in, if you don't have this NUMA API to allocate right kind of things, you suddenly will see performance drop where you don't know where it from, just look at memory bandwidth. And avoid MMIO read, as I said, long latency instruction, batch MMIO write, and hopefully we can see each other another time to say how we can optimize software to best use of DDIO. It's a very kind of advanced topic. Okay, so moving into the core side, I kind of prepared this interesting tool, very big number here for the audience to take a quick look. So this is basically a VPP 17.04 running two different configurations. And then first, using Perf, give me IPC number of 2.63. The second one, give me 2.12. So which one do you think has the highest throughput per core, right? It's generally IPC, the higher the better, right? And then it is kind of say, you have higher IPC, you have a higher throughput. So the reality is, so what I hear actually is, the A was that kind of prior to 17.01, we don't have a crypto dev. It's just an optimized VPP IPsec using OpenSSL versus the B is using AESNI crypto dev Accelerate. Despite our IPC slightly come down, IPC is really just a ratio between instruction divided by cycles, right? But surprisingly to me also, the traditional one is probably very, there's a lot of layers of processing to do to accomplish the similar functionality. It takes an average of more than 15,000 instructions per packet to get the IPsec encryption to accomplish its job. So even though the IPC is tuned, it's very like a well-designed dual loop and prefetching this kind of technique to make IPC very smooth, simply because the amount of a call has to be applied for every IPsec packet. It just causing huge number of a cycle per packet, right? Whereas in the crypto dev, we also introduce more direct path to accomplish the same IPsec way. We only have to do 1800 instruction per packet in average and that in turn, reduce our cycles per packet by six times and that's your speed up in performance. So kind of the takeaway is IPC is really just a ratio. So unless it is really poor, like a less than one or something that can indicate really big, bigger issue. But if anything is closer to two or above two, we look at instruction per packet or cycle per packet. That will be a better core matrix for comparing performance. And that's also nice cycle per packet if you change into different skill with different frequency. It's still kind of you can scale it appropriately. So beyond IPC, how can we zoom in to more to understand where it's a CPU bottleneck, right? So this is where I want to spend a few minutes talk about processor trace. It's basically a new hardware since fourth generation processor, but it's also applicable in our Denver atom-based microserver, so it's pretty good. Like you can use it on different class of machines. But you can think of it's basically a hardware that's like a branch tracer. So without any kind of condition if statement for loop or anything, it will basically, the CPU is just going to fetch next instruction, next instruction execute forever. But whenever you have if condition is based on the runtime condition, right? Your packet size, your packet header, content orders decide where you go. So this is a branch tracer can reveal this kind of exact execution path of how your program was executed. So this is a dynamic tracing technique. You can like, you have analyzer in your CPU free one. Okay, so there's also just a couple words mentioned in the integration with like ubiquitous Linux perf tools, right? And a button here you kind of see this is a flame graph of various transition for different function that is doing complex processing. But really the takeaway here is the Linux perf with like a modern kernel 4.4, like Ubuntu 16.04 or newer, then it has like out-of-box support for this hardware support. So it can give you like a finest detail like how your code was executed, but you need to do some post-processing to reduce the tracing data because even for one millisecond of trace, it give you like maybe more than 20 million instruction. That's like maybe thousands of loop iteration of your packet processing, right? So reduce that is kind of data analytic subject too. So here it's just kind of another like interesting demonstration of processor trace applying to VPP how we can get more insight for performance optimizations, right? So this chart you can think of it's like a VPP trace, so you see like when packet was coming for receiving and then doing various nodes processing and transmit out, but even finer granularity. So I abstract it out here to show the key component but you can really trace into instruction granularity what happened to your packets, right? So the first thing to call out here was we have a vector with 256 packets and for people that understood DPDK we have every receive is 32 packets so if one together that many packets in the vector we have to call it 8 times 8 times 32 256. So you can see if your CPU is fully loaded or like the amount of traffic coming out at bursty or they are very scattered. Similar thing transmit out we receive 256 we want to send 256 out. TX we are bursting at 16 so you call it 16 times to get to that vectors. And then here you can also see where are the most densest function in your code. So VPP provides psychopanel but here basically provides you how many instruction panels you are doing. So because instruction is kind of what a programmer write write C code that generates assembly instruction and then psychopacad is kind of the amount of time to execute in this work. So with this amount of detail information you can kind of see where you need to spend your effort for the next level optimization, right? Or if hardware or flow come to the picture this can also tell you where you may need to consider hardware or flow if really we exhausted all the software options. And then another final thing to call out was actually this was unexpected but throughout the preparation of this presentation we found out the typical way of launching VPP did not use the most optimal or actually I should say the most minimum receive path because the initialization parameter passing from VPP to DPDK was 9K for the max package size so DPDK just choose to use scatter as a receiving path, right? So this kind of thing is good to call out that you may not notice at the compilation time or development time but when you're really doing the benchmarking here we'll show you what functions are called at runtime. Okay, so really kind of to summarize everything that we quickly gone through today was to using your available performance monitoring hardware to baseline your workload performance, right? There's a rich set of tool available Open Source Live for you, PCM or Linux Perf and then really to improve performance is two main factor, right? Whether we can reduce instruction come per packet by creating more VPP direct VPP path to processing packet the IPC slide was a great example how a very structured way can take a very long instruction per packet but if you understood what your end goal is maybe we can take a more shortcut like implement plug-in to process that directly also consider hardware off-load whenever you're exhausted, optimization from a software perspective, right? So reduce cycles per packet is kind of... Yeah, the meaning is very clear but to do it you can kind of go through the uncalled performance take-away side to minimize your latency as much as possible and then one thing we did not cover here today but it's kind of good to keep in mind was the CPU L1 miss latency if we measure those we can kind of get the residency view like basically your data structure are mostly cache in L2 or in LLC or memory so that will give us indication which hash table we have to optimize and then there's a lot of reference for you to download a tool or documentations Yeah, thank you Oh, Lisa, sorry, right? So what time is it? I'm here for Patrick Hi, thank you for the presentation My question will be in the sense of virtualization using VMs and containers leveraging what you have presented Do you have any performance benchmarking that you have run using containers and VMs leveraging the user space flow graphs as well as your optimized host configuration? Thank you I think the data that we have presented was all run at least the one from the CCD graph on scaling VPP was on the host so there was no virtualization and containers I think that if you look into the CCD project there might be some use cases for containers usually containers at the moment we are not doing the SIV path at the moment so it will be with AF packet or V host VTO but I think there may be some performance information there I don't know of the top of my head Okay, I think that's it then Thank you