 Hi, everyone. My name is Florian Quares. I'm a Cisco technical lead, also an FDIO VPP project container, and in today's talk, I'd like to give you a high-level overview of the benefits of using VPP as Envoy's network stack. My background is in networking, in particular, I'm one of the co-creators of VPP's whole stack. So I typically talk about transport protocols and socket layer implementations. However, today I'll mainly focus on how Envoy can leverage user space networking and some of the benefits there are. Now, before we dive in and in the interest of those of you who are not familiar with VPP, a very quick introduction. VPP is an L2127 networking stack, which at its core leverages two important ideas, vectorized packet processing and the modeling of the forwarding as a directed graph of nodes. When done correctly, this to ensure really efficient use of a CPU's caching hierarchy and consequently minimal overhead per packet when doing software forwarding. But another really important aspect of this approach is composability, that is starting from these simple ideas. One can implement all types of network functions from device drivers to our four features and then tie them together to build a really efficient full network processing pipeline. Now, looking at this from a last abstract standpoint, it might be worth noting that VPP is typically used together with DPDK. So it supports a large set of network interfaces, but it should be noted that it also has a smaller set of really efficient native drivers. It supports L2 switching, bridging, IP forwarding, virtual routing and forwarding. That is VRF, so it has the right constructs for IP layer multi-tenancy. But in addition to these basic L2 and L3 functions, it also supports a multitude of additional features. And just to name a few, a very efficient IP stack implementation, ACL, NAT, NPLS, segment routing and several flavors of tolling protocols, things like VXLAN and LISP. Now, on top of the networking stack, VPP also implements a custom OS stack built and optimized in a similar fashion. As one might expect, it supports commonly used transports like TCP and UDP, but also TLS and Qwik. The session or socket layer provides a number of features, but perhaps the most important for the context of this stock is the shared memory infra that can be used to exchange IO and control events with external applications using per worker message use. And finally, to simplify interoperability with applications, VPP provides comms library or VCL, which exposes POSIX-like APIs. So I guess that by this point, some of you may be asking the inescapable question, why yet another host stack? And you'd be right to ask that, because from a functional perspective, Linux is obviously the one stack to use. However, because Linux's networking stack was designed around a single-pass run-to-completion model, per packet performance is limited. This is especially noticeable when hardware acceleration cannot be leveraged. Furthermore, in addition to the performance benefit, the fact that the stack is in user space could be utilized to optimize interactions and perhaps minimize data copies. Also, because the whole protocol stack is packaged with the application, it could potentially be customized or extended in certain situations. One can certainly imagine scenarios where the sockets provide more context data to the underlying layers with the aim of improving network utilization by the apps. Also, note that all of this does not preclude Kubernetes integration. In fact, VPP can be used as a data plane by CNIs by Calico. So how exactly does Envoy integrate with VCL, and what sort of changes were needed? Well, rather intuitively, the first step was to make sure that Envoy components do not make any assumptions with respect to the underlying socket layer, and consequently always use generic socket interfaces such that they can potentially interoperate with custom socket layer implementations once they're available. Obviously, this is not exactly glamorous work as the changes are not so much features as they are focused on API refactoring. Still, out of the set of changes that have gone in, perhaps the most notable are the fact that as the core rule, we now avoid using raw file descriptors anywhere in the code. IO handles still expose the FDs, but last time I checked, we've managed to clean them to a point where they were only used in, I believe, a couple of places. We added support for pluggable IO handle factories that is support for multiple types of sockets. Another interesting consequence of the first point is that file event creation is now delegated to IO handle implementations. So as a desired side effect, the socket layer that provides the IO handle is now the one that decides how events are created. Or in other words, socket events are no longer tightly coupled with live event. And finally, an interesting scenario that might serve as an example going forward was TLS, which mainly for convenience reasons relied on bios that needed explicit access to the FD. It eventually turned out that writing a custom bio that uses the IO handle as opposed to the FD is relatively straightforward. So we actually switched to that. Now, all of these changes are enough to allow the implementation of a VCL specific socket interface, but they still leave one more problem to be solved, mainly Both LibEvent and VCL want to handle async polling and the dispatching of the IO handles, but only one of them can be the main dispatcher. So the solution to this problem is to leave control to LibEvent and to register event FD, the event FD associated to a VCL workers message queue with LibEvent. If you recall, the MQs are used by BPD to convey IO and controlled events to VCL. And the event FD is used to signal MQ transitions from empty to non-empty state. This ultimately means that BPD generated events force LibEvent to hand over control to the VCL interface, which for each employee worker uses a locally maintained E-Pull FD to pull or pull events from VCL and subsequently dispatch them. Now, these are just the stepping stones for the employee VCL integration and as first next steps, the plan is to further optimize the performance. The lowest hanging through here are the read operations as VCL could pass pointers to socket data in the shape of buffer fragments instead of doing a full copy. Now, the groundwork for this is already done. What's left is the actual integration and speaking about performance to evaluate the potential benefits of this integration. I built the following topology wherein WRK connects to VCL and Envoy, which performs ACDP routing to a back-end engine X. Now, this type of scenario might not be relevant in practice and in fact, I'd be delighted to learn if that's the case and also what type of scenarios would be interesting for those who actively deploy Envoy. Nonetheless, for the purpose of this experiment, this is ideal because it gives us an idea of how many VVP workers are needed to load Envoy and an upper bound on performance. Now, at a glance, these results show us that for an equal number of cores, one VVP worker is actually enough to outperform the kernel by a significant margin. That is, performance seems to be good 20% to 40% better and to scale pretty well. However, after a certain point, about 4 to 5 workers, performance does not scale linearly and it behaves somewhat worse for larger payloads. Albeit, this should be noted that TSO for VVP was not enabled in this scenario. So, the results are really encouraging, but there are still some things that need further investigation for a better understanding. So, with that, should you be interested in further exploring Envoy VVP integration, please give the code a try. For more in-depth conversations, you should be able to grab me on one of Envoy's Slack channels. Before I conclude, I'd like to quickly say thank you to Matt and the whole community, Lysin, Antonio, Draggyan, just to name a few for the constant support and openness towards the refactoring effort. And with that, thank you very much for your attention and I look forward to your questions.