 Hi everyone, I'm Kira Loftus, this is Bruce Richardson, we're both the Mintel as you might have guessed. Today we're here to discuss some prototyping work which ourselves and a couple others from Mintel have been working away for the last while. The work itself has been kind of driven by some challenges and trends with Cloud Native, particularly in the area of switching container to container or application to application or east-west as it's often referred to as. So as regards an agenda we'll elaborate a bit more on that Cloud Native type problem statement. We'll discuss how we switch today east-west using current approaches to v-switching. We will put forth our proposed scalable v-switching solution and discuss its many benefits including just raw performance as well as better utilization of resources and automatic scaling. We'll discuss some next steps and hopefully have some time for Q&A at the end. So in terms of the problem statement that we've been looking at, given that this is an SDN room I expect most people are familiar in some degree with SDN and in the comms world with NFE network function virtualization which is very much a move away from discrete appliances towards more virtualized infrastructure where you've got VMs deployed on Cots hardware running custom boxes. But looking beyond that we then see further trends towards containerization in this sort of Cloud Native as it's referred to, style of deployment where you take your monolithic VM which may be using let's say four cores as in the diagram here or possibly eight, ten cores, whatever it happens to be and try to subdivide that further into individual containers or microservices formed into service chains or intercommunicating containers of some degree or other to give you additional granularity of deployment and scalability. However, if we look at this from a networking point of view we see challenges in terms of the amount of network bandwidth required. If you look at your monolithic VM as shown here where you may have let's say a 25 gig connection coming into that if you break that up into four services, four service containers you now have a whole bunch of additional traffic running between them that you have to manage as well. So your 25 gig network may suddenly shoot up to needing 100 gig or more, okay because of this east-west traffic as we refer to it. So if you've got real network out there and you're trying to send this out your network infrastructure is going to hugely have to go up but if you co-locate these on a system we need to then see about how we handle switching between them on a virtualized switch within that platform and how to handle these east-west connections on a single platform efficiently and with high performance for networking and communications is really the high level problem that we set out to look at. Okay, so how do we switch east-west today? So current approaches to be switching typically use a centralized model whereby a dedicated number of cores are nailed up and essentially reserved for the sole purpose of virtual switching. So for example in this diagram here we've got two of those nailed up v-switch cores servicing for network functions and a typical flow through one of those v-switch cores would be say we receive a packet from network function 0, we parse it see that maybe it's got a VLAN tag of 200 we consult our lookup table, see that there's a rule there to send those types of packets to network function 1 so we proceed to execute that action. So those types of operations, parsing, classification they're pretty heavy duty and as such your v-switch cores are typically going to be pretty busy and then with this extra east-west traffic they're even busier and then another challenge that cloud native brings is that we've got this kind of dynamic environment with workloads spinning up and down quite frequently and there's a challenge there where the underlying infrastructure the v-switch in this case needs to be able to respond to that increase or decrease in demand pretty instantly otherwise you run the risk of dropping packets or providing a poor service to your network functions or your tenants. So there's kind of a conflict here between the static v-switch configuration with these nailed up cores and the dynamic environment with these workloads spinning up and down and that leaves the operator deploying the v-switch with two options really. Either they try and scale up and down v-switch resources as the work scales up and down on the platform but that's pretty heavy handed it requires manual intervention nobody really wants that. So the alternative then is to over provision and just nail up a ton of v-switch cores at the beginning and then that way over the lifetime workloads have spin up and down you'll always deliver that service pretty poor kind of utilization of your resources it's not very efficient. So there really does exist a need for a better solution without this manual intervention or over provisioning because neither of those options are really cloud native friendly. So in order to get scalability in a solution we then we're looking at how can we distribute our v-switch and try and co-locate the v-switching requirements alongside our workload. The idea being that yeah you may have some traffic coming in and out of a platform but if you've got communicating entities on that platform each of them is going to require a certain amount of v-switching logic to it. So can we move away from having four cores on the platform or three cores or eight cores or whatever happens to be dedicated for your v-switch be it OVS, DPDK or whatever it happens to be and instead put a little bit of v-switching alongside each container on each core as shown on the diagram on the right there. So in that way we can then scale easier. So if you have a workload that's using eight cores and your traffic rate in your network goes up and you now need to spawn off an additional four or further eight instances you are able to scale up v-switching logic alongside that by co-locating them. If we manage to achieve this there are some other benefits we can get in terms of lower level efficiencies for moving the packets from one container or one application to another. So if we look at how we do this right now if we want to send a packet from container one on core one to container two on core two we actually have to take two hops with that one packet. You have to send the packet from core one to your v-switching core and then from your v-switching core to core two. So there's two moves of data and possibly two packet copies involved there. Obviously this is just an inefficiency. Any cycles that you spend, any CPU cycles you spend in either copying the data or trying to read it from remote caches or any of that, it's all just pure overhead. It's not actually doing the real work you want to do on the packet. Whereas ideally we'd much rather situation like this where we can actually send the packet straight from core one to core two. The challenge here is how do you do that while still maintaining some degree of security, container isolation, all that good stuff that we want to have. We have restricted resources like our lookup tables or the interfaces themselves, interface rings for Vert.io or whatever your interface type happens to be. Those all need to be protected. So if you put those inside the container then we're giving an untrusted application access to restricted resources. So that's a no go. If you put them outside the container then how do they get accessed on the core? How can we do that? So even though I come from a DPDK background as some of you may know and work in user space all the time I think in this case for containers especially the answer has got to be in the kernel itself. Because the kernel on the container in the container world, the kernel is actually the host kernel. It's shared across all these, all the various containers on the system and it's also therefore since it's on the host it's the trusted environment where we can put things like lookup tables and Vert.io ports or rings of whatever connection type we're using. So we could switch in on the core from user space to kernel space to actually do our packet lookups and packet transfers. What we looked to do was we looked to do a prototype of this setup to see just how it would perform compared to the centralized V-switch model. We implemented this using two DPDK applications that need to talk to each other on two remote cores. So how we set this up? Well, as in most cases whenever you need to sell up your network interface you need to register some packet buffers. If we want to do a direct core one to core two transfer we need to make sure those buffers are accessible from core one in the kernel. So what we do, whenever the second receiver container starts up and starts configuring its interface and registers its buffers, we then duplicate those buffer mappings down into the kernel space. This means that any core when running in kernel space can write to those buffers to copy packets in but they're still isolated from the container because the container is in user space and it still doesn't have access to them. Then after that setup is done we can start transferring packets pretty easily. We can have our transmitting container make some sort of system call in our initial prototype we just used Ioctyl for ease of use make a system call which switches us on that core into kernel mode and one particular benefit we have here is that the packet we are sending is already in all our local caches because we haven't actually switched core we've just moved from user to kernel on that one core so we've got a lot of benefits from cache locality from using such a scheme as this. Anyway once we're in kernel space then we can look to do our lookups we can have lookup tables in the kernel configured by some external entity whatever that happens to be and we can do do those table lookups in our prototypes we were actually working off OVS code base so we actually transferred into kernel space the DPCLS the table lookup routines from OVS to have some real V-switching logic in there and then we can do a single copy from source core to destination core we don't need to do two copies our initial prototype actually did two but we managed to get it down to one using the memory mapping scheme as described here so in that case we can copy straight into the destination buffer which then has the advantage that it potentially allows container 2 to receive that packet without having to make a system called itself because system calls are not the cheapest thing in the world but one thing we do know from DPDK and other high performance packet processing environments is if you have expensive operations you can have schemes to get around them using amortization and that's exactly what we're looking to do here is make a system call to transfer 32 packets from core 1 to core 2 it's not actually so bad especially when you compare it to let's say using a V-host-vert.io combination you know the cost of one system call every 32 packets in terms of cycles is not that significant okay so how does our solution perform so for our benchmarks we took a fixed resource pool of 12 cores and a fixed workload and at a very high level we try and run any of these workloads on our 12 cores as we can so looking at the workload in a little bit more detail we've got two applications each are using one core to encrypt and forward packets between pairs of physical and virtual interfaces and then the part that we're particularly interested in is that we switch packets east-west between our two virtual interfaces using one of two different V-switch implementations so the first implementation is a well known dbdk-based centralized V-switch and that of course uses an extra core to perform that switching operation and then the second is a prototype of the solution we just described which does not require that extra core because we've got our switching functionality up in the worker cores so we chose the crypto workload because it's cycle intensive and that way when we move our V-switching up on the same core as the workload you can kind of see the penalty that you have to pay you'll take cycles away from your workload and with the crypto workload the cycles you take away will be critical to application performance say as opposed to if you were to use something like dbdk-based espmd or some simple IO forwarding app the cycles you take away wouldn't be as critical to performance and you wouldn't take as big of a hit so it's just kind of to make the benchmark a bit more transparent but anyways to generate our data then we sent in iMix packets at 40G lightweight into both ends of the pipeline and we measure what we receive back and compare for both implementations okay so the left here is our centralized V-switch configuration and you can see we're using 8 cores to run 4 of the 2 core workloads and the remaining 4 cores in orange here are our dedicated V-switch cores so to kind of rephrase that we're running 4 workloads and we've got 1 nailed up core per workload and that uses our full 12 core budget and then for the distributed V-switch you'll see to run 4 of the workloads we only need to use 8 cores because of course we're not nailing up dedicated V-switch cores or V-switch is running on the same core that of course comes at a price we're taking precious cycles away from our workload, we're not doing as much work and in this case the hit is 10% so we're down 10% in the distributed V-switch case but we've got a 33% saving in cores, we're using 8 cores versus 12 so the 10% hit doesn't seem as big when you kind of consider the core utilization and then an interesting thing to do is to take those 4 idle cores and do something useful with them like run 2 more workloads and then in this case then we can kind of compare at a system level you know 12 cores versus 12 both solutions and in this case the distributed V-switch outperforms the centralized V-switch by 31% when you use the full resource budget so the kind of key takeaway from these benchmarks would be that you do suffer a hit at your kind of workload level and your workload won't be as performant but when you take a step back and look at the bigger picture at your full resource budget and kind of utilization and all of that the distributed V-switch does appear to be the more compelling option of the two. So having heard all that just some initial next steps that we've been looking at doing so we've done an initial prototype to show that there is potential benefit here and that we can gain from you know putting the V-switching logic alongside the workload and it actually gives us some interesting benefits in terms of you know dynamically it dynamically balances how you spend your cycles if you need to do a lot of I.O. you spend more time inside kernel doing switching if you don't you've got more time to use the space for your actual workload all the things we are thinking about doing is you know the packet copies for large packets they still are showing up as you know a bit of a hotspot so we're looking to see if we can look at doing copy acceleration using the Intel quick data on the Intel Xeon systems so some links about that there and also some of the you know memory mapping stuff for switching to kernel space and all this and kind of familiar well it is kind of like what's being done in AFXDP and so we're looking to see you know can we use AFXDP infrastructure and rings and methods of using pull to get in and out of kernel space and to user space and put this distributed V-switching under that AFXDP interface type and we've already had some discussions with Magnus and Bjorn about doing that so that's on our radar as well to kind of standardize it and that's it from us Any questions for folks in here? I obviously didn't go on long enough So for the distributed flow table do you need to have the full flow table for each individual distributed switch instance or are you also doing some cleverness where the distributed switch for each container or each workload only has flow rules relevant to that workload Will I take this? Well we're basing this right now of OVS, an OVS DPDK which already has its lookup tables per port so this essentially means that our port is now basically our container instance or container interface so inside the kernel you'll have all the tables but each individual core will only be looking up the tables that are needed for the traffic coming from that core itself so in that regard they'll be distributed and you get nice caching benefits rather than all the flow tables trying to fit into your OVS cores you now have just a flow table for each individual container being in the cache of that container of the core that that container is running on itself so time's up