 it's not perfect okay we're at the tail end of the day two more presentations to go and the first of those is by Jens Freiman yes and on Vert.io 1.1 what's new in the next version okay everyone can you hear me okay so yes my name is Jens Freiman I work for Red Hat and this talk is about the reason what I work that we've been doing on extending the Vert.io specification so the goal is to extend Vert.io in a way to improve its performance and also to make it work better with workloads that it currently doesn't support very well so I'll give a very very brief explanation of Vert.io and just put it into context to NFV very shortly before talking about the main new topic and Vert.io 1.1 which is a feature called packed vertices and eventually if time allows I will talk about a few other developments in the space of Vert.io that I think are worth mentioning here to this crowd but first let's start with what Vert.io actually is so Vert.io is an abstraction layer for devices and hypervisors and the idea is to have a common framework for IO virtualization and at its core it defines virtual cues also a way to configure devices and mechanisms for backward compatibility Vert.io is quite good at that so there are many different devices there but on net, block, console and so on and to guess these devices appear as normal PCI devices or if you're on a mainframe as channel devices good so and to put this into context to NFV so where on a possible NFV setup is better you're used and I really just picked a random example let's say you're using a DPDK based application in your VM and then you have a Vert.io front end in the guest and you have a Vert.io back end in the host in this case Vert.io is part of the DPDK library and in the host it's part of your stuza in this case also part of DPDK and by the way I guess most people are familiar with DPDK but this is like a library and provides users based drivers and libraries to build fast package and package processing applications so yeah just a part talking about the Vert.io architecture at the top you see the device specific components this has everything that's specific to Vert.io and that block and so on at the bottom you have the transport specific parts as I said devices can be PCI devices they can be also CCW devices that's for mainframes but I want to talk about the Vert.io part today so we want to add a new kind of Vert.io's and so why do we want to do this the very Vert.io's are defined until Vert.io 1.0 and I'll call them split Vert.io's from now on as opposed to packed Vert.io's they're not really fit for a straightforward implementation hardware so I've been expecting hardware support for Vert.io for quite a while actually and it never happened and well I guess it's just because they're too complicated to implement I know that one company once tried it and stopped at some point and now there are several companies that make an effort to push forward in this space and actually try implementing Vert.io in hardware and make accelerators forward but not also hardware also pure software implementation should benefit from Vert.io 1.1 and the packed ring layout so what's the problem with the old way of Vert.io's with split Vert.io's so this is what the data structures of it look like so first each request is stored in a buffer and memory shared between the driver and the device the buffer address and lengths are stored in the 16 byte descriptors and they are stored in a descriptor table now if the buffer is not contiguous then multiple descriptors can be chained together by setting a next field in the descriptor so we're on the table to put each descriptor that's up to the driver and a request is identified by a head index which is a 16 bit offset into after first descriptor in the table now head indices are stored in the available ring and to signal to the device which indices are valid the driver also maintains an available index in shared memory and after executing the request the device will store the head index in this use string which is also in shared memory it also stores the length of the use part of the buffer and it then increments the used index value in the shared memory to signal to the driver that this descriptor has been used and so fundamentally what we see is that hosts and guests communicate by passing messages between CPUs using shared memory and how do CPUs do this so at a very low level when a CPU accesses memory that is also modified by another CPU this will cause the CPU to synchronize their caches and they do this by exchanging cache coherence protocol messages so basically they speak to each other for example for some architectures when a CPU accesses memory that was previously accessed by another CPU this will also cause a cache miss and while the number of these cache coherence messages exchanged and cache misses directly impact the communication latency now let's count cache misses when using rather one that the queue consists of five parts and as a buffer moves between hosts and guests each of these parts is read and written at least once so this is implying at least five cache misses per buffer if no batching is allowed if we batch then it gets a little bit better but not too much so what can we do to reduce the overhead so the issue is that the information for this request is sped over multiple data structures and this is causing multiple cache misses so to reduce the overhead we should pack as much information as possible into a single data structure and what follows is a proposal to do exactly that this is currently being discussed on the virtual mailing list and there's a prototype for DPTK also on the DPTK mailing list for working on QMU and Linux as well but DPTK prototype was there first we started with DPTK and so what we do is we get rid of the available and use string and the index structures completely and what we do instead is we make the guest write descriptors out in the script or ring order and in our new descriptor ring so within each descriptor we add three new fields or flags and available bit use bit and descriptor ID then to make buffers available to the host the guest will write them into the descriptors in our new ring and then it will flip the available bit this implies that it's now okay for the host to consume this descriptor and the host can then process them in any order each descriptor also has an ID field as I mentioned and after processing the host writes the process descriptor ID back into the ring and it flips the use bit and by this it's telling the guest that an entry has been used and then it can that the guest can use this again or can make it available again so device and driver are also expected to maintain internally not a single bit counter and so starting at one and changing value each time the last descriptor in the ring is accessed it has to be flipped and the last one is accessed and as I just mentioned these bits are passed as available and use bit and the descriptors so assuming that the ring is zero initialized on the first iteration over the ring to mark the script as available the available bit is set to one and then on the next iteration it is set to zero and so on similarly to mark the descriptors used the use bit is set by the device on the first iteration to one zero on the second one and so on so what do we get by these counters well flipping the counter values allows the host and guest to detect new available descriptors even after the ring has stopped because it could be that some descriptors were not touched in one pass of the ring because we skipped them and this way we will know that this descriptor was made available and is currently to be used so as I said we have an implementation and I was measuring performance according to RC 2544 to stand on Intel based servers host and guest on the device on a test we're running rail 74 at the time the patches were based on the PDK version 1711 and the patches are now available on GitHub and hopefully soon also upstream so apart from micro benchmarks this is the only real test run I managed to do before this talk so with 64 byte packets we measured we measure the boost from 18.8 to 22.6 MPPS and well this is using slow nicks so larger packets simply hit wire speed on the setup and I'm working on benchmarking with larger packets so if someone is here familiar with t-rex and in combination with excel 7010 Intel nicks then please talk to me I have some questions yeah so I think we can conclude that what use as defined in 1.0 are not an optimal data structure for host to guest communication and also not perfectly fine for an easy implementation hardware and we're trying to make this better with 1.1 on well by the way if you want to participate in whether or work on a specification or you have a new feature that you want to add you have a new device that you want to add this is actually quite easy and also fun to basically just download the specification source from GitHub you edit it it's a lattice so you have to build it compile it and look there to make sure it looks right and once you've done this you basically just subscribe to this virtual comment mailing list you have to reply to confirm your subscription and then you can send your actual sometimes it's a one-liner patch to reserve just a feature bit you send it to the mailing list and if everyone's okay with it then you send another mail and ask for a ballot this ballot will run for for a week and after two weeks if all goes well your feature is in the specification good so that's the part of my talk where I wanted to I was going to talk about whether or specification now I have some actually two things that I wanted to mention that I think are interesting and so one thing is a project initiated by Intel PDPA it's called a virtual data pace data data path accelerator and well so better as a para virtualized device you know decouples your VM and your physical devices and that's nice and cloud environments because you can easily live migrate them and so on but in terms of north south traffic that doesn't compare to for example an siov device obviously now this picture here which is probably not easy to read it shows how it looks without an accelerator device and Intel is now working on a framework to make it easier to use a virtual accelerator device so gently this will allow you to use hardware that can write and read the whether or viewing format and it will get you basically siov like performance and and the data path and the way they did this will also make it easier to basically switch from a setup where you just use normal whether or to one that support that's actually using an accelerator device so this picture includes the accelerator device and as you can see the control path is still emulated but this is not a bad thing because it's not performance critical but the data pass is now handled by this hardware device and that can write directly into the viewing um yeah with traditional you would usually do packet i o via shared memory and interrupts via iq of these and kicks double via i event of these um back end then would be usually be host kernel or user but with vdpa that's a bit different you have the double kick and this could be depending on your guest either part i o based or if you have a newer guest then would be mmyo mapped um interrupt notification here is done via fio interrupts uh lacking pass through devices and as i said nq dq to the reeling is done with the accelerator device um it's also using vfo m therefore address space access into dvm um and the control path is set up by vhost user protocol messages basically um also was mentioning that here it's currently not using a virtual iam view um but that's something we're looking into um that should be improved so while you might ask if you have a virtual capable device why don't you just pass through the entire device to the guest um and well there's several reasons why you might not want to do this and one is that well there was a growing spec and things evolve in step and the hardware implementation would have to keep up with the uh spec all the time um kind of unlikely also physical functions have a lot more um features and metadata and so on that a virtual function doesn't really need to care about you would inherit all the pass through uh properties and that includes also um it's harder to live migrate um without having vendor specific solutions um so yeah for example in brother 095 um you couldn't really pass through a device to the guest because all the register base access was um part IO based and eventually the major reason is um that the nice thing about whether that is that you have a front end and a back end and that's decoupled and you can combine things so the front end is often um not a static but in the back end it's more common to have a choice of um the actual vertical back end okay um this is another project that's currently uh in the works it's also um started by Intel and this is more about efficient east west communication between virtual machines um so imagine you have a set of virtual network functions like i don't know firewall intrusion detection system router whatever and they need to forward packets to each other you would usually put them into a virtual switch in the host and then packets gather there and transfer it one by one or to the target VM and the nuts that you said you have like a long code bus and um it doesn't always scale very well um but what if instead you could send packets directly from uh to the target VM without going having the data pass going through the host um code pass would be shorter and it would be also easier to scale um now the way this works here is that they have introduced a new set of VOS PCI devices that you have in the VMs and also we host PCI server and uh as a central component in one of the VMs um and they are connected via unix sockets and uh they speak the existing VOS user protocol over that um to set things up so there's a presentation um from kvm forum about this um there's also a recording video on YouTube if you're more interested in this um the current status is that there are patches on the mailing list and they are being discussed there are still some design discussions and um so the thing was that the way they wanted to do this would also introduce a lot of code duplication especially in the VOS user part so um Stefan Hainochi suggested that they could use something similar um and okay so where are your devices usually um where just you know you had your driver and a guest and you had an emulated device in QMU and a host user space process and then as it evolved you had um so Vhost user uh Vhost came along which kind of um allowed you to implement part of the where they are back end and the uh kernel and user space um and then the next thing was actually Vhost user which would allow you to implement this back end in a user space process in the host now whether Vhost user is kind of taking this even one step further by moving the Vhost device back end into the guest and it works by tunneling the Vhost user protocol over a new virtual device type um which is called virtual Vhost user um so yeah the final diagram will show this um so in this diagram VM2 sees a regular virtual net device um VM2's QMU uses the existing Vhost user feature like if it were talking to a host user space um Vhost user back end and then virtual machine one QMU um will tunnel the Vhost user protocol messages from VM1 QMU to the new word IO Vhost user device so that um guest software and VM1 can act as the Vhost user back end um now it's possible to reuse existing the Vhost user back end software um this for the user since they used the same protocol right um now a driver is required here for the Vhost whether Vhost user PCI device um so that code has to be written for that as well and the Vhost device back end V-rings they are accessed through shared memory and do not require Vhost user messages to be exchanged in a data pass so no VM exits are taken here uh when pole mode is used of course not at all but even when interrupts are used then QMU is not involved here because you can use the lightweight IO event fd um VM exits um yeah so Java can be implemented in the guest user space process um using linux vfio PCI but uh also guest kernel drive implementation would be possible um also the Vhost device back end V-rings um as I said they are um accessed just using shared memory um it's also worth pointing out that this works for um net devices, block devices, SCSI and so on um another thing I think I might skip this here how much time do we have left okay then I might just skip this um by one sentence so this is just a new idea to do um transparent bonding in the guest by without having complex configuration by the user so the idea is to just have a virtual device that has a special feature flag and if this feature flag is seen in the guest then it we can go and look for another sriov device that we can bond with and then we could have um always use the fast data pass over the sriov device and for migration we could just put the link of the sriov device down and then switch over to a virtual device to a migration and on the target system see is there another capable sriov device that we can use for data pass and if not we could just um use the virtual device so yeah conclusion of my talk um so virtual 101 will be a rather big release um there will be many changes um mostly for the packed word queues but also some other things required for um hardware implementations um it's worth taking a look um dpdk implementation is available I didn't put the link in here but it was in a previous slide so you can look it up there um also if you're interested in this and in participating we have like a monthly meeting a monthly call where we discuss these things very one at one but not further about your features you're welcome to join just contact me um just quite a few companies already involved and you once are always welcome so that's the end of my talk basically um are there any questions there's one um so the question was um the implementation of weather 101 will require changes to qmu and what's the timeline for that um so right now the next target is to get things some upstream and dpdk and we're at the moment already started to work on the virtual implementation so as a few months any other questions okay thank you okay thank you yes so last talk of the day coming up now uh in just a few minutes with Pablo Camarillo I hope I pronounce that right on segment routing v6 with vpp nice to meet you so we are actually two so it's not me that it is just plain minutes and I'm going to explain to you okay and when you plug it in just give it give it a lot of time I'm sure you're meeting it after you don't have that just for those leaving one last reminder that we have a meet-up a little bit later for the stnnfv dev room uh in the mannequin pizcafe um which is rude grand calm I think around 7 30 so you're welcome to join us thank you I don't want to take any questions but so whatever I tell you that's when the end of year 30 okay so good afternoon everyone last talk of the day and then you can go for beers um so this is my colleague Ahmed he's a PhD student from Italy and I'm a software engineer at Cisco working in the segment routing architecture team and today I'm here to talk about segment routing which is something quite different from all the talks that we've been seeing today because all of them were focusing on different platforms and this one what is focusing is on the actual protocols on the network actually on sp networks um so what I'll try to do is give you a brief overview of segment routing to deployment use cases and then um talk a bit about bbp linux and serra which which you will see later what what it actually is so what is segment routing the idea behind signal routing is that we actually leverage the parting of source routing so what does this mean that instead of programming all the routers in the network what we are actually doing is that on the head end we're actually adding the list of segments that a packet has to traverse through the network so what this means is that if I want to go from Madrid to Amsterdam by a Brussels what I just need to do is when my packet is out from Madrid I just add one little segment that is saying Brussels and then Amsterdam and the packet will follow the shortest path to Brussels and then it will go to Amsterdam that's simple and this is actually really scalable because what this means is that you can implement any traffic engineering policy that you want and you can actually um put this together with any nfb deployment that you want and actually one of the main benefits is that we can have policies end to end so starting from the data center and I'm going traversing the entire networks through the metro on one so we have two data plane instantiations one of them is npls and the other one is ipv6 so in npls what we are just doing is one segment is one npls label and that's it the second instantiation is ipv6 so in ipv6 what we are doing is we are using an ipv6 routing extension error which it was defined 15 years ago and we have one segment is one ipv6