 Hi everyone, I'm João Martins and I'm here to talk about ClicquoS, which was returned by me and my colleagues at NEC, as well as some other people at the University of Bucharest. So the view of the network, as taught in schools, or something like this, and systems with a network in between containing switches and routers. Pretty simple. But as we all know, the reality is quite different. Toolboxes are a commonplace in today's networks, existing as many as switches and routers. They are useful for many reasons. Security, such as firewalls and intrusion detection systems. Monitoring. Traffic shaping. Dealing with adverse exhaustion issues. Performance optimization. And others, often more dubious nature, such as this advertisement dissection box. Obviously, these middle boxes are very useful, but they come with a number of problems. The most clear one is the price. These middle boxes are extremely expensive, both in the price of the box and the cost of managing the box. It's difficult to add new features as we get vendor locked in. Also YSISCO box, for example, we get stuck with them with upgrades and firmware and whatsoever. They are difficult to manage, often requiring specialized personnel to deal with each kind of middle box. Harder middle boxes cannot be scaled on demand, and that can be shared among different tenants. Finally, it's hard for new players to enter in this hardware market. So clearly, moving these middle boxes into software and in the clouds would certainly solve most of these issues. The question that still remains unanswered is whether we can build these software middle boxes in commodity hardware while still achieving the high performance as seen by the hard middle boxes. We believe so, and that's what we want to propose at ClickOS. In one sentence, ClickOS is a tiny, mini-OS-based virtual machine that runs ClickModularRouter software on top. So let me talk about Click. ClickModularRouter is a network processing framework initially designed for making routers. Its main concept is around processing elements. Users create a configuration specifying connection of these elements. These elements can have parameters, which are then exposed as variables in the proxy slash profile system or by sockets. For ClickOS, we managed to compile 262 out of 300 available elements at Click. As you can see on the listing on the side, you can see the classes of these elements. You can observe the variety of packet processing we can do with it. Furthermore, programmers can write new elements that allow them to extend the runtime. So in the figure in the middle, you can see a configuration that just receives packets from the ATH1 interface, decreases the TL and forwards back to the packet out. So in this slide, you can see an example of a very simple firewall in Click. We received packets from the net front element, going to the IP filter that allows only UDP packets with the source 10.01 and destination 10.01. And packets as much as criteria are for the back to net front element otherwise dropped. So what is ClickOS then? As we all know, there is a pervitualized filter machine, has a slightly modified guest kernel and applications running on top. In case of ClickOS, we use MiniOS as the guest operating system. It has a single address space, which means a system calls in a cooperative schedule. And on top of MiniOS, we run Click and we call the whole thing ClickOS. So the work we did and many contributions are having a generic build system in order to build this ClickOS filter machines, which are 5 megabytes in size. And we also use it for other purposes. You have to emulate the Click control point for MiniOS. So Click is a process or through it as a kernel module, so the way to communicate with it, we had to change it to that as a VM. Doing optimizations for the reduced boot times, we started with something like one second and going down to 30 milliseconds. And finally, optimizations to the data plane to achieve 10 gigabits time for almost all packet sizes. So I will focus on the last item, which we think has higher contribution for the Zen community. So we started doing performance analysis into the ZenIOS subsystem. Let me just give you a quick overview of how it's organized. So first, we have a NIC. In our case, we use Intel EXBE 10 gigabit NICs. And then we have DOM0, which hosts a network driver for it, the DOM0 or the driver domain. And this driver is attached to a software switch, say the Linux switch or OpenVSuite in more recent versions. To the switch, you attach a virtual interface called VIF, which is managed by the network driver, by the Netpack driver. The Netpack driver connects to a network entailment over the Zen ring. Netchannels are used for notifications in the Zen bus for contraintualization. Further, we added two new elements from Netfront and two Netfront, so that click can interact with Netfront driver. So our goal is to achieve 10 gigabits line rate for all packet size, meaning 813,000 packets per second for maximum size packets, and all the way up to 14.88 million packets per second for minimum size packets, as the table can show. So the problem is, when we put all of this together, there were several issues. For instance, OpenVSuite could only forward 300,000 packets per second. When we plug in Netpack and VIF, it was only forwarded 350,000 packets per second. And when we put all of the pipe together with MiniOS Netfront driver, we had only 225,000 packets per second, which, as you can see, is still one-fourth of line rate for maximum size packets. So the many issues with this pipe are the following. The back-end switch, Linux Bridger OpenVSuite, are not prepared to handle really high packet rates. But although it's still able to achieve 10 gigabits relying on new features to the TSO or the Aero, while this for instance is good enough for middle boxes, this is far from ideal, where the switch needs to handle million packets per second. Regarding the back-end and front-end communication, packets always need to be copied from the guest to the back-end domain. This copy between domains are normally done in batches of packets, although it's still one of the main causes for low packet rates. In addition to this copy, NetBuckworks works with a packet metadata that structures such as SKBs or M-bursts in previously, in which its allocation and manipulation are really expensive. Last but not least, the MiniOS Netfront is not as good and functional, is not as fast and futureful as the Linux Netfront. So you saw 255 in Linux, we get 430,000 packets per second. And the reverseive path on MiniOS is even worse, we were getting 8,000 packets per second. After careful analysis, we started optimizing things up. First we started with transparent decelerate things without modifying either the guest or the XenRing IOProtocol. For this we started by changing the back-end switch to useValue. I'll go to a little bit more in detail, but basically the NIC enters a special mode called NetmapMode where the host that is connected and the packets are sent by the Netmap API. So for this modification, NetBuck only required slight modifications and mostly to remove packet metadata manipulation, in which we don't need the VALUE switch. We also applied some of the previous optimizations into supporting multi-page IO rings, which has let us batch more packets, roughly batching, extending the ring to be represented over multiple pages, which lets us batch 1,000 packets in the ring. We also had to change the front-end for this, but the compatibility is still kept if the front-end doesn't support multi-page rings. So this optimization is already known by the Xen community and I think it's still available, is already available on the blog drivers. So applying all of these changes, the results were actually pretty good. We were getting 2.7 improvements from IZM-sized packets and 4.2 for Minimum-sized packets. So before going forward, let me give you a little bit of a view about Netmap in VALUE. Netmap, as I mentioned earlier with the VALUE switch, is a fast packet IO framework that is able to send 1488 million packets per second in one core with 900 megahertz. It roughly corresponds to 1665 cycles per packet and it's currently available at FreeBSD9 and also on Linux. Netmap requires minimal device driver modifications to work, but what makes it special in my opinion is that the critical resources such as NIC registers, physical addresses, and packet descriptions on the NIC are not exposed in any way to the user space application. The Netmap ring shown in the figure is a copy of the NIC ring, which its content is validated by the kernel before pushing packets out. NICs works under a special mode, which is connected by the host stack, but this mode change is done in runtime when the application registers an interface. So Netmap is built around this shared memory region that is mapped to user space, which contains pre-allocated packet buffers and a softer ring. Packet are sent over LAN badges per system call instead of one per packet, as it normally happens. Now VALUE was built as an extension to Netmap and it's built around the same concepts. So the graph on top shows you a comparison of packet forwarding between two 10 gigabit NICs in FreeBSD bridge and open V-switch, whereas in the VALUE's case it's only between two virtual ports. Access to these virtual ports is done with the Netmap API, whereas each virtual port uses a separate memory region. In VALUE, the switching fabric is a coupled from the switching logic, which means that the other kind of module can extend the switch to implement their own WUCA punches. The default provided by VALUE is a layer 2 learning bridge. So regarding the previous optimizations we did, we changed the underlying switching remove packet manipulation, but are able to batch more packets with the Zenio ring. Performance was better, but it was still far from ideal. For example, Mimim's aspect is 1.2 million packets, which is still 10 times lower than what we want. So the next optimization we did was to replace the Zenio rings with the Netmap ring regarding the VALUE port. We basically ended up with a much thinner netback that involved much less processing, delegating most of the packet processing operations into VALUE without having to deal with these skibbies. We mapped all the memory regarding VALUE ports all the way to the guest. The Netmap buffers are also mapped to guests so that we don't need to do the extra grand copy between domains for packets to arrive at the back end. With the Netmap ring, the data structures are not overrides by requesting response model as we normally see in the Zen ring. This means that the guest, when he sends an IO request, doesn't need to parse an IO response on the way back. So the naturalization mechanisms on our front-end and back-end are somewhat similar, and event channels are also used for identification. There are two event channels, one for transmit and one for receive. These event channels on the back end will act mainly as a proxy to Netmap operations from and to the guest in VALUE. So in the end, we got really good, really major improvements with this netback hand, but we also break other guest support. But to show that our organizations do apply to other guests, we also implemented the Linux Netfront Compact with this new back-end. So let me speak to you a little bit how the back-end and front-end works. So when we do a network attach of an interface with the VM, what happens is the following. Back-end registers a new VALUE port and calculates which pages belong to the ring and grant them to the guest. The reason for the grants are then shared to the guest via the Zen store. The Netmap buffers that are granted as well are contiguous phases in guest memory. A page fits two buffers, so after granting the ring, we'll need to take a look to these buffers and grant them as well. Although there are a lot of grants to be shared, so these need to be included in the ring instead. The guest on the other side will grant the ring grant references, map them, and after that look at each slot on the ring and grants and maps the grant reference regarding the buffer for each slot. So the darn side of this is basically we have higher memory requirements with this back-end and front-end. We allow that the administrator chooses the ring size depending on the VM workload. So the table on the side shows how much memory and grants are exchanged between back-end and front-end. So we have a minimum of 135 kilobytes per ring with 64 slots and for both rings we require 66 grants and 270 kilobytes for the minimum ring size. But let me say that the default on Netmap is 1024 as a ring size, which is actually the one who offers best performance, but at the cost of the bigger chunk of shared memory. Roughly four megabytes for four rings, for both rings, RX and TX. So let me give you also talk a little bit about the synchronization in our back-end front-end. Now, our back-end is based on Netmap. And Netmap is made for user process and kernel instead of VM. In a Netmap application, operation is done in the sender thread context, which means that the process wants to send packets and switches to the kernel, the process will not access simultaneously to the ring and change int indexes. Well, in our case, back-end and VM ideally run simultaneously, ideally run on different contexts, and different CPUs, virtual CPUs and physical CPUs. So when we want to send the packets, in our case, what happens is that we fill the ring, we update the cursor on the ring for the nice packet to be filled, and the number of slots available in the ring, and we notify the back-end. Meanwhile, back-end will process the packets, and when he's finishing processing, he will notify back to the guest, which means that he finished work and he is idle. When the back-end is processing packets, what the guest will do, he will grab a copy of the ring indexes and update this copy instead of the shared memory region. And the notification comes back. He will update the shared memory region accordingly and notify back the back-end if there is work to be done. So this way, we avoid that the front-end and back-end change simultaneously the same variables on the ring. So as alternative, it could wait until the back-end notifies back with the work done, but this turned out to be efficient. In our case, we only wait when there are no more slots to be used. So I explain to you how we're optimizing apps that will proceed to evaluation. So we first started evaluating the CLICK OS VM's performance. So we started by doing a TX NRX measurements in one of our low-end machines, which has four cores, 3.2 gigahertz. We have one core for VM and the remaining core for DOM zero. The Y-axis shows you rates in million packets per second and the lines for better readability in gigabits per second. And below are the packet sizes on the X-axis, and each bar represents a different ring size. So we can see that on our X, we are able to achieve 9 million packets per second in receiving packets and up to 14.2, which is 95% of line rates on TX for minimum size packets. It's actually a really good result. And we see that it's a line rate for all the packet sizes. Also, for ring bigger than 512, we are able to achieve line rate, always line rates for all the packet sizes, for most of the packet sizes. Next, after doing experiments just for 10 gigs, we figured how we were scaled for multiple nicks in VMs. So the bars in the graph represent VMs just transmitting packets, and the lines represent the VMs doing forwarding traffic. For this experiment, we manually pinned event channels, affinity nicks interrupt affinity to avoid starvation in the hardware and the VM. Actually, the setup was quite sensible, as we easily one of the nicks would not receive any packets. So for this experiment, three cores are reserved for DOM zero and three other cores for VMs assigned in a round robin fashion. So what we can observe is that for smaller packet sizes, it doesn't scale quite well, but for bigger packet sizes, we are able to get up to 40 gigs in VMs just doing transmit and up to 30 gigabits for VMs forwarding traffic. Just one as a side note, regarding the forwarding rates that are represented on the graph, the lines in the graph correspond to the rates measured at the receiver machine. So these are not the rates in duplex, just to facilitate understanding. I should add that the transmitter in the forwarding experiment, it's doing line rate for all the cases. So the rate actually corresponds in duplex is a double. Now, this graph shows you the Linux performance, gas performance. The bars represent KVM rates measured with internal packet chain with vertio backends and V host enabled with all the cores pinned, all the CPUs pinned. Zen measured as well with internal packet chain and our optimized version of Zen with user space net map packet chain. Let me point out, you can see that the results are really good on our case, but this experiment is not exactly, it's a bit unfair because we are not using the host stack on the optimized version, thus bypassing any SKB processing and all of that. But what the graph shows is not, what the graph shows at how fast we can go by using our backend and frontend and a net map application. Next is, we actually implemented a few middle boxes and check it's performance. So each VM uses one core for different mega boxes. The first middle box is the most simplest one to get the baseline management, which just receives packets and transmits packets. The other net mirror just simply flips the destination and source MAC addresses. We have a standard-compliant IPv4 router, a firewall loaded with 10 rules, almost standard-compliant carrier-grade net, whereas each packet is assigned with randomized source and destination parts, just to stress it. A software BRAS, which checks for a decision session, strips PPOE headers, performs IPv4 cups, rewrites MAC addresses, a load balancer, based on five tuple, assigned in a router with fashion, backs packets with the source MAC address, so that the backend switch can split packets based on that field. A flow monitor that keeps performance per flow packet comes in rates and an intrusion detection system, based on regular expression matching, contains a single rule that matched the string A, B, C on the beginning of the packet. So you can see most of these middle boxes are just a proof-of-concept, but the ones actually more ready for production, let in future flow-like carrier-grade net and software BRAS, with all that amount of processing can still do really good performance in ClickOS. Next, we measure delay comparing the other existing approaches. So we use in ClickOS a simple ICMP responder click configuration, and all the VMs in the tests, all the tests where the VMs are idle, just to get the baseline reading, and the next general boxes to pings and measured RTT. So we can see that ClickOS matches somehow, that matches almost the delay with DOM0, but comparing, and we actually improved by half, we decreased by half the delay in a standard DOMU, and performing better than KVM as well. So as main conclusions, I presented you guys ClickOS, which is a tiny virtual machine tailored at network processing, very small, with 5 megabytes of size and runs with a minimum of 6 megabytes of RAM, can be booted on demand in 30 milliseconds, can achieve 10 gigabits throughput using just a single core, for almost all packet sizes, and can run a wide range of middle boxes with high throughput. As a future work, we are starting to be exploring performance on NUMA systems. We are, at the moment, the performance is not quite good, but since the latest changes in Zen for VNUMA support and NUMA affinity support, that are being pushed to the mainline, the performance will change for sure. High consolidation of ClickOS VMs in a single server, up to running, for example, 1000 VMs in a single server, and doing survey chains with these VMs, which means traffic being forward to each of these VMs until being forwarded out to the external world. And so, that's it for my talk, hope you guys like it, thanks. All using the same netmap interface, can each guest see the other, Hangouts under Dustin to the other guests? So don't you lose some security here, some isolation? That's why we, so you can implement your own lookup. So for example, us, we implement static lookup. So for example, a guest can only send packets to that interface. That's for sending, but what about to reading? Because all of the packets, all of the physical NIC buffers are mapped to the guest. No, no, no, no, not the physical NICs, not the physical NICs. So the physical NICs are tied to the switch. So the guest only sees a visual port. So it's not the NIC, the NIC buffers are not mapped to the user space. Okay, so there is an additional copying to... Yes, there is a copy within the switch, but not between the guest and the back end. Other questions? Is it open source or is... Sorry, sorry. Is it open source? We will be open sourcing. It's in the... Are they the squad or in the first part of next year? The worst case. We basically took the decision last... How much memory are you using in your mini-S base domain? Sorry, sorry. How much memory are you giving to your mini-S base domain? For 1k ring size, we give 16 megabytes. For example, for 64 ring size, we use 6 megabytes of memory. So during the definition of this project, it was considered to use the Mirage OS approach of an appliance or was it compared against it at some point? So our focus were... So Mirage is a rock ammo, which is focused on type safety and we thought that that could be an overhead given that the packet processing needs to be really fast. But as well, ClickOS started like 2 years and a half ago. So Mirage was not at the stage at this point, but even though it wouldn't fit. If you know that. Any other questions? Let's give them a hand.