 That's a good one. This is him. Yeah. I'm going to pause it for a second. I'm not going to be claimed that it's quite easy. We will thank you. It seems it's true. Luigi, are you a French refugee? I'm where we are. Hello, everyone. And welcome to Ned's seminar. Today's speaker comes from Italy. Luigi Rizio, he comes from University of Piz and he's also spending some time with the Google for six months. Luigi has done quite a lot of work in networking and systems, so if you use dummy net at any time in OSX or to GSD, you have to thank Luigi for this. He's also done work in packet scheduling, congestion control, and today he's going to talk about some recent work that he's done on a framework for fast IO called NEPA. So, Luigi. Okay, thanks for the introduction and feel free to ask a question. We're going to do it in the talk. I'd like to answer or postpone the answer. I'll cover the topic later in the slides. So, the goal of today's talk is to show you how we can improve the performance of network processing in standard OS in ways that probably are not matching the original design of the network stacks in a few weeks' systems. And the reason why I chose this picture is because in the end you would see the solution that I used and you would say, oh, was it so easy? And yes, in the end it was so easy. But I've been working in this area for a long time, 15 years and I felt so stupid when I designed the network because all the technology that I've been using could have been used 10 years ago so it was easy but after you have the solution everything is easy. So this is joint work with some of my former students and colleagues and also it's funded by the European project every now and then there is good outcome from European projects. And if you are bored by my face you can look at the slides online. So the question is the following suppose you have to do some packet processing applications what are your options? And I'm referring to this here what are the possibilities that you have with the mainstream operating systems? So one option is to use the sockets. The socket API is very flexible you can do almost anything including sending and receiving grow packets. It's been available for a long time on the operating system. It has only one problem, it's very low and it's very low for a number of reasons that we will see. The other option is to try and use some direct pocket systems that exist. For instance Linux has a pocket or a viad which is called pf ring. There are other systems that also expose the hardware and use the space so you can program the registers of the car directly. The problem is the following here. If you use one solution such as pf pocket pf ring the underlying mechanism to move packets between the network car and the operating system is still based on SKBUFFs and NBUFFs depending on the operating systems and that's a very low mechanism. So it's like having a very fast car and a very fast engine but you're using this car on a road which is very slow so in the end you would not be able to exploit the performance of the system. The other option is to run your code directly into the kernel which is great except when you make mistakes and try to run more locations and crash the entire system. And of course there are manufacturers who sell you hardware and software together for instance Solaflair is a library which is called open on load which is a West Bypass library which lets you use sockets from a user space application and control directly the hardware bypassing the operating system Intel has the PDK it's a similar library this is tied to vendor hardware so you don't know what's going on inside the system you're right unfortunately if you're moving to a different target or the vendor goes out of business you're in trouble because many of the extensions that the vendor has supplied are probably not available anymore. So I was looking at this program and trying to figure out is there any way we can implement direct packet I.O. from a user space application directly to the one in West Bypass and can we do it in a way which is fast, which is safe and it's software from crush risks that you have and it's all very independent and yes there is a way and we eventually built a prototype which is quite flexible it's even quite small in terms of code size which is good it's maintainable so I don't have to rewrite the device driver from scratch which is a major piece of work that would be a major piece of work otherwise and the results that we achieved are shown in these slides for instance if you say you have an application we want to send packets row packets to network interface if you use sockets you have this kind of performance so I'm using I'm plotting this graph with variable valuable clock speeds because our system is so fast that it saturates the link at the top speed so if you want to see how fast is our system we need to underclock the CPU to get the exact performance point but anyways if you take a socket based application you are getting more or less one minimum packet per second at the top speed and then decreases linear with the clock speed there are better solutions for instance in Linux there is a kind of model which is called packet gen some of you might have used it and it peaks at about 4 million hearts per second and with our solution with NetMap we separate the link and clock or at 9 other megabytes so there is more than one other margin so I'll probably talk about this later but I guess one other way of sort of measuring it would be to change your packet size and use like minimum size packet this is all with minimum size packet awesome so yeah many papers you read you will find the performance in megabits per second or gigabits per second here in the medium of packets per second so one 14 million times 64 equals 10 gigabits if you include your framing so yeah 14.88 is the magic number for the gigabit 64 by 1 yes I was saying NetMap has this performance it uses its own API which might be a problem if it forces you to write all of your software fortunately the API that NetMap exposes is very simple and compatible with LeapPickup where you have a file script so you can select on it and in fact we have a LeapPickup migration library which is very small to run an unmodified application even without very compiling on top of NetMap now the rest of the talk will be on how we managed to do this and hopefully some of the ideas that we used can be applied to some other complexes and some of your projects and so the first thing we did actually it wasn't the first thing but after we built the system we started to figure out why it is so fast so at some point we measured the time of the stack and trying to figure out where the time was spent in sending the path and you know immediately if you look at the code the network stack is heavily layered so data from the socket down to the network card pass through several layers and in principle you might think ok maybe there is one place where most of the time is spent and I'm good and reality is not like that unfortunately but to realize that what we did was run a set of measurements and the thing we want to measure is the time spent at every layer in the network stack and those times can be very small tens of nanoseconds or even less so small that it becomes really complex to measure them even using the TSC timer on the CPU so we did it in a different way we instrumented the code so that at various stages in the flow from the socket to the device driver we basically insert some written statements freeing resources if it's necessary and then we measured the packet rate aborting the processing at different stages and by the receiver power this number is the time spent to do the process the packet up to that level and then by difference we can compute how much time is spent at every layer and of course in order to do this we needed to run a fast traffic generator in user space make sure that wasn't the bottleneck and then move down the stack and see how much time is spent at every level and the measurement went down on relatively cheap system with a tiny interface at the bottom you don't need a interface until you reach this point you don't need an echo card until you reach that point but anyways and this was some of FreeBSD Linux is somewhat similar maybe slightly more efficient in some cases because it uses different mechanisms but for instance the device driver is as expensive in Linux as it is in FreeBSD now the numbers are like this so the way to read it is stable basically in the second column you see the list of functions that are coded from sender to down to sending the packet to the wire and this column tells you how much time is spent for sending a packet if you abort processing at that level so if you abort right before send to it takes nanosecond for the user space generator to create a packet if you abort after IP output sorry before IP output it takes 330 nanoseconds and so on and if you get down to the wire you spend on FreeBSD approximately 150 nanosecond now if you look at this table you see that apart from having 10 or more different layers that there isn't a single place where a lot of time is spent there are many places and you see them emphasising this column there are many places where there is a significant amount of time spent in that layer for instance just in the device drivers 120 nanosecond consider that at the top packet rate on a technique interface your budget is 67 nanoseconds per packet so it's already a short stop and the ITER output the macabre system for the packet spends another 160 nanoseconds IP output computes the IP header another almost 100 the system call at the beginning takes 96 nanoseconds and that's basically something you cannot avoid if you want to move from protected ring to ring zero you need to pay that cost can you just explain again measure dates so yeah, okay modify the kind of sources for instance at some point the code calls the IP output right before the code calls the IP output yes, right before the code calls the IP output I return instead of calling the function that's one point and then I can make the return right before the IP address and so on so I don't need to collect time stamps on every packet but I just do the measurement on one second I see how many packets I can make in one second and then now, it's very similar to the picture I showed before with all the stages now, how do we fix that how do we move the code or how do we modify the code we can send the receipt packets 16 nanoseconds that we have available as I said, the system call is the one thing that cannot be avoided you need to pay the cost if you want some protection and you do want some protection because I want my system to be safe so the way to reduce the cost of the system call is of course amortizing it on a batch of times some things can be safe for instance a lot of time spent in the device driver is for programming the device to send one packet at a time and again, if you try to move patches of packets from the application to the network card you can amortize some costs again some other costs can be saved, for instance those related to data copies and buff allocation the internal representation in the kernel in the network packet turns out that your current system allocates entries about every time you want to send a receipt packet that's a very expensive operation for certain application you don't need to do that or you can do the allocation in advance and then recycle the buffers as you produce traffic or receive traffic and so on, there are many, many tricks that you can use in different places but mostly it's about removing unnecessary operation and amortizing the cost of the performance one thing which is a particularly bad idea is the fact that the internal representation of packets in the kernel it's very complex basically the end buff or the escape buff can be a list of chunks of memory where you store the packet and that was very convenient 30 years ago because memory was expensive and the packets were by step basis you start with the payload the TCP adder in front of it and an MP adder and so on so you didn't want to copy data at the time and so the most natural way to create a packet was to create a link list of buffers with the small chunks representing all the bias layers and in recent times what happened is that people found that the packet processing was expensive so started to offload features to the network card for instance when it turned out that IP or TCP checksum were expensive cards came out with hardware checksums when it came out when VLANs came out again cards implemented VLAN tag insertion and removal in hardware and in order to tell the hardware whether you should use or not these features the packet insertion was also extended so in the end it's like playing a Tetris game with the packets that have very different shapes and the network card has to double with those shapes it would be like it would be a lot simpler if you had a Tetris game like the one on the all the blocks are the same size very uniform and you will have to think when you process them and this is exactly one of the things we did in the Tetris game so now the measurement was for sockets when you do raw packet IOC you don't need many of the layers that the socket implements you don't need to add the headers you don't need to add mac headers the packages are very preferential so we started from that simple problem and then some of the ideas that we use can be also used to improve the performance of stack and this is what we have not done yet one of the reasons to work on raw packet IOC is that it's useful to determine the maximum performance that you can get from the system it's also poorly supported by current APIs because whether you use BDF or raw socket etc you still have to spend a lot of time processing the packet so there is a lot of return working on this problem and as I said the guidelines were the following first of all I didn't want to rely on special hardware features because I would leave it to myself on a given vendor or anyways I wouldn't be compatible with some of these if I have some cost that cannot be avoided then I want to amortize them if I have some things that can be avoided and in general reduce runtime decisions every time you need to make a decision in the code you have to take a jump possibly have a cache miss on the instruction cache or in the professional unit or in the jump predictor so these things are very expensive which should be avoided and the other thing I was a little bit concerned was in the operating system we have many device drivers so if you do something that requires more different applications to the device driver you you are going to do a lot of work to make your system work with all the hardware that is supported so modification were permitted but as long as the code could be maintained so as long as the modification were small it didn't require a lot of work to be implemented to be ported to the new hardware now the basic idea of that but also the misleading name that I chose is to use Shem memory vision that both Canon and user space and the user space application can access and this is represented in this art and this Shem memory vision contains three type of options the first one is pocket buffers basically where you put your table and pocket buffers are a big size sufficient to store full size pockets so if you have 2K or if you have jungle buffer the second region is called the network ring the network ring is basically a ring buffer and it mimics the ring buffer that is managed by the hardware the network hardware in fact both the network ring and the nick ring point to the same buffer this is the Shem memory vision so the hardware has to send pockets with fetch the data from here and when it receives the pockets will send and we copy the data here and these buffers are also accessible from here the network ring is a C4 ring it's accessed by the Canon and by user space and there is a third data structure which is called network IF and the reason why we have this third data structure is that we support multiple rings one of the features introduced by nick venues to perform us was to support multiple users so you can have multiple threads sending or receiving data from the other without the need of synchronizing and so we need to support it that's why we have this further structure which has pointers to all the rings that are supported by the hardware being a simpler ring we have a couple of information in the network ring and in particular I like the basically this information and the tail of the queue I particularly like this representation but you have a current pointer to the insert or read or write position and another field which is just the counter of available slots and then for every slot you have a very compact and hardware independent representation for every buffer in the system you have eight bytes which contain the after all buffer size if you want to send targets if you want to receive packets how many bytes are received by the hardware and something that translates to a pointer to the buffer I write here index and the reason why I write the index is that this memory region is shared between user space and the current so there are different types of spaces and they can be mapped different points in memory so if I had put a pointer here the pointer wouldn't have the same meaning in user and in current I think they are just an index it's very easy to translate the index to the proper pointer in user space and in the current without so now you have an interface if you want to control it using netmap there is this data structure that lets you manage the interface the netring which I show here is a data structure which is also in main memory but it's not accessible from the user space it's managed by the NIC and by the device header and basically points to the same buffers the network ring and the NIC ring sharing information for protection I said I want to be safe I want that mistakes in the application of the crash system for protection we use the standard OS mechanism so we make sure that the only region exposed to the user space program cannot be accessed in a way that crash the current for instance one thing that pointers to the physical addresses of these buffers and the user space program is never allowed to write pointers to physical addresses the transfer of information from here to here and vice versa is done by the operating system I modify the device line and also when this lets the kind of validating information that the user space provides the size of the region the size of the region when you attach to when you open the device the kernel defines how the user space is it's allocated only once it is allocated only once is there a way to prevent the data from changing between the time that you validated and the time that you use it from the payload node I mean there is no point in making sure that so this is an API that lets you send raw packets and makes no restriction on the content of the buffers that you send the network ring yes the network ring is like you read information here only once and so once you've read it even if the user space changes it no effect so the other thing we should be careful about is that we have memory relationship between user space and the kernel and so how do we protect from braces in access to this data structure here the contact is supported whenever the user space program is running so it's not blocked on the system call then it has full access the only area which is available to user space is accessible to user space and whenever the application is blocked on the system call which is related to that marker then the kernel assumes that nobody is going to touch this information this said before all the information here is only read once after that so even if user space changes some of the information you can touch the system the most that can happen is that you have embedded the data which is playing against the system another important part in the operating system you have interrupt service routines and those run as increments with respect to user space unfortunately they never touch this share region so there is no chance of that great system to access a device in network mode basically the system is very similar to a software you try to get a handle to manipulate the device and then you issue system calls to configure the device so the handle is written by an open on a special device name then you have an IOCTL which does a system which which device you want to access in network mode to get to 0 or 0 and if the card has multiple rings you might also specify that you want to access a single ring with that file script or all the rings in the card or you want to attach to the host stack I mean there are some multiple features that you can specify once you have issued the IOCTL basically the network card is disconnected from the host stack you have full control on the packet and you can send and receive packets using this data structure of course you have to map this region using an MAP and the size of the region is returned in the argument of the IOCTL for transmissions the first thing you do is fill the buffers with data the information, the length information in the ring current and availability tell the system how many buffers you are going to transmit and then once you you have done this you can issue an IOCTL to tell the operating system ok now take these packets queue them on the device and start operation and this is a non-blocking system receive there is a dual IOCTL which reports the newly received packets so you can ask the system if there are new packets of it from here to here basically the important part of the length of the packet and then current and real updated if you reflect the moment of course you need synchronization you don't want to spin on the IOCTL to transmit and receive and there are several mechanisms and units to do that so that MAP implements that so basically when you pull on an MAP 5 descriptor you can specify pull-in and pull-out to decide whether you want to block on reads or on the writes and whenever you have data available for reception or buffers available for transmission the poll will return poll is whenever the poll blocks it is broken up by an interrupt coming on the card so if the card supports interrupt mitigation and moderation those delays are propagated in the user space process which is very convenient because you can tune the amount of work that you do in user space basically every system call is going to cost you a lot we have seen 100 nanoseconds plus whatever power it is in the current to implement the machinery for pull-in selector so you want to in this way you want to reduce the number of system calls that you do in every second assuming you have a lot of traffic coming in and it was designed exactly to limit the amount of work that the system does by amortizing the cost of interrupts in that particular case and the system call is now a case there is no KQ support KQ is the mechanism previously which is similar to the E-Poll in Linux support for that will come later KQ and E-Poll are very important when you have many interfaces many file scripts to handle within a single multi-core support yes, we have multi-Q and multi-core support multi-Q is to again basically the net map IAC points to the multiple rings and a single file descriptor can be attached to other rings so in which case E-Poll and E-Poll wakes up when any of the rings are available or you can attach a specific ring to a file descriptor so for instance if you want to share the load of multiple cores you open multiple file descriptor attached to one of the rings in the current program the configuration is done using the ISTL now the next problem is how do I bind those processes which manage the file descriptor to a specific core there are standard system cores to do that there is P-trap set affinity in general there is something similar which I don't remember there is no need to introduce any new mechanism you just use whatever the operating system provides you can bind the process to a core and then you can have that process handled on a specific file descriptor so you can decide how you map the load on your system so in this kind of system how do you prevent packets or are you classifying packets into rings that's another of the things that netmap doesn't do netmap is only concerned with moving packets up and down any classification is done by external mechanism for instance the some nicks have RSS or other features where they do hash and subdivide traffic into the values queues and you just reuse that ATH2 are support for that so again I didn't want to reinvent the wheel there is something that works and people are familiar with just reuse them how do you talk to the host stack now the question is when you start the system the operating system believes that the card is attached it has an address possibly they are processing trying to send packets to this out to this card or expecting packets to come in from this card when you start the interface netmap mode you grab the packets that are coming from the card and you take extremely control of the transmit queues but the operating system still believes the card is there and the way netmap supports this is by creating another couple of netmap queues which are connected to the host stack so you can open another netmap which is attached to the host stack and you can move packets from the card to the host stack just using the same rings that I've shown before so from the operating system point of view the card is there it's able to send and receive packets whenever the operating system wants to send packets these packets end up in this web queue the netmap application can read packets from here and if it is the case it can decide to forward them to the card in a better direction when there are packets that are coming from the cards and you want to send them to the host stack you just reject them in this particular queue and they will go up to the stack so for instance if you want to implement a very fast pipeline or something that protects you from attacks or a stuffy schedule or anything you can use this interface and without changing anything in your applications so when you have multiple interfaces typical thing you want to do is route traffic between interfaces so do you need to copy the packets or not and the answer is no you don't need to copy packets because the payload is in these buffers and each interface access these buffers so the thing you can do is just swap the indexes of the buffers between the transmit queue and the receive queue by swapping the indexes you achieve two results the first one is that you do a zero copy forwarding and the second result is that the buffer that was attached to the other queue becomes available for receiving the new packets so basically with the buffer swapping you transmit the packets and you make a buffer available for reception without allocating anything I'm sorry in terms of coincides not that we care much but the code is really less than 2,000 lines of code and the device library modification I really commented that over the range of a few hundred lines of code consider a typical device driver is between 4 and 10,000 lines of code so the modifications are really great and there is no user library for that one because the API is really, really simple so I just have this little copy user tiny layer underline the code half of them are probably copyrighted and those are markers to access the buffers from the network now in terms of performance what I was saying before when you do packet processing most of the cost is by packet because you have a system call which is metadata not with the data itself so the relevant metric is packets per second so you find a lot of papers measuring performance in terms of gigabit per second but when you take it into the paper you find out that those tests are run with 1500 bytes of packet or jump of grams which are almost 10 kilobytes the numbers here are measured with packet source in terms of packet rates is 20 times higher than what you would have with 1500 bytes and the other thing is especially in the first measurement we only measure the packet IO if your application is doing intensive data processing of course it's going to be slow but the part we are concerned with NetMap is after the application from the application to the network or by steps that's the part that we are covering here and the other thing to notice that there are many cards that we cannot do line rate I wrote this report in 0.88 if you buy 10 giga not all of it is able to do line rate and no configuration most of them can do line rate with 5 packets but as you move to short packets they are not able to do and the reason is sometimes the buses don't have enough capacity internal buses in the car some cars use microcode and microcontroller within the car is low etc now this is the graph we saw before you see that NetMap can send minimum side packets at line rate even with highly underclocked of the CPL so I guess one of the reasons that the performance is done in piece per second is because when you do it in packets you have to assume some average packet size so the piece per second seems to be more like an absolute metric than packets is kind of like relevant depends on the packet distribution depends on the party distribution I guess that's why we use it depends on the application I think this NetMap and all these rise by pass are mostly useful when you have to implement firewall traffic monitoring etc and there you have to deal with worst case equations which are tiny grams that's the reason now there are if I have time which I don't there are some cases where you also need to show the performance because there are other bottlenecks which are dependent on the packet size it's also done for comparison reasons yeah I know well but if you take routing hardware for instance there are some RFC which specifically you should measure performance with 64, 156 and then you can pass away now in terms of a lot of performance in NetMap comes from the if you send one packet at a time you already have to pay the 100,000 seconds for the system call and you cannot do line rate now the only way to get decent performance is to send a packet in matches how much is this affected how much is performance affected by the batch size and so we measure the throughput with different batch sizes and we see that we can reach line rate with a batch of 8 packets 64 bytes 1500 bytes is not a problem even Linux or 3DSD can do that but with short packets you are not some problem so fortunately the batch size is required to achieve line rate is not huge on the receive side you see some articles the receive side basically the problem is that some packet sizes can do, you can get line rate for some others who can and the reason is that if the packet size is multiple of cache line size all the transfers are easy if the packet size is not a multiple of 64 then you need to do multiple transactions on the bus and those are going to exceed the bandwidth available on the PCI Express Bus most cards use 4 lanes for each tanking port 4 lanes have a raw capacity of 16 gigabit per second but of course if you have to you are tanking a bit per second and the metadata and you need to do two transactions instead of one for each packet you are going to exceed the available bandwidth so for instance the Intel card that we use at this point so you have to keep that in mind when you run your machine at least if you know that the card is limited by the bus size you don't waste months trying to do performance for something that is not under the control in terms of forwarding performance we do connect map and compare the performance of packet forwarding with other solutions so previously bridging or even Linux bridging all in the range of 6 to 800 900,000 packets per second if we take two interfaces run them in network mode we connect them we just put packet forwarding between them with the 0.3 that I showed before you can move packets at the right rate assuming the packet size is one of the compatible ones with one core at 1.7 GHz and of course if you do some complex processing in the middle the performance is going to be the right if you use the leaping up emulation library so there is some overhead there is an extra copy etc to 7.5 million bytes per second these are all data I think we are slightly faster and then we took some application and tried to modify them in order to use NetMap for instance, for open with switch we took the user space forwarding plane and moving it to NetMap even on top of the pickup library so it's not the most efficient solution that we can have we reached about 3 million bytes per second so I think a lot faster than the internal forwarding plane we took click which is probably something some of you have used click as kernel implementation which is quite fast the user space implementation which wasn't so fast because it was based on the pickup and just adopting it to use the pickup emulation library we reached about 4 million bytes per second and so higher speed and the internal implementation and the convenience of working in the user space the code is available for this part of FreeBSD since the beginning of 2012 there is a Linux version for most kernel from 2632.6 on my website there are even bootable images so you can take them around without having to install things it was very important to have bootable images because it's very critical in order to get that performance you need to make sure that the system is configured properly the device drivers the correct one and so on a lot of feedback I receive when people try to apply our patches is oh, I applied the patches I cannot get any performance I cannot access the device etc how comes that bootable images work the documentation was not good enough but at least this gives people confidence that the numbers are real supported hardware I said I need the device driver modification I have changes for the Intel 10GB and 1GB card some pieces for real tech and for NVIDIA and for other card I really need programming documentation without the manuals of the card open issues how can we use NPUB apart from accelerating the packet processing application click etc well, for instance you could try to use the the NPUB API in the host stack simplifying the the format of packets and speeding up some something that we should do is to modify the model of operation right now when you open the interface in network mode the card is owned by the user space applications not in a sense available to the rest of the system other than forwarding the packet through network I should extend this in a BPF like way so client opens the card in network mode the rest of the still goes to the card performance in many cases is limited by the cost of the system and some of the system especially when you interact with a scheduler to go to sleep and wake up those system calls are very expensive I'm talking about several microseconds for simple operations like scheduling a new process or going to sleep and waking up etc and those are things that are important for instance when people care about latency in the communication we have only covered the throughput which is great but the latency is not improving with the network it's just the same latency that we have with some of the system and that's mostly something that has to be done within the operating system not in the right side current work that we have done we have for instance used the internet to implement software switcher which implements with a local internet and can be used to interconnect with our machines we have a QM of a can to talk about the switcher can be used to interconnect applications for instance if you have a packet processing tool and you want to test its performance without this tool you would need a real speed or very fast network interfaces etc with this tweak you can take your packet processing application attach it to one port of this software switch and attach the packet generator to the other port and test the performance of your system this is also part of the standard network distribution so you can create dynamically many instances of switches like each switch can have dynamically created ports and on top of this for instance we have invented a test of the user space version of the IPFW and dominant packet scheduler and network emulator which does several million packets per second per corp which is 5 or 10 times faster than the internet version and so I think I'm out of time if you don't I can show you some sites on the IPFW switch otherwise I'm happy to answer questions basically we try to use the same ideas that we use in network for implement this IPFW switch and what we have so the abstraction that we create is this IPFW switch and the idea is whenever an application wants to use a IPFW switch we issue the same course that we would use to access physical interface open that in GeoCTL instead of specifying the interface name you specify a special name which is Valexy where hex is the switch name and y is the port name within the switch and the port is created the switch is created dynamically and the switch is the same switch that the switch implements the learning bridge up and the kind of performance that you can achieve is about 18 million packets per second or more frames and in terms of bandwidth because this is one of those cases where the throughput of the packets because you eat against memory bus so this switch can do about 70 gigabit per second on the bike face again it's part of the standard 3DSD and the NOS model you don't need any hardware to use this switch and we also have QMU and KBM back end in terms of performance or operation mostly this is also important so all the work is charged on the sender of a packet so when you want to send a packet the packet is processed immediately and sent to one or multiple destinations and if you implement the switch in a naive way you will take a packet and complete a set of destination and copy-rich destination and this operation will be kind of inefficient when I say inefficient operating in this way would still reach by a million packets per second or in a speech but we can do a lot better we can do almost 20 million packets per second and the way to improve the performance is to try to amortize the locking overhead and the cache misses that you would have to do with this scheme and ok so whenever you transmit a packet what happens is that you kind of receive an f-up ring with a set of packets that to the various destination so the naive operation would be taking packets, compute the destination then one at a time block the interface and make a copy of the packet unlock and move to the next one instead of doing this we first collect a number of packets into a pool of pounded maximum size then for each of these packets while we collect the addresses we also do a perfection of the payload so we amortize the cost of potential cache misses we compute a set of destinations but we don't forward immediately we just compute this array of destinations with one bit means that this packet goes to the first port and then where else this packet goes to all the ports etc when it's time to forward we take the entire batch and we scan the list of interfaces instead of scanning the packets so we first select the first interface and we lock the interface once and we forward all the packets that are marked as one here then we unlock, move to the next one and forward all the packets here and there are some optimizations so we are scanning interfaces that have no privacy directed to them we can start processing not from the first packet but from the first packet and by operating this way we, as I said we can reach almost 4 times the speed that we got before now there are a few graphs about performance the cost of this system is proportional to the number of copies that you do so here we compare the performance of our baleswitch with other solution for learning to multiple destinations sometimes are based on the use of the NIC many more than for multiple queues that can be attached to the computer and the NIC can do the copy from one queue to another one and this is basically a memory to memory copy which is done by the NIC and it's kind of stupid to do these things this way because if the NIC has to do a memory to memory copy it has to copy the data to the PCI Express bus and back into the memory so it goes through the botanical network which is 16 gigabit per second botanical much much smaller in the memory bus so the CPU is a lot faster and it shows here for instance if you have our baleswitch depending on the number of parts this is the throughput that you get if you want to do the forwarding to the PCI Express bus you get this kind of throughput it's good and for instance if you take the Linux bridging code it is done here no surprises the Linux bridging code basically pays the cost of the system core and the map allocation throughput versus packet size of course the throughput of this bridging solution depends on the packet size because you do a memory to memory copy either using the CPU which is the case for the baleswitch or for the Linux bridging the PCI Express bus and the different slope depends on the memory on the bandwidth that you have available on the bus that is used for the copy and then again a lot of performance comes from batching so if you move one packet at a time you're going to be very slow if you move multiple packets at a time you're going to have a higher throughput however the delay also increases because there is a delay roughly by the batch size so you want to have some control over the batch size at least at least I mean to be able to say I don't want the batch to be larger than 20 packets or 1,000 packets or 10 packets because I want to control my delay so there are parameters in the system that let the system administrator control the batch size and so I have some graphs that show the performance depending on the batch size whether or the batch size used on the transmitter or the batch size used on within the packet and I mean you can have a look at the paper for a complete discussion on why the slope is different in the various cases in terms of latency which is another thing that matters the Linux page has a latency you send the request and then the response depending on the packet size the latency for Linux bridge is between 7 and 8 microseconds for the batch switch is between 5 it's around 5 microseconds but most of this cost is operating system cost in both cases so batch is slightly more efficient because it doesn't do the location but the bigger chunk is definitely the resistance so we need to fix the operating system in order to do this should we wrap it up for some more questions the final slide just to get an idea of what kind of performance you can get using the batch switch with virtual machines basically you use Vataion and KVM which are the most efficient solution for doing network communication on the virtual machine you get between 3 and 400,000 packet per second if you have a faster back end which is the batch switch you can go up to over a million packet per second and of course here you are still limited by the speed at which the application within the virtual machine can send packets if you have a much faster packet generator using the batch switch you can even go higher I think so you can find a lot of information on the code papers that are on my website so you had this slide at the beginning with the breakdown of where time is spent in the traditional stacking in 3BSD and from what you said it seems that you are doing only the lower stack yeah what in the in the application I have been using basically I don't have this part the initial part of the IF at the time so this part this part is going away ok and from this what is the how much do you reduce another question is based on this architecture in mind would you do things in the higher stacks differently so you said that there is this chain and the chain also goes on how the layering works so each of the layers go and process a different one thing in a standard stack for instance what I would do is instead of adding headers one piece at a time I would write well the first time you send a packet to a destination you probably don't know which option is the option the others is etc but a lot of these operations could be cash and recyclable for instance when you send packets the typical thing that is done you make a copy in the buffer and then you do the segmentation when it's time to and the segmentation is expressive you need to touch once you know the MTU the others etc you could as well do the segmentation and similarly on the receive side you have a socket buffer you could probably allocate memory initially and avoid the escape buffer location and the location at the time you receive a packet but in terms of savings basically all this cost in the device driver is gone the cost per packet is around 20-25 nanoseconds right now I'm using batches so I don't need to program the interface on every packet I need to program the interface on every batch so this is amortized the cost up there is amortized some parts are not there the memory code is there the logic is not there anymore so most number of applications use TCP which sort of requires the Linux algorithm do you have any comparison of like end-to-end TCP for good over avail or valid do you have an experience that doesn't really affect it or is that pretty much what you were showing that it cuts out the packet delivery algorithm with most of the arguments in the OS it cuts the packet delivery and so for 1500 or 9k buffers you don't care because you're already saturating the the memory bus if you're using communicating between different machines so it may not actually it makes a saving for the types of CPU cycles which is always good but if you just measure it too quick probably you're not going to see a lot of difference so I worked on MiniNet which is a network emulator and apparently we use OpenVswitch and I actually want to congratulate you for writing one of the only good things I've read in CACN in years it's one of the few good things that's really good and I want to switch to the OBS using NetMap but NetMap's not supported in Linux so I guess two questions one is why isn't it in Linux or I guess when will it be and why not cut off that no I don't know why it's not supported in Linux I submitted it to the Linux tab at least I wasn't even considered I'm not playing by the rules I'm just sending the code and saying if you want to it, be free to use it or not it etc there is one reason why it might not be accepted NetMap is basically written for FreeBSD it uses the internal FreeBSD APIs for the convention etc it's another language so people reading Linux code are not familiar with the FreeBSD instruction I have a small layer which loves FreeBSD codes to the Linux codes and that's very very convenient for me because for instance when I wrote Vale I developed it on FreeBSD and once it was ready to build on FreeBSD I just compiled it without any changes that was great if I had two different versions I would have wasted a lot of time to this part etc so I have a stronger incentive of not having two different versions and the Linux people probably don't like the styles whatever the good part is that Vale doesn't make any change to the system so if you don't need to use the the right driver so being outside the mainstream Linux is not the main problem it's only for device drivers and for device drivers the patches are really small so they mostly apply to multiple versions so one of the things I want to do is to extend Vale and replace the learning bridge that I have decided to open this feature for Vale code so thank you very much applause