 M2. The TCP, IP and Ethernet ethers in this way are created only one. This provides an advantage because the computational cost is almost independent of the payload length, but it depends strongly of the number of crosses. Instead, on the right, we can see the behavior when the TSO is not present. In this case, the TCP layer segments the data, and each packet must cross all level of the natural stack. The TSO helps the sender. Instead, the RSC, received side coalescing or hardware error, helps the receiver. The RSC allows an IC to combine in hardware, incoming TCP, IP packets from the same connection into one large received segment before passing it on operating system. It reduces CPU use because the TCP, IP stack is executed only once for a set of received Ethernet frames. Basically, the RSC is the dual of the TSO. This diagram shows on the left the RSC, how RSC works. The RSC allows the aggregation, and then it's passed to the operating system. In this way, only the aggregate system will cross the natural stack. Why do we need a software implementation? There are several reasons. For example, if we have an old NIC that supports only EP version 4, the hardware upgrade is really difficult. Or if the card has a malfunction in the floating system, it cannot be used. Or if we don't have the hardware in the communication, for example, between virtual machines, we need a software mechanism to do a segmentation. In all these cases, it's useful to have a software solution to allow us to obtain the same benefit. Finally, the software is easier to extend than hardware to support new protocol or to fix the bed. OK, since FreeBSD 7.1, we have a software implementation of RSC called LRO, this software mechanism that requires to change each device driver to obtain the same advantage of the RSC. On the lab there is a hardware RSC, and on the light there is a software LRO. In the device driver, in the same interrupt context, the packets from the same connection are aggregated in one segment, and then it's passed to the network stack. At this point, the behavior is the same to the LRO. There is only one big segment that crosses the stack. Now, I will introduce a software mechanism that we have developed for the center side. It's called GSO, Generic Segmentation of Load. Basically, it's the software implementation of the TSO, but it supports not only TCP, but also UDP on AP version 4 and version 6. The GSO is available for FreeBSD current, 10 stable and 9 stable. To avoiding changing every device driver, we have the segmentation just before the device driver. We can see it in the next slide. On the left there is a GSO scheme diagram, and on the right there is without GSO support. Much of the advantage of the hardware TSO comes from crossing the network stack only once. The GSO does the same by doing the segmentation in software as late as possible. Ideally, this could be done within the device driver, but that would require modification of drivers. A more convenient, similar, effective approach is to do the segmentation just before the packet is passed to the driver. This diagram shows an example of TCP's data flow. Our modifications to other GSO support are referred to the network stack. In particular, we changed TCP output, IP output, header output, and we added GSO dispatch function to do the segmentation just before the driver. TCP output, sorry for... TCP output checks if GSO is enabled. There is some CCTV to enable the GSO or to change some parameters that we will discuss later. And after that checks if the packet length exceeds the M2. In this case, our set... TCP output sets GSO flag in the Mbuff packet header field called CSOM flag. We used 4 bit in this field. Both to indicate that GSO is required and to specify the type of segmentation. In this way, GSO dispatch will avoid to inspect the header to understand the type of segmentation. Then the packet is passed to IP output. If GSO is enabled and required, the IP output avoids to calculate the checks, both TCP and IP, because they will be calculated after segmentation on each new packet. And it avoids to do IP fragmentation. At last, Ether output, after building the Ethernet header, if GSO is required, we call GSO dispatch instead of if transmitted. In GSO dispatch, using the information contained in the CSOM flag field, the appropriate function is invoked to perform the segmentation. We used a simple array of functions to do this. For example, in the case of TCP on IP version 4, the function is GSO underscore IP4 underscore TCP. This function, similar to the others, for example, to ODP or TCP on IP version 6, performs three main operations. One calls the mseg function. This function returns mbuff queue that contains the new segment. After that, number two, fix the TCP and IP ethers in each new segment, because these ethers are simply copied from original packet into a new segment. And then some new segments to the device driver. These are mseg function. This is the mseg function from the original packet to mbuff. In the mseg function, there is some parameter. The m0 is the original packet. Header LAN is the first byte in the m0 that are copied in each new segment. And MSS is the maximum segment size. After the segmentation, we need to fix the TCP IP either. In this case, TCP on IP version 4. The red fields must be changed. In more detail, for IP header, total LAN will contain the new packet sites. Identification will be increased. And we will recalculate the checksum of the IP header. For the TCP header, the sequence number will refer the data contained in the segment. Some flags are only set in the first segment, for example, CWR, and other flags only in the last segment, push and fin. And we will recalculate the TCP checksum if the hardware is not capable. For TCP on IP version 6, the changes in the TCP header is the same. In the IP header, there is only payload length that contains the length of this segment. As we say, the GSO also supports UDP traffic. In this case, we need to perform IP fragmentation, but we delayed this just before to call the device driver, as we can see in the intergeal monolab. On the right, if GSO is not enabled, the fragmentation is done in an IP level, and each fragment must cross the Ethernet level that adds the same header to all fragments. The steps that are performed with the IP fragmentation are the same that we already described. I'm sag, fix the header, and send the packet. The only difference is the IP header changes, because in this case, we must do an IP fragmentation, and modify the appropriate fields, like fragment offset, more field, and we calculate the Ether checks. For IP version 6 to do IP fragmentation, we need to add an additional header, fragmentation header in red, that contains the fragment information. To manate the GSO parameter, we add some CIS CTL. There are two CIS CTL to disable-enable GSO or TCP or UDP communication. For example, net.inat.tcp.gso, enable or disable the GSO on all TCP communication. And there is other two for each interface to limit the GSO bar, and to disable-enable GSO on individual interface. Our code is available in these two repositories. The first contains the pages, utilities, and the PicoBSD images, and the brief description of GSO. And the second one contains the free BSD source code with GSO support. To compile the kernel with GSO support, you just need to add option GSO in kernel config. In this slide, we can see the changes to add GSO support in free BSD current. We add the two files, GSO.c and GSO.h in CSNET, and a very small piece of code in the network stack files, like IP, TCP, and UDP. At the end, I show you the results that we have obtained in the experiments. This is our testing environment. We use NetPath as a benchmark tool and 10 gigabit links between the machines. To see the effect of GSO or TSO, the receiver must have a lot of software or hardware enabled. In this slide, we see the result with TCP, EP4 traffic. On the horizontal axis, we have the CPU frequency and on the vertical axis, we have the throughput. The red curve is obtained with hardware TSO. The blue curve is obtained with GSO software and the red curve is without any segmentation of loading. We perform an experiment by scaling the frequency of the CPU because, as you can see from the graph, at high frequency, 10 gigabit links becomes the bottleneck. From the graph, we can see at 2 gigahertz the GSO saturates the link and from the table at 1 giga and a half, the GSO is almost twice fast without offloading. In this other slide, we can see the results with TCP on EP version 6. There is the same TSO on the red curve is TSO, blue curve is GSO and green curve is without segmentation. Also in this case, GSO allow us to have a double throughput compared with without offloading. With UDP traffic, we have a speed up about 20%. This is because the GSO in this case only prevents the crossing of header output. UDP output and IP output are crossed only once, both GSO and both GSO is enabled or not. For DP on EP version 6 is the same. Footwork will try to do more performance measurements, for example, by using multiple concurrent stream, we try to optimize critical path and add support to new protocol, for example, as CTP. Thank you. Do you have any question? Is it possible to make the GSO option enabling by default, automatic, I mean? So if depending on whether segmentation offloading is supported by the hardware or not to make it the... to have it enabled automatically or not enabled? If they have a TSO, the network stack uses an hardware TSO. Or if you disable the TSO with the config, for example, and enable the GSO with the CCTL. No, I mean, can you make it so that the system selects whether to enable GSO or not depending on the hardware support of segmentation offloading? Oh, yeah, if you enable GSO, for example, in this way, also both in the NIC and in the stack, if the NIC didn't have the TSO, it's automatic. Aha, okay. Okay, thank you. Have you tested this feature together with packet filters, PF, for to see how networks address translation? No, we didn't test. Hello. Question is how much extra load does it add to the CPU? We didn't have this result, but the load of the CPU is not very high. We think it's not very high. But you said currently it's single threaded. Hi, I was just wondering why is GSO so much faster than non offloading in lower frequencies than higher, like in 1.5 GHz. It was twice as fast, but on 2 GHz it wasn't that fast. Do you have any idea? Because the bottleneck for example at 2 GHz, the bottleneck is the link, is the 10 gigabit link. The link is saturated. Yeah, okay, thanks. These tests, what M2 was used on the network then? 1.500. Okay. Any other questions? Let's thank our speaker then. Thank you.