 First, I want to thank Linus Foundation for providing me the travel grant to make this trip. I'm a PhD student from University of Hong Kong and today I'm going to introduce my recent effort on perivircializing TCP for congestive control inversion machines. So this is my outline and first I will discuss why this research is needed and next I will try to provide a concrete understanding of the problem and then I will introduce my solution, a perivircialized TCP which we name it PVTCP. So a virtualized data center is very different from a physical data center because all competing entities are virtual machines instead of physical machines. In physical data centers, the network delay is mainly composed of the propagation delays in the network cable and the physical hardware switch. But in virtualized data centers, virtualization can bring additional overhead to network RTTs. So basically there are two types of delays that virtualization can bring to the network. The first is IO virtualization overhead, no matter PV or HVM. This is because the guest VMs have no direct control to the hardware and all IO traffic must be proxied via a previous domain called the driver domain or the domain zero. And I want to mention that for each VM maybe password IO can avoid it. But for the second type of delays, which is caused by VM scheduling because in many scenarios there will be multiple VMs sharing one physical call. So when you have a scheduling queue, you have scheduling delays. In our platforms, we find that the delay caused by IO virtualization of PV against is sub-mini-second. But for VM scheduling delays, it can be tens of milliseconds. For example, when we run two VMs on one side, the peak latency can be as high as 30 milliseconds. And when we run three VMs on one side, the peak latency can be 60 milliseconds. This is because then credit schedulers use 30 milliseconds time slots to schedule the virtual CPUs. So this type of delays can dominate the network RTTs. And this picture shows the reality in public clouds. You can see that in both Amazon EC2 and Windows Azure, the network RTTs can largely vary without apparent patterns that people can follow to predict such delays. And another problem is in cost congestion. This problem is very hard in fiscal data centers, typically in largely distributed data processing applications such as MapReduce and WebSearch. So typically it's kind of when multiple centers simultaneously transmit data to one TCP receiver and it features a very synchronized workload, means the next request will not be issued until the current request has been satisfied. So the main phenomenon of this problem is that the application levels are put perceived by the receiver, which we call it good put, can be very low and much lower than the link capacity. For example, in this figure, when the number of centers exceed 20, the good put can be lower than 100 Mbps in one GB link. So a lot of network resources have actually been wasted. And prior work mainly focused on fiscal clusters. And there are many articles and many papers show that all of the listed solutions cannot fully eliminate this problem. And the dominant factor is once the packet loss, whether the sender can know it as soon as possible so that it can arrange retransmissions to retrace the lost segments so as to saturate the network link. In particular, when two loss happens, I mean the two of the TCP window, when this kind of loss happens and the senders can only count on the timeout-based recovery methods to retransmit the lost segments. And on the other hand, there's one method that has been shown to be safe and effective that is significantly reducing R2 mean. R2 mean is a parameter in TCP source code and later I will introduce what is R2 mean. So RTO means retransmission timeout. So even with instant support in hardware switch, for example, the data center TCP called DC-TCP solution from Microsoft, a small R2 mean still shows a lot of benefits to reduce the latency of querying delays. However, how does this solution perform in the virtual cluster, the cluster of virtual machines? And to my knowledge, there's no prior work. So in this figure, when we run three VMs on one side and three VMs on another side, per CPU call, we can see that when these two machines are connected directly, even when there is no network congestions, RTT still specs. And in common sense, RTT is an indicator of the network delays in the hardware switch. However, in virtualized environment, virtualization can bring additional delays to network RTTs. So this kind of congestion is not a real congestion, which we call it suit congestion. So the blue points are calculated RTO values and the red points are measured RTTs. So TCP's low pass filter adopts a lower bound of R2 mean to protect TCP's retransmission hammer. So we can see that without the large R2 mean to protect the retransmission hammer, there will be frequent spiritual RTOs. And this figure shows all experimental results using as many as 20 physical machines and sending VMs about in-cast congestion. So we can see that when you use a small R2 mean, a lot of spiritual R2s will decrease the network performance. And on the other hand, if you use a big R2 mean, throughput claps will happen due to the real network congestion. So it is very difficult to find or impossible to find a suit of RT mean that fits the whole range. And this is exactly what Omen and Pakistan have noted. So RT mean can be a trade-off between timely response and pre-material timeouts. And virtualized data centers is clearly a new example of this phenomenon. So VM scheduling delays can happen on both the center side and the receiver side. When the sender VM is dis-scheduled, the arc from the TCP receiver cannot be received until the sender VM wakes up. And when the receiver VM is dis-scheduled, it cannot return the arc to the sender until it gets simple cycles again. But the nature of these two problems I find is very different. In this experiment, it shows that when the sender VM suffers scheduling delays, RT only happens once after it wakes up. However, when the receiver VM suffers a scheduling delays, RTLs can happen multiple times before the receiver VM wakes up, which we call successive RTLs. And in this slide, I intend to give you a micro view of the problem with TCP data. So when the sender VM is pre-empted, we can see that even the receiver VM has returned the arc before the sender VM wakes up, RTLs do happens. So from TCP is perspective, this kind of RTL should not happen because the arrival time of the arc is not delayed, but its receiving time is too late. So the receiving time is something to the hypervisor now to TCP. But when the receiver VM is pre-empted, since the arcs cannot be returned until the receiver VMs get simple cycles again, this kind of problem, well, RTL must happen on the sender side. Well, this is much like a traditional networking problem. And in this slide, I plan to give you, to show you a schematic explanation of what happened inside the guest OS when the sender VM suffers a scheduling delay. So when the VM is running, both TCP sender and TCP receiver can progress normally. However, when the TCP sender is pre-empted, the arc from the TCP receiver need to be stored in the buffer of the driver domain, and it cannot be delivered until the TCP sender wakes up. However, after the TCP sender wakes up, both the interrupt, both the network interrupt and the timer interrupt are pending. But RTL happens just before arc enters the sender VM. This is because due to common OS design reasons, timer interrupt is always executed before other interrupts because kernel timers are used to provide any other services for the kernel, for example, for processor scheduling, disk scheduling, etc. And network processing is actually executed for a little bit in bottom half. So let me ask, well, Spiritus RTL is not a problem because we have a lot of detection algorithms for such spiritual cases, and the sender can just restore the window after detecting such cases. And there are two well-known detection algorithms, FRTL and Nelfi. Nelfi performs much worse in some situations than FRTL, so it is not implemented in Linux, and we adopt FRTL in all experiments. With both scenarios, we find that, well, FRTL, well, DelayDark plays a significant role in affecting FRTL's detection rate. And reducing the DelayDark timeout value does not help at all. So the bad phenomenon is worse when the receiver VM suffers a scheduling delay. For example, we can see that the detection rate can be lower than 10%. However, the enabling DelayDark seems to be helpful. The main purpose of using DelayDark is to reduce the server workload. Well, DelayDark allowed the receiver to return one arc for multiple segments instead of one arc for each segment. So this is much like interrupt coalescing, which I can call arc coalescing. So we are interested to see, well, after disabling the arc, how about CPU overhead? So in this figure, we can see that while the Disabling DelayDark brings a lot of CPU overhead to both the center VM and the receiver VM, because each arc will trigger a network operation and not be interrupt in both the center side and the receiver side. With more details, we investigate how many arcs are involved during the data transmission. And we find that after disabling arc, more than 10 times of arcs are sent. That's why it caused so much CPU overhead. And to this end, we propose a para-virtualized PVTCP to address this problem. So the main idea is that we find suspicious articles only happen when the VM just suffers a scheduling delay. So if we can detect such delays, if we can detect such moment and we let the guest OS be aware of such delays, we have a chance to handle the problem, right? So this idea is partially inspired also by Omen and Paxon. So the more information about the current network to transport a protocol, the more efficiently it can use the network resource. So how to detect the VMs wake up moment? Suppose we have three VMs of perceived call and the guest OS, the tick rate in the guest OS is 1,000. And when the VMs is running, it actually registers the clock event periodically to the hypervisor. And the hypervisor will use a one-shot timer to provide the virtual timer interrupt so that the guest OS can increment its system clock called Jeffys. However, when the VMs suffers a scheduling delay, the pre-registered event can only be delivered to the VM after it wakes up. And in this scenario, since the maximum scheduling delay is 60 milliseconds and it can cause sudden increase of 60 to Jeffys. So if we detect that the system clock does not increase continuously, we have some confidence to say that the VMs just suffers a scheduling delay and it just wakes up. And I want to mention that this technique only applies to UPVM. I mean, VM with a single vCPU because in SMPVM, multiple vCPUs can update Jeffys and you need to keep local Jeffys, maybe local Jeffys to detect such wake up moment. So as we have seen that when the sender VM suffers a scheduling delay, spirituals should not be happening. It can be avoided and we do not need to detect them. So how? For TCP, when the sender VM suffers a scheduling delay, the retransmission timer's spare time actually has been extended until the moment that the VM wakes up again. But in order to avoid that, the network interrupt is executed a little bit, then the timer interrupt. In our solution, we slightly extend the retransmission timer's spare time by only one millisecond. In this way, arc enters the VM first. And after it enters the VM, it can reset the retransmission timer with a new time of value. In this case, spirituals are to start when it will not happen. There is another problem, that once the arc enters the VM, it will be used to calculate the network RTTs. However, the measured RTT has include the VM scheduling delays. And according to TCP's low pass filter, this kind of VM scheduling delays can be tens of milliseconds, which will dominate the network RTTs. It will cause serious inflation to smooth RTT, RTD variance, and finally the expected RTL values. So in all solution, we adopted a conservative approach by giving the smooth RTT to the measured RTT at this moment. And for the receiver side problem, since our spirituals RTLs must happen, so we have to let the sender detect such spirituals RTLs. Well, the call of detecting algorithms is to eliminate the retransmission ambiguity. That is, the sender must distinguish the arcs for original transmission and the retransmission. But delayed arc can cause retransmission ambiguity, because it's allowed the receiver to return one arc for multiple segments instead of for each segment. So rather to use the delayed arc is actually a trade-off between the retransmission ambiguity and CPU overhead. Well, here we take two detection algorithms, LVN and LFRTO. But LFR requires the first arc from the receiver after the RTL. And the LFRTO approach relies on two future arcs from the receiver after the RTL happens. So in our approach, we only return three arcs after the receiver when we wake up so that it can help both LFN and LFRTO to detect spirituals cases. Oh, this is our evaluation. Well, in our experiments, we create an extreme case. So on each physical machine, we only use one CPU call. And on this CPU call, we run three VMs. And there are as many as 20 senders and one receiver. So as we have seen that for TCP, it's difficult to find a suit of RTL mean for the whole range, because it cannot distinguish a suit congestion from a real congestion. But with PPT CPU, even with a 1 millisecond RTL mean, we can achieve the highest support in the whole range. And for CPU overhead, as we know that as we have seen before that when there is no network congestion for TCP, it needs a very large RTL mean to protect the, to solve the RTL specs. So in all solution with PVTCP, even we use 1 millisecond RTL mean, we can achieve almost the same performance as TCP with a large RTL mean. And with more details here, we show the total arcs involved. And when the sender VM is preempted, since spacious RTLs can be avoided, no additional arcs are introduced. But when the receiver VM is preempted, since the delayed arc needs to be temporarily disabled, and PVTCP introduced 7.4% more arcs. But compared with the solution that totally disabling delayed arc, this kind of solution, I think, is very lightweight. So before I conclude my research, I'd like to raise one concern. Well, we know that when the VM has been descheduled, all incoming packets need to be stored in the buffer. So the buffer size matters. The default value is only 32, which means that it can only store as many as 32 packets at a time. But in data center networks, for 1GB link and 10GB link, the TCP's window size can be as high as several hundreds or even several thousands. So a lot of packets are actually dropped due to this small buffer. So maybe this parameter needs to be set bigger. And let me summarize my work. So the main problem is that VM scheduling delays can cause pseudo-congressions to TCP and the center side problem and the receiver side problem are very different. And to this end, we provide a PVTCP based on a method to detect a VM's wake-up moment. For the center side problem, a spiritual sartils can be avoided. For the receiver side problem, a spiritual sartils can be detected. And future work, well, I need your input. And that's why I come here today. So you're ready to take questions? George, please. I'm going to schedule this such that I don't have to draw the writer in such. OK. So thank you. This has been very informative talk. And I was wondering if so, a lot of the things you said, particularly with having the very long scheduling quantum, you've been thinking about, actually changing the default to that to be much lower in Zen server, sorry, in Zen. So have you tried it with a shorter quantum with like 10 milliseconds instead of 30 milliseconds or 5 milliseconds or 30 milliseconds? Yeah, you must, Lucian. I use the default one in 30 milliseconds. But I think there's no free lunch. Because if you reduce the scheduling times last length, maybe there will be two problems. The first is the reduced CPU cache effect. The second is increased VM context switch. So there's no free lunch. So in many scenarios, I think VM scheduling must happen so as to properly allocate CPU cycles. So these kind of delays cannot be fully eliminated. So in my research, I'm going to think about how we can live up with these kind of latency instead of avoid it, instead of hide it. Okay, yeah, so if your goal is to, yeah, you're right. So we are going to have some delays, no matter what. And given that we're going to have delays, having really bad delays helps you test the limits of your thing. Pardon? Yeah, so, yeah, that's fine. Also, have you tried this with credit two? Credit two, no. Okay, because credit two was designed in part to address particularly this kind of issue with the delayed act things and the effect that that had. So I'd be interested to see what kind of results. I will try to let on. Thanks for coming. Okay. Thanks. We have a lot more questions. So in the meantime, presumably that link Gail shows you a full paper, right? And it's... My full paper? Yeah. Okay, part of this paper has been published in SMP conference and part of the work have been submitted to another conference and a review, so. Okay. But I did see you submitting a paper on XEM project under the research. Yeah, yeah. So if you go to XEM project and research, I think you'll find... Okay, I'll do this. If you should do this time, so I don't have to walk too much. Have you done any testing on jumbo frames? Jumbo frames? Yes. Actually, in my experiment I varied the block size, but I didn't investigate too much about the jumbo frames. The block size is maybe chanting to multiple frames by TCP, so I didn't care too much about the details. You might look at that because that changes things radically when you... Thanks. Go up to large block sites, especially for storage. Thanks. Thanks. This is all very interesting. Thanks. Well done. I just had one comment. Just at the end there, you mentioned that maybe a buffer size was too small. Yeah, yeah. You're probably aware of this already, but there are difficulties with just increasing buffer sizes in circumstances where the buffer isn't going to be serviced soon. You can get very, very long RTTs as a result and that also makes TCP perform very poorly. And this is the problem of excessive buffering has started to be recognized as a problem. So you mean buffer flow problem, right? Buffer flow. Yeah, buffer flow. Yes. Yeah, I'm aware of that problem. I think for some latency sensitive applications, buffer blow is really a big concern, but for some other applications, for example, bulk transfer, we do not care about latency, we just care about throughput. I'm afraid that's not the case. So to give an extreme example, if I use my mobile internet from my laptop on the train, I sometimes experience RTTs measured in hundreds of thousands of milliseconds. So, you know, minutes sometimes. And of course, under those circumstances, the proportion of the data that's being buffered that is actually useful is very small because it consists mostly of retransmissions of the same stuff. Sure, sure, sure. So, and also you end up with a situation where the TCP tries to estimate the RTT and if you have too large a buffer, the RTT varies too much and you end up with a lot of spurious retransmissions all being buffered up. So, you need to, really the only sensible way to do this, I think, is either to have not have very large buffers or to somehow cap the time that a particular frame might be in any buffer to a sensibly low value. So that if the buffer is being serviced quickly, then that's fine. Then obviously you need, if you want a big fat pipe, you need a big buffer. But if maybe the VM is descheduled and suddenly the host has a great CPU crisis or the VM isn't able to deal with the incoming data, then you need to stop buffering stuff and you need to start throwing it away instead. Thanks for your comment. So, yeah, yeah. This parameter needs to be chosen very carefully. Maybe a new reserve problem. That's why I use a term perhaps. I have no idea. I have no idea. But in my experiments, I think a large value matters. Do we have any more questions? I saw a lot more hands raised earlier. Do you also test with various TCP versions? For instance, TCP RESTUDE. I seem to recall that some TCP versions are less sensitive to the delay variations and stuff. You mean TCP Vegas? TCP RESTUDE or some? I don't really remember the exact version. But I mean, what I meant is, do you test with various TCP versions or which TCP version do you... TCP Neurano. Neurano. Yeah, yeah. Oh, I see. But as far as I know, the default TCP version in Linux is TCP Cubic. Yeah, it is D4.1, but you can change it at random. TCP Cubic, BIC, and WESTUDE, Skateboard, and many others. And my question is, do you also have observed similar results with TCP Cubic or other versions? But I think this kind of delay is not specific to one kind of congestive control algorithm. Actually, it's generic to all TCP algorithms. Well, I think all TCP algorithms need to rely on, more or less, on RTT measurement. So this problem, I believe also affects TCP Cubic. Okay, nice. Any more? It's just to follow up on this buffer thing. The reason why VIFs had such small TSQ LAN is to prevent, is to reduce the amount of memory that can be used up by inside the driver domain or DOM zero by guests not handling their packets. If every, if you have hundreds of VIFs in a driver domain and each one could queue up 10,000 packets, I mean, that's the reason why it's 32. Not 10,000 in addition to the buffer bloke issue. But I think the aggregated buffer sizes of all guest domains should have something to do with the network link capacity. Right? So if it is only one GBD in queue, you do not need to maintain a very large buffer. You need a very small buffer and for multiple guest domains. That's my viewpoint. Actually, I have one question regarding this buffer size. Buffer size? I don't know. You meant the buffer in the back end, right? If the buffer is full then does the front end tries to send a packet or does the front end just waits? So your question is if the buffer is full. Full at the back end side. Full in the back end. Yeah. Then the driver domain just dropped the packet. Ah, dropped the packet. Yeah, dropped the packet. That's why it will cause intensive packet loss. If, for example, if the TCP window is 100 and you have only 10 buffer size, the rest 90 packs will be dropped, then it will cause serious distortion to TCP level. And that drop happens at the back end domain. They are back end driver. Sorry? Does the drop happens at back end? In the back end, yeah. In the back end. Yeah. More questions? I have one then. So is the code somewhere already available for that? Actually, the solution, the implementation is very, very simple. Actually, you can implement a solution on your own. It's very simple. It's very simple. Because it requires no modification to the hypervisor. Of course, I can release my own source code maybe next year, maybe. Okay. So in that case, if we have no more questions, we can get ready to go for lunch. Yeah, yeah, do find me during the lunch if you are interested in this topic because I really need your input. So please be back at 1.45.