 Okay, welcome to the very last session of the day. Thank you for sticking around. What I want to do today is to tell you a little bit about some work I've been doing to implement a new transport protocol and make that available through GRPC. That transport protocol is called HOMA. So just for background, as I'm sure you all know, GRPC is based primarily on TCP. It sends HTTP requests over TCP sockets in order to transport messages from the client to the server and back again. By and large, that works fine, particularly for long-haul traffic where the latencies are already high. But unfortunately, TCP has pretty poor performance in data centers. I'm gonna come back and talk a little bit about that to explain that to you, especially for really small messages. The performance can be very, very poor. So at Stanford, we've developed a new transport protocol called HOMA. It's a clean slate design designed just for data centers. And it is dramatically faster than TCP. It's one to two orders of magnitude faster tail latency. There's short messages, particularly when you're running under high load and there's contention and there's short traffic and long traffic all mixed together. HOMA is just way, way, way better than TCP. The problem is, though, that HOMA does not have TCP's API. It's not API compatible with TCP. I'll explain why in a second. And so that makes it kind of hard for people to deploy HOMA. So this brought me to the idea of integrating HOMA with GRPC, which is kind of a win-win situation. So the first thing is that HOMA provides a considerable performance boost to GRPC, particularly for short messages. But then the second thing is that GRPC hides the socket level APIs. So people using GRPC don't program to the TCP socket APIs, they program to the GRPC APIs. And so that means that once HOMA has been integrated into GRPC, any GRPC application can easily switch to using HOMA. Basically, it's a one-line change right now. So GRPC masks all the API differences. So it feels like it's a situation that's good for HOMA and good for GRPC. And one of the reasons for talking today is to perhaps see if there's interest out there in people actually trying to use HOMA with GRPC. So what I'm gonna do in the rest of this talk is I have four sections of the talk. First, I'm gonna spend a little bit of time just talking about TCP and why TCP actually isn't a great protocol for data centers. And I'll give a very, very quick overview of HOMA, probably just enough to confuse you about HOMA, but there's a little bit of the flavor. Then I'll talk about how HOMA and GRPC have been integrated. How do you use it? That's pretty easy. A little bit about how it's structured. And then I wanna talk a little bit about some lessons learned or some observations about GRPC from this project, and particularly the complexity of the GRPC code base and some performance issues, fairly serious performance issues with GRPC. So that's what you'll hear about over the next 15 or 20 minutes. So let's start with TCP. First, before I talk about problems with TCP, I just wanna say TCP is really a mind-blowing engineering achievement. You know, it was designed 45 years ago. I may be the only person in the room that was alive when TCP was designed. And if you think about the network world, then they were on the order of 100 hosts and the total aggregate bandwidth of all links in the internet was something like 100 megabits or less, the total aggregate bandwidth. And so to be able to design a protocol for that world and have it almost 50 years later still widely used and surviving technology change after technology change, it's really amazing, really amazing. So kudos to the TCP designers. But data centers didn't exist when TCP was designed. And so there was no way they could have designed it for data centers. And if you look at the design of TCP, literally every major aspect of the TCP architecture is wrong for data centers. I'll show you in a slide or two. And the result of that is you get bad performance, particularly short messages get bad performance. If the workloads are heavy, that makes it even worse. And if you care about pale latency, you know, not average latency, but say 99th percentile tail latency, then it's particularly problematic. And so the bottom line is we have today, we have amazing data center networks. So just the performance that our data center networks can provide is really mind blowing. TCP makes it impossible for applications to harness that performance. They're only a tiny fraction of the performance. So I'm gonna talk a little bit about this. I've written an article with more details as a link on the slide there. You can go and look at it if you wanna see the full details. So let's just take one particular aspect of TCP's design, which is the data model that provides for applications. The data model is a byte street. Now, you open a connection, you push bytes in one end, the bytes come out the other end. There is no structure of those bytes. You know, boundaries or anything, it's just bytes. Well, the problem with that is that applications typically care about messages. Certainly, GRPC is an hourly message-based. The communication between a client server is a sort of series of messages sent in both directions. And so when GRPC or another application uses TCP for the messages, it pushes messages into the stream, but the message boundaries are lost. TCP has no knowledge of those boundaries. And so when the bytes come out the other end, again, no knowledge of the boundaries. So typically, a receiver will typically provide a buffer. Give me the next 4K or 64K bytes from this stream. That could give you all of the message. It could give you just part of a message, or it could give you bits and pieces of several message. This particular example here, for example, the green messages split across one, two, three, four receive operations on the TCP stream. So the first problem is, you need extra complexity in your application to reimpose the message back if you have to output, like length information in the stream, and when you receive things, reassemble full messages. That's an inconvenience, but not an insurmountable problem. A much bigger problem is load balancing. So this lack of message boundary basically destroys load balancing. You can't share one TCP stream across multiple threads. What you'd like to do, if you're building a high-performance server, you want to have a whole bunch of threads that are servicing incoming requests. Those requests arrive over one TCP connection or maybe many TCP connections, and you'd like to load balance the request and spread them out across all of these threads. But if you have several threads that you try to read from the same TCP stream, the problem is that chunks of messages could get spread across all of those threads. Like in this particular case, the message chunks from the green message are spread across all three threads. And in fact, in this case, there's no way for thread three and thread one to even tell which of those chunks was actually first in the message. So basically you can't share a TCP stream across multiple threads reading from a concurrent. It just doesn't work. So that leaves you with some unpleasant alternatives. If you want to do load balancing, then you have two choices. The first one is you introduce a dispatcher thread. So you have one thread that manages all of the incoming connections for your application and it reads from those connections and it reassembles messages and then it dispatches messages out to a collection of worker threads. So you can do that and that gives you the load balancing and it deals with the problem about no message that you can't receive from one connection with multiple threads, but it has a couple of problems. First, every message now has to pass through an extra thread. It has to go into the dispatcher thread and then have a thread context switch through different thread to get processed. And that's a pretty significant latency hit. And the second thing is that the throughput of your application is limited by the throughput of that one dispatcher thread. Fundamentally, you can only handle messages as fast as that dispatcher thread can dispatch them. And that can be a severe limit for applications. The other alternative is you divide up the connections. So get rid of the dispatcher thread and just statically allocate certain connections to each of your worker threads. So each of them has a subset of the connections. So that gets rid of the problems associated with the dispatcher thread. But the problem is that this load balancing is static and you assign connections to worker threads. And the problem is that not all connections are equally active all the time. You could end up in a situation where a few connections are very hot and a few other connections are idle and then you get poor load balancing across your threads. By the way, this is what, second purchase is what Memcasty uses, for example, and it does have load balancing issues. So the whole notion of using streams is just a bad idea. You really want a transport mechanism that's based on messages. Okay, that's just one of several things. So we had been having issues with network transport, building high performance data center applications in my research group. And so five or 10 years ago we decided, what would happen if we just step back and did a clean slate redesign of network transport? So if we wanted the perfect transport for a data center, what would it look like? And over the next year or two, HOMA emerged from that. And what's interesting is that HOMA is different from TCP in every major aspect of its design. I said earlier that all of the aspects of TCP were wrong. So I'm just gonna go through these very quickly. Yeah, this probably is gonna be too quick to make sense, but just to give you a flavor. So as I said, TCP is based on streams. HOMA is based on messages. You send and receive messages. Actually, it's really based on RPCs. The notion that you send a request message and you get a response message back. Second thing is TCP is connection oriented. HOMA has no connections, no concept of a connection. There are a whole bunch of problems with connections. Actually it's been interesting because I've heard about them in various other talks today, like for example, connections get dropped and then you have to detect that and reopen them. If you're not, if you're connectionless, it's not an issue, there's nothing to drop. Connections have state and the state can be problematic. If you don't have, if you really can have thousands of connections open, if you have no connections, that state just goes away. So HOMA is connectionless. You might think, people seem to think that you have to have connections to do anything good in the network. You can't have a good, you can't have nice behavior or network without connections. It's actually not true. You can still have, for example, reliable flow control delivery without connections. Third thing is fair schedule. So on TCP, when there's a whole bunch of traffic incoming, what will happen is the receiver tries to split its bandwidth across all of the TCP streams that are open at the time. That's well known to produce the worst possible result. That is, if you have a whole bunch of large messages, that means nobody finishes until the very end. Everybody waits. We know from scheduling you're much better off letting somebody finish. There's no point that everybody has to wait a long time to finish. And so HOMA users run to completion. That is it will typically pick one message and finish it, get that completely delivered. So at least somebody can now make progress. And in fact, HOMA prioritizes short messages. It's even better because now you don't have short messages stuck behind long messages and getting queued up. The next aspect is how do you do congestion control? When you have a whole bunch of senders sending to one receiver, this is one of TCP's biggest problems. If you follow the literature, there've been dozens of papers written over the last 20 years on how to improve TCP's congestion control. It has a whole bunch of bad properties and it's very difficult to make TCP work well. Turns out in TCP, congestion control is driven from the sender. The sender tries to detect that there's problems, typically because packets get dropped or it gets back notifications saying things are getting congested. That makes it very difficult to do congestion control well. HOMA takes the opposite approach. It actually does congestion control from the receiver. Turns out that's a good place to do it because congestion typically happens at the receiver, at its downlink. That's the primary congestion point. And the receiver is the one place where you happen to know about all of the traffic on that downlink. So HOMA can do much more effective congestion control. TCP assumes that when you send, when it issues a stream of packets, that they will arrive at the destination in the same order they were transmitted. That's just a bunch of assumptions about that in the TCP ecosystem. That's a problem. That restricts you in a whole bunch of different ways. And in particular, it affects load balancing. You can't do load balancing across your data center fabric as effectively because all of the packets of one flow have to follow the exact same path through your fabric so they don't get reordered. HOMA eliminates that. There are no order requirements. So again, you can do much more effective load balancing both in the fabric and also among cores, processing incoming packets on the receiver. And then last, modern network switches have priority queues. For each egress port, there's typically eight or 16 priority queues. You can specify a priority in your packets and they get sorted among those queues and the highest priority ones get delivered first. TCP was designed before such things existed and so it doesn't take advantage of those. HOMA can take advantage of those to implement its run to completion and favor short messages and again provide some really nice performance advantages. So that's a once over very quickly is give you the flavor of HOMA. There are various papers on it available if you want to learn more. I'll have a link to the HOMA Wiki at the end of the talk that we can go to find out about all that stuff. So the result is that HOMA way outperforms TCP. In particular, its latency is way lower and you see it even at low load. If you're running in an unloaded network, HOMA's basic round trip times for short messages just a little bit more than half those of TCP but it's under load that HOMA really outperforms TCP. And on the right side of the slide, I've shown the results of one measurement that was taken in a 40 node cluster using a workload that was derived from a Facebook adobe cluster a few years ago. So it's a sort of a data center workload and it has a messages of a whole bunch of different lengths that are getting sent. The X axis shows message length and then the Y axis shows slow down which is basically a measure of latency how long it took the packets to get through. And notice that the Y axis is logarithmic scale. So there's three curves on here. One is for HOMA which is the blue curve and TCP is the green curve and the brown curve is DC TCP which is an improved version of TCP. And the interesting thing is that the slow down for HOMA is better at every message size HOMA provides better performance than TCP. And it's not a little, it's a lot. This is more than an order of magnitude better performance. In fact, across all of the experiments we did the latency benefit from HOMA vary from a factor of seven to a factor of 83 X depending on the particular application message length. So it's a huge, huge difference in performance. Okay, so this was, this work was originally done as part of the PhD dissertation of one of my graduate students, Badan Montessori who by the way networks at Google in the networking team. And honestly, the results were way better than my wildest dreams, you know if it had been a factor of two or three better than TCP I would have felt pretty good but a factor of 50, maybe 100 X in some cases that's pretty amazing. So I decided to make it my life mission to see is it possible to actually bring HOMA into widespread usage in the data center? And you know, I enter that sober minded and if you laughed at me as being a fool I wouldn't blame you because you know there probably is no more entrenched standard in the history of the world than TCP. And so the idea that you could displace this you know maybe that's a fool's error but I decided I'm going to try it until either it happens or I realize why it actually can't ever happen. So I've been doing a bunch of things over the last few years. First, for those of you that know me you know that I may be a professor but actually first and foremost I'm a coder. I love programming. I always code if I write less than 5 or 10,000 lines of code a year I feel like I'm disappointed in my year. So I personally have built a Linux kernel driver for HOMA to make it easy for people to use. You can download it from GitHub and install it. It doesn't require any modifications to the Linux kernel, it's just an installable driver. So now people can run, can use HOMA but this brings us back to the API problem. As I said earlier HOMA's API is not compatible with TCP. And it needs to be incompatible. So it's not just that wasn't just a frivolous decision where we should have just made it compatible. The thing is a lot of HOMA's benefit comes from having this message-based API. And so if we wanna fix all of TCP's problems we have to change the API. But so what do you do now? Nobody's gonna go through and modify the, I put thousands on the slide there's probably millions of TCP applications that just, they're not all gonna get changed to use HOMA and probably many of them don't need to get changed anyhow. TCP is probably adequate for many of the applications. So what do we do? And this is where I got the idea of integrating HOMA with RPC frameworks. Because the frameworks tend to mask the underlying interfaces you see the framework interface. And so if I could integrate HOMA with GRPC and maybe there's a relatively small number of frameworks out there and a few others thrift and so on then hopefully it could become really, really easy to use HOMA with applications. And so I decided to do GRPC as the first one. It seemed like, I thought it was the, I think the nicest in terms of features it provides and it seemed like it had the potential for being the most widely used. So that leads us to the HOMA, to the GRPC HOMA project. So this is a library, GRPC HOMA is a library that you can allow HOMA to hook into GRPC and be used with GRPC. There's a GitHub repo, it's all open source, you can grab it. This is work in progress still. So it supports C++ applications. Java support is not there yet, but under development not yet done anything with go yet. And it doesn't support secure connections. So it's only currently for insecure connections. And I should mention all of HOMA, HOMA is only for data centers. HOMA doesn't really make sense for long-haul networks. The properties that make it so effective within a data center would cause problems if you tried to use it over a long-haul network. So it's more for the apps running inside data centers. Okay, so now the question about how do you use HOMA with GRPC? It's actually pretty straightforward. First you compile and install the HOMA kernel module on your Linux system. Then you compile the HOMA library and you link your applications with the HOMA library. And then it takes a one-line change on the client and a one-line change on the server. So right now the way HOMA integrates is through the channel credentials mechanism of GRPC. So on client machines using C++ you typically have a call someplace where you call GRPC create channel and you would normally pass in channel credentials you got from GRPC. So instead you call a HOMA method to get channel credentials that work for HOMA. And that's it on the client side. Every all existing client code will work just fine if you do that. Similarly on the server side, again when you create a listening port you pass in credentials. Instead of getting GRPC credentials you call a different HOMA method to get HOMA credentials and you're done on the server side. So the integration should be pretty easy. And ideally I'd like to make it so you don't even have to do that that you only effectively you pass in a different server name as you know command line parameter your applications like a different internet address and would work with HOMA. I haven't figured out how to do that yet with GRPC but with some discussions we were having beforehand there may be a way to make this even simpler. So how is HOMA integrated with GRPC? I'm not gonna go over the details but just to give you a flavor there's basically a whole bunch of classes and mechanisms you need to use in order to create a new transport. The first one I already mentioned which is the credential mechanism. So HOMA generates credentials which get passed to the application and then it passes them down into GRPC. Then once that happens GRPC will use those credentials to communicate back with the transport. There's a collection of classes, channel, sub-channel connector and a couple of others that are used to open connections. Once that's done then there's two major interfaces between the main core of GRPC and a transport. The GRPC transport mechanism and the GRPC stream mechanism. I put those in red because those are the main ones, those are pretty well-defined APIs for communication. And then finally to do a new transport you have to use a bunch of other APIs that are a little bit less sort of well known. For example, HOMA uses the GRPC notification mechanism to find out when packets arrive when a HOMA socket becomes readable. It needs to use a GRPC mechanism for that. As a bunch of utility classes used to pass information back and forth between transports and GRPC. The slice mechanism, the metadata batch mechanism and so on. So I'm not gonna go through all of the details of those but I'll give you a very quick synopsis of what it looks like. So now I'd like to talk about two challenges that I encountered in working with GRPC. And the first one is the complexity of the GRPC code base. So I've been programming for a little bit more than 50 years now, I'm embarrassed to say. And I think probably the GRPC code base is the most challenging one I've ever worked in my career. It's very complex with a very large number of classes, many layers, very deep call chains as we were discussing before the talk. At one point I was trying to figure out what was going on. So I set a break point in the socket open system at the lowest level socket open system call just to see how did GRPC find its way from an application down to opening a socket. And when I typed where in GDB, there were 50 levels of method call between the application call and the socket open call. Another problem is that the GRPC implementation is based very heavily on closures. And what this means is that when one method A wants to call a method B, it doesn't actually call B. Instead it creates a closure which packages up the desire to someday call B and it hands that closure up to somebody else and we hand that to somebody else. We hand that to somebody else and then eventually somebody invokes that closure and B gets called. The problem with this is you put a break point in B and you have no idea who it was that actually, who A was, that's long gone. So you actually have this closure. In fact, there are closures that invoke closures that invoke closures. And so you're many levels deep in closure. And the other problem is also when you invoke a closure on the other side, you have no idea where it's going to go either. So this makes it pretty difficult to see the structure of the code. And then one particular knit is that the metadata mechanism is a pretty complex mechanism. It took quite a while to understand. Once I understood it, it wasn't that hard to use but trying to figure out exactly which three lines of code I had to write took several days of work. So it's a complicated code base and there's a lot of dependencies that are non-obvious. To integrate a transport, you have to do a lot more than just implement the stream mechanism. It's a bunch of other things. For example, I had to figure out how do I get GRPC to talk to my transport in the first place? It took weeks to figure out that, oh, you have to go through the credentials or maybe there's a better mechanism through resolvers that I haven't actually yet discovered. So that was hard. And things like when a transport calls back into GRPC to establish a channel, it has to add additional arguments to the list that were passed in. Again, the only way to figure that out is to look at existing transports and see that they do it. I try it the obvious way without the arguments and of course channels didn't work. So I had to try and figure out what do I have to add. So there's a lot of these internal dependencies that you only kind of figure out through difficult experience. And then unfortunately, there's almost no internal documentation in the GRPC code base, which was very sad. I do want to say that there's external documentation for the API, which is pretty good. I had no trouble learning how to write GRPC applications and servers. So that was easy to do. There was a great documentation for that. But internally, not much. There was one webpage that described that transport, the GRPC transport and stream mechanisms. VJPi, wherever you are, there's a place in heaven that's reserved with your name on it. Thank you very much. That was super helpful. But when looking at the code, there's essentially zero comments in the code. And this was problematic because it made it hard to figure out all these sort of dependencies that you have to resolve it. I ended up having to just reverse engineer the existing CHTTP to transport to figure this out. And that was difficult because it was hard to figure out what's fundamental, what things do all transports have to do versus things that are specific to CHTTP too. So I think I'm gonna skip this slide so we're running late on time. I'm gonna talk about my second challenge, which is performance. So I've done some, at least basic performance measurements here. And this slide shows the performance for a really, really simple RPC with a very short request message and a short response message. So you can see it takes 116 microseconds with TCP and 73 microseconds with HOMA. And I've broken that down. So this shows the lifetime of the message. So starting on the client side when the client makes the call to GRPC, this first green box is the code in GRPC and the transport until it actually issues a kernel call to send something with TCP. The brown arrow shows time in the kernel and across the network. And then once the select or the E-Poll returns on the server side, the blue box shows how long it takes the server to process the incoming message handed off to the application. The application does essentially nothing except pass a response back in and it can go through GRPC across the network and then on the client side receive the message and back to the application. So you can see all of those boxes are time spent in GRPC. So first thing you can see from this, the good news is that HOMA is quite a bit faster than TCP, about 40% faster. In fact, every stage along there, every one of those boxes is smaller. HOMA not only is the HOMA protocol faster, but actually the HOMA transport implementation in GRPC is quite a bit faster than CHTTP. That's the good news. The bad news is that the GRPC overheads are very, very high, unfortunately. Google has invented the term data center tax to talk about the overhead imposed by data center software that helps us write applications. Where the GRPC tax is fairly breathtaking. Now, if you take how long it takes just in the kernel of the network, if I was just a program to say the TCP socket interfaces, it would take about 25 microseconds to go around trip time. But using GRPC, it takes 116. So that's a 4.6x tax. And HOMA, again, using GRPC versus raw HOMA more than 5x slower. So that's a pretty serious penalty. And what's even more concerning is it's getting worse. So I had really been working on GRPC 1.43. And a few weeks ago, I decided I should really upgrade before I come and give a talk at this conference. So I upgraded to 1.57. And things got way worse. And so if you take the, I have shown here, you can see the time before or the time after for HOMA and the time before and the time after for TCP. And I separate out with the base time. That's the time in the kernel and the network. That's in the HOMA stack, same thing for TCP. So the lighter boxes, this is all time spent in GRPC. So that went up by 44% for HOMA and about by 40% for the TCP stack. So it's really concerning. Not only is it high overheads, but they're getting worse. And furthermore, I'm a little worried that it's going to be really hard to fix this because the same things that make it slow are the same things that make it complicated. It's this huge number of layers and classes. And so it's not like there's one place you can go and suddenly speed it up by a factor of five. It will take thousands of fixes in order to try and do that. But very challenging to figure out how to solve that. And I'm worried, actually, that performance could be an existential threat to GRPC in the data center. Because if you think about somebody trying to build a high performance application, like if they took Redis and put Redis on top of GRPC, replacing their homegrown transport protocol, you would need four times as many Redis servers as they need right now. And so people building high performance applications, I'm worried, are going to see this performance and think they have to find some other solution, which is not good. I want people using GRPC. So I think a challenge for the GRPC team to see is there's something you can do to try and first prevent the escalation of the problem. And then second, can you somehow buy back some of that performance to reduce the tax? Okay, so just to wrap up very quickly. The good news is it is possible to add new transports to GRPC. So HOMA GRPC provides a data point that you can do that. And it is now possible to use HOMA, at least for C++ GRPC applications, and I'm working on the other ones as well. And you get significantly better performance by doing that. As I mentioned, there's these two challenges with GRPC about the difficulty of the code base and the high overhead. So I hope that there can be solutions for those in the years ahead. And if you're interested in any of this, I would love to have you try it out and I will be happy to do whatever I can to help you. If you have run into any problems, I will fix bugs for you and help you resolve any issues that come up. I consider myself one of the most promiscuous computer scientists in the world right now. I really wanna encourage people to use HOMA and I'm happy to do anything. And if you'd like me to wash your car so you have a little bit more time to work on this, just let me know where you live. I'll be over there with my brush and bucket. So try it out. And if you're really ambitious, I would love to have help on this. There's still quite a bit of work to be done. For example, I haven't even thought about those support. Love to see somebody do that. And I've not addressed security issues. That's gonna be another big chunk of work. If there's somebody that understands how security works in GRPC and would like to work with me on HOMA, I would love that. So again, if any of this is interesting, there's a HOMA Wiki that kind of brings together all of the known information on HOMA in one place where you can go to get more information. And sorry, I've run a few minutes past my time, but thank you for your attention and I'll be happy to take questions. So the performance slide is very impressive. I wonder if you have tested your performance when you have a mixed environment, say 30% of the traffic's TCP and the other half is HOMA. And what's gonna happen to both types? Yeah, that's a really good question. I've actually been asked that question several times on my list of things to do. I have not done that yet. Yeah, because any deployment of HOMA is probably gonna involve a mixture of TCP and HOMA traffic initially. So need to understand that. The current performance is, yeah, we run everything's running HOMA or everything's running TCP. I'm gonna learn that. The HOMA is a business model for the demo in the past. That's curious, like what's gonna happen to HOMA and what's gonna be the past? Are you gonna break anything or are you gonna test all the... I think HOMA just won't work. It'll just get continuous timeouts and yeah, I would not expect it would work at all. Because it's based on, it assumes that the latency of communication between the two end points is low enough that it can do really rapid exchange of information in order to manage connections. And that latency becomes milliseconds. It won't work very well. It'll be some combination of very, very slow or just not functioning. So sort of follow up to the previous question, but when you say long call, what kind of distance are we talking about? Are we talking about next availability zone? Is already a long call or is it different region you know which over to? Yeah, probably even availability zones, if you're referring to a data center space a few tens of miles apart, probably wouldn't want to use HOMA for that, probably better off with TCP. To within a football field size box is kind of the sweet spot for HOMA. So am I understand this correctly that the most benefit of using HOMA is for latency, but how does it affect TCP usage? Well, sorry, CPU usage. Like if the bottleneck in our case is CPU utilization and not latency, would you still advise us to use HOMA? Yes, yes, in fact, because the reason, well, the improved latency comes from two things. Part of it is it's avoiding congestion in the network, but part of it is it's just using less CPU time than TCP as well. And in particular, if you go back to the earlier slide, and you compare the effect here, again, all of this difference between TCP and HOMA at 116%, almost all of that is reduced CPU time. Well, actually 11 microseconds is in the network, although some of that's in the kernel. And then, sorry, and then the rest of that is all reduced CPU time. So this will actually affect throughput as well, because HOMA uses less CPU time. Okay, thank you. What's the length of the header, the HOMA header, and how small is too small in terms of message length? The length of the HOMA header is on the order of 50 bytes, I think. It's a little, I think it's a little larger than the TCP header, but not as it's less than twice the length of the TCP header. And how small is too small in terms of message, the payload on the HOMA? You can have a one byte payload. You're still gonna pay 50 bytes of header. So, you know. I come from the ATM wall when ATM link layer protocols existed. You had 64 byte cells and 16 byte headers and 48 bytes of payload. And all the service providers, when they adopted it, they shut it down because of the size ratio of the payload to the header was very cost prohibitive for them to scale up. Yeah, so I think this depends on how bigger the messages people are sending. And my sense is that it's pretty common to see messages in data centers are less than a kilobyte. That's very common. To see messages that are less than 50 or 100 bytes is probably less common. But the header overheads will be more expensive for those short messages. So using HOMA requires the kernel support. Would it have been possible for this to be a user thing? Would it be possible to have done HOMA at user level? I debated about that. The problem is there's really no good way to do that that is widely usable as far as I know. The only option I know of is DPDK. But that's not very generalizable. Typically when you open a DPDK connection, you grab the entire NIC and you own the NIC for one application. So if you have multiple applications running on the same machine, you need effectively multiple NICs for them to run on. So I thought about that and you're right, putting something in the kernel creates inertia. That will make it more difficult for some people to install, but it seemed like the only way to really get the performance and generality. Thank you for a very nice presentation. We'll definitely follow up regarding these performance measurements that you have done and we'll want to understand and replicate and hopefully benefit from some of this work. Regarding HOMA itself, you mentioned that both the congestion control and flow control is new. What kind of scale has it been tested at? Because previously when we have tried other congestion control protocols in Philly Band, etc., they have not scaled well beyond like a thousand node data center. Yeah, so by Google standards, the scale at which HOMA has been tested, you'd probably label it pathetic. It's the most nodes I've been able to get at once as 40 nodes. So, and I think you're probably right that as it scales higher than that, I'm sure some new problem, at least some new problem will come out, that I was not able to observe at 40 nodes, but it does happen at higher scales. But the only way to do that is to get some company with the resources interested enough at HOMA to try the experiment. So, one thing I would love to do, if Google or any other company would give me access to a cluster with thousands of nodes, I would love to run those experiments and see how it scales and to fix the inevitable bugs that will probably come up at that scale. Sure. As Abashik said, we would like to follow up on the performance issues over here, especially the GRPC tax. But a quick question, is the data, does the RPC use serialized messages or is that using protobufs? So, HOMA is using serialized protobufs. It gets all the data through the same channels through GRPC that the CHTDP transfer does. So, I believe it's a protobuf that has been serialized by somebody above and passed down in serialized form to HOMA. I think so. The tax probably includes serialization and deserialization as well. Yes. Actually, when I started it, I thought, so we've used protobufs before. We always thought protobufs were slow. And they actually are slow, but protobufs are basically in the noise here. Protobufs, and this is maybe a microsecond or two, they're protobufs. I don't have measurements here. I did separate measurements to see how, because I was wondering how much of this was protobuf serialization and deserialization. It was on the order of one to two microseconds for the round trip. So, out of something like 100 microseconds of total overhead. Very interesting. What was the size of the message? The size of the message here was small, 10 to 20 bytes something, really small.